# Scipy-Stats Juptyer Notebook
*****
*****

## Need an image

### Outline of this Notebook:

1. Overview of Scipy Stats Jupyter Notebook
2. Analysis of Variance(Anova)
3. Dataset: Diet
    - Importing Packages for the Notebook
    - Exploring the Dataset
    - Preprocessing the Data
4. Hypothesis and Assumption Testing of Dataset to meet Anova Requirements
5. Conducting the Anova Test
6. Conclusion
7. References

### Overview of Scipy Stats Jupyter Notebook

The Scipy-stats module is a sub-package of the SciPy library providing many uses for statistical analysis including probabilistic distributions, random variables and statistical operations. https://data-flair.training/blogs/scipy-statistical-functions/ It is used to analyse normal distributions and calculate different distribution values with a number of in built methods available.https://www.delftstack.com/api/scipy/scipy-scipy.stats.norm-method/. 
<br>
Within the library, there are functions for both continious and discrete functions that have the ability to work with different types of distributions and performs hypothesis and t-tests.https://data-flair.training/blogs/scipy-statistical-functions/?ref=morioh.com&utm_source=morioh.com. The library works seamlessly with other packages to enable statistics calculations, descriptive analysis and data visualisation. These include:
- pandas
<br>
- matplotlib
<br>
- seaborn
<br>
- numpy
<br>


**Some of the Key Terms in Statistical Analysis that will be referred to in this library** https://realpython.com/python-statistics/
<br>
<br>
**Types of Variables**
<br>
- **Dependent Variable** The chosen data category that is examined to see if there is any affect from the independent variables
- **Indepedent Variable:** These are the chosen datapoints measured that may have an effect on the dependent variable
<br>
**Measures of of Central Tendency**
<br>
- **Mean**: the average of all the items in the dataset.
- **Median**: the middle element of a sorted dataset.
<br>
**Measures of Variability**
<br>
 - **Variance**: the average of the squared differences from the mean 
$$Var(X) = E(X^2) - (E(X))^2$$
<br>
- **Standard Deviation**: is a measure of how spread out numbers are and is calculated by determining the square root of the variance https://www.mathsisfun.com/data/standard-deviation.html
<br>
$$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$
<br>
- **Percentiles** 
<br>
*is the element in the dataset such that p% of the elements in the dataset are less than or equal to that value.... Each dataset has three quartiles (first quartile is the sample 25th percentile, second quartile is the sample 50th percentile (median) and the third quartile is the sample 75th percentile* 

### Analysis of Variance (ANOVA)

**Analysis of Variance** can be defined as:
<br>
<br>
*statistical formula used to compare variances across the means of different groups... where a range of scenarios use it to determine if there is any difference between the meands of different group* 
<br>

https://www.tibco.com/reference-center/what-is-analysis-of-variance-anova. 
ANOVA due to its procedures helps select the best features when training a dataset, reduces complexity by limiting input variables and can determin if an independent variable is influencing a target variable. 
<br>
<br>
The outcome of the ANOVA is the 'F Statistic' which enables the researcher to conclude wheather or not the null hypothesis was supported or not. This is acheived through calculating the difference between the group variance and within group variances. ANOVA is important is ascertaining whether or not a mean values are statistically significant. ANOVA can also indirectly show if an independent variable is influencing the dependent variable.
<br>
<br>

**Limitations of ANOVA**
<br>
- The test can only if there is significant difference between the means of at least two group but can't identify which pairs differ in its means. This requires ANOVA to be used in tandem with other statistical methods
- Assumes uniform distribution limiting its ability to work with data that does not have a normal distribution and/or may contain outliers
- Assumes Standard Deviation is similiar across the variable to avoid inaccurate conclusions being made
<br>
**Hypothesis Testing**
<br>
- **A Null Hypothesis(HO)** It is inferred there is no difference between the groups or means
<br>
- **An Alternative Hypothesis** It is inferred that there is a difference between groups and means
<br>
**Types of ANOVA**
<br>
 - One Way ANOVA: The one-way ANOVA is suitable with only one independent variable with two or more levels.
 - Two Way ANOVA: When there are two or more independent variables that may have multple levels and includes every possible selection of variables and their levels.
 <br> 

## Diet Dataset

### Importing Packages for this notebook

In [None]:
import pandas as pd
import seaborn as sns
import scipy.stats as ss
import statsmodels.api as sn
import numpy as np
import collections as co
import scipy.special as spec
import matplotlib.pyplot as plt

### Exploring the Dataset
***

The Diet Dataset contains information on 76 participants who undertook one of 3 diets (A, B, C). At the beginning and end of the trail, the participants weights were taken. The dataset contains information on their gender, allocated diet, height initial weight and their weight after six weeks. https://bioinformatics-core-shared-training.github.io/linear-models-r/anova.html https://bioinformatics-core-shared-training.github.io/linear-models-r/anova.html. This analysis will explore whether or not there is any correlation between the height of the participants and the weight lost, we can further explore which of the diest was most effective in assisting to lose weight and if the gender of the participants makes an impact on the outcome of the diet. 

### Preprocessing the Dataset
***

In [None]:
df = pd.read_csv('dietdataset.csv') #first look at the dataset
df

In [None]:
df.describe()

In [None]:
df = pd.read_csv('dietdataset.csv', na_values=' ') # replacing empty cells with Nan
df

In [None]:
df.rename(columns={'pre.weight': 'initialWeight'}, inplace=True) #changing the name of the pre.weight column
df

In [None]:
df['Diet'] = df['Diet'].replace([1,2,3], ['Diet A', 'Diet B', 'Diet C']) #changing the names of the Diet variables 
df

In [None]:
df['weightloss'] = (df['initialWeight'] - df['weight6weeks']) #creating my dependent variable
df

In [None]:
df.isna().sum() # looking into the Nan values

There are two instances where the gender is Nan. It appears that they participants did not fully participate in the trial. Due to the small number of datapoints, these data will be dropped.

In [None]:
df.dropna(axis=0, how='all', subset=['gender'], inplace=True)
df

In [None]:
df.groupby("Diet")['weightloss'].describe()

In [None]:
df['Diet'].value_counts()

In [None]:
# legend_elements() is a method so we must name our scatter plat scatter...
scatter = plt.scatter(df.Height, df.initialWeight, c=df.gender, cmap="bwr")

# No arguments necessary, default is prop='colors'
handles, labels = scatter.legend_elements()

# Print out labels to see which appears first
print(labels)

# Re-name labels to Gender
labels = ['Female','Male']
leg = plt.legend(handles, labels, frameon=True)
leg.get_frame().set_linewidth(1.0)
leg.get_frame().set_edgecolor('b')
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.title("Graph Name")
plt.show()
# Reference
# https://blog.finxter.com/matplotlib-scatter-plot/

In [None]:
points = plt.scatter(df.Height, df.initialWeight, c=df.gender,cmap="rainbow", lw=0) #to assign male and female to the gender variables for 1 and 0
plt.colorbar(points)

From the above chart, it is inferred from the two clusters that one group has higher instances of height and initial weight. It is well known that males are on avaerage taller and weigh more than their female counterparts. It is assumed that this group is male and the labels will be changed accordingly

In [None]:
df['gender'] = df['gender'].replace([0,1], ['Female', 'Male']) #changing the names of the Diet variables 
df

In [None]:
# count plot on single categorical variable
sns.countplot(x ='Diet', hue ='gender', data = df)
 
# Show the plot
plt.show()

In [None]:
df.Diet.describe()

## Hypothesis Testing

The following are the hypothesis drawn from the initial exploratory data analysis:
<br>

#### Hypothesis 1

<br>

**A Null Hypothesis(HO)** The means of all diets are equal with respect to weightloss

<br>

**An Alternative Hypothesis** The mean of at least one diet is different with respect to weightloss

<br>

#### Hypothesis 2

<br>

**A Null Hypothesis(HO)** The means of all genders are equal with respect to weightloss

<br>

**An Alternative Hypothesis** The mean of the genders are different with respect to weightloss

<br>

## Assumptions within the dataset
***
<br>
Limitations of Assumptions
https://www.statology.org/one-way-anova-r/

| Assumption | Explaination |
| :- | :- |
**Your dependent variable should be measured at the interval or ratio level** | Dependent variables must be of 'metric measurements'https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/assumptions-of-the-factorial-anova/is and the values take on any given number within a range https://www.javatpoint.com/anova-test-in-python https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/assumptions-of-the-factorial-anova/ |
|**Your independent variable should consist of two or more categorical independent groups** | The categorical groups shouldn't overlap, being part of one group shouldn't affect the chance of being part of another group.|
|**You should have independence of observation** | There is no relationship between observations in each group or between the groups themselves. Each time there is a new datapoint in a group it is independent of all other datapoints in that group.|
|**There should be no significant outliers** | It can be difficult to define outlier in the context of the data set.|
|**Your dependent variable should be approximately normally distributed for each category of the independent**| When measured, the data points should take the form of a bell shaped curve.| https://www.statology.org/anova-assumptions/
|**There needs to be a homogeneity of variances**| This assumption can be tested using a Levene's test for homogeneity of variances.|https://www.statology.org/anova-assumptions/

### Assumption 1: Dependent Variable
***

The chosen dependent variable in this instance is the metric weighloss which is measured in kgs. 

In [None]:
#The dependent variable
dependent = df['weightloss']
dependent

### Assumption 2: Independent Variable
***

As we are going to look at two independent variables in this notebook, the author has chosen the following:
<br>
**Diet** - categories are: Diet A, Diet B and Diet C
<br>
**Gender** - catogories are: 0(which = Female) and 1 (which = Male)

In [None]:
#first independent variable
independent = df['Diet']
independent

### Assumption 3: Independence of Observation
***

This is a study design issues rather than something that you can test for. For this to be reached the "obersevations in each group are independent of each other and the observations within groups were obtained by a random sample". https://www.statology.org/anova-assumptions/. There is no standardised test to ensure independence of observation, nonetheless if this assumption is violated, the results obtained from the same could be wrong. Strong, robust and ethical data collection is required. https://www.statisticshowto.com/assumption-of-independence/

### Assumption 4: There should be no significant outliers
***

Outliers are unusual values in a dataset which can impact the analysis and distort the findings from research.https://statisticsbyjim.com/basics/remove-outliers/ The diet dataset has already been preprocessed to remove any null values. Some casues of outliers can include: data entry errors, sampling errors and natural variations.

In [None]:
# https://seaborn.pydata.org/generated/seaborn.boxplot.html
sns.boxplot(x=dependent, y=independent)

### Assumption 5: Normal Distribution with the Dataset
***

Normal Distribution has two parameters: the mean of the distribution and the standard deviation. The data points are centred around the mean. The higher the standard deviation the flatter the curve will be. https://www.kaggle.com/gadaadhaarigeek/normal-distribution

In [None]:
#To explore the normal distribution of each of the Diets in respect to the weightloss category
#KDE of the three categories
sns.displot(x=dependent, hue=independent, kind="kde")

As inferred in the previous assumption, there are outliers in the dataset and they have an negative impact on the data analysis. Above, you can see that Diet A slightly positively skewed distribution where as the other two variables are minimally negatively skewed in their distribution. However, they all appear to have a bell shaped curve. https://www.analyticsvidhya.com/blog/2020/04/statistics-data-science-normal-distribution/

data_points = ('DietA','weightloss')    
  
sm.qqplot = df(x = 'Diet A', y = 'weightloss', line = '45')
py.show()

Since the dataset is small and there are potential outliers, a Shapiro Wilks test will be performed on each of the dependent variables (Diet A, Diet B and Diet C). 
Next, each of the dependent variables will be extracted and a Shapiro Wilks Normality Test to see if the the data accepts or rejects the hypothesis of normality. https://variation.com/wp-content/distribution_analyzer_help/hs141.htm#:~:text=Shapiro%2DWilks%20Normality%20Test&text=The%20Shapiro%2DWilks%20test%20for,than%20or%20equal%20to%200.05. Shapiro was selected because the size of the sample of the dataset is relatively small (*n = 76*). https://statistics.laerd.com/spss-tutorials/testing-for-normality-using-spss-statistics.php

In [None]:
#extract the Diet A weight losses
weightloss_dietA = dependent[independent == 'Diet A']
weightloss_dietA.head()

In [None]:
ss.shapiro(weightloss_dietA)

In [None]:
weightloss_dietB = dependent[independent == 'Diet B']
weightloss_dietB.head()

In [None]:
ss.shapiro(weightloss_dietB)

In [None]:
#extract the Diet C weight losses
weightloss_dietC = dependent[independent == 'Diet C']
weightloss_dietC.head()

In [None]:
ss.shapiro(weightloss_dietC)

In each of the cases, the p > 0.05 which means that the test did not show evidence of non-normality. https://quantifyinghealth.com/report-shapiro-wilk-test/

To visualise the normal distribution, the qq plot function is used

import numpy as np
import statsmodels.api as sm
import pylab as py
 
data_points = weightloss_dietB

sm.qqplot(data_points, line ='45')
py.show()

import numpy as np
import statsmodels.api as sm
import pylab as py
  
np.random generates different random numbers
whenever the code is executed
Note: When you execute the same code 
the graph look different than shown below.
  
Random data points generated
#data_points = np.random.normal(0, 1, 100)    
data_points = weightloss_dietB

sm.qqplot(data_points, line)
py.show()

### Assumption 6: There needs to be a homogenity of variances
***

This assumption examines the distribution of spread of values around the means of continous variables. This aims to determine whether or not they are relatively similiar. https://methods.sagepub.com/reference/encyc-of-research-design/n179.xml. A p value of less than 0.05 idicates a violation of the assumption. "Listwise deletion, logarithmic transformation or non parametric methods" should be considered as a n alternative. https://www.scalestatistics.com/homogeneity-of-variance.html https://www.statisticssolutions.com/the-assumption-of-homogeneity-of-variance/ 
<br>
Barlett's test for homogeneity is also conducted. This focus on determing a *"test-statistic and finding the p value for the test-statistic, given the degrees of freedom and significance level"*https://stattrek.com/anova/homogeneity/bartletts-test.aspx

In [None]:
#conducting levene's test of homogenity
ss.levene(
    dependent[independent == 'Diet A'],
    dependent[independent == 'Diet B'],
    dependent[independent == 'Diet C'],
)

The test shows a p value of higher than 0.05 indicating that there is no evidence of violation of this assumption.

In [None]:
# conducting Bartlett's test
from scipy.stats import bartlett # https://www.marsja.se/levenes-bartletts-test-of-equality-homogeneity-of-variance-in-python/

# subsetting the data:
DietA = df.query('Diet == "Diet A"')['weightloss']
DietB = df.query('Diet == "Diet B"')['weightloss']
DietC = df.query('Diet == "Diet C"')['weightloss']

# Bartlett's test in Python with SciPy:
stat, p = bartlett(DietA, DietB, DietC)

# Get the results:
print(stat, p)

#to get each individual group
df['Diet'].unique() #https://stattrek.com/anova/homogeneity/bartletts-test.aspx

The p value is greater thans the significance level, indicating that the null hypothesis should not be rejected and the assumption is met.

## Performing the Anova

#### Fisher's One Way Anova

In [None]:
# ANOVA.
ss.f_oneway(
    dependent[independent == 'Diet A'],
    dependent[independent == 'Diet B'],
    dependent[independent == 'Diet C']
)

The pvalue is under the recommended >0.05 indicating that there is a difference between the means of the groups. As the groups are not of same size, it is not possible to do individual t tests to explore the relationship between each of the Diet and the accompanying weight loss. 

In [None]:
pip install Pingouin

In [None]:
import pingouin as pg
import pandas as pd
import numpy as np

#create DataFrame
df = pd.DataFrame({'score': [64, 66, 68, 75, 78, 94, 98, 79, 71, 80,
                             91, 92, 93, 90, 97, 94, 82, 88, 95, 96,
                             79, 78, 88, 94, 92, 85, 83, 85, 82, 81],
                   'group': np.repeat(['a', 'b', 'c'], repeats=10)}) 

#perform Welch's ANOVA
pg.welch_anova(dv='score', between='group', data=df)

In [None]:
import pingouin as pg
pg.pairwise_gameshowell(dv='weightloss', between='Diet', data=df)

#create DataFrame

#perform Welch's ANOVA
pg.welch_anova(dv='weightloss', between='Diet', data=df)

## Post Hoc Tests

In [None]:
# https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide-4.php
# https://www.statology.org/tukey-test-python/
# Import the Tukey HSD test from statsmodels
from statsmodels.stats.multicomp import pairwise_tukeyhsd


# endog is the dependent variable (weightloss).
# groups is the independent variable (Diet).
# alpha is the p-value threshold. In this case anything below 0.05 will reject the null hypothesis.
tukey = pairwise_tukeyhsd(endog=df['weightloss'],
                          groups=df['Diet'],
                          alpha=0.05)

# Print the Tukey table
print(tukey)

## Conclusion

- Diet C has demonstrated a greater weightloss after 6 weeks than that other two Diets the mean of the weightloss was greater than that of the other two. 
- Ensuring the dataset meets the assumptions can be difficult especially when the dataset may contain outliers. In this notebook, numerous test were used to ensure the assumptions were met due to the nature of the outliers. 
- Data collection is extremely important and should always be considered when first designing your hypothesis. In this dataset,  as the participants for each Diet were different counts, different tests has to be conducted. 
- For example, as it was a small dataset, a paired t test could have been utilised to determine the difference in means between the three Diets

## References