### A Jupyter Notebook exploring the Scipy.Stats module for Python. [scipy.stats offfical](https://docs.scipy.org/doc/scipy/reference/stats.html)
The Scipy.Stats module for Python offers a wide array of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionality, and more. Since statistics is such a large discipline and covers many areas, there are other Python modules for areas such as machine learning, classification, regression, model selection and so on.<br> 
One particular area of interest for the purpose of this demonstration is statistical testing.
#### ANOVA Testing
One-way analysis of variance (ANOVA) testing is performed on 2 or more independent groups to determine if there are any statistically significant differences between the means of the groups. The test is generally performed on three or more groups with a t-test being performed when there are two groups, however for the purpose of this example a one-way ANOVA will be used. [Laerd Statistics](https://statistics.laerd.com/spss-tutorials/one-way-anova-using-spss-statistics.php)<br>
#### Assumptions
As part of the one-way ANOVA process, the data must be checked against 6 assumptions to ensure that the data can actually be analysed using a one-way ANOVA. Each of the 6 assumptions will be explored further in this notebook.
***
Import Python modules

In [1]:
# import modules
# numerical operations
import numpy as np

# general plotting
import matplotlib.pyplot as plt

# data frames
import pandas as pd

# statistical operations
import scipy.stats as ss

# statistical plots
import seaborn as sns

#### Example One-Way ANOVA: Golf Ball driving distance dataset
***
A golf ball manufacturer is testing to see that there is no significant change in driving distance between the current golf ball design and the new golf ball design.
* **Null Hypothesis** (desired outcome) - The change in golf ball design has no effect on driving distance (mean of the current and new ball are almost same)
<br><br>
* **Alternative Hypothesis** - The change in golf ball design has a significant effect on driving distance (mean of the current and new ball are significantly different).

In [2]:
# read in the dataset
df = pd.read_csv('https://raw.githubusercontent.com/killfoley/ML-and-Stats/main/data/golf_ball.csv')
df

Unnamed: 0,ball,distance
0,current,264
1,current,261
2,current,267
3,current,272
4,current,258
...,...,...
75,new,263
76,new,261
77,new,255
78,new,263


<br>

#### Assumption 1 - The dependent variable should be measured at the interval or ratio level  
(in this case metres)
***

In [3]:
# dependent variable
v_dep = df['distance']
v_dep

0     264
1     261
2     267
3     272
4     258
     ... 
75    263
76    261
77    255
78    263
79    279
Name: distance, Length: 80, dtype: int64

In [4]:
# describe the data
v_dep.describe()

count     80.000000
mean     268.887500
std        9.387568
min      250.000000
25%      262.000000
50%      267.500000
75%      275.250000
max      289.000000
Name: distance, dtype: float64

<br>

#### Assumption 2 - The independent variable should consist of two or more categorical, independent groups.
***

In [5]:
# independent variabl
v_indep = df['ball']
v_indep

0     current
1     current
2     current
3     current
4     current
       ...   
75        new
76        new
77        new
78        new
79        new
Name: ball, Length: 80, dtype: object

Note: There are two independent categories 'current' and 'new'