# Class Notes on Statistical Tests

Statistical tests are used to answer very specific questions about a data set.  They only apply in certain conditions, and they only answer certain questions, but they have great *statistical strength*.  

## Basic Tests

These tests are used mostly for when we are exploring a data set.  They often let us apply later tests.  While these will be the only that we explore in this class, there are *several* other statistical tests that are frequently used in Data Science.

### 1) Test for Normality

- Many theories in Data Science assume that a data set is **Normaly Distributed**
    - i.e., it has a bell shape.  
- A *test for normality* gives us a way to verify if a particular data set really **HAS** a Normal Distribution
- There are actually several such tests:
    - Lilliefors Test
    - Anderson-Darling Test
    - Cramer-von Mises Test
    
- One of the simplest to use is a **Visual Check**
    - This works by simply plotting the data and overlaying the equivalent Normal Distribution

In [None]:
import pandas as pd
hw_data = pd.read_csv("http://nur-socr-web-dev02.miserver.it.umich.edu:3000/datasets/Baseball_Players.csv", usecols=["Name","Height(inches)","Weight(pounds)"])
hw_data.head()

In [None]:
hw_data.shape

In [None]:
import matplotlib.pyplot as plt
plt.rc("figure", figsize=(10,4), dpi=100)

In [None]:
plt.hist(hw_data["Height(inches)"], bins=17, density=True)
plt.grid()
plt.xlabel("Height"); plt.ylabel("Fraction of People")
plt.show()

In [None]:
import numpy as np
from scipy import stats

In [None]:
d_mean = hw_data.mean()[0]
d_std = hw_data.std()[0]

In [None]:
d_mean

In [None]:
d_std

In [None]:
plt.hist(hw_data["Height(inches)"], bins=17, density=True)
xline = np.linspace(65,85)
plt.plot(xline, stats.norm.pdf(xline,d_mean,d_std),linewidth=5)
plt.grid()
plt.xlabel("Height"); plt.ylabel("Number of People")
plt.show()

In [None]:
from statsmodels.graphics.gofplots import qqplot

In [None]:
qqplot(hw_data["Height(inches)"], line='s')
plt.grid()
plt.show()

- A statistical test that could tell us the same thing is the Shipiro-Wilks test, 
- When this test returns a small probability, it tells us the data is likely **not** normal

In [None]:
stats.shapiro(hw_data["Height(inches)"])

In [None]:
plt.hist(hw_data["Weight(pounds)"], bins=15, density=True)
plt.grid()
plt.show()

In [None]:
qqplot(hw_data["Weight(pounds)"], line='s')
plt.grid()
plt.show()

In [None]:
stats.shapiro(hw_data["Weight(pounds)"])

#### What About Other Distributions?

- If we wanted to test for other distributional shapes, we can always use the **Kolmogorov-Smirnov Test**, which lets us test any general distribution
    - Often abbreviated as the KS Test

In [None]:
stats.kstest(hw_data["Height(inches)"], 'norm')

In [None]:
stats.kstest(hw_data["Height(inches)"], 'expon')

In [None]:
stats.kstest(hw_data["Height(inches)"], 'laplace')

In [None]:
stats.kstest(hw_data["Height(inches)"], 'uniform')

### 2) Test for Outliers

- Whenever we look at data, we need to be wary of outliers
    - They can really mess up predictions
    - They may need to be removed
- An outlier can be:
    - a **global** outlier if it is significantly different than all other data points
    - a **contextual** outlier if it is only an outlier in a certain context
        - These are much harder to find
- There are multiple definitions for global outliers:
    - If the data is roughly normal, then any data points more than 3 standard deviations away is an outlier
    - Using the quartiles (25\% chunks of the data) we can calculate the Inter-Quartile-Range (IQR) as the distance between the 25\% chunk and the 75\% of data
        - Any data point more than 1.5 times the IQR is an outlier
    - More generally, a global outlier is typically defined as being *significantly* further away from the 1 and 99 percentiles

In [None]:
plt.boxplot(hw_data["Height(inches)"],vert=False, patch_artist=True)
plt.show()

In [None]:
hw_data[hw_data["Height(inches)"] < 68]

In [None]:
hw_data[hw_data["Height(inches)"] > 79]

- The Tietjen-Moore test can find outliers
- It typically requires the exact specification of the NUMBER of supposed outliers

In [None]:
import scikit_posthocs

In [None]:
good_vals = scikit_posthocs.outliers_tietjen(hw_data["Height(inches)"], 12)

In [None]:
plt.boxplot([good_vals,hw_data["Height(inches)"]],vert=False, patch_artist=True)
plt.show()

### 3) Test for Correlation

- We have already looked at correlations in class, but we did not test is the correlation was **significant**
- This is different than having a large or small correlation
- The correlation is deemed *significant* if it is large enough given the *sample size*

In [None]:
from scipy.stats import pearsonr

In [None]:
hw_data.corr()

In [None]:
stat, p = pearsonr(hw_data["Height(inches)"], hw_data["Weight(pounds)"])

In [None]:
stat

In [None]:
p

- This super small value indicates that the two data sets are **very significantly** correlated.

### 4) Test for Homogenity

- Sometimes we want to test if two sets of data have the same *frequency* of labels showing up
- For example, we could label all the baseball players above 200lbs as "Big" and all players below 200lbs as "Small"
- We could then check to see if the label "Big" occurs with the same frequency for tall people (above 74") as it does for shorter people

In [None]:
df = hw_data

In [None]:
df['Size'] = pd.Categorical(["Big"]*1034, categories=["Big", "Small"])

In [None]:
df.loc[df["Weight(pounds)"] > 200, 'Size'] = "Big"

In [None]:
df.loc[df["Weight(pounds)"] <= 200, 'Size'] = "Small"

In [None]:
df[df["Height(inches)"] > 74].Size.value_counts()/df[df["Height(inches)"] > 74].count()[0]

In [None]:
df[df["Height(inches)"] > 74].count()[0]

In [None]:
df[df["Height(inches)"] <= 74].Size.value_counts()/df[df["Height(inches)"] <= 74].count()[0]

In [None]:
plt.hist(df.loc[df["Height(inches)"] > 74, "Weight(pounds)"], bins=10)
plt.hist(df.loc[df["Height(inches)"] <= 74, "Weight(pounds)"], bins=10, alpha=0.5)
plt.legend(["Tall", "Short"])
plt.grid()
plt.show()

In [None]:
df.loc[df["Height(inches)"] > 74, "Size"].value_counts().plot(kind='bar', rot=0)
plt.grid()
plt.show()

In [None]:
df.loc[df["Height(inches)"] <= 74, "Size"].value_counts().plot(kind='bar', rot=0)
plt.grid()
plt.show()

- We can use a ChiSquare test to determine if they are in fact different

In [None]:
from scipy.stats import chisquare

In [None]:
chisquare([0.698113,0.301887], [0.675716,0.324284])