# IT Academy - Data Science with Python
## Sprint 8: Hypothesis Testing
### [Github Hypothesis Testing](https://github.com/jesussantana/Hypothesis-testing)

[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)  
[![Made withJupyter](https://img.shields.io/badge/Made%20with-Jupyter-orange?style=for-the-badge&logo=Jupyter)](https://jupyter.org/try)  
[![wakatime](https://wakatime.com/badge/github/jesussantana/Hypothesis-testing.svg)](https://wakatime.com/badge/github/jesussantana/Hypothesis-testing)

### Exercise 1: 
  - Grab a sports theme dataset you like and select an attribute from the dataset. Calculate the p-value and say if you reject the null hypothesis by taking a 5% alpha.

In [None]:
import pandas as pd 
import numpy as np
import scipy as sp
import datetime
import warnings
import time
import math

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import IPython as ip

import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import researchpy as rp

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

mpl.style.use('ggplot')
mpl.rc('font', family='Noto Sans CJK TC')
ip.display.set_matplotlib_formats('svg')

warnings.filterwarnings('ignore')
sns.set_theme(style='darkgrid', palette='deep')

In [None]:
np.random.seed(20180701+3)

In [None]:
pd.set_option('display.max_columns', None)

path = '../data/'
file = 'raw/MLB_Stats.csv'

df_raw = pd.read_csv(path+file)

In [None]:
df = df_raw.copy()

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.describe().round(3)

In [None]:
corr = df.corr()
corr.style.background_gradient(cmap="magma")

### Exercise 2: 
  - Continue with the sports theme dataset you like and select two attributes from the dataset. Calculate the p-value and say if you reject the null hypothesis by taking a 5% alpha.

### Hits vs Runs Scored

In [None]:
ax = sns.jointplot(df.H, df.R)
ax.set_axis_labels('Hits', 'Runs Scored')
plt.show()

In [None]:
dfReg = df[['H', 'R']]
dfReg.std().round(2)

In [None]:
dfReg.corr().round(3)

In [None]:
# select the point (x,y)

dfReg.mean().round(2)

In [None]:
x = dfReg.H
y = dfReg.R

slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print("intercept: %f;  slope: %f;  std. error: %f  p-value: %f;  R value: %f; R-square: %f." % 
      (intercept, slope, std_err, p_value, r_value, r_value**2))

In [None]:
plt.plot(x, y, 'o', label='original data',color='darkblue')
plt.plot(x, intercept + slope * x, 'r', label='fitted line') # equation 𝑓(𝑥) = 𝑏₀ + 𝑏₁𝑥
plt.legend()
plt.title('Hits vs Runs Scored')
plt.ylabel('% of Hits')
plt.xlabel('% of Units in Runs Scored')
plt.show()

- Splitting the dataset into Train and Test sets

In [None]:
X = df.H.values.reshape(-1,1)
y = df.R.values.reshape(-1,1)

In [None]:
X

- we split 80% of the data to the training set while 20% of the data to the test set using below code.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
pearson_coef, p_value = stats.pearsonr(df.H, df.R)

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

- Conclusion:
Since the p-value is < 0.05, the correlation between Hits and Runs Scored is statistically significant, although the linear relationship is extremely strong (~0.973)