<a href="https://colab.research.google.com/github/philsaurabh/Tutorials/blob/main/Testing_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scipy
SciPy stands for Scientific Python.

It provides more utility functions for optimization, stats and signal processing.

# statmodels

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. 

# Importing Libraries

In [48]:
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [49]:
# To read data as dataframe
df = pd.read_csv('https://raw.githubusercontent.com/philsaurabh/Tutorials/main/Advertising.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


# Percent Point Function
To find the median of a distribution, we can use the percent point function ppf, which is the inverse of the cdf

Normal Curve, Mean=0, Std=1.
We are Providing Shared area to get critical value(Reverse for cdf)

In [50]:
st.norm.ppf(.95)
#st.norm.ppf([.95,0.05])

1.6448536269514722

In [51]:
a=st.norm.ppf(.975)
b=st.norm.ppf(.025)
two_sided=a-b
two_sided

3.9199279690801085

# Cumulative Distribution Function

In [52]:
st.norm.cdf(1.64)
#st.norm.cdf([1.,0,-1.])# For more than one point
#st.norm.cdf(df['TV'])

0.9494974165258963

# Critical value for z

```
z-critical = stats.norm.ppf(1 - alpha) (use alpha = alpha/2 for two-sided)
t-critical = stats.t.ppf(alpha/numOfTails, ddof)
```

In [53]:
# alpha to critical
alpha = 0.05
n_sided = 2 # 2-sided test
z_crit = st.norm.ppf(1-alpha/n_sided)
print('Z critical value: ',z_crit) 

Z critical value:  1.959963984540054


In [54]:
# critical to alpha
alpha = st.norm.sf(z_crit) * n_sided
print('alpha: ',alpha) 

alpha:  0.05


Each z-score tells us how many standard deviations away an individual value is from the mean.

In [55]:
df.apply(st.zscore)
#st.zscore(df['TV'])

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,-1.723412,0.969852,0.981522,1.778945,1.552053
1,-1.706091,-1.197376,1.082808,0.669579,-0.696046
2,-1.688771,-1.516155,1.528463,1.783549,-0.907406
3,-1.671450,0.052050,1.217855,1.286405,0.860330
4,-1.654129,0.394182,-0.841614,1.281802,-0.215683
...,...,...,...,...,...
195,1.654129,-1.270941,-1.321031,-0.771217,-1.234053
196,1.671450,-0.617035,-1.240003,-1.033598,-0.830548
197,1.688771,0.349810,-0.942899,-1.111852,-0.234898
198,1.706091,1.594565,1.265121,1.640850,2.205347


# T- stats

In [56]:
#n=999, p<0.05, 2-tail
#p/2=0.025
print(st.t.ppf(1-0.025, 999))

#n=999, p<0.05%, Single tail
print(st.t.ppf(1-0.05, 999)-st.t.ppf(1-0.025, 999))

1.9623414611334487
-0.3159611157059137


# One-Way Anova
Null Hypothesis: Means of the samples are following same distribution or not.

In [57]:
a= df['TV']
b= df['radio']
c= df['newspaper']
F,p=st.f_oneway(a,b,c)

In [58]:
F

358.8514595342597

In [59]:
p

4.552931539744962e-103

# Additional

In [60]:
df_melt = pd.melt(df.reset_index(), id_vars=['index'], value_vars=[ 'TV','radio', 'newspaper', 'sales'])
# replace column names
df_melt.columns = ['index', 'treatments', 'value']
# Ordinary Least Squares (OLS) model
model = ols('value ~ C(treatments)', data=df_melt).fit()
anova_table = sm.stats.anova_lm(model, typ=2)#Anova table for one or more fitted linear models.
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(treatments),2349842.0,3.0,387.144309,5.150514e-155
Residual,1610489.0,796.0,,


# Two- Way ANOVA

“~” separates the left-hand side of the model from the right-hand side.
If treatments had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the C() operator.

In [47]:
# To read data as dataframe
data = pd.read_csv('https://raw.githubusercontent.com/philsaurabh/Tutorials/main/Credit.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance,Defaultee
0,1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333,0
1,2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903,0
2,3,104.593,7075,514,4,71,11,Male,No,No,Asian,580,0
3,4,148.924,9504,681,3,36,11,Female,No,No,Asian,964,0
4,5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331,0


A two-way ANOVA is used to determine whether or not there is a statistically significant difference between the means of three or more independent groups that have been split on two factors.

The purpose of a two-way ANOVA is to determine how two factors impact a response variable, and to determine whether or not there is an interaction between the two factors on the response variable.

In [83]:
model = ols('Defaultee ~ C(Gender) + C(Ethnicity) + C(Gender):C(Ethnicity)', data=data).fit()
sm.stats.anova_lm(model, typ=1)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
C(Gender),1.0,0.000677,0.000677,0.004523,0.946415
C(Ethnicity),2.0,0.007807,0.003903,0.026083,0.974256
C(Gender):C(Ethnicity),2.0,0.069887,0.034944,0.233504,0.791864
Residual,394.0,58.96163,0.149649,,


In [84]:
model = ols('Defaultee ~ C(Gender) + C(Ethnicity) + C(Gender)*C(Ethnicity)', data=data).fit()
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Gender),0.000762,1.0,0.005089,0.943165
C(Ethnicity),0.007807,2.0,0.026083,0.974256
C(Gender):C(Ethnicity),0.069887,2.0,0.233504,0.791864
Residual,58.96163,394.0,,


In [80]:
model= ols('Defaultee ~ C(Gender)*C(Ethnicity)', data=data).fit()
sm.stats.anova_lm(model, typ=3)

Unnamed: 0,sum_sq,df,F,PR(>F)
Intercept,1.28,1.0,8.553359,0.003648
C(Gender),0.048089,1.0,0.321346,0.571123
C(Ethnicity),0.042944,2.0,0.143483,0.866381
C(Gender):C(Ethnicity),0.069887,2.0,0.233504,0.791864
Residual,58.96163,394.0,,


# T Test (Two sided dependent)

In [61]:
# To read data as dataframe
data = pd.read_csv('https://raw.githubusercontent.com/philsaurabh/Tutorials/main/Credit.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance,Defaultee
0,1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333,0
1,2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903,0
2,3,104.593,7075,514,4,71,11,Male,No,No,Asian,580,0
3,4,148.924,9504,681,3,36,11,Female,No,No,Asian,964,0
4,5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331,0


In [62]:
from scipy.stats import ttest_ind

male = data[data['Gender']=='Male']
female = data[data['Gender']=='Female']

ttest_ind(male['Defaultee'], female['Defaultee'])

Ttest_indResult(statistic=0.06754766913026408, pvalue=0.9461796342145767)

In [63]:
# Another Example
# n = 9, degree of freedom = 9-1 = 8
# for 99% confidence interval, alpha = 1% = 0.01 and alpha/2 = 0.005
ci = 99
n = 9
t = st.t.ppf(1- ((100-ci)/2/100), n-1) # 99% CI, t8,0.005
print(t)

3.3553873313333957


# Chi-Square

In [64]:
from scipy.stats import chi2_contingency

we want to test if there is a statistically significant difference in Genders (M, F) population between Married and Not Married.

In [65]:
contigency= pd.crosstab(data['Gender'], data['Married'])
contigency

Married,No,Yes
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,79,128
Male,76,117


chi2: The test statistic

p: The p-value of the test

dof: Degrees of freedom

expected: The expected frequencies, based on the marginal sums of the table


In [66]:
# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contigency)
c

0.02141530550698175

In [67]:
p,dof

(0.8836532332904176, 1)

In [68]:
expected

array([[ 80.2125, 126.7875],
       [ 74.7875, 118.2125]])