# <font color ="green"> z-Test on WineEnthusiast data set using pandas </font>

- Let's say you know the mean and standard deviation of a population

- How can you tell if a sample is from this population or some other population? 

- Although we may never know with 100% certainty, we can look for statistically significant differences between the sample statistics and the population paramters. 

- This is done by first stating what is refered to as a null hypothesis, which in this scenario would be that there is no difference between the sample mean and the population mean

- Then we look for statistical evidence to accept or reject the null hypothesis.

## load packages

In [40]:
import os
import numpy as np
import pandas as pd
import scipy.stats as stats # some useful stuff
os.chdir("/Users/paritoshgupta/Downloads/")
wine_data = pd.read_csv("winemag-data-130k-v2.csv")
print(wine_data.shape)
wine_data.sample(100).head(5)

(129971, 14)


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
107512,107512,Argentina,Earthy plum and red-berry aromas are suggestiv...,Yauquen,86,12.0,Mendoza Province,Mendoza,,Michael Schachner,@wineschach,Ruca Malen 2015 Yauquen Bonarda (Mendoza),Bonarda,Ruca Malen
31062,31062,Germany,"A whiff of evergreen lends a cool, alpine touc...",,90,22.0,Mosel,,,Anna Lee C. Iijima,,Fritz Haag 2012 Riesling (Mosel),Riesling,Fritz Haag
116505,116505,US,"This is an impressive Cab Franc from Red Newt,...",,87,20.0,New York,Finger Lakes,Finger Lakes,Susan Kostrzewa,@suskostrzewa,Red Newt Cellars 2005 Cabernet Franc (Finger L...,Cabernet Franc,Red Newt Cellars
26076,26076,Italy,This pretty Viognier shows bright tones of hon...,Astraio,87,15.0,Tuscany,Maremma Toscana,,,,Rocca di Montemassi 2011 Astraio Viognier (Mar...,Viognier,Rocca di Montemassi
124192,124192,Austria,The scent of lemon and orange flesh is contain...,Kellerberg Smaragd,94,50.0,Wachau,,,Anne Krebiehl MW,@AnneInVino,Tegernseerhof 2015 Kellerberg Smaragd Riesling...,Riesling,Tegernseerhof


- **Lets assume the WineEnthusiast point scores are interval-scaled normally distributed data. Let's find the population mean and population standard deviation.**

### Question

- A sample of N=10 wine point scores yields a sample mean of x_bar = 90.2. Is this sample from the WineEnthusiast population?

To test this question we will use what is refered to as a **one-sample z-test** First we state the null hypothesis and alternative hypothesis like this;

H0: The sample is from the WineEnthusiast population, x_bar = μ.
HA: The sample is not from the WineEnthusiast population, x_bar != (not equal) μ.

## Steps

- First we state the null hypothesis and alternative hypothesis like this;
    - H0: The sample is from the WineEnthusiast population, x_bar = μ.
    - HA: The sample is not from the WineEnthusiast population, x_bar != (not equal) μ.


- Then, we specify a significance (alpha) level. Usually, statistical significance is associated with an alpha level of α = 0.05 or smaller. 


- Next, we use a z table to look up the critical z value that cooresponds to this α level. 


- Here we are doing a two-tailed test because we don't care if the sample mean is greater than or less than the population mean. We just are testing to see if the two are equal or notl (see the alternative hypothesis above). 


- Next we calculate the z-statitic for the sample mean compared to the population mean dividing by the standard deviation of the sample mean, which is the standard error σ/sqrt(N). 


- If this z-statistic is less than z-critical then we accept the null hypothesis, otherwise we reject the null and accept the alternative hypothesis. 


- Let's do it!!

In [41]:
# population parameters
points = wine_data['points']
mu = points.mean()
sigma = points.std(ddof=0)
print("mu: ", mu, ", sigma:", sigma)


z_critical = 1.96 # alpha level of 0.05 and two-tailed test
x_bar = 90.2
N = 10
SE = sigma/np.sqrt(N)
z_stat = (x_bar - mu)/SE
print(z_stat)

mu:  88.44713820775404 , sigma: 3.0397185090150947
1.8235358539097541


**Result:** Now the z-statistic is greater than z-critical and we reject the null hypothesis. Statistically speaking we say that this sample was drawn from some different population than the WineEnthusiast population

**Note**: Statsmodels packages provides functionality for many statistical test without writing any custom code or utility

# <font color ="green"> t-Test  </font>

## Question

Do you have one sample that you want to compare to some specified value? Do a one-sample t-test. For example, let's say it is well known that acorns have an average mass of 10 g, and you want to test to see if them mass of acorns from a forest subjected to acid rain are signifcantly different.

###  One-sample location test on whether the mean of a population is equal to a value specified in null hypothesis

The mass of a sample of N=20 acorns from a forest subjected to acid rain from a coal power plant are m = 8.8, 6.6, 9.5, 11.2, 10.2, 7.4, 8.0, 9.6, 9.9, 9.0, 7.6, 7.4, 10.4, 11.1, 8.5, 10.0, 11.6, 10.7, 10.3, and 7.0 g. Is the average mass of this sample different from the average mass of all acorns of μ = 10.0 g?

H0: x̄ - μ = 0, that is there is no difference between my sample mean and the value of μ.
Ha: x̄ - μ ≠ 0 (two-sided test)
α = 0.05

- degrees of freedom: df = N-1


- t-critical for specified alpha level: t* = 2.093


- t-statistic: t = (x̄ - μ)/(s/sqrt(N)) where s is the sample standard deviation.

In [38]:
x = [8.8, 6.6, 9.5, 11.2, 10.2, 7.4, 8.0, 9.6, 9.9, 9.0,
     7.6, 7.4, 10.4, 11.1, 8.5, 10.0, 11.6, 10.7, 10.3, 7.0]
mu = 10
t_critical = 2.093
x_bar = np.array(x).mean()
s = np.array(x).std(ddof=1) # subtract 1 from N to get unbiased estimate of sample standard deviation
N = len(x)
SE = s/np.sqrt(N)
t = (x_bar - mu)/SE
print("t-statistic: ",t)

# a one sample t-test that gives you the p-value too can be done with scipy as follows:
t, p = stats.ttest_1samp(x, mu)
print("t = ", t, ", p = ", p)

t-statistic:  -2.2491611580763977
t =  -2.2491611580763973 , p =  0.03655562279112415


**Result:** Note that t is greater in magnitude that t* so there is a statistically significant difference at the α = 0.05 level between the sample mean and the stated population mean of 10 g.
    

**Note**: that statistical signficance doesn't mean the effect is large. Let's report the 95% confidence intervals too..

In [39]:
# margin of error
err = t_critical*SE

# negative side 
x_low = x_bar - err
# postive side 

x_high = x_bar + err
print("x_bar = {}, 95% CI [{}, {}]".format(x_bar.round(2), x_low.round(2), x_high.round(2)))

# you can also get CIs by using the build int t-distribution function like this:
print("CI using scipy: ",stats.t.interval(0.95, N-1, loc=x_bar, scale=SE))
x_bar = 9.24, 9

x_bar = 9.24, 95% CI [8.53, 9.95]
CI using scipy:  (8.532759313560822, 9.947240686439175)


In [None]:
x-u/SE = zalpha/2

x = Zalpha/2*SE +   u

if two tailed test 

if on the negative side 

alpha = 0.05

remianing = 0.95





# <font color ="green"> Chi Square Test </font>

In [42]:
shopping_data = pd.DataFrame({'Gender': ['Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female'], 
                   'Like Shopping? ': ['No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No']})
shopping_data

Unnamed: 0,Gender,Like Shopping?
0,Male,No
1,Female,Yes
2,Male,Yes
3,Female,Yes
4,Female,Yes
5,Male,Yes
6,Male,No
7,Female,No
8,Female,No


In [43]:
#Contingency Table
contingency_table=pd.crosstab(shopping_data["Gender"],shopping_data["Like Shopping? "])
print('contingency_table :-\n',contingency_table)

contingency_table :-
 Like Shopping?   No  Yes
Gender                  
Female            2    3
Male              2    2


In [23]:
# Observed Values
Observed_Values = contingency_table.values 
print("Observed Values :-\n",Observed_Values)

Observed Values :-
 [[2 3]
 [2 2]]


In [24]:
#Expected Values
import scipy.stats
b=scipy.stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)

Expected Values :-
 [[2.22222222 2.77777778]
 [1.77777778 2.22222222]]


In [27]:
#Degree of Freedom
no_of_rows=len(contingency_table.iloc[0:2,0])
no_of_columns=len(contingency_table.iloc[0,0:2])
df=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:",df)


Degree of Freedom: 1


In [28]:
#Significance Level 5%
alpha=0.05

In [32]:
# chi-square statistic - χ2
from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)

chi-square statistic:- 0.09000000000000008


In [45]:
sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])

array([0.05, 0.04])

In [33]:
#critical_value
critical_value=chi2.ppf(q=1-alpha,df=df)
print('critical_value:',critical_value)

critical_value: 3.841458820694124


In [34]:
#p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=df)
print('p-value:',p_value)

p-value: 0.7641771556220945


In [35]:
print('Significance level: ',alpha)
print('Degree of Freedom: ',df)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)

Significance level:  0.05
Degree of Freedom:  1
chi-square statistic: 0.09000000000000008
critical_value: 3.841458820694124
p-value: 0.7641771556220945


In [36]:
#compare chi_square_statistic with critical_value and p-value which is the probability of getting chi-square>0.09 (chi_square_statistic)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

Retain H0,There is no relationship between 2 categorical variables
Retain H0,There is no relationship between 2 categorical variables
