## Chi-square Test using:

In [16]:
from scipy import stats
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
import seaborn as sns
import pandas as pd
import scipy
from scipy.stats import chi2
from scipy.stats import chi2_contingency

In [5]:
%%html
<style>
table {float:left}
</style>

A group of scientists gives a series of seminars on the advantages of getting vaccinated. After the seminars, the scientists want to know whether people have changed their opinion about getting vaccinated. Given in the following tables are the data based on a survey of 800 randomly chosen people.     
**Please conduct related hypothesis testing using a 0.01 significance level. Before doing hypothesis testing, check the related assumptions. Comment on the results.**

**Table 1: Before the Seminars**

| Positive | Negative | Neutral | No Opinion |
|----------|----------|---------|------------|
| 0.38     | 0.28     | 0.13    | 0.21       |   

**Table 2: After the Seminars** 

| Positive | Negative | Neutral | No Opinion |
|----------|----------|---------|------------|
| 0.49    | 0.26     | 0.09    | 0.16       |


In [8]:
#create Crosstab.
ar=np.array([[304,224,104, 168],[392, 208, 72, 128]])    
df=pd.DataFrame(ar, columns=["Positive", "Negative", "Neutral", "No Opinion"])
df.index=["Before", "After"] 
df

Unnamed: 0,Positive,Negative,Neutral,No Opinion
Before,304,224,104,168
After,392,208,72,128


we can see that we have a stuations which are after and before the seminar with different populations. Thus, our case is independent. we can use chi-square test. The Chi-square test of independence tests if there is a significant relationship between two categorical variables.
The hypothesis being test is:

The H0 (Null Hypothesis): There is no relationship between variable one and variable two(i.e. there is no change of opinion about vacine after the seminar.)
The H1 (Alternative Hypothesis): There is a relationship between variable 1 and variable 2.(i.e. there is a difference about opinion about vacine after the seminar.)

If the p-value is significant, you can reject the null hypothesis and claim that the findings support the alternative hypothesis.

First check of the assumptions:

We want to use Pearson’s chi-squared test to examine whether these two variables, “Before” and “After the Survey”, are statistically linked (= alternative hypothesis) or statistically independent (= null hypothesis). In this example, we set our α-level at 0.01.

In [9]:
df2=df.copy() # create contingency table with the marginal totals and the grand total. 
df2.loc['Column_Total']= df2.sum(numeric_only=True, axis=0)
df2.loc[:,'Row_Total'] = df2.sum(numeric_only=True, axis=1)
df2

Unnamed: 0,Positive,Negative,Neutral,No Opinion,Row_Total
Before,304,224,104,168,800
After,392,208,72,128,800
Column_Total,696,432,176,296,1600


In [10]:
n=df2.at["Column_Total", "Row_Total"]  # grand total 

exp=df2.copy()               # create dataframe with expected counts
for x in exp.index[0:-1]:
    for y in exp.columns[0:-1]:
        # round expected values to 6 decimal places to get the maximum available precision:
        v= (((df2.at[x, "Row_Total"]) * (df2.at["Column_Total", y])   )   /n ).round(6) 
        exp.at[x,y]=float(v)

exp = exp.iloc[[0,1], [0,1,2,3]]
exp

Unnamed: 0,Positive,Negative,Neutral,No Opinion
Before,348,216,88,148
After,348,216,88,148


In [11]:
#Let’s first calculate our test-statistic without Yates’ correction. We apply the aforementioned formula for χ2 and get:
tstat = np.sum(((df-exp)**2/exp).values)
tstat

22.942616597789012

In [12]:
#After that we want to obtain our degrees of freedom:
dof = (len(df.columns)-1)*(len(df.index)-1) # determine degrees of freedom 
dof

3

In [14]:
#Now we can calculate the corresponding p-value by using the cumulative distribution function given the degrees of freedom:
pval=1-chi2.cdf(tstat, dof) # subtract the cumulative distribution function from 1
pval

4.1510596975213865e-05

In [17]:
#Finally, we want to verify our result by comparing it to Python’s built-in function scipy.stats.chi2_contingency.
#For now, we do not want to apply Yates’ correction, therefore we choose “correction=False”:

tstat_scipy,pval_scipy,ddof_scipy,exp_scipy=chi2_contingency(df, correction=False) # "correction=False" means no Yates' correction is used! 
print("Chi-squared test statistic without Yates correction (Scipy): " + str(tstat_scipy))
print("P-value without Yates correction (Scipy): " + str(pval_scipy))

Chi-squared test statistic without Yates correction (Scipy): 22.942616597789012
P-value without Yates correction (Scipy): 4.151059697524019e-05


We are pleased to come upon identical results. As we can see, our p-value is less than our α-level of 0.01. Hence, we would reject the null hypothesis.  At this significance level, it can be concluded that people changed their opinion.