In [1]:
## DS & visuals
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("pastel")
sns.set_theme(style="whitegrid")

## Stats 
import scipy
from scipy import stats
## For encoding
#from sklearn.preprocessing import LabelEncoder

## Linear Regression
from statsmodels.formula.api import ols
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse

## scikitlearn
from sklearn.model_selection import train_test_split

##
import warnings
warnings.filterwarnings("ignore")

from scipy import stats
from scipy.stats import skew
from scipy.stats import kurtosis
import statsmodels.api as sm

import math
import pandas as pd
import numpy as np

## Graphs
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# **Chi-squared tests: Goodness of fit versus independence**
---
Hypothesis tests are used to see significant differences among groups. 

# 1. **Chi-squared tests** are used to determine whether one or more observed categorical variables follow expected distribution(s). 

For example, you may expect that 50% more movie goers attend movies on weekends in comparison to weekdays. After observing movie goers attendance for a month, you then can perform a chi-squared test to see if your initial hypothesis was correct. 

This reading will cover the two main chi-squared tests—Goodness of Fit and Test for Independence—which can be used to test your expected hypothesis against what actually occurred. Data professionals perform these hypothesis tests to offer organizations actionable insights that drive decision making.

### The Chi-squared Goodness of Fit Test 

Chi-squared (χ²) Goodness of Fit Test is a hypothesis test that determines whether an observed categorical variable follows an expected distribution. 

- The null hypothesis **(H0)** of the test is that the categorical variable follows the expected distribution. 
- The alternative hypothesis **(Ha)** is that the categorical variable does not follow the expected distribution. 

Consider the scenario in this reading that will define the null and alternative hypotheses based on the scenario, 
**set up a Goodness of Fit test, evaluate the test results, and draw a conclusion** 

Chi-squared Goodness of Fit scenario
Imagine that you work as a data professional for an online clothing company. 

Your boss tells you that they expect the number of website visitors to be the same for each day of the week. You decide to test your boss’s hypothesis and pull data every day for the next week and record the number of website visitors in the table below:

In [2]:
boss_expectatios = {}
boss_expectatios['dayOfTheWeek']  = ['mon', 'tues', 'wed', 'thu', 'fri', 'sat', 'sun']
boss_expectatios['observedValue'] = [650, 570, 420, 480, 510, 380, 490]
boss = pd.DataFrame.from_dict(boss_expectatios)
boss

Unnamed: 0,dayOfTheWeek,observedValue
0,mon,650
1,tues,570
2,wed,420
3,thu,480
4,fri,510
5,sat,380
6,sun,490


In [3]:
total = boss['observedValue'].sum()
print(f'Expected total for the week: {total}')

Expected total for the week: 3500


#### Main steps:

1. Identify the Null and Alternative Hypotheses 

2. Calculate the chi-square test statistic (𝛘2)

3. Calculate the p-value 

4. Make a conclusion 

# *Step 1:*

Identify the null and alternative hypotheses

The first step in performing a chi-squared goodness of fit test is to determine your null and alternative hypothesis. Since you are testing if the number of website visitors follows your boss’s expectations, the below are your null and alternative hypotheses : 

- $H0$: The week you observed follows your boss’s expectations that the number of website visitors is equal on any given day

- $Ha$: The week you observed does not follow your boss’s expectations; therefore, the number of website visitors is not equal across the days of the week

# *Step 2:*

Calculate the chi-squared test statistic $(𝛘2)$

Next, calculate a test statistic to determine if you should reject or fail to reject your null hypothesis. This test statistic is known as the chi-squared statistic and is calculated based on the following formula: 

$\chi^2 = \sum \frac {(O - E)^2}{E}$

$O:Observed.$
$E:Expected.$

Since there were a 

total of 3,500 website visitors you observed; 

your boss’s expectation is that 500 visitors would visit each day (3,500/7). 

In the formula above, 500 would serve as the “expected” value. 

A column has been added to your original table to include the test statistic calculation for each weekday:

The chi-squared statistic would be the sum of the third column above: 

x2= 45 + 9.8 + 12.8 + 0.8 + 0.2 + 28.8 + 0.2

x2 = 97.6

In [4]:
## Expectations.
boss['expectedValue'] = [500, 500, 500, 500, 500, 500, 500]

In [5]:
## Formula.
boss['ChiSquaredTestStatistic'] = (boss['observedValue'] - boss['expectedValue'])**2 / boss['expectedValue']

In [6]:
## Chi Squared
chi_chi = boss['ChiSquaredTestStatistic'].sum()
print(f'chi-squared statistic: {chi_chi}')

chi-squared statistic: 97.6


In [7]:
boss

Unnamed: 0,dayOfTheWeek,observedValue,expectedValue,ChiSquaredTestStatistic
0,mon,650,500,45.0
1,tues,570,500,9.8
2,wed,420,500,12.8
3,thu,480,500,0.8
4,fri,510,500,0.2
5,sat,380,500,28.8
6,sun,490,500,0.2


### Notes:

[chi-squared-wiki](https://en.wikipedia.org/wiki/Chi-squared_test)

[chi-squared-book](https://web.archive.org/web/20230322163848if_/http://vassarstats.net/textbook/)

[scipy function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html)

# *Step 3*

Find the p-value 

Significance level (alpha) for your hypothesis test and how to use Python’s Scipy stats package to determine p-values. You can use the same module’s 
chisquare function
 to pass in your data to obtain the test statistic and p-value. The following code uses your observed and expected values to calculate the chi-squared test statistic and the p-value:


In [8]:
expectations = boss['expectedValue']
observations = boss['observedValue']

result = stats.chisquare(f_obs = observations, f_exp = expectations)
result

Power_divergenceResult(statistic=97.6, pvalue=7.943886923343835e-19)

The output confirms your calculation of the chi-square test statistic in Step 2 and
also gives you the associated p-value. Because the p-value is less than 
the significance level of 5%, you can **REJECT the null hypothesis.**

# *Step 4:*

Make a conclusion

Since the p-value is less than 0.05, there is sufficient evidence to suggest that the number of visitors is not equal per day. Now that you have made this conclusion, you now have been asked to expand your analysis to look at the relationship between the device that a website user used and their membership status. In order to expand your analysis you must use the Chi-Squared Test for Independence. 

# **The Chi-Squared Test for Independence**
---

# 2. Chi-squared $(χ²)$ Test for Independence is a hypothesis test that **determines whether or not two categorical variables are associated with each other.** 

- Null hypothesis $(H0)$ of the test is that two categorical variables are independent. 

- Alternative hypothesis $(Ha)$ is that two categorical variables are not independent. 

You will utilize the chi-squared test of independence to compare if the 
type of device a visitor uses to visit the website (Mac or PC) is dependent on whether he or she has a membership account or browses as a guest (Member or Guest). 

# *Step 1:* 

Identify the null and alternative hypotheses

Just like the Goodness of Fit scenario, the first step is to determine your null and alternative hypotheses.  You are comparing if the device used to visit your clothing store (Mac or PC) is independent from the visitor’s membership status (Member or Guest). From that information you can determine that your null and alternative hypotheses are as follows: 

$H0$: The type of device a website visitor uses to visit the website is independent of the visitor’s membership status.

$Ha$: The type of device a website visitor uses to visit the website is not independent of the visitor’s membership status.

# *Step 2:* 

Calculate the chi-squared test statistic (𝛘2)

The table below now breaks down our website visitors based on the device they used and their membership status. 

In [9]:
visitors = {}
visitors['observedValues'] = ['mac', 'pc', 'total']
visitors['member'] = [850, 1300, 2150]
visitors['guest'] = [450, 900, 1350]
visitors['total'] = [1300, 2200, 3500]

visitors = pd.DataFrame.from_dict(visitors)
visitors

Unnamed: 0,observedValues,member,guest,total
0,mac,850,450,1300
1,pc,1300,900,2200
2,total,2150,1350,3500


In order to get the expected value under the independence assumption, you will use the following formula: 

$$expectedValue = (columnTotal * rowTotal) / OverAllTotal$$

$$expectedValue = (2150 * 1300) / 3500 = 799 $$

In [10]:
expVal = {}
expVal['expected Value'] = ['mac', 'pc']
expVal['member'] = [799, 1351]
expVal['guest'] = [501, 849]
expValdf = pd.DataFrame.from_dict(expVal)
expValdf

Unnamed: 0,expected Value,member,guest
0,mac,799,501
1,pc,1351,849


# *Step 3:*

Find the p-value

You can use the Python Scipy package

[chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) 

function to obtain the **chi-square test** statistic and **p-value.** 
 
The chi2_contingency function only needs the observed values - it will calculate the expected values for you. Here is the python code: 

In [11]:
Observations = np.array([[850, 450],[1300, 900]])
Result = stats.contingency.chi2_contingency(Observations)
print(Result)m 

(13.396423236539514, 0.0002521045757089368, 1, array([[ 798.57142857,  501.42857143],
       [1351.42857143,  848.57142857]]))


In [14]:
chiSquareStatistic, pVal, DOF, expectedVal = stats.contingency.chi2_contingency(Observations)

print(chiSquareStatistic)
print(pVal)
print(DOF)
print(expectedVal)


13.396423236539514
0.0002521045757089368
1
[[ 798.57142857  501.42857143]
 [1351.42857143  848.57142857]]


In [16]:
pd.DataFrame(expectedVal, index=['mac', 'pc'], columns=['member', 'guest'])

Unnamed: 0,member,guest
mac,798.571429,501.428571
pc,1351.428571,848.571429


**The output above is in the following order:**
the chi-square statistic, p-value, degrees of freedom, and expected values in array format. Looking at the p-value compared to a significance level of 5%, you can REJECT the null hypothesis in favor of the alternative. 

**Step 4:** 

Make a conclusion 

Based on the above p-value, you conclude that the type of device a website user uses is not independent of his or her membership status. You may recommend to your boss to dive into the reasons behind why visitors sign up for paid memberships more on a particular device. Is the sign-up button showing up differently on a particular device? Are there device specific bugs that need to be fixed? These are a couple of many questions you should seek into next to help 

Key takeaways
The Chi-squared Goodness of Fit test is used to test if an observed categorical variable follows an expected distribution.

The Chi-squared Test for Independence is used to test if two categorical variables are independent of each other or not. 

Both Chi-squared tests follow the same hypothesis testing steps to determine whether you should reject or fail to reject the null hypothesis to drive decision making, as you have explored elsewhere in this program.  