# Topic 2 - Chi-Squared  $x^2$

In [1]:
import pandas as pd
import random
from scipy import stats as ss

The Ratcliffe book you have has a relatively simply explanation of chi-squared tests in chapter 15, while the *Essential Math for Data Science* book has some brief information at page 216.

From [Laerd](https://statistics.laerd.com/spss-tutorials/chi-square-test-for-association-using-spss-statistics.php): "The chi-square test for independence, also called [Pearson's](https://en.wikipedia.org/wiki/Karl_Pearson) chi-square test or the chi-square test of association, is used to discover if there is a relationship between two *categorical* variables." Chi-square is appropriate where two categorical variables exist that are either ordinal or nominal (explain), for instance ethnicity, job, education level, etc.

The below code replicates the demonstration found at the Laerd link.

In [2]:
male_books = [['Male','Books']] * 16
male_online = [['Male','Online']] * 24
female_books = [['Female','Books']] * 13
female_online = [['Female','Online']] * 27

# In Python it is quite simple to add lists together
raw_data = male_books + male_online + female_books + female_online

# Shuffling data purely to prove that Python will preserve the gender-medium combinations in each row
random.shuffle(raw_data)

# Unpack the each list items into two whole columns, a more Pandas-friendly format
gender, medium = list(zip(*raw_data))

In [3]:
# create a Pandas dataframe using the two lists of data
df = pd.DataFrame({'gender': gender, 'medium': medium})

df

Unnamed: 0,gender,medium
0,Male,Books
1,Female,Online
2,Female,Online
3,Male,Online
4,Female,Online
...,...,...
75,Male,Online
76,Male,Online
77,Male,Books
78,Male,Online


## Contingency Table

In [4]:
cross = ss.contingency.crosstab(df['gender'], df['medium'])

# Show.
cross.count

array([[13, 27],
       [16, 24]])

In [7]:
# The first variable values, and the second.
first, second = cross.elements

# Show.
first, second

(array(['Female', 'Male'], dtype=object),
 array(['Books', 'Online'], dtype=object))

In [9]:
# Find all rows in df with gender equal to the first value in first.
df[df['gender'] == first[0]]

Unnamed: 0,gender,medium
1,Female,Online
2,Female,Online
4,Female,Online
7,Female,Books
8,Female,Online
9,Female,Online
11,Female,Books
15,Female,Online
17,Female,Online
18,Female,Online


In [5]:
# Do the statistics. Just do them.
result = ss.chi2_contingency(cross.count, correction=False)

# Show.
result

Chi2ContingencyResult(statistic=0.486815415821501, pvalue=0.4853513240525321, dof=1, expected_freq=array([[14.5, 25.5],
       [14.5, 25.5]]))