## Compare Two Variables using Chi-Square Test

This notebook allows you to compare two variables to see if they are related. It implements a chi-square test for independence to determine if distributions of categorical variables differ from one another in a statistically meaningful sense. In our article, we use a chi-square test to determine if there is a relationship between the subject an article is about (science or the humanities) and the type of newspaper that article appears in (top-circulating newspapers or student newspapers).

See this module's `README.md` file for more information about the data used and produced in this notebook and for more information about the chi-square test.

## Settings

This code block imports required Python modules and sets a required variable.

In [1]:
# Python imports
from scipy.stats import chi2_contingency
import numpy as np
from pathlib import Path

# Define paths 
# directory this notebook is in
current_dir       = %pwd
# directory of this repo on your machine
module_dir        = str(Path(current_dir).parent)
# directory of repo data
data_dir          = module_dir + '/data'

csv_file = data_dir + '/tables/chi-sq-contingency-notext.csv'

## Conduct the chi-square test

This cell will conduct the chi-square test and print the results to the notebook. The null hypothesis for this comparison is that there is no relationship between the subject an article is about (science or the humanities) and the type of newspapers that article appears in (top-circulating newspapers or student newspapers).

"Independent (null hypothesis holds true)" = estimated that the variables being tested are independent of one another <br/>
"Dependent (reject null hypothesis)" = estimated that the variables being tested are dependent on one another

In [2]:
# uploading csv to array
with open(csv_file) as file_name:
    data = np.loadtxt(file_name, delimiter=",")

# running the test
stat, p, dof, expected = chi2_contingency(data)
  
# interpreting p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject null hypothesis)')
else:
    print('Independent (null hypothesis holds true)')

p value is 8.805779309195633e-49
Dependent (reject null hypothesis)


### Learn more about the values used in the test

You can view the other values calculated as part of the chi-square test by running the cells below.

#### View the degrees of freedom of the contingency table

In [None]:
print(dof)

#### View the expected values calculated from the contingency table

In [3]:
print(expected)

[[1945.52279473 8378.47720527]
 [ 530.47720527 2284.52279473]]


#### View the test statistic

In [None]:
print(stat)