<h1>Multi-category chi-squared tests</h1>
<p>Chi-squared tests work for a single categorical variable (e.g. gender), but they can also be applied to look at the interaction between two categorical variables.</p>
<p>For instance, census data collects both gender demographic data as categorical and high/low income as categorical. A person can be male/female and high/low income as a tuple of categorical dimensions that describe them</p>

<table align="center">
<tr>
    <td></td>
    <th colspan="3">Sex</th>
    <td></td>
</tr>
<tr>
    <th rowspan="5">Income</th>
</tr>
<tr>
    <td>income cat</td>
    <th>Male</th>
    <th>Female</th>
    <th>Totals</th>
</tr>
<tr><td>>50k</td><td>6662</td><td>1179</td><td>7481</td></tr>
<tr><td><=50k</td><td>15128</td><td>9592</td><td>24720</td></tr>
<tr><td>Totals</td><td>21790</td><td>10771</td><td>32561</td></tr>
</table>
<p>Now we need to convert the values to proportions in order to compare across categories</p> 
<table align="center">
<tr><td></td><th colspan="3">Sex</th><td></td></tr>
<tr><th rowspan="5">Income</th></tr>
<tr><td>income cat</td><th>Male</th><th>Female</th><th>Totals</th></tr>
<tr><td>>50k</td><td>.205</td><td>.036</td><td>.241</td></tr>
<tr><td><=50k</td><td>.465</td><td>.294</td><td>.759</td></tr>
<tr><td>Totals</td><td>.669</td><td>.331</td><td>1</td></tr>
</table>
<p>Once the values have been converted to proportions, we can now determine our expected values by multiplying the proportional likelihood of each category. For instance, 24.1% of all people in income earn >50k, and 33.1% of all people in the data set are female. So we would expect the femal proportion of those earning >50k to be .241&#42;.331, which is 0.0799771.</p>
<p>we can convert our expected proportion to an expected value by multiplying it by the sample size--so 32561&#42;0.0799771 is 2597.4</p>

In [2]:
import numpy as np

def compute_expected_value(proportion, N):
    return(proportion*N)
n = 32561
males_over50k = compute_expected_value((.241*.669),n)
males_under50k = compute_expected_value((.759*.669),n)
females_over50k = compute_expected_value((.241*.331),n)
females_under50k = compute_expected_value((.759*.331),n)


In [5]:
observed = [6662, 1179, 15128, 9592]
expected = [5249.8, 2597.4, 16533.5, 8180.3]
values = []

for i, obs in enumerate(observed):
    exp = expected[i]
    value = (obs - exp) ** 2 / exp
    values.append(value)

chisq_gender_income = sum(values)
print(chisq_gender_income)

1517.5510981525103


<h2>Now with SciPy:</h2>

In [8]:
from scipy.stats import chisquare
observed = np.array(observed)
expected = np.array(expected)
import numpy as np
from scipy.stats import chisquare

observed = np.array(observed)
expected = np.array(expected)

chisquare_val, pvalue = chisquare(observed, expected)
print(chisquare_val)
print(pvalue)

1517.55109815
0.0


<h2>Scaling up to >2 value categories with cross tables</h2>
<p>In order to run a chi-squared test under these conditions, we need to find the observed frequency counts for each combination of the categorical variables. This is where <a href="http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html">pandas.crosstab</a> comes in. It will print a table that assigns the observations into their respective combination of categorical variables</p>
<p>These will just be notes until I have a good data set to demonstrated</p>


````
import pandas as pd
fake_data = pd.read_csv("fakedata.csv")
print(fake_data.columns.values)
["gender","race","income"]
crosstab_by_gender_and_race = pd.crosstab(fake_data["gender"],[fake_data["race"]])


output:
race      Amer-Indian-Eskimo   Asian-Pac-Islander   Black   Other   White
sex                                                                      
 Female                  119                  346    1555     109    8642
 Male                    192                  693    1569     162   19174
````

<h2>Finding Expected Values</h2>
<p>we can use <a href="http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html">scipy.stats.chi2_contingency</a> to calculate the p-value, degrees of freedom, and expected frequencies based on observed values</p>
<p>You can directly pass the result of a pandas.crosstab function into the scipy.stats.chi2_contingency function</p>
````
from scipy.stats import chi2_contingency
chisq_value, pvalue_gender_race, df, expected = chi2contingency(crosstab_by_gender_and_race)
print(chisq_value)
print(pvalue_gender_race)
print(df)
print(expected)
Output:
454.267108913
5.19206130276e-97
4
[[   102.87709223    343.69549461   1033.40204539     89.64531188
    9201.3800559 ]
 [   208.12290777    695.30450539   2090.59795461    181.35468812
   18614.6199441 ]]````

<h2>Chi Squared Caveats</h2>
<ul>
    <li>Be mindful of significance</li>
    <li>Chi-Squared tests do not test for correlation, they identify the difference in the expected vs observed values. Further investigation is required to create a more complete picture of the relationship</li>
    <li>Chi-Squared tests can only be applied to independent categorical variables; that is, categorical variables where membership is mutually exclusive</li>
    <li>Chi-Squared tests are more valid when the numbers in each cell of the cross table are larger. I.e. when there are adequate examples of each combination of categorical variables</li>
</ul>