# Mateusz Nowakowski, 4 zadanie zaliczeniowe

**Problem 5c (Chi-square independence test).** 
You are given the results of IPSOS exit polls for 2015 parliamentary elections in Poland in table **data**. Decide if we can assume that gender has no effect on voting preferences. To this end:
 * Compute row totals $r_i$, column totals $c_j$, and overall total $N$.
 * If the variables are independent, we expect to see $f_{ij} = r_i c_j / N$ in $i$-th row $j$-th column.
 * Compute the test statistic as before, i.e. $$ S = \sum_{ij} \frac{\left(f_{ij}-X_{ij}\right)^2}{f_{ij}}.$$
 * Again test vs $\chi^2$ CDF. However, if the variables are independent, we only have $(r-1)(c-1)$ degrees of freedom here (we only need to know the row and column totals).
 * One obvious offender is the KORWiN party, try removing the last column and repeating the experiment.
 
**Note:** This kind of data is (to the best of our knowledge) not available online. It has been recreated based on
online infographics and other tidbits of information available online. It is definitely not completely accurate, hopefully it is not very far off. Moreover, exit polls do not necessary reflect the actual distribution of the population.

In [9]:
import numpy as np
import scipy.stats

# Rows: women, men
# Columns: PiS, PO, Kukiz, Nowoczesna, Lewica, PSL, Razem, KORWiN

#data = np.array([ [39.7,26.4,7.5,7.1,6.6,5.0,4.2,2.8], 
#                  [38.5,20.3,10.6,7.1,6.6,5.4,3.5,7.1]])

data = np.array(
    [[ 17508, 11642,  3308,  3131,  2911,  2205,  1852, 1235],
     [ 17672,  9318,  4865,  3259,  3029,  2479,  1606, 3259]]
)

data = data // 50  # It is good to lower the sample size to obtain more realistic results

In [13]:
def calculate(data):
    n_rows, n_cols = data.shape

    sum_total = np.sum(data)
    sum_columns = np.sum(data, axis = 0)
    sum_rows = np.sum(data, axis = 1)
    
    print("Overall total:", sum_total)
    print("Column totals: ", sum_columns)
    print("Row totals: ",sum_rows)

    expected = np.reshape(np.array([sum_rows[i] * sum_columns[j] / sum_total for i, j in np.ndindex((n_rows, n_cols))]), (n_rows, n_cols))
    print()
    print("Expected f_{i, j}:")
    print(expected)
    print("")
    
    S = np.sum([(expected[i][j] - data[i][j]) ** 2 / expected[i][j] for i, j in np.ndindex((n_rows, n_cols))])
    print("S =", S)
    print()
    
    alpha = 0.5

    deg = (n_rows - 1) * (n_cols - 1)

    if (1 - scipy.stats.chi2.cdf(S, deg)) < alpha :
        print("We reject the hypothesis! Gender may have an affect on voting preferences.\n")
    else :
        print("We accept the hypothesis! Gender has no affect on voting preferences.\n")

In [11]:
print("With KORWIN: \n")
calculate(data)

print("NO KORWIN: \n")
calculate(data[:, :-1])

With KORWIN: 

Overall total: 1780
Column totals:  [703 418 163 127 118  93  69  89]
Row totals:  [873 907]

Expected f_{i, j}:
[[344.78595506 205.00786517  79.94325843  62.28707865  57.87303371
   45.61179775  33.84101124  43.65      ]
 [358.21404494 212.99213483  83.05674157  64.71292135  60.12696629
   47.38820225  35.15898876  45.35      ]]

S = 29.955740932887522

We reject the hypothesis! Gender may have an affect on voting preferences

NO KORWIN: 

Overall total: 1691
Column totals:  [703 418 163 127 118  93  69]
Row totals:  [849 842]

Expected f_{i, j}:
[[352.95505618 209.86516854  81.83737433  63.76286221  59.24423418
   46.69248965  34.6428149 ]
 [350.04494382 208.13483146  81.16262567  63.23713779  58.75576582
   46.30751035  34.3571851 ]]

S = 11.67783242040946

We reject the hypothesis! Gender may have an affect on voting preferences

