**Problem 5c (Chi-square independence test).** 
You are given the results of IPSOS exit polls for 2015 parliamentary elections in Poland in table **data**. Decide if we can assume that gender and voting preferences are independent. To this end:
 * Compute row totals $r_i$, column totals $c_j$, and overall total $N$.
 * If the variables are independent, we expect to see $f_{ij} = r_i c_j / N$ in $i$-th row $j$-th column.
 * Compute the test statistic as before, i.e. $$ S = \sum_{ij} \frac{\left(f_{ij}-X_{ij}\right)^2}{f_{ij}}.$$
 * Again test vs $\chi^2$ CDF. However, if the variables are independent, we only have $(r-1)(c-1)$ degrees of freedom here (we only need to know the row and column totals).
 * The KORWiN party looks like an obvious outlier. Note, when we work with categorical variables we should not just remove a category -- it is better to aggregate them. Introduce an aggregated category by summing the votes for the parties with less than 5% total votes and repeat the experiment.
 
**Note:** This kind of data is (to the best of our knowledge) not available online. It has been recreated based on
online infographics and other tidbits of information available online. It is definitely not completely accurate, hopefully it is not very far off. Moreover, exit polls do not necessary reflect the actual distribution of the population.

In [17]:
import numpy as np
from scipy import stats
# Rows: women, men
# Columns: PiS, PO, Kukiz, Nowoczesna, Lewica, PSL, Razem, KORWiN
data = np.array([[ 17508, 11642,  3308,  3131,  2911,  2205,  1852, 1235],
 [ 17672,  9318,  4865,  3259,  3029,  2479,  1606, 3259]])

In [18]:
#obróbka danych

def get_percentage_val(data):
    total = np.sum(data)
    return np.sum(data, axis=0) / total * 100

def get_non_outliers(data, percent):
    percentage = get_percentage_val(data)
    boolean = percentage > percent
    return data[:, boolean]

def get_outliers(data, percent):
    total = np.sum(data)
    percentage = get_percentage_val(data)
    boolean = percentage <= percent
    
    outliers = data[:, boolean]
    outliers = np.sum(outliers, axis=1)
    outliers = outliers.reshape(len(outliers), 1)
    
    if np.sum(outliers) / total * 100 > percent:
        return outliers
    return None

def get_full_data(data, percent):
    non_outliers = get_non_outliers(data, percent)
    #print(non_outliers)
    outliers = get_outliers(data, percent)
    #print(outliers)
    if(outliers is not None):
        full_data = np.concatenate((non_outliers, outliers), axis=1)
    else:
        full_data = non_outliers
    return full_data

In [31]:
#wlaściwe zadanie

def get_E_table(data):
    row_len = len(data)
    col_len = len(data[0])
    row_sum = np.sum(data, axis=1)
    col_sum = np.sum(data, axis=0)
    total_sum = np.sum(data)
    e = np.arange(row_len * col_len).reshape(row_len, col_len)
    for i in range(row_len):
        for j in range(col_len):
            e[i][j] = row_sum[i] * col_sum[j] / total_sum;
            
    return e

def get_p_val(data):
    e = get_E_table(data)
    row_len = len(data)
    col_len = len(data[0])
    k = ((data - e)**2) / e
    K = np.sum(k)
    
    parameter = (row_len - 1) * (col_len - 1)
    p_val = 1 - stats.chi2.cdf(K, parameter)
    return p_val

In [32]:
#ustawiam ten procent na więcej niż 5, żeby usunął 
#się Korwin i coś jeszcze, żeby miał się z kim połączyć
new_data = get_full_data(data, 5.1)
print(new_data)
print("Nowe dane na górze \n")

boundry = 0.05
if get_p_val(new_data) > boundry:
    print("brak podstaw do odrzucenia hipotezy")
else:
    print("hipoteza odrzucona")

[[17508 11642  3308  3131  2911  2205  3087]
 [17672  9318  4865  3259  3029  2479  4865]]
Nowe dane na górze 

hipoteza odrzucona
