# 1. RELATED CATEGORICALS: Chi-Square

## Loading libraries

In [1]:
import pandas as pd
import numpy as np
import datetime
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Loading data

In [2]:
data = pd.read_csv('lesson_404.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,STATE,MDMAUD,DOMAIN,INCOME,HV1,HV2,HV3,HV4,IC1,...,CARDPROM,CARDPM12,NUMPRM12,MAXADATE,RFA_2,NGIFTALL,TIMELAG,AVGGIFT,IC2_,age
0,0,IL,XXXX,Town,4.0,479,635,3,2,307,...,27,6,14,9702,L4E,31,1.386294,7.741935,Low,60.0
1,1,CA,XXXX,Suburban,6.0,5468,5218,12,10,1088,...,12,6,13,9702,L2G,3,2.890372,15.666667,High,45.0
2,2,NC,XXXX,Rural,3.0,497,546,2,1,251,...,26,6,14,9702,L4E,27,2.484907,7.481481,Low,59.0
3,3,CA,XXXX,Rural,1.0,1000,1263,2,1,386,...,27,6,14,9702,L4E,16,2.197225,6.8125,Moderate,69.0
4,4,FL,XXXX,Suburban,3.0,576,594,4,3,240,...,43,10,25,9702,L2F,37,2.639057,6.864865,Low,77.0


## $\chi^{2}$ test in Python

Let's use the $\chi^{2}$ test in order to determine of columns `DOMAIN` and `RFA_2` are related.

In [3]:
data_crosstab = pd.crosstab(data['DOMAIN'], data['RFA_2'], margins = False)
data_crosstab

RFA_2,L1E,L1F,L1G,L2E,L2F,L2G,L3D,L3E,L3F,L3G,L4D,L4E,L4F,L4G
DOMAIN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
City,911,5580,2413,962,2063,868,450,1408,640,303,906,732,358,164
Rural,1092,5617,2035,1034,2010,801,557,1453,639,253,1042,739,378,146
Suburban,828,6533,2917,828,2293,1010,413,1511,795,323,811,749,404,189
Town,910,5652,2189,927,2018,866,436,1543,657,245,912,726,399,155
Urban,489,3736,1529,526,1258,524,217,830,367,180,399,435,250,123


In [4]:
data_crosstab.shape

(5, 14)

### How many degrees of freedom???

Based on the data in the contingency table we calculate the expected value of the nominal variables. Based on the expected values, the chi-square test statistic is calculated which helps us decide on whether the variables are independent or not. Technically, though the value of the test statistic we are trying to prove or disprove some hypotheses on the independence of categorical variables.

* **H0** (Null Hypothesis) - assumes that there is no association between the two variables.

* **Ha** (Alternate Hypothesis) - assumes that there is an association between the two variables.

In [5]:
from scipy.stats import chi2_contingency

chi2, p_value, dof, expected_freq = chi2_contingency(data_crosstab, correction=False)

print("The Chi2 value is: ",round(chi2,2))
print("The p-value is: ",p_value)
print("The number of degrees of freedom is: ",dof)
print("The expected frequencies are: ")
pd.DataFrame(expected_freq, columns=data_crosstab.columns, index=data_crosstab.index)

The Chi2 value is:  464.56
The p-value is:  1.349917635349138e-67
The number of degrees of freedom is:  52
The expected frequencies are: 


RFA_2,L1E,L1F,L1G,L2E,L2F,L2G,L3D,L3E,L3F,L3G,L4D,L4E,L4F,L4G
DOMAIN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
City,897.919336,5756.44836,2352.633571,907.896218,2046.746629,863.74321,440.044157,1431.788634,657.625084,276.805394,863.955484,717.698647,379.75832,164.936956
Rural,899.840777,5768.766472,2357.667926,909.839007,2051.126422,865.591518,440.985799,1434.852491,659.032323,277.397724,865.804246,719.234436,380.570957,165.289902
Suburban,991.260878,6354.849288,2597.197236,1002.274888,2259.512384,953.532036,485.788132,1580.62757,725.987281,305.580186,953.766377,792.30568,419.235393,182.082672
Town,891.699938,5716.576575,2336.338159,901.607715,2032.569929,857.760531,436.996211,1421.871414,653.070073,274.888113,857.971335,712.727539,377.127941,163.794528
Urban,549.279071,3521.359305,1439.163108,555.382172,1252.044635,528.372705,269.185701,875.859891,402.28524,169.328584,528.502558,439.033698,232.30739,100.895943


We can determine if we accept the the null hypotheses by two ways:

* Comparing the Chi2 againts the Chi2_critical value
* Comparing the p-value againt the corresponding p_critical --> that's alpha

Let's use the usual confidence level of 95%. --> That's alpha = 0.05

* Using the Chi2 table:

We look for dof = 52 and p = 0.05 give us: 69.83.
Now we compare our Chi2 against the critical value of chi2
which is: 69.83

464.56 > 69.83  -> We **reject** the nul hypotheses **Ho**

* Using the p_value:

We compare the our p_value against 0.05 (95%)

0 < 0.05 -> -> We **reject** the nul hypotheses **Ho**

Remember the meaning of the p-value = P(D|Ho): the probability of observing our data under the null hypotheses.

Therefore, **columns `RFA_2` and `DOMAIN` are related.**

* If the observed chi-square test statistic is greater than the critical value (this value is known already based on certain parameters) in the data, the null hypothesis can be rejected.

* If the observed chi-square test statistic is lower than the critical value (this value is known already based on certain parameters) in the data, the null hypothesis is accepted (also put as we fail to reject the null hypothesis) 

Chi2 <-> Chi2_critical

```python
if (Chi2 >= Chi2_critical):
    reject Ho
else:
    accept H0
```

Based on the statistics we either reject H0 or we fail to reject H0. You can also use the p value directly as we will see later in the lesson.



We have seen that in our case:

**columns `RFA_2` and `DOMAIN` are related.**

so let's drop one of those columns, for example RFA_2.

In [None]:
data.drop('RFA_2', axis = 1, inplace = True)

### 404. Activity 1

#### Use unit4.csv that you already have locally HERE!!!!!!

Use the Chi-Square test for measuring the salary differences (ie. `INCOME`) between men and women (assume a confidence level of 95%)

First group income in two groups:

* INCOME > 3 -> 'HIGH_INCOME'
* INCOME <= 3 -> 'LOW_INCOME'

In [None]:
data = pd.read_csv('unit4.csv')
data.head()

### 404. Activity 2

Repeat the same steps already done on the class for the columns `STATE` and `DOMAIN` to find the chi-square statistic.

Check the p value to decide to reject or fail to reject the null hypothesis. If you reject the null hypothesis, then drop one of the columns (`STATE` here).

Our Chi2 is  15066.07 and the Chi2_critical_value is (check here)[https://web.ma.utexas.edu/users/davis/375/popecol/tables/chisq.html]:

Chi2_critical_value = 51.00


Our p-value = 0, therefore under a confidence level of 95%

```python
if (p_value < 0.05):
    reject Ho
else:
    accept Ho
``` 

We reject the null hypotheses.

Both columns `STATE` and `DOMAIN` are related, threfore we can drop one of them. 


In [None]:
data = data.drop(['STATE'], axis=1)

# 2. Cleaning and encoding categoricals

## Cleaning `DOMAIN` column

Grouping values with low frequency together.

In [None]:
vals_domain = pd.DataFrame(data['DOMAIN'].value_counts())
vals_domain = vals_domain.reset_index()
vals_domain.columns = ['domain', 'counts']
group_vals_domain_df = vals_domain[vals_domain['counts']<5000]
group_vals_domain = list(group_vals_domain_df['domain'])
group_vals_domain

In [None]:
def clean_vals_domain(x):
    if x in group_vals_domain:
        return 'other'
    else:
        return x

data['DOMAIN'] = list(map(clean_vals_domain, data['DOMAIN']))

## Encoding/Dummifying

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='error', drop='first').fit(data[['DOMAIN']])
encoded = encoder.transform(data[['DOMAIN']]).toarray()
encoded

# 3. Transforming Numericals

In [None]:
data_corr = data.corr()
display(data_corr.head())
numerical = data_corr.drop(['AVGGIFT'], axis=1)

## Standardization/Standard Scaler

In [None]:
from sklearn.preprocessing import StandardScaler

transformer = StandardScaler().fit(numerical)
x_standardized = transformer.transform(numerical)
x_standardized

## Min-max scaler

In [None]:
from sklearn.preprocessing import MinMaxScaler

transformer = MinMaxScaler().fit(numerical)
x_min_max = transformer.transform(numerical)
x_min_max

### 404. Activity 3

Check the distributions of the numerical data (numerical) we got and decide which scaler should perform better with it. 

**Hint:** You can also plot the scaled distributions.

We won't use normalizer because it doesn't work as expected, so we will be using either minmax or standard scalers. They perform fairly good so you can choose the one you like the most. We could even use a robust scaler for some of the features.

### 404. Activity 4

Create one of the two scaling methods we have seen (standard and minmax) as a Python function. 

Remember that you need to apply the changes **to the test with the info from the train!**