# Statistical Data Analysis 

In this section of the project, we will discover the Pearson’s chi-squared statistical hypothesis test for quantifying the independence of pairs of categorical variables.

In the case of classification problems where input variables are also categorical, we can use statistical tests to determine whether the output variable is dependent or independent of the input variables. If independent, then the input variable is a candidate for a feature that may be irrelevant to the problem and removed from the dataset.

Therefore, our hypothesis statements are going to be the following

    H0: X and Y are independent
    H1: X and Y are dependent

Here, we explore if the features Gender and Nationality are dependent to the target variable i.e. if the user pays for apps.

In [1]:
import pandas as pd
import scipy.stats as stats
pd.options.mode.chained_assignment = None

In [2]:
clean = pd.read_excel('clean_data.xlsx', index_col=0)

In [3]:
complete = clean[clean.response == 1]

In [4]:
complete.head()

Unnamed: 0_level_0,startdate,enddate,response,participant_type,browser,browser_version,os,screen_resolution,flash_version,java_support,...,extraverted_enthusiastic,critical_quarrelsome,dependable_self_disciplined,anxious_easily_upset,open_to_new_experiences_complex,reserved_quiet,sympathetic_warm,disorganized_careless,calm_emotionally_stable,conventional_uncreative
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,2012-09-26 07:45:19,2012-09-26 07:56:41,1,1,Safari,6,CPU iPhone OS 6_0 like Mac OS X,320x480,-1,0.0,...,6.0,3.0,7.0,2.0,6.0,3.0,4.0,3.0,4.0,4.0
3,2012-09-26 07:45:35,2012-09-26 08:01:56,1,1,Safari,6,CPU OS 6_0 like Mac OS X,768x1024,-1,0.0,...,4.0,4.0,5.0,2.0,3.0,3.0,5.0,3.0,5.0,3.0
4,2012-09-26 16:58:29,2012-09-26 17:05:50,1,1,Firefox,15.0.1,Intel Mac OS X 10.6,1920x1200,11.4.402,1.0,...,4.0,3.0,6.0,3.0,5.0,5.0,5.0,2.0,5.0,3.0
5,2012-09-27 04:16:04,2012-09-27 04:24:56,1,1,Chrome,22.0.1229.79,Intel Mac OS X 10_7_4,1280x800,11.4.402,1.0,...,2.0,6.0,4.0,3.0,6.0,5.0,7.0,3.0,5.0,3.0
6,2012-09-27 08:50:34,2012-09-27 08:56:48,1,1,Chrome,21.0.1180.89,WOW64,1920x1080,11.3.31,1.0,...,3.0,2.0,6.0,2.0,6.0,4.0,3.0,2.0,5.0,2.0


In [5]:
complete.do_not_pay_for_apps = complete.do_not_pay_for_apps.fillna(0)

In [6]:
complete.do_not_pay_for_apps

id
2        1.0
3        1.0
4        1.0
5        0.0
6        1.0
7        0.0
8        1.0
9        0.0
11       1.0
13       0.0
16       0.0
17       1.0
18       0.0
19       1.0
20       1.0
21       1.0
22       0.0
23       1.0
25       1.0
26       1.0
27       1.0
28       1.0
33       0.0
34       0.0
36       1.0
38       1.0
39       1.0
40       1.0
41       0.0
42       0.0
        ... 
9933     1.0
9934     0.0
9936     0.0
9939     1.0
9942     1.0
9943     1.0
9944     0.0
9945     1.0
9959     1.0
9966     1.0
9971     1.0
9981     0.0
10007    0.0
10010    1.0
10017    1.0
10027    1.0
10030    1.0
10034    1.0
10041    1.0
10046    1.0
10056    0.0
10060    1.0
10069    1.0
10071    1.0
10073    1.0
10086    1.0
10092    1.0
10111    1.0
10112    1.0
10208    1.0
Name: do_not_pay_for_apps, Length: 4824, dtype: float64

**Gender**

    H0: Paying for apps is independent of the gender of the users
    H1: Paying for apps is dependent with the gender of the users

In [7]:
gender_pays_df = complete[['gender', 'do_not_pay_for_apps']]
gender_pays_df

Unnamed: 0_level_0,gender,do_not_pay_for_apps
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,Male,1.0
3,Male,1.0
4,Male,1.0
5,Male,0.0
6,Female,1.0
7,Male,0.0
8,Male,1.0
9,Male,0.0
11,Female,1.0
13,Female,0.0


In [8]:
gender_pays = gender_pays_df.groupby('gender').sum()
gender_pays

Unnamed: 0_level_0,do_not_pay_for_apps
gender,Unnamed: 1_level_1
Female,1518.0
Male,1209.0


In [9]:
gender_pays['pay_for_apps'] = gender_pays_df.groupby('gender').count() - gender_pays_df.groupby('gender').sum()
gender_pays

Unnamed: 0_level_0,do_not_pay_for_apps,pay_for_apps
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,1518.0,960.0
Male,1209.0,1137.0


In [10]:
stats.chi2_contingency(gender_pays)

(45.981011507353415,
 1.1940473451340507e-11,
 1,
 array([[1400.80970149, 1077.19029851],
        [1326.19029851, 1019.80970149]]))

For degree of freedom 1, we have a very high chi-squared value of 45.98 well above the table value of 3.84 so we have enough statistical to conclude that the two variable are dependent.

**Country**

    H0: Paying for apps is independent of the nationality of the users
    H1: Paying for apps is dependent with the nationality of the users

In [11]:
country_pays_df = complete[['nationality', 'do_not_pay_for_apps']]
country_pays_df

Unnamed: 0_level_0,nationality,do_not_pay_for_apps
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,USA,1.0
3,Other,1.0
4,Italy,1.0
5,Mexico,0.0
6,UK,1.0
7,Other,0.0
8,Australia,1.0
9,Japan,0.0
11,Australia,1.0
13,Japan,0.0


In [12]:
country_pays = country_pays_df.groupby('nationality').sum()
country_pays

Unnamed: 0_level_0,do_not_pay_for_apps
nationality,Unnamed: 1_level_1
Australia,147.0
Brazil,252.0
Canada,246.0
China,162.0
France,124.0
Germany,150.0
India,204.0
Italy,166.0
Japan,160.0
Korea,203.0


In [13]:
country_pays['pay_for_apps'] = country_pays_df.groupby('nationality').count() - country_pays_df.groupby('nationality').sum()
country_pays

Unnamed: 0_level_0,do_not_pay_for_apps,pay_for_apps
nationality,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia,147.0,100.0
Brazil,252.0,176.0
Canada,246.0,180.0
China,162.0,395.0
France,124.0,109.0
Germany,150.0,100.0
India,204.0,161.0
Italy,166.0,100.0
Japan,160.0,54.0
Korea,203.0,61.0


In [14]:
stats.chi2_contingency(country_pays)

(303.2415112441552,
 1.1847806153179469e-55,
 15,
 array([[139.62873134, 107.37126866],
        [241.94776119, 186.05223881],
        [240.81716418, 185.18283582],
        [314.87126866, 242.12873134],
        [131.71455224, 101.28544776],
        [141.32462687, 108.67537313],
        [206.33395522, 158.66604478],
        [150.36940299, 115.63059701],
        [120.9738806 ,  93.0261194 ],
        [149.23880597, 114.76119403],
        [138.49813433, 106.50186567],
        [147.54291045, 113.45708955],
        [162.24067164, 124.75932836],
        [141.88992537, 109.11007463],
        [141.32462687, 108.67537313],
        [158.28358209, 121.71641791]]))

For degree of freedom 1, we have a very high chi-squared value of 45.98 well above the table value of 3.84 so we have enough statistical to conclude that the two variable are dependent.