<a href="https://colab.research.google.com/github/halfChewedGum/python-practice/blob/main/testing_independence_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, I will test all the factors of interest to see if there is any relationship between the factors and the target variable (using mobile banking) which is statistically significant. 

Our factors are of two types:
- Categorical (Nominal) 
- Numerical (Continous) 

For the categorical variables, I will use Chi-Squared test of independence. 
For the continous variables, I will use Ranksum test. 


In [1]:
#read data 

import pandas as pd 
import numpy as np 

from google.colab import files
uploaded = files.upload()

Saving data_dec_00.csv to data_dec_00.csv


In [2]:
import io
df = pd.read_csv(io.BytesIO(uploaded['data_dec_00.csv']))

In [3]:
df.head()

Unnamed: 0,ID,Province,Aboriginal,Education,House_size,Gender,Age_gp,House_type,Lang,Employment,...,Average_spending_online,Internet_income,Non_onlineShopper,OnlineShopper,Security_incident,vid_stream_user,gov_service_user,social_media_user,casual_sns_user,smartphone_user
0,100000,"""Alberta""",0,2,1,2,3,3,1,1,...,2040,0,0,1,0,1,1,1,1,1
1,100001,"""Saskatchewan""",0,1,1,1,5,3,1,1,...,2223,0,0,0,0,0,0,0,0,0
2,100002,"""Quebec""",0,3,1,1,5,3,3,1,...,2220,0,0,1,1,0,1,1,1,1
3,100003,"""Ontario""",0,2,2,2,4,2,1,1,...,1400,0,0,1,1,1,1,1,1,1
4,100004,"""Quebec""",0,1,4,1,1,1,2,0,...,0,0,0,1,1,1,0,1,1,1


In [4]:
#Envirnoment settings 
pd.set_option('display.max_column', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_seq_items', None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

In [5]:
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13810 entries, 0 to 13809
Data columns (total 184 columns):
 #    Column                     Non-Null Count  Dtype 
---   ------                     --------------  ----- 
 0    ID                         13810 non-null  int64 
 1    Province                   13810 non-null  object
 2    Aboriginal                 13810 non-null  int64 
 3    Education                  13810 non-null  int64 
 4    House_size                 13810 non-null  int64 
 5    Gender                     13810 non-null  int64 
 6    Age_gp                     13810 non-null  int64 
 7    House_type                 13810 non-null  int64 
 8    Lang                       13810 non-null  int64 
 9    Employment                 13810 non-null  int64 
 10   Current_student            13810 non-null  int64 
 11   immigrant                  13810 non-null  int64 
 12   IN_personal_use            13810 non-null  int64 
 13   IN_household_yn            13810 non-null  i

It seems that most (except Province) are thought to be integer. This isn't true, so I need to fix this. With the exception of a few truly integer columns, I will cast the others into datatype of "object". 

In [6]:
#find out which ones are actually categorical 

df.nunique()

ID                           13810
Province                        10
Aboriginal                       3
Education                        4
House_size                       5
Gender                           2
Age_gp                           6
House_type                       5
Lang                             5
Employment                       3
Current_student                  3
immigrant                        3
IN_personal_use                  2
IN_household_yn                  3
IN_work                          3
IN_school                        3
IN_library                       3
IN_public                        3
IN_business                      3
IN_home                          3
IN_otherhome                     3
hours_online                     6
DVC_SMPH                         3
DVC_lptp                         3
DVC_tblt                         3
DVC_pc                           3
DVC_other                        3
COM_email                        3
COM_IM              

Based on the above result, if it has more than 10 values it is not a categorical variable. These are: 


ECOM_video_spend               113
ECOM_ebook_spend               146
ECOM_podcast_spend              74
ECOM_news_spend                 33
ECOM_giftcard_spend             92
ECOM_gambling_spend             46
ECOM_gaming_spend               41
ECOM_storage_spend              87
ECOM_software_spend             77
ECOM_other_spend               122
ECOM_totalspend                 96
PURCH_clothing                 758
RIDE_spend                      77
SRVC_total_spend               214
Average_spending_online       2369


In [7]:
#continues variables 

x_spendOnVid = df['ECOM_video_spend']
x_spendEBook = df['ECOM_ebook_spend']
x_spendPodcast = df['ECOM_podcast_spend']
x_spendNews = df['ECOM_news_spend']
x_spendGiftCard = df['ECOM_giftcard_spend']
x_spendGamble = df['ECOM_gambling_spend']
x_spendGaming = df['ECOM_gaming_spend']
x_spendStorage = df['ECOM_storage_spend']
x_spendSoftware = df['ECOM_software_spend']
x_spendOther = df['ECOM_other_spend']
x_totalSpend = df['ECOM_totalspend']
x_spendClothes = df['PURCH_clothing']
x_spendrideShare = df['RIDE_spend']
x_spendService = df['SRVC_total_spend']
x_averageSpend = df['Average_spending_online']

In [8]:
#keep a copy of dataframe 
df_copy01 = df.copy()

Keeping track: copy 1 of the dataframe will have none of the continous variables and ID. 

In [13]:
columns_to_drop01 = ['ID','ECOM_video_spend', 'ECOM_ebook_spend','ECOM_podcast_spend', 'ECOM_news_spend','ECOM_giftcard_spend','ECOM_gambling_spend','ECOM_gaming_spend','ECOM_storage_spend','ECOM_software_spend','ECOM_other_spend','ECOM_totalspend','PURCH_clothing','RIDE_spend','SRVC_total_spend','Average_spending_online']

df_copy01 = df_copy01.drop(columns = columns_to_drop01)
df_copy01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13810 entries, 0 to 13809
Columns: 168 entries, Province to smartphone_user
dtypes: object(168)
memory usage: 17.7+ MB


In [14]:
#convert everything that is left to an object 
df_copy01 = df_copy01.astype(object)

In [15]:
df_copy01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13810 entries, 0 to 13809
Columns: 168 entries, Province to smartphone_user
dtypes: object(168)
memory usage: 17.7+ MB


df_allObjectVars : dataframe with only categorical variables 
df_allContinous : dataframe with only continous variables 

In [16]:
df_allObjectVars = df_copy01.copy()

In [23]:
columns_continousForm = ['SM_USE_banking','ECOM_video_spend', 'ECOM_ebook_spend','ECOM_podcast_spend', 'ECOM_news_spend','ECOM_giftcard_spend','ECOM_gambling_spend','ECOM_gaming_spend','ECOM_storage_spend','ECOM_software_spend','ECOM_other_spend','ECOM_totalspend','PURCH_clothing','RIDE_spend','SRVC_total_spend','Average_spending_online']

df_allContinous = df.loc[:, df.columns.isin(columns_continousForm)]


In [24]:
df_allObjectVars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13810 entries, 0 to 13809
Columns: 168 entries, Province to smartphone_user
dtypes: object(168)
memory usage: 17.7+ MB


In [25]:
df_allContinous.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13810 entries, 0 to 13809
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   SM_USE_banking           13810 non-null  int64
 1   ECOM_video_spend         13810 non-null  int64
 2   ECOM_ebook_spend         13810 non-null  int64
 3   ECOM_podcast_spend       13810 non-null  int64
 4   ECOM_news_spend          13810 non-null  int64
 5   ECOM_giftcard_spend      13810 non-null  int64
 6   ECOM_gambling_spend      13810 non-null  int64
 7   ECOM_gaming_spend        13810 non-null  int64
 8   ECOM_storage_spend       13810 non-null  int64
 9   ECOM_software_spend      13810 non-null  int64
 10  ECOM_other_spend         13810 non-null  int64
 11  ECOM_totalspend          13810 non-null  int64
 12  PURCH_clothing           13810 non-null  int64
 13  RIDE_spend               13810 non-null  int64
 14  SRVC_total_spend         13810 non-null  int64
 15  Av

For Continous Variables, Do RankSum test 

## Continous Variables - RankSum Test 


Null Hypothesis: The distributions are identical 


In [77]:
from scipy.stats import wilcoxon 
from scipy.stats import ranksums 
import numpy as np


y_cont = np.array(df_allContinous['SM_USE_banking'])
x1_cont = np.array(df_allContinous['ECOM_video_spend'])
x2_cont = np.array(df_allContinous['ECOM_ebook_spend'])
x3_cont = np.array(df_allContinous['ECOM_podcast_spend'])
x4_cont = np.array(df_allContinous['ECOM_news_spend'])
x5_cont = np.array(df_allContinous['ECOM_giftcard_spend'])
x6_cont = np.array(df_allContinous['ECOM_gambling_spend'])
x7_cont = np.array(df_allContinous['ECOM_gaming_spend'])
x8_cont = np.array(df_allContinous['ECOM_storage_spend'])
x9_cont = np.array(df_allContinous['ECOM_software_spend'])
x10_cont = np.array(df_allContinous['ECOM_other_spend'])
x11_cont = np.array(df_allContinous['ECOM_totalspend'])
x12_cont = np.array(df_allContinous['PURCH_clothing'])
x13_cont = np.array(df_allContinous['RIDE_spend'])
x14_cont = np.array(df_allContinous['SRVC_total_spend'])
x15_cont = np.array(df_allContinous['Average_spending_online'])



In [78]:
def do_rankSumTest(x, y):
  """
  This function performs a ranksum test and outputs the results as the U statistics and p value
  """

  res = ranksums(x, y)
  

  return u_val, p_val 

In [79]:
def rankSum_decision(alpha, p):
  """
  If p-value < alpha = 0.05, reject the null hypothesis : they are independent. 
  Else, there is a relationship. 
  """
  if p < alpha: 
    print("Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.")
  else:
    print("Cannot reject Null Hypothesis. The two samples might be dependent. There might be a relationship.")

  

In [80]:
def final_results(var_name, x_, y_):
  print('For ', var_name, ' the results are as follows: ')
  print('WilCoXon statistic: ', do_rankSumTest(x1_cont, y_cont)[0])
  print('P-Value: ', do_rankSumTest(x1_cont, y_cont)[1])
  p_value = do_rankSumTest(x1_cont, y_cont)[1]
  print('The decision is: ') 
  rankSum_decision(0.05, p_value)

In [112]:
print('ECOM_video_spend')
rs_01 = ranksums(x1_cont, y_cont)
print('U Statistic: ', rs_01.statistic)
print('p-value: ', rs_01.pvalue)
rankSum_decision(0.05, rs_01.pvalue)


ECOM_video_spend
U Statistic:  89.76772777245463
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [113]:
print('ECOM_ebook_spend')
rs_02 = ranksums(x2_cont, y_cont)
print('U Statistic: ', rs_02.statistic)
print('p-value: ', rs_02.pvalue)
rankSum_decision(0.05, rs_02.pvalue)


ECOM_ebook_spend
U Statistic:  93.68542080404748
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [114]:
print('ECOM_podcast_spend')
rs_03 = ranksums(x3_cont, y_cont)
print('U Statistic: ', rs_03.statistic)
print('p-value: ', rs_03.pvalue)
rankSum_decision(0.05, rs_03.pvalue)


ECOM_podcast_spend
U Statistic:  70.46259101153585
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [115]:
print('ECOM_news_spend')
rs_04 = ranksums(x4_cont, y_cont)
print('U Statistic: ', rs_04.statistic)
print('p-value: ', rs_04.pvalue)
rankSum_decision(0.05, rs_04.pvalue)


ECOM_news_spend
U Statistic:  22.619840988823146
p-value:  2.7647454245675254e-113
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [116]:
print('ECOM_giftcard_spend')
rs_05 = ranksums(x5_cont, y_cont)
print('U Statistic: ', rs_05.statistic)
print('p-value: ', rs_05.pvalue)
rankSum_decision(0.05, rs_05.pvalue)


ECOM_giftcard_spend
U Statistic:  54.96039777964721
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [117]:
print('ECOM_gambling_spend')
rs_06 = ranksums(x6_cont, y_cont)
print('U Statistic: ', rs_06.statistic)
print('p-value: ', rs_06.pvalue)
rankSum_decision(0.05, rs_06.pvalue)


ECOM_gambling_spend
U Statistic:  57.50903874769051
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [118]:
print('ECOM_gaming_spend')
rs_07 = ranksums(x7_cont, y_cont)
print('U Statistic: ', rs_07.statistic)
print('p-value: ', rs_07.pvalue)
rankSum_decision(0.05, rs_07.pvalue)


ECOM_gaming_spend
U Statistic:  49.0197338653684
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [119]:
print('ECOM_storage_spend')
rs_08 = ranksums(x8_cont, y_cont)
print('U Statistic: ', rs_08.statistic)
print('p-value: ', rs_08.pvalue)
rankSum_decision(0.05, rs_08.pvalue)


ECOM_storage_spend
U Statistic:  64.87267316648206
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [120]:
print('ECOM_software_spend')
rs_09 = ranksums(x9_cont, y_cont)
print('U Statistic: ', rs_09.statistic)
print('p-value: ', rs_09.pvalue)
rankSum_decision(0.05, rs_09.pvalue)


ECOM_software_spend
U Statistic:  64.06599448711816
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [121]:
print('ECOM_other_spend')
rs_10 = ranksums(x10_cont, y_cont)
print('U Statistic: ', rs_10.statistic)
print('p-value: ', rs_10.pvalue)
rankSum_decision(0.05, rs_10.pvalue)


ECOM_other_spend
U Statistic:  69.56223877602092
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [122]:
print('ECOM_totalspend')
rs_11 = ranksums(x11_cont, y_cont)
print('U Statistic: ', rs_11.statistic)
print('p-value: ', rs_11.pvalue)
rankSum_decision(0.05, rs_11.pvalue)


ECOM_totalspend
U Statistic:  57.57007054696351
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [123]:
print('PURCH_clothing')
rs_12 = ranksums(x12_cont, y_cont)
print('U Statistic: ', rs_12.statistic)
print('p-value: ', rs_12.pvalue)
rankSum_decision(0.05, rs_12.pvalue)


PURCH_clothing
U Statistic:  140.40726524169006
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [124]:
print('RIDE_spend')
rs_13 = ranksums(x13_cont, y_cont)
print('U Statistic: ', rs_13.statistic)
print('p-value: ', rs_13.pvalue)
rankSum_decision(0.05, rs_13.pvalue)


RIDE_spend
U Statistic:  140.66604327033588
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [125]:
print('SRVC_total_spend')
rs_14 = ranksums(x14_cont, y_cont)
print('U Statistic: ', rs_14.statistic)
print('p-value: ', rs_14.pvalue)
rankSum_decision(0.05, rs_14.pvalue)


SRVC_total_spend
U Statistic:  137.1296159808691
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [126]:
print('Average_spending_online')
rs_15 = ranksums(x15_cont, y_cont)
print('U Statistic: ', rs_15.statistic)
print('p-value: ', rs_15.pvalue)
rankSum_decision(0.05, rs_15.pvalue)


Average_spending_online
U Statistic:  129.1700910236407
p-value:  0.0
Reject the Null Hypothesis. The two samples are independent, there is NO RELATIONSHIP.


In [128]:
df_allContinous.to_csv('cont_data.csv')