## Statistical tests to derive relationships

In order to derive relationships between variables, two types of statistical tests were performed:

- Test of difference in proportion
- Test of independence

For Test of difference in proportion, wilcoxon signed ranked test where the objective was to identify the features which had significant change in their responses.

For Test of independence, Chi-Square test of indepence was used to identify features which have significant relationship with their target variables.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data_before=pd.read_excel('Full_before_knn.xlsx')

In [3]:
data_present=pd.read_excel('Full_present_knn.xlsx')

### Test of proportion

wilcoxon signed ranked test

This test is used when the samples are related or matched in some way or represent two measurements of the same technique.
More specifically, each sample is independent, but comes from the same population.

    H0 : No significant difference in responses by individuals for a question before and during pandemic
    
    Hn : significant difference in responses by individuals for a question before and during pandemic 

In [4]:
# Preprocessed data was used here.
# A dictionary with features that follow same scale were selected and paired from both the dataframes 

dict_columns={'Emp_stat_Before':'Emp_stat_Present',
              'Notice_things_Before':'Notice_things_Present',
             'All_angles_Before':'All_angles_Present',
             'Sincere_prod_Before':'Sincere_prod_Present',
             'Lonely_Before':'Lonely_Present',
              'Worry_health_Before':'Worry_health_Present',
              'Charity_Before':'Charity_Present',
              'New_env_Before':'New_env_Present',
              'Meeting_ppl_Before':'Meeting_ppl_Present',
              'Hob_interests_Before':'Hob_interests_Present',
              'Surveys_Before':'Surveys_Present',
              'Spent_onli_Before':'Spent_onli_Present',
              'WFH_office_Before':'WFH_office_Present',
              'Enthu_Before':'Ethu_Present',
              'Smoking_hab_Before':'Smoking_hab_Present',
              'Drinking_hab_Before':'Drinking_hab_Present',
              'Sleeping_hab_Before':'Sleeping_hab_Present',
              'Healthy_Lifestyle_Before':'Healthy_Lifestyle_Present',
              'Medi_pref_Before':'Medi_pref_Present',
              'Pol_interest_Before':'Pol_interest_Present',
              'Internet_interest_Before':'Internet_interest_Present',
              'Economy_Manag_intrst_Before':'Economy_Manag_intrst_Present',
              'Medicine_intrst_Before':'Medicine_intrst_Present',
              'Religion_intrst_Before':'Religion_intrst_Present',
              'Save_all_money_Before':'Save_all_money_Present',
              'Brand_non-brand_Before':'Brand_non-brand_Present',
              'Food_pref_Before':'Food_pref_Present',
              'Mode_of_transport_Before':'Mode_of_transport_Present',
              'Basic_medications_Before':'Basic_medications_Present',
              'Digital_content_Before':'Digital_content_Present',
              'Active_Investor_Before':'Active_Investor_Present',
              'Edu_instit_fee_Before':'Edu_instit_fee_Present',
              'Spend_on_gadgets_Before':'Spend_on_gadgets_Present',
              'spend_on_Luxury_Before':'Spend_on_Luxury_Present',
              'Track_Household_exp_Before':'Track_Household_exp_Present',
              'Spending_kind_Before':'Spending_kind_Present'
              }

In [5]:
from scipy.stats import wilcoxon,chi2_contingency
diff_in_prop=[]
for item in dict_columns:
    stats,p=wilcoxon(data_before[item],data_present[dict_columns[item]])
    
    if p<0.05:
        diff_in_prop.append(item)

In [6]:
# Below are 28 features whose responses has significant difference
diff_in_prop 

#took their names from before dataset

['Emp_stat_Before',
 'Notice_things_Before',
 'All_angles_Before',
 'Sincere_prod_Before',
 'Lonely_Before',
 'Worry_health_Before',
 'Charity_Before',
 'New_env_Before',
 'Meeting_ppl_Before',
 'Spent_onli_Before',
 'WFH_office_Before',
 'Enthu_Before',
 'Smoking_hab_Before',
 'Drinking_hab_Before',
 'Sleeping_hab_Before',
 'Medi_pref_Before',
 'Pol_interest_Before',
 'Internet_interest_Before',
 'Economy_Manag_intrst_Before',
 'Medicine_intrst_Before',
 'Save_all_money_Before',
 'Food_pref_Before',
 'Mode_of_transport_Before',
 'Basic_medications_Before',
 'Digital_content_Before',
 'Edu_instit_fee_Before',
 'spend_on_Luxury_Before',
 'Track_Household_exp_Before',
 'Spending_kind_Before']

###  Test of independence


Chi-Sqaure test of independence with below hypothesis

    H0 : No significant relationship between the Variables.The variables are independent.
    
    Hn : A relationship between the variables exists.

In [7]:
sigini_vars=[]

for i in data_before.columns:
    #contingency table
    table=pd.crosstab(index=data_before[i],columns=data_before['Spending_kind_Before'])
    
    #chi-sqaure goodness of fit test
    stats,p_value,dof,ideal_exp=chi2_contingency(table)
    
    if p_value<0.05:
        sigini_vars.append(i)


In [8]:
# Below are 25 features excluding target variable who has siginificant relationship with target variable

# These features can be applied in model building for feature selection (possible application)
sigini_vars

['Income',
 'Gender',
 'Notice_things_Before',
 'All_angles_Before',
 'Lonely_Before',
 'Charity_Before',
 'Meeting_ppl_Before',
 'Spent_onli_Before',
 'WFH_office_Before',
 'Smoking_hab_Before',
 'Drinking_hab_Before',
 'Healthy_Lifestyle_Before',
 'Internet_interest_Before',
 'Religion_intrst_Before',
 'Save_all_money_Before',
 'Brand_non-brand_Before',
 'Food_pref_Before',
 'Basic_medications_Before',
 'Digital_content_Before',
 'Active_Investor_Before',
 'Edu_instit_fee_Before',
 'Spend_on_gadgets_Before',
 'spend_on_Luxury_Before',
 'Track_Household_exp_Before',
 'Domestic_help_Before',
 'Spending_kind_Before']

In [9]:
sigini_vars_pre=[]

for i in data_present.columns:
    #contingency table
    table=pd.crosstab(index=data_present[i],columns=data_present['Spending_kind_Present'])
    
    #chi-sqaure goodness of fit test
    stats,p_value,dof,ideal_exp=chi2_contingency(table)
    
    if p_value<0.05:
        sigini_vars_pre.append(i)


In [10]:
# Below are 16 features excluding target variable who has siginificant relationship with target variable

# These features can be applied in model building for feature selection (possible application)
sigini_vars_pre

['Emp_stat_Present',
 'Gender',
 'All_angles_Present',
 'Hob_interests_Present',
 'Income_Change',
 'Smoking_hab_Present',
 'Internet_interest_Present',
 'Save_all_money_Present',
 'Brand_non-brand_Present',
 'Food_pref_Present',
 'Basic_medications_Present',
 'Digital_content_Present',
 'Active_Investor_Present',
 'Edu_instit_fee_Present',
 'Spend_on_gadgets_Present',
 'Spend_on_Luxury_Present',
 'Track_Household_exp_Present',
 'Spending_kind_Present']

In [11]:
# we also wanted to check on multi-collinearity in the data, hence performed the test with all the features

overlap={}

for i in data_before.columns:
    for j in data_before.columns:
        table=pd.crosstab(index=data_before[i],columns=data_before[j])
        
        stats,p_value,dof,ideal_exp=chi2_contingency(table)
        
        if p_value<0.05:
            overlap[i]=j

In [12]:
# here when observed we notice that many features have relationship with target feature, hence filtering out target feature
for item in overlap:
    if overlap[item] != 'Spending_kind_Before' :
        print(item,':',overlap[item])
        print()

Age : Domestic_help_Before

Emp_stat_Before : Domestic_help_Before

Marital_status : Domestic_help_Before

Loc : Domestic_help_Before

Sincere_prod_Before : Track_Household_exp_Before

Worry_health_Before : spend_on_Luxury_Before

New_env_Before : Spend_on_gadgets_Before

Hob_interests_Before : Edu_instit_fee_Before

Surveys_Before : Track_Household_exp_Before

Enthu_Before : Track_Household_exp_Before

Sleeping_hab_Before : Track_Household_exp_Before

Medi_pref_Before : Food_pref_Before

Pol_interest_Before : spend_on_Luxury_Before

Economy_Manag_intrst_Before : Edu_instit_fee_Before

Medicine_intrst_Before : Edu_instit_fee_Before

Mode_of_transport_Before : Spend_on_gadgets_Before



- we can observe that most of the features have relationship with domestic help, tracking household expenses and gadgets.

- Hence, this proves there is very less multi- collinearity in the before data

In [13]:
#for present dataset
overlap_pre={}

for i in data_present.columns:
    for j in data_present.columns:
        table=pd.crosstab(index=data_present[i],columns=data_present[j])
        
        stats,p_value,dof,ideal_exp=chi2_contingency(table)
        
        if p_value<0.05:
            overlap_pre[i]=j

In [14]:
# here when observed we notice that many features have relationship with target feature, hence filtering out target feature
for item in overlap_pre:
    if overlap_pre[item] != 'Spending_kind_Present' :
        print(item,':',overlap_pre[item])
        print()

Age : Domestic_help_Present

Income : Domestic_help_Present

Marital_status : Domestic_help_Present

Loc : Domestic_help_Present

Notice_things_Present : Track_Household_exp_Present

Sincere_prod_Present : Edu_instit_fee_Present

Lonely_Present : Spend_on_gadgets_Present

Worry_health_Present : Spend_on_gadgets_Present

Charity_Present : Spend_on_gadgets_Present

New_env_Present : Spend_on_gadgets_Present

Meeting_ppl_Present : Track_Household_exp_Present

Surveys_Present : Spend_on_Luxury_Present

Spent_onli_Present : Spend_on_gadgets_Present

WFH_office_Present : Edu_instit_fee_Present

Ethu_Present : Track_Household_exp_Present

Drinking_hab_Present : Domestic_help_Present

Sleeping_hab_Present : Edu_instit_fee_Present

Healthy_Lifestyle_Present : Spend_on_Luxury_Present

Medi_pref_Present : Economy_Manag_intrst_Present

Pol_interest_Present : Spend_on_Luxury_Present

Economy_Manag_intrst_Present : Domestic_help_Present

Medicine_intrst_Present : Edu_instit_fee_Present

Religion_int

- we can observe that most of the features have relationship with domestic help, tracking household expenses and gadgets.

- Hence, this proves there is very less multi- collinearity in the Present data

# For MICE(R) data 

In [15]:
#similar steps were performed

data_before_mice=pd.read_excel('Before_mice.xlsx')

In [16]:
data_present_mice=pd.read_excel('Present_mice.xlsx')

In [17]:
# few columns names were different hence had to define the dictionary again
dict_columns_mice={'Emp_stat_Before':'Emp_stat_Present',
              'Notice_things_Before':'Notice_things_Present',
             'All_angles_Before':'All_angles_Present',
             'Sincere_prod_Before':'Sincere_prod_Present',
             'Lonely_Before':'Lonely_Present',
              'Worry_health_Before':'Worry_health_Present',
              'Charity_Before':'Charity_Present',
              'New_env_Before':'New_env_Present',
              'Meeting_ppl_Before':'Meeting_ppl_Present',
              'Hob_interests_Before':'Hob_interests_Present',
              'Surveys_Before':'Surveys_Present',
              'Spent_onli_Before':'Spent_onli_Present',
              'WFH_office_Before':'WFH_office_Present',
              'Enthu_Before':'Ethu_Present',
              'Smoking_hab_Before':'Smoking_hab_Present',
              'Drinking_hab_Before':'Drinking_hab_Present',
              'Sleeping_hab_Before':'Sleeping_hab_Present',
              'Healthy_Lifestyle_Before':'Healthy_Lifestyle_Present',
              'Medi_pref_Before':'Medi_pref_Present',
              'Pol_interest_Before':'Pol_interest_Present',
              'Internet_interest_Before':'Internet_interest_Present',
              'Economy_Manag_intrst_Before':'Economy_Manag_intrst_Present',
              'Medicine_intrst_Before':'Medicine_intrst_Present',
              'Religion_intrst_Before':'Religion_intrst_Present',
              'Save_all_money_Before':'Save_all_money_Present',
            'Brand_non_brand_Before':'Brand_non_brand_Present',
              'Food_pref_Before':'Food_pref_Present',
              'Mode_of_transport_Before':'Mode_of_transport_Present',
              'Basic_medications_Before':'Basic_medications_Present',
              'Digital_content_Before':'Digital_content_Present',
              'Active_Investor_Before':'Active_Investor_Present',
              'Edu_instit_fee_Before':'Edu_instit_fee_Present',
              'Spend_on_gadgets_Before':'Spend_on_gadgets_Present',
              'spend_on_Luxury_Before':'Spend_on_Luxury_Present',
              'Track_Household_exp_Before':'Track_Household_exp_Present',
              'Spending_kind_Before':'Spending_kind_Present'
              }

In [18]:
diff_in_prop_mice=[]
for item in dict_columns_mice:
    stats,p=wilcoxon(data_before_mice[item],data_present_mice[dict_columns_mice[item]])
    
    if p<0.05:
        diff_in_prop_mice.append(item)

In [19]:
# Below are 29 features whose responses has significant difference
diff_in_prop 

#took their names from before dataset

['Emp_stat_Before',
 'Notice_things_Before',
 'All_angles_Before',
 'Sincere_prod_Before',
 'Lonely_Before',
 'Worry_health_Before',
 'Charity_Before',
 'New_env_Before',
 'Meeting_ppl_Before',
 'Spent_onli_Before',
 'WFH_office_Before',
 'Enthu_Before',
 'Smoking_hab_Before',
 'Drinking_hab_Before',
 'Sleeping_hab_Before',
 'Medi_pref_Before',
 'Pol_interest_Before',
 'Internet_interest_Before',
 'Economy_Manag_intrst_Before',
 'Medicine_intrst_Before',
 'Save_all_money_Before',
 'Food_pref_Before',
 'Mode_of_transport_Before',
 'Basic_medications_Before',
 'Digital_content_Before',
 'Edu_instit_fee_Before',
 'spend_on_Luxury_Before',
 'Track_Household_exp_Before',
 'Spending_kind_Before']

In [20]:
#test of independence

In [21]:
sigini_vars_bef_mice=[]

for i in data_before_mice.columns:
    #contingency table
    table=pd.crosstab(index=data_before_mice[i],columns=data_before_mice['Spending_kind_Before'])
    
    #chi-sqaure goodness of fit test
    stats,p_value,dof,ideal_exp=chi2_contingency(table)
    
    if p_value<0.05:
        sigini_vars_bef_mice.append(i)


In [22]:
# Below are 25 features excluding target variable who has siginificant relationship with target variable

# These features can be applied in model building for feature selection (possible application)
sigini_vars_bef_mice

['Income',
 'Gender',
 'Notice_things_Before',
 'All_angles_Before',
 'Lonely_Before',
 'Charity_Before',
 'Meeting_ppl_Before',
 'Spent_onli_Before',
 'WFH_office_Before',
 'Smoking_hab_Before',
 'Drinking_hab_Before',
 'Healthy_Lifestyle_Before',
 'Internet_interest_Before',
 'Religion_intrst_Before',
 'Save_all_money_Before',
 'Brand_non_brand_Before',
 'Food_pref_Before',
 'Basic_medications_Before',
 'Digital_content_Before',
 'Active_Investor_Before',
 'Edu_instit_fee_Before',
 'Spend_on_gadgets_Before',
 'spend_on_Luxury_Before',
 'Track_Household_exp_Before',
 'Domestic_help_Before',
 'Spending_kind_Before']

In [23]:
# we also wanted to check on multi-collinearity in the data, hence performed the test with all the features

overlap_mice={}

for i in data_before_mice.columns:
    for j in data_before_mice.columns:
        table=pd.crosstab(index=data_before_mice[i],columns=data_before_mice[j])
        
        stats,p_value,dof,ideal_exp=chi2_contingency(table)
        
        if p_value<0.05:
            overlap_mice[i]=j

In [24]:
# here when observed we notice that many features have relationship with target feature, hence filtering out target feature
for item in overlap_mice:
    if overlap_mice[item] != 'Spending_kind_Before' :
        print(item,':',overlap_mice[item])
        print()

Age : Domestic_help_Before

Emp_stat_Before : Domestic_help_Before

Marital_status : Domestic_help_Before

Loc : Domestic_help_Before

Sincere_prod_Before : Track_Household_exp_Before

Worry_health_Before : spend_on_Luxury_Before

New_env_Before : Spend_on_gadgets_Before

Hob_interests_Before : Edu_instit_fee_Before

Surveys_Before : Track_Household_exp_Before

Enthu_Before : Track_Household_exp_Before

Sleeping_hab_Before : Track_Household_exp_Before

Medi_pref_Before : Medi_pref_Before

Pol_interest_Before : spend_on_Luxury_Before

Economy_Manag_intrst_Before : Edu_instit_fee_Before

Medicine_intrst_Before : Edu_instit_fee_Before

Mode_of_transport_Before : Spend_on_gadgets_Before



In [25]:
sigini_vars_pre_mice=[]

for i in data_present_mice.columns:
    #contingency table
    table=pd.crosstab(index=data_present_mice[i],columns=data_present_mice['Spending_kind_Present'])
    
    #chi-sqaure goodness of fit test
    stats,p_value,dof,ideal_exp=chi2_contingency(table)
    
    if p_value<0.05:
        sigini_vars_pre_mice.append(i)


In [26]:
sigini_vars_pre_mice

['Gender',
 'All_angles_Present',
 'Hob_interests_Present',
 'Income_Change',
 'Smoking_hab_Present',
 'Internet_interest_Present',
 'Save_all_money_Present',
 'Brand_non_brand_Present',
 'Food_pref_Present',
 'Basic_medications_Present',
 'Digital_content_Present',
 'Active_Investor_Present',
 'Edu_instit_fee_Present',
 'Spend_on_gadgets_Present',
 'Spend_on_Luxury_Present',
 'Track_Household_exp_Present',
 'Spending_kind_Present']

In [27]:
#for present dataset
overlap_pre_mcie={}

for i in data_present_mice.columns:
    for j in data_present_mice.columns:
        table=pd.crosstab(index=data_present_mice[i],columns=data_present_mice[j])
        
        stats,p_value,dof,ideal_exp=chi2_contingency(table)
        
        if p_value<0.05:
            overlap_pre_mcie[i]=j

In [29]:
# here when observed we notice that many features have relationship with target feature, hence filtering out target feature
for item in overlap_pre_mcie:
    if overlap_pre_mcie[item] != 'Spending_kind_Present' :
        print(item,':',overlap_pre_mcie[item])
        print()

Age : Domestic_help_Present

Income : Domestic_help_Present

Emp_stat_Present : Domestic_help_Present

Marital_status : Domestic_help_Present

Loc : Domestic_help_Present

Notice_things_Present : Track_Household_exp_Present

Sincere_prod_Present : Edu_instit_fee_Present

Lonely_Present : Spend_on_gadgets_Present

Worry_health_Present : Spend_on_gadgets_Present

Charity_Present : Spend_on_gadgets_Present

New_env_Present : Spend_on_Luxury_Present

Meeting_ppl_Present : Track_Household_exp_Present

Surveys_Present : Spend_on_Luxury_Present

Spent_onli_Present : Spend_on_gadgets_Present

WFH_office_Present : Edu_instit_fee_Present

Ethu_Present : Track_Household_exp_Present

Drinking_hab_Present : Domestic_help_Present

Sleeping_hab_Present : Edu_instit_fee_Present

Healthy_Lifestyle_Present : Spend_on_Luxury_Present

Medi_pref_Present : Spend_on_Luxury_Present

Pol_interest_Present : Domestic_help_Present

Economy_Manag_intrst_Present : Edu_instit_fee_Present

Medicine_intrst_Present : E

Although datasets have different imputation techniques, the results obtained in statistical tests are similar, which can be due very less percentage of null values imputated.