3.	Build statistical model(s) – also known as econometric model(s) – to explain how a set of explanatory variables each affect annual net savings. Comment on the results. Which explanatory variables have robust statistically significant effects in your model(s)? Do these results make sense?

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

import statsmodels.api as sm


Build a simple regression model: ordinary least square (OLS) where the net saving is represented as a linear combination of the independent variables and find the best fit to estimate the co-efficients. In modeling so, we make the following assumptions:

1) linear relationship between net savings and independent (explanatory) variables 
2) residuals (erros) are normally distributed 
3) variance of the residuals are constant 
4) annual net saving is independt of residuals
5) no multicollinearity -- we know from our previous correlation analysis that some of the variables are *highly correleated* - a) `net_salary`, `observed_income`, `other_income`, `observed_surplus` b) `age`, `savings_bal_lbg`. Keep `observed_income`, `age` and drop the  others.  

In [12]:
# read the cleaned dataset 
df = pd.read_pickle('data/customer_data_modified.pkl')

In [13]:
# drop the variables we dont need 
# df = df_clean.drop(columns=['cust_unique_id', 'gender', 'ethnicity_group', 'geo_region', 'net_salary', 'other_income', 'observed_surplus', 'savings_bal_lbg'])

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4972 entries, 7 to 30304
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   cust_unique_id          4972 non-null   int64  
 1   age                     4972 non-null   int64  
 2   gender                  4972 non-null   object 
 3   gender_cat              4972 non-null   int64  
 4   ethnicity_group         4972 non-null   object 
 5   ethnicity_group_cat     4972 non-null   int64  
 6   geo_region              4972 non-null   object 
 7   geo_region_cat          4972 non-null   int64  
 8   net_salary              4972 non-null   float64
 9   other_income            4972 non-null   float64
 10  observed_income         4972 non-null   float64
 11  housing_spend           4972 non-null   float64
 12  childcare_spend         4972 non-null   float64
 13  gambling_spend          4972 non-null   float64
 14  observed_surplus        4972 non-null   floa

In [15]:
# function to perform OLS fit and to return the model

def regression_model(df, dependent_var, independent_vars):
    X = df[independent_vars]
    Y = df[dependent_var]

    # add coefficients 
    X = sm.add_constant(X)

    model = sm.OLS(Y,X).fit()
    predictions = model.predict(X)
    #print(predictions)

    
    return model


In [20]:
X = ['age', 'gender_cat', 'ethnicity_group_cat', 'geo_region_cat', 'net_salary', 'other_income', 'housing_spend', 'childcare_spend', 'gambling_spend', 'credit_score']
model1 = regression_model(df, 'annual_net_savings_lbg', X)
print(model1.summary())

                              OLS Regression Results                              
Dep. Variable:     annual_net_savings_lbg   R-squared:                       0.271
Model:                                OLS   Adj. R-squared:                  0.269
Method:                     Least Squares   F-statistic:                     184.3
Date:                    Thu, 07 Dec 2023   Prob (F-statistic):               0.00
Time:                            08:27:24   Log-Likelihood:                -52049.
No. Observations:                    4972   AIC:                         1.041e+05
Df Residuals:                        4961   BIC:                         1.042e+05
Df Model:                              10                                         
Covariance Type:                nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------

# Summary:
* model1 uses the set of independent variables shown above. 
* Adj. R^2: model explains 27% of the variance in annual net salary 
* The F-statistics along with P(F) strongly rejects the null hypothesis and confirms the affect of the independent variables on the regression is significant 
* Age has the highest co-efficient with p-value of 0 indicating robust affect on annual net savings with large statistical significance
* `net_salary`, `other_income`, `housing_spend`, `childcare_spend`, `gambling_spend` also affect net savings with large statistical significance 
* On the other hand categorical variables- `gender`, `ethnicity`, `geo_region` although have large "affect" (value of coef) on net savings they are not statistically significant 

In [21]:
# lets create another model by dropping the variables that doesn't have statistically significant affect on net savings 
X = ['age', 'net_salary', 'other_income', 'housing_spend', 'childcare_spend', 'gambling_spend']
model2 = regression_model(df, 'annual_net_savings_lbg', X)
print(model2.summary())

                              OLS Regression Results                              
Dep. Variable:     annual_net_savings_lbg   R-squared:                       0.270
Model:                                OLS   Adj. R-squared:                  0.269
Method:                     Least Squares   F-statistic:                     306.5
Date:                    Thu, 07 Dec 2023   Prob (F-statistic):               0.00
Time:                            08:30:15   Log-Likelihood:                -52051.
No. Observations:                    4972   AIC:                         1.041e+05
Df Residuals:                        4965   BIC:                         1.042e+05
Df Model:                               6                                         
Covariance Type:                nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
co

* The Adjusted R^2 remains the same further confirming the affect of those variables are not significant  

# Caveat/Drawback:

Despite the above result and insights, keep in mind that this model has the following drawbacks:
* Condition number which measures the sensitivity of the dependent variable compared to the indepent variables is very high i.e. large fluctuation of output with small fluctuation in input. This is due to the multicollinearity - which we already know from our correlation matrix. We only dropped the most significant correlations. However, OLS assumes no multicorrelation. For cases like this a Principal Component Analysis followed up a linear regression would be the next approach to try. 

* From P(Omnibus), P(JB) also suggest that the assumption of the normal distribution of the residuals is also not valid.