# Lab 5 by Nicholas Fong, worked with Adrian Chavez

In [1]:
import os
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from sklearn import feature_selection, linear_model

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

In [2]:
df = pd.read_csv(os.path.join('credit.csv'))

In [3]:
df.head()

Unnamed: 0,Income,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,14.891,283,2,34,11,Male,No,Yes,Caucasian,333
1,106.025,483,3,82,15,Female,Yes,Yes,Asian,903
2,104.593,514,4,71,11,Male,No,No,Asian,580
3,148.924,681,3,36,11,Female,No,No,Asian,964
4,55.882,357,2,68,16,Male,No,Yes,Caucasian,331


A description of the dataset is as follows:

- Income (in thousands of dollars)
- Rating: Credit score rating
- Cards: Number of Credit cards owned
- Age
- Education: Years of Education
- Gender: Male/Female
- Student: Yes/No
- Married: Yes/No
- Ethnicity: African American/Asian/Caucasian
- Balance: Average credit card debt

## Question 1: Let's explore the quantitative variables that affect `Balance`.  From your preliminary analysis, which 2 variables seem to affect `Balance` the most?  Our goal is interpretation; can we use these 2 variables simultaneously?  Why or why not?

In [4]:
df.corr()

Unnamed: 0,Income,Rating,Cards,Age,Education,Balance
Income,1.0,0.791378,-0.018273,0.175338,-0.027692,0.463656
Rating,0.791378,1.0,0.053239,0.103165,-0.030136,0.863625
Cards,-0.018273,0.053239,1.0,0.042948,-0.051084,0.086456
Age,0.175338,0.103165,0.042948,1.0,0.003619,0.001835
Education,-0.027692,-0.030136,-0.051084,0.003619,1.0,-0.008062
Balance,0.463656,0.863625,0.086456,0.001835,-0.008062,1.0


In [5]:
model = smf.ols(formula = 'Balance ~ Income + Rating + Income * Rating', data = df).fit()
model.summary()

0,1,2,3
Dep. Variable:,Balance,R-squared:,0.878
Model:,OLS,Adj. R-squared:,0.877
Method:,Least Squares,F-statistic:,946.9
Date:,"Mon, 16 May 2016",Prob (F-statistic):,3.3e-180
Time:,17:35:41,Log-Likelihood:,-2599.2
No. Observations:,400,AIC:,5206.0
Df Residuals:,396,BIC:,5222.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-461.8984,33.272,-13.883,0.000,-527.310 -396.487
Income,-9.6035,0.772,-12.442,0.000,-11.121 -8.086
Rating,3.7952,0.101,37.589,0.000,3.597 3.994
Income:Rating,0.0034,0.001,2.863,0.004,0.001 0.006

0,1,2,3
Omnibus:,102.124,Durbin-Watson:,1.903
Prob(Omnibus):,0.0,Jarque-Bera (JB):,195.908
Skew:,1.402,Prob(JB):,2.8800000000000003e-43
Kurtosis:,4.972,Cond. No.,139000.0


Answer: The 2 variables that seem to affect Balance the most are Income and Rating. We can use both of these variables simultaneously because when we run a model with both of them, all variables are statistically significant. 

## Question 2: `Race`, `Gender`, `Married`, and `Student` are categorical variables.  Go ahead and create dummy variables for all of them.

In [6]:
race_df = pd.get_dummies(df.Ethnicity, prefix = 'Ethnicity')
df = df.join([race_df])
gender_df = pd.get_dummies(df.Gender, prefix = 'Gender')
df = df.join([gender_df])
df = df.join([pd.get_dummies(df.Married, prefix = 'Married')])
df = df.join([pd.get_dummies(df.Student, prefix = 'Student')])
df = df.rename(columns={'Ethnicity_African American': 'Ethnicity_AfricanAmerican'})
df.columns

Index([u'Income', u'Rating', u'Cards', u'Age', u'Education', u'Gender',
       u'Student', u'Married', u'Ethnicity', u'Balance',
       u'Ethnicity_AfricanAmerican', u'Ethnicity_Asian',
       u'Ethnicity_Caucasian', u'Gender_Female', u'Gender_Male', u'Married_No',
       u'Married_Yes', u'Student_No', u'Student_Yes'],
      dtype='object')

## Question 3: Using sklearn and a linear regression, predict `Balance` using `Income`, `Cards`, `Age`, `Education`, `Gender`, and `Race`

First, find the coefficients of your regression line.

In [7]:
x = df[['Income', 'Cards', 'Age', 'Education', 'Gender_Female', 'Ethnicity_Caucasian', 'Ethnicity_AfricanAmerican']]
y = df[['Balance']]
model = linear_model.LinearRegression(fit_intercept = True)
model.fit(x,y)
print '- R^2 =', model.score(x, y)
print '- beta_0 (intercept) =', model.intercept_
#print '- beta_n (n > 0)     =', model.coef_
iterator = 0
for i in x:
    print i, '=',model.coef_[0][iterator]
    iterator += 1

- R^2 = 0.232312608335
- beta_0 (intercept) = [ 223.49632361]
Income = 6.27995894353
Cards = 33.6295350792
Age = -2.32970547308
Education = 1.64553607303
Gender_Female = 27.1254312316
Ethnicity_Caucasian = 10.021007194
Ethnicity_AfricanAmerican = 6.546030781


Then, find the p-values of your estimates.  You have a few variables try to show your p-values alongside the names of the variables.

In [8]:
smfmodel = smf.ols(formula = 'Balance ~ Income + Cards + Age + Education + Gender_Female + Ethnicity_Caucasian + Ethnicity_AfricanAmerican', data = df).fit()
smfmodel.summary()

0,1,2,3
Dep. Variable:,Balance,R-squared:,0.232
Model:,OLS,Adj. R-squared:,0.219
Method:,Least Squares,F-statistic:,16.95
Date:,"Mon, 16 May 2016",Prob (F-statistic):,1.41e-19
Time:,17:35:42,Log-Likelihood:,-2966.5
No. Observations:,400,AIC:,5949.0
Df Residuals:,392,BIC:,5981.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,223.4963,128.463,1.740,0.083,-29.066 476.058
Income,6.2800,0.587,10.696,0.000,5.126 7.434
Cards,33.6295,14.881,2.260,0.024,4.373 62.887
Age,-2.3297,1.202,-1.938,0.053,-4.694 0.034
Education,1.6455,6.527,0.252,0.801,-11.187 14.478
Gender_Female,27.1254,40.695,0.667,0.505,-52.883 107.134
Ethnicity_Caucasian,10.0210,49.582,0.202,0.840,-87.459 107.501
Ethnicity_AfricanAmerican,6.5460,57.531,0.114,0.909,-106.562 119.654

0,1,2,3
Omnibus:,36.209,Durbin-Watson:,1.968
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.357
Skew:,0.349,Prob(JB):,0.000103
Kurtosis:,2.216,Cond. No.,502.0


## Question 4: Which of your coefficients are significant at the 5% significance level?

Answer: The intercept, Income, and Cards are significant at the 5% significance level

## Question 5: What is your model's $R^2$?

In [9]:
model.score(x,y)

0.23231260833540443

## Question 6: How do we interpret this value?

Answer: 23.2% of the variability in the data can be accounted for by the linear model involving Income, Cards, Age, Education, Gender, and Ethnicity

## Question 7: Now let's focus on the two most significant variables from your previous model and re-run your regression model.

In [10]:
x = df[['Income', 'Cards']]
y = df[['Balance']]
model = linear_model.LinearRegression(fit_intercept = True)
model.fit(x,y)
print '- R^2 =', model.score(x, y)
print '- beta_0 (intercept) =', model.intercept_
iterator = 0
for i in x:
    print i, '=',model.coef_[0][iterator]
    iterator += 1

- R^2 = 0.223991751622
- beta_0 (intercept) = [ 151.32994635]
Income = 6.07099859467
Cards = 31.8381289478


## Question 8: In comparison to the previous model, did the $R^2$ increase or decrease?  Why?

In [11]:
model.score(x, y)

0.22399175162249518

Answer: The r^2 value decreased because we removed elements from our model. In any model, when you reduce the complexity of the model by taking away variables, the r^2 always decreases (or at best stays the same) because there are less variables to account for changes in the data. 

## Question 9: Now let's regress `Balance` on `Gender` alone.  After running your linear regressions, do you have enough evidence to claim that females have more balance than males?  (Hint: Look at the p-value of the Gender coefficient.  If it is significant then you will have evidence to support that claim, otherwise you cannot support the statement.)

In [12]:
#Using smf.ols to analyze categorical data for all following entries
model = smf.ols(formula = 'Balance ~ Gender', data = df).fit()
model.summary()

0,1,2,3
Dep. Variable:,Balance,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.002
Method:,Least Squares,F-statistic:,0.1836
Date:,"Mon, 16 May 2016",Prob (F-statistic):,0.669
Time:,17:35:42,Log-Likelihood:,-3019.3
No. Observations:,400,AIC:,6043.0
Df Residuals:,398,BIC:,6051.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,529.5362,31.988,16.554,0.000,466.649 592.423
Gender[T.Male],-19.7331,46.051,-0.429,0.669,-110.267 70.801

0,1,2,3
Omnibus:,28.438,Durbin-Watson:,1.94
Prob(Omnibus):,0.0,Jarque-Bera (JB):,27.346
Skew:,0.583,Prob(JB):,1.15e-06
Kurtosis:,2.471,Cond. No.,2.58


Answer: There is insufficient evidence that Gender plays a role on Balance since the p-value > 0.05.

## Question 10: Now let's regress `Balance` on `Ethnicity`.  After running your linear regressions, do you have enough evidence to claim that some ethnic groups carry more balance than others?

In [13]:
model = smf.ols(formula = 'Balance ~ Ethnicity', data = df).fit()
model.summary()

0,1,2,3
Dep. Variable:,Balance,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.005
Method:,Least Squares,F-statistic:,0.04344
Date:,"Mon, 16 May 2016",Prob (F-statistic):,0.957
Time:,17:35:42,Log-Likelihood:,-3019.3
No. Observations:,400,AIC:,6045.0
Df Residuals:,397,BIC:,6057.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,531.0000,46.319,11.464,0.000,439.939 622.061
Ethnicity[T.Asian],-18.6863,65.021,-0.287,0.774,-146.515 109.142
Ethnicity[T.Caucasian],-12.5025,56.681,-0.221,0.826,-123.935 98.930

0,1,2,3
Omnibus:,28.829,Durbin-Watson:,1.946
Prob(Omnibus):,0.0,Jarque-Bera (JB):,27.395
Skew:,0.581,Prob(JB):,1.13e-06
Kurtosis:,2.46,Cond. No.,4.39


Answer: There is insufficient evidence that Ethnicity plays a role on Balance since the p-values > 0.05.

## Question 11: Finally let's regress `Balance` on `Student`.  After running your linear regressions, do you have enough evidence to claim that students carry more balance than non-students?

In [14]:
model = smf.ols(formula = 'Balance ~ Student', data = df).fit()
model.summary()

0,1,2,3
Dep. Variable:,Balance,R-squared:,0.067
Model:,OLS,Adj. R-squared:,0.065
Method:,Least Squares,F-statistic:,28.62
Date:,"Mon, 16 May 2016",Prob (F-statistic):,1.49e-07
Time:,17:35:42,Log-Likelihood:,-3005.5
No. Observations:,400,AIC:,6015.0
Df Residuals:,398,BIC:,6023.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,480.3694,23.434,20.499,0.000,434.300 526.439
Student[T.Yes],396.4556,74.104,5.350,0.000,250.771 542.140

0,1,2,3
Omnibus:,20.866,Durbin-Watson:,1.95
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21.92
Skew:,0.544,Prob(JB):,1.74e-05
Kurtosis:,2.637,Cond. No.,3.37


Answer: Yes there is sufficient evidence that being a student has an effect on Balance. The model predicts that Balance = 480.3694 + 396.4556 * Student. Thus, being a student is correlated with an increase in Balance by 396.4556. 

## Question 12: No let's consider the effect of `Student` and `Income` on `Balance` simultaneously.  Are all the coefficients significant?

In [15]:
model = smf.ols(formula = 'Balance ~ Income + Student', data = df).fit()
model.summary()

0,1,2,3
Dep. Variable:,Balance,R-squared:,0.277
Model:,OLS,Adj. R-squared:,0.274
Method:,Least Squares,F-statistic:,76.22
Date:,"Mon, 16 May 2016",Prob (F-statistic):,9.640000000000001e-29
Time:,17:35:42,Log-Likelihood:,-2954.4
No. Observations:,400,AIC:,5915.0
Df Residuals:,397,BIC:,5927.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,211.1430,32.457,6.505,0.000,147.333 274.952
Student[T.Yes],382.6705,65.311,5.859,0.000,254.272 511.069
Income,5.9843,0.557,10.751,0.000,4.890 7.079

0,1,2,3
Omnibus:,119.719,Durbin-Watson:,1.951
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23.617
Skew:,0.252,Prob(JB):,7.44e-06
Kurtosis:,1.922,Cond. No.,192.0


Answer: Being a student and Income both are significant in predicting the Balance. The model predicts that Balance = 211.1430 + 382.6705 \* Student + 5.9843 \* Income. Thus, the model says that being a student is correlated with an increase in Balance by 382.6705. Also, an increase in Income by 1 unit is correlated with an increase in Balance by 5.9843 units.

## Question 13: No let's consider the interaction effect of `Student` and `Income` on `Balance` simultaneously.  Are all the coefficients significant?  It they are, write down your regression model below

(First generate a new variable for the interaction term)

In [16]:
model = smf.ols(formula = 'Balance ~ Income * Student', data = df).fit()
model.summary()

0,1,2,3
Dep. Variable:,Balance,R-squared:,0.28
Model:,OLS,Adj. R-squared:,0.274
Method:,Least Squares,F-statistic:,51.3
Date:,"Mon, 16 May 2016",Prob (F-statistic):,4.94e-28
Time:,17:35:42,Log-Likelihood:,-2953.7
No. Observations:,400,AIC:,5915.0
Df Residuals:,396,BIC:,5931.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,200.6232,33.698,5.953,0.000,134.373 266.873
Student[T.Yes],476.6758,104.351,4.568,0.000,271.524 681.827
Income,6.2182,0.592,10.502,0.000,5.054 7.382
Income:Student[T.Yes],-1.9992,1.731,-1.155,0.249,-5.403 1.404

0,1,2,3
Omnibus:,107.788,Durbin-Watson:,1.952
Prob(Omnibus):,0.0,Jarque-Bera (JB):,22.158
Skew:,0.228,Prob(JB):,1.54e-05
Kurtosis:,1.941,Cond. No.,309.0


Answer: Since Income \* Student is not statistically significant, we revert to our previous model of: Balance = 211.1430 + 382.6705 \* Student + 5.9843 \* Income

## Question 14: Is there any income level at which students and non-students on average carry same level of balance?

In [17]:
#Assuming we're using the model involving Income * Student:
#Student = Non-student
#200.6232 + 6.2182 * Income = 200.6232 + 476.6758 + (6.2182 - 1.9992) * Income
#1.9992 * Income = 476.6758
476.6758 / 1.9992

238.4332733093237

Answer: When Income = 238.4333, Students and Non-Students on average carry the same Balance