# DS-SF-27 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [214]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model,feature_selection

In [215]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [216]:
pd.crosstab(df.admit, df.prestige)

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [217]:
# TODO
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')
prestige_df

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We only need 3 out of those 4 binary variables for modeling as we can know the prestige of the university of a candidate by just looking at 3 of the variables. This way, we avoid using variables which wouldn't be independent (which is one of the pre requisites for a linear model)

> ### Question 4.  Why are we doing this?

Answer: We are using binary variables instead of the original variable as we want to test the hypothesis of wether or not going from a prestige 2 university to a prestige 3 university increases your chance of getting admitted to grad school by the same amount as going from a prestige 3 university to a prestige 4 university.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [218]:
df = df.join([prestige_df])

In [219]:
df = df.drop('prestige', 1)

In [220]:
df

Unnamed: 0,admit,gre,gpa,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [221]:
freq_pres_1=pd.crosstab(df.admit, df['prestige_1.0'])[1]
freq_pres_1

admit
0    28
1    33
Name: 1.0, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [222]:
# TODO
p_adm_one = 100*float(freq_pres_1[1])/(freq_pres_1[1]+freq_pres_1[0])
p_adm_one

54.09836065573771

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [223]:
# TODO
freq_pres_not_1=pd.crosstab(df.admit, df['prestige_1.0'])[0]
freq_pres_not_1
p_adm_not_one = 100*float(freq_pres_not_1[1])/(freq_pres_not_1[1]+freq_pres_not_1[0])
p_adm_not_one

27.678571428571427

> ### Question 9.  Finally, what's the odds ratio?

In [224]:
# TODO
p_adm_one/p_adm_not_one

1.954521417239556

> ### Question 10.  Write this finding in a sentenance.

Answer: Your odds of being admitted to grad school is almost twice  (~x1.95) as high (~54%) if you have attended one of the most prestigious undergraduate schools instead of not attending one of the most prestigious undergraduate schools (~27%)

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [225]:
freq_pres_4=pd.crosstab(df.admit, df['prestige_4.0'])[1]
freq_pres_4

p_adm_four=100*float(freq_pres_4[1])/(freq_pres_4[1]+freq_pres_4[0])
p_adm_four

17.91044776119403

In [226]:
freq_pres_not_4=pd.crosstab(df.admit, df['prestige_4.0'])[0]
freq_pres_not_4
p_adm_not_four = 100*float(freq_pres_not_4[1])/(freq_pres_not_4[1]+freq_pres_not_4[0])
p_adm_not_four

34.54545454545455

In [227]:
p_adm_four/p_adm_not_four

0.5184603299293008

Answer: Your odds of being admitted to grad school is almost twice (~x0.52) as low (~18%) if you have attended one of the least prestigious undergraduate schools instead of not attending one of the the least prestigious undergraduate schools (~34%)

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [206]:
df.rename(columns = {'prestige_1.0': 'prestige_1',
                           'prestige_2.0': 'prestige_2',
                           'prestige_3.0': 'prestige_3',
                           'prestige_4.0': 'prestige_4'}, inplace = True)
df.columns

Index([u'admit', u'gre', u'gpa', u'prestige_1', u'prestige_2', u'prestige_3',
       u'prestige_4'],
      dtype='object')

In [207]:
# TODO
X_sm= df[['gre','gpa','prestige_2','prestige_3','prestige_4']]
y_sm = df['admit']
model = smf.Logit(y_sm, X_sm).fit()

Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5


> ### Question 13.  Print the model's summary results.

In [208]:
# TODO
model.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,392.0
Method:,MLE,Df Model:,4.0
Date:,"Thu, 13 Oct 2016",Pseudo R-squ.:,0.05722
Time:,16:21:13,Log-Likelihood:,-233.88
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.039e-05

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0014,0.001,1.308,0.191,-0.001 0.003
gpa,-0.1323,0.195,-0.680,0.497,-0.514 0.249
prestige_2,-0.9562,0.302,-3.171,0.002,-1.547 -0.365
prestige_3,-1.5375,0.332,-4.627,0.000,-2.189 -0.886
prestige_4,-1.8699,0.401,-4.658,0.000,-2.657 -1.083


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [209]:
# TODO
for i in range(5):
    print  "odds ratios ( exp(beta_j) ): ",X_sm.columns[i], np.exp(model.params[i])
for i in range(5):
    print  "95% confidence intervals is:",X_sm.columns[i], "[", model.conf_int()[0][i],";", model.conf_int()[1][i],"]"


odds ratios ( exp(beta_j) ):  gre 1.00136757945
odds ratios ( exp(beta_j) ):  gpa 0.876072621845
odds ratios ( exp(beta_j) ):  prestige_2 0.38434194315
odds ratios ( exp(beta_j) ):  prestige_3 0.214917812922
odds ratios ( exp(beta_j) ):  prestige_4 0.154134789101
95% confidence intervals is: gre [ -0.000680481001459 ; 0.00341377133496 ]
95% confidence intervals is: gpa [ -0.513657423237 ; 0.249044843573 ]
95% confidence intervals is: prestige_2 [ -1.54727949804 ; -0.365165793326 ]
95% confidence intervals is: prestige_3 [ -2.18876883998 ; -0.886230338853 ]
95% confidence intervals is: prestige_4 [ -2.65674316557 ; -1.08311244539 ]


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: your odds of getting admitted to the UCLA grad school is multiplied by ~0.38 if you went to a prestige 2 undergrad instead of a non prestige 2 undergrad

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: your odds of getting admitted to the UCLA grad school is multiplied by ~1.14 (1/exp(0.0014)) when you add one point to your gpa score

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [148]:
# TODO
print "probability of admission if he/she come from a tier-1 undergraduate school:"
print 1/(1+np.exp(-(sum([800,4,0,0,0]*model.params))))
print "probability of admission if he/she come from a tier-2 undergraduate school:"
print 1/(1+np.exp(-(sum([800,4,1,0,0]*model.params))))
print "probability of admission if he/she come from a tier-3 undergraduate school:"
print 1/(1+np.exp(-(sum([800,4,0,1,0]*model.params))))
print "probability of admission if he/she come from a tier-4 undergraduate school:"
print 1/(1+np.exp(-(sum([800,4,0,0,1]*model.params))))

probability of admission if he/she come from a tier-1 undergraduate school:
0.63739858332
probability of admission if he/she come from a tier-2 undergraduate school:
0.403204249654
probability of admission if he/she come from a tier-3 undergraduate school:
0.274201614468
probability of admission if he/she come from a tier-4 undergraduate school:
0.213184326729


Answer: (see above)

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [120]:
# TODO

X = df[ ['gre','gpa','prestige_2','prestige_3','prestige_4'] ]
y = df.admit

# TODO
model_sk = linear_model.LogisticRegression(C = 10 ** 2).fit(X, y)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
model_sk.coef_

In [121]:
# TODO
for i in range(5): 
    print 'odds ratio ( exp(beta_j) )',X.columns[i],': ',np.exp(model_sk.coef_[0][i])

odds ratio ( exp(beta_j) ) gre :  1.00216054603
odds ratio ( exp(beta_j) ) gpa :  1.96041258389
odds ratio ( exp(beta_j) ) prestige_2 :  0.533219355935
odds ratio ( exp(beta_j) ) prestige_3 :  0.285867331187
odds ratio ( exp(beta_j) ) prestige_4 :  0.208296628667


Answer: the odds ratio for the GRE are similar between sk learn and stats model. The odds ratios for GPA, prestige_2, prestige_3 and prestige_4 are bigger with sk learn than with stats model

Variable | statsmodels | sklearn
---|---|---
`gre` | 1.00136757945 | 1.00216054603
`gpa` | 0.876072621845 | 1.96041258389
`prestige_2` | 0.38434194315 | 0.533219355935
`prestige_3` | 0.214917812922 | 0.285867331187
`prestige_4` | 0.154134789101 | 0.208296628667

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [212]:
# TODO
print "probability of admission if he/she come from a tier-1 undergraduate school:"
print 1/(1+np.exp(-(sum([800,4,0,0,0]*model_sk.coef_[0])+model_sk.intercept_[0])))
print "probability of admission if he/she come from a tier-2 undergraduate school:"
print 1/(1+np.exp(-(sum([800,4,1,0,0]*model_sk.coef_[0])+model_sk.intercept_[0])))
print "probability of admission if he/she come from a tier-3 undergraduate school:"
print 1/(1+np.exp(-(sum([800,4,0,1,0]*model_sk.coef_[0])+model_sk.intercept_[0])))
print "probability of admission if he/she come from a tier-4 undergraduate school:"
print 1/(1+np.exp(-(sum([800,4,0,0,1]*model_sk.coef_[0])+model_sk.intercept_[0])))

probability of admission if he/she come from a tier-1 undergraduate school:
0.711853946857
probability of admission if he/she come from a tier-2 undergraduate school:
0.568462979352
probability of admission if he/she come from a tier-3 undergraduate school:
0.413910637705
probability of admission if he/she come from a tier-4 undergraduate school:
0.339754859079


Answer: (See above)

Probability of admission | statsmodels | sklearn
---|---|---
`from a tier-1` | 0.63739858332 | 0.711853946857
`from a tier-2` | 0.403204249654 | 0.568462979352
`from a tier-3` | 0.274201614468 | 0.413910637705
`from a tier-4` | 0.213184326729 | 0.339754859079