# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [565]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf
import statsmodels.api as sm

from sklearn import linear_model, cross_validation

from sklearn import linear_model

In [566]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [567]:
pd.crosstab(df.prestige, df.admit, dropna = False)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [568]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

prestige_df

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: 3 of the 4 binary variables needed for modeling

> ### Question 4.  Why are we doing this?

Answer:  1 binary variable can be omitted, because it can be derived using the remaining 3. coefficients of the remaining binary variables will refer to the one omitted

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [569]:
prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
                           'prestige_2.0': 'prestige_2',
                           'prestige_3.0': 'prestige_3',
                           'prestige_4.0': 'prestige_4'}, inplace = True)

df = df.join([prestige_df])

In [570]:
df.drop(['prestige'], axis = 1, inplace = True)

In [571]:
df

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


In [572]:
df.isnull().sum()

admit         0
gre           0
gpa           0
prestige_1    0
prestige_2    0
prestige_3    0
prestige_4    0
dtype: int64

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [573]:
pd.crosstab(df.prestige_1, df.admit, dropna = False)

admit,0,1
prestige_1,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,243,93
1.0,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [574]:
p1 = 28. + 33.
p1_a1 = 33.

prob_p1_a1 = p1_a1 / p1

print "Odds of being admitted for applicants that attended the most prestigious undergrad schools:", "{0:.0f}%".format(prob_p1_a1 * 100)

Odds of being admitted for applicants that attended the most prestigious undergrad schools: 54%


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [575]:
p234 = 243. + 93.
p234_a1 = 93.

prob_p234_a1 = p234_a1 / p234

print "Odds of being admitted for applicants that did not attended the most prestigious undergrad schools:", "{0:.0f}%".format(prob_p234_a1 * 100)

Odds of being admitted for applicants that did not attended the most prestigious undergrad schools: 28%


> ### Question 9.  Finally, what's the odds ratio?

In [576]:
prob_p1_a1/prob_p234_a1

1.9545214172395557

> ### Question 10.  Write this finding in a sentence.

Answer: The odds of being admitted to UCLA are almost doubled (1.95) if the applicant attended the most prestigious undergraduate school. An undergraduate degree from the most prestigious ranked college seems to be more strongly correlated with being admitted to UCLA.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [577]:
pd.crosstab(df.prestige_4, df.admit, dropna = False)

admit,0,1
prestige_4,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,216,114
1.0,55,12


In [578]:
p4 = 55. + 12.
p4_a1 = 12.

prob_p4_a1 = p4_a1 / p4

print "Odds of being admitted for applicants that attended the least prestigious undergrad schools:", "{0:.0f}%".format(prob_p4_a1 * 100)

Odds of being admitted for applicants that attended the least prestigious undergrad schools: 18%


In [579]:
p123 = 216. + 114.
p123_a1 = 114.

prob_p123_a1 = p123_a1 / p123

print "Odds of being admitted for applicants that did not attended the least prestigious undergrad schools:", "{0:.0f}%".format(prob_p123_a1 * 100)

Odds of being admitted for applicants that did not attended the least prestigious undergrad schools: 35%


In [580]:
print "Odds Ratio:", prob_p4_a1/prob_p123_a1

Odds Ratio: 0.518460329929


Answer: The odds of being admitted to UCLA are almost half (0.52) if the applicant attended the least prestigious undergraduate school as compared to all other ranked schools. An undergraduate degree from the least prestigious ranked college seems to be more strongly correlated with not being admitted to UCLA.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [581]:
df.drop(['prestige_1'], axis = 1, inplace = True)

In [512]:
def Xy(df):
    df = df.dropna(subset = ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4', 'admit'])

    X = df[ ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4'] ] # X is a DataFrame
    X = sm.add_constant(X)

    y = df.admit # y is a Series

    return X, y

    X, y = Xy(df)

model = smf.OLS(y, X).fit()

model.summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.099
Model:,OLS,Adj. R-squared:,0.087
Method:,Least Squares,F-statistic:,8.594
Date:,"Sat, 11 Feb 2017",Prob (F-statistic):,9.71e-08
Time:,03:22:06,Log-Likelihood:,-239.02
No. Observations:,397,AIC:,490.0
Df Residuals:,391,BIC:,513.9
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,-0.2377,0.217,-1.095,0.274,-0.665 0.189
gre,0.0004,0.000,1.997,0.047,6.48e-06 0.001
gpa,0.1508,0.064,2.349,0.019,0.025 0.277
prestige_2,-0.1635,0.068,-2.407,0.017,-0.297 -0.030
prestige_3,-0.2910,0.070,-4.139,0.000,-0.429 -0.153
prestige_4,-0.3240,0.079,-4.082,0.000,-0.480 -0.168

0,1,2,3
Omnibus:,152.312,Durbin-Watson:,1.946
Prob(Omnibus):,0.0,Jarque-Bera (JB):,50.314
Skew:,0.678,Prob(JB):,1.19e-11
Kurtosis:,1.904,Cond. No.,6070.0


In [514]:
modelols = smf.ols(formula = 'admit ~ gre + gpa + prestige_2 + prestige_3 + prestige_4', data = df).fit()

modelols.summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.099
Model:,OLS,Adj. R-squared:,0.087
Method:,Least Squares,F-statistic:,8.594
Date:,"Sat, 11 Feb 2017",Prob (F-statistic):,9.71e-08
Time:,03:22:32,Log-Likelihood:,-239.02
No. Observations:,397,AIC:,490.0
Df Residuals:,391,BIC:,513.9
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.2377,0.217,-1.095,0.274,-0.665 0.189
gre,0.0004,0.000,1.997,0.047,6.48e-06 0.001
gpa,0.1508,0.064,2.349,0.019,0.025 0.277
prestige_2,-0.1635,0.068,-2.407,0.017,-0.297 -0.030
prestige_3,-0.2910,0.070,-4.139,0.000,-0.429 -0.153
prestige_4,-0.3240,0.079,-4.082,0.000,-0.480 -0.168

0,1,2,3
Omnibus:,152.312,Durbin-Watson:,1.946
Prob(Omnibus):,0.0,Jarque-Bera (JB):,50.314
Skew:,0.678,Prob(JB):,1.19e-11
Kurtosis:,1.904,Cond. No.,6070.0


> ### Question 13.  Print the model's summary results.

In [515]:
model.summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.099
Model:,OLS,Adj. R-squared:,0.087
Method:,Least Squares,F-statistic:,8.594
Date:,"Sat, 11 Feb 2017",Prob (F-statistic):,9.71e-08
Time:,03:22:37,Log-Likelihood:,-239.02
No. Observations:,397,AIC:,490.0
Df Residuals:,391,BIC:,513.9
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,-0.2377,0.217,-1.095,0.274,-0.665 0.189
gre,0.0004,0.000,1.997,0.047,6.48e-06 0.001
gpa,0.1508,0.064,2.349,0.019,0.025 0.277
prestige_2,-0.1635,0.068,-2.407,0.017,-0.297 -0.030
prestige_3,-0.2910,0.070,-4.139,0.000,-0.429 -0.153
prestige_4,-0.3240,0.079,-4.082,0.000,-0.480 -0.168

0,1,2,3
Omnibus:,152.312,Durbin-Watson:,1.946
Prob(Omnibus):,0.0,Jarque-Bera (JB):,50.314
Skew:,0.678,Prob(JB):,1.19e-11
Kurtosis:,1.904,Cond. No.,6070.0


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [586]:
names_X = ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4']


def X_c(df):
    X = df[ names_X ]
    c = df.admit
    return X, c

    df_X, df_c = X_c(df)


modellr = linear_model.LogisticRegression().\
    fit(df_X, df_c)

print "intercept:", modellr.intercept_
print "coefficient:", modellr.coef_

print "Odds Ratios:", zip(names_X, np.exp(modellr.coef_[0] - 1))

intercept: [-1.81701706]
coefficient: [[ 0.00178497  0.23229458 -0.60347467 -1.17214957 -1.37729795]]
Odds Ratios: [('gre', 0.3685366819600418), ('gpa', 0.46407670855038258), ('prestige_2', 0.20119621062333798), ('prestige_3', 0.11393244775427529), ('prestige_4', 0.092800992394150128)]


In [582]:
print model.conf_int().\
rename(columns = {0: '5%', 1: '95%'})

                  5%       95%
const      -0.664512  0.189012
gre         0.000006  0.000837
gpa         0.024562  0.277086
prestige_2 -0.297107 -0.029977
prestige_3 -0.429169 -0.152753
prestige_4 -0.480056 -0.167932


In [611]:
df['intercept'] = 1.0
train_cols = df.columns[1:]
train_cols

Index([u'gre', u'gpa', u'prestige_2', u'prestige_3', u'prestige_4',
       u'intercept'],
      dtype='object')

In [717]:
logit = sm.Logit(df['admit'], df[train_cols])

# fit the model

result = logit.fit()

params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['5.0%', '95.0%', 'OR']
print np.exp(conf)

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6
                5.0%     95.0%        OR
gre         1.000074  1.004372  1.002221
gpa         1.136120  4.183113  2.180027
prestige_2  0.272168  0.942767  0.506548
prestige_3  0.133377  0.515419  0.262192
prestige_4  0.093329  0.479411  0.211525
intercept   0.002207  0.194440  0.020716


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: The odds of being admitted to UCLA drop by half (0.51) if the student attended an undergraduate school with prestige ranking of 2 as compared to the baseline - a school with a prestige ranking of 1.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: One point increase in GPA results more than doubles (2.18) the odds of being admitted to UCLA.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [810]:
predict_X = [ [800, 4, 0, 0, 0] ]

print "tier 1 admit", model.predict(predict_X)
print "tier 1 prob", model.predict_proba(predict_X)

tier 1 admit [1]
tier 1 prob [[ 0.28814605  0.71185395]]


In [806]:
predict_X = [ [800, 4, 1, 0, 0] ]

print "tier 2 admit", model.predict(predict_X)
print "tier 2 prob", model.predict_proba(predict_X)

tier 2 admit [1]
tier 2 prob [[ 0.43153702  0.56846298]]


In [796]:
predict_X = [ [800, 4, 0, 1, 0] ]

print "tier 3 admit", model.predict(predict_X)
print "tier 3 prob", model.predict_proba(predict_X)

tier 3 admit [0]
tier 3 prob [[ 0.58608936  0.41391064]]


In [797]:
predict_X = [ [800, 4, 0, 0, 1] ]

print "tier 4 admit", modelskl.predict(predict_X)
print "tier 4 prob", modelskl.predict_proba(predict_X)

tier 4 admit [0]
tier 4 prob [[ 0.66024514  0.33975486]]


Answer: See above for admission and probabilities at every tier.

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [785]:
X = df[ ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4'] ]
y = df.admit

modelskl = linear_model.LogisticRegression(C = 10 ** 2)
modelskl.fit(X,y)

print modelskl.intercept_
print modelskl.coef_

[-3.51478687]
[[ 0.00215822  0.67315495 -0.62882239 -1.25222745 -1.56879212]]


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [787]:
print "Odds Ratios skl:", zip(X, np.exp(modelskl.coef_[0] - 1)) 

print "Odds Ratios statsmodels:", zip(names_X, np.exp(model.coef_[0] - 1))

Odds Ratios skl: [('gre', 0.36867426164504413), ('gpa', 0.7211954872118933), ('prestige_2', 0.19616043928550936), ('prestige_3', 0.10516471420702631), ('prestige_4', 0.07662804691837001)]
Odds Ratios statsmodels: [('gre', 0.36867426164504413), ('gpa', 0.7211954872118933), ('prestige_2', 0.19616043928550936), ('prestige_3', 0.10516471420702631), ('prestige_4', 0.07662804691837001)]


Answer: With addition of "(C = 10 ** 2)" the odds ratios are exactly the same. 

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [798]:
predict_X = [ [800, 4, 0, 0, 0] ]

print "tier 1 admit", modelskl.predict(predict_X)
print "tier 1 prob", modelskl.predict_proba(predict_X)

tier 1 admit [1]
tier 1 prob [[ 0.28814605  0.71185395]]


In [799]:
predict_X = [ [800, 4, 1, 0, 0] ]

print "tier 2 admit", modelskl.predict(predict_X)
print "tier 2 prob", modelskl.predict_proba(predict_X)

tier 2 admit [1]
tier 2 prob [[ 0.43153702  0.56846298]]


In [800]:
predict_X = [ [800, 4, 0, 1, 0] ]

print "tier 3 admit", modelskl.predict(predict_X)
print "tier 3 prob", modelskl.predict_proba(predict_X)

tier 3 admit [0]
tier 3 prob [[ 0.58608936  0.41391064]]


In [801]:
predict_X = [ [800, 4, 0, 0, 1] ]

print "tier 4 admit", modelskl.predict(predict_X)
print "tier 4 prob", modelskl.predict_proba(predict_X)

tier 4 admit [0]
tier 4 prob [[ 0.66024514  0.33975486]]


Answer: See above for admit and probabilities at every tier.