# DS-SF-25 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [263]:
import os
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import linear_model

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

In [264]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


In [265]:
df.shape

(397, 4)

## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [266]:
df.prestige = df.prestige.astype(int)
pd.crosstab(df.prestige,df.admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1,28,33
2,95,53
3,93,28
4,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [267]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')
prestige_df.head()

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: 3 (= # categories - 1)

> ### Question 4.  Why are we doing this?

Answer: The effect of prestige between a 4 to 3 rank may not be the same as the difference between a 2 to 1 rank even though numerically they are both 1 unit apart.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [268]:
df.drop('prestige',axis=1,inplace=True)
df = df.join([prestige_df])
df.head()

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [269]:
ct = pd.crosstab([df.prestige_1,df.prestige_2,df.prestige_3,df.prestige_4],df.admit)
ct["total"] = ct[0] + ct[1]
ct

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,admit,0,1,total
prestige_1,prestige_2,prestige_3,prestige_4,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,0.0,0.0,1.0,55,12,67
0.0,0.0,1.0,0.0,93,28,121
0.0,1.0,0.0,0.0,95,53,148
1.0,0.0,0.0,0.0,28,33,61


In [270]:
print('Sum 0:')
print(ct[0].sum())
print('Sum 1:')
print(ct[1].sum())

Sum 0:
271
Sum 1:
126


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [271]:
# Rank 1 is the most prestigous, Rank 4 is the least
nf = 28
f = 33
total = nf + f
p = f*1./total
print('Probability: %.1f%%' %(p*100.))
o1 = p / (1. - p)
print('Odds: %.1f' %(o1))

Probability: 54.1%
Odds: 1.2


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [272]:
# Ranks 2,3,4
nf = 271-28
f = 126-33
total = nf + f
p = f*1./total
print('Probability: %.1f%%' %(p*100.))
o234 = p / (1. - p)
print('Odds: %.1f' %(o234))

Probability: 27.7%
Odds: 0.4


> ### Question 9.  Finally, what's the odds ratio?

In [273]:
# Odds Rank 4 / Odds Ranks 1,2,3
o1/o234

3.079493087557604

> ### Question 10.  Write this finding in a sentence.

Answer: Those that go to the highest ranking school have three times the odds of being admitted compared to everyone else.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [274]:
# Rank 4 is the least prestigous
nf = 55
f = 12
total = nf + f
p = f*1./total
print('Probability: %.1f%%' %(p*100.))
o4 = p / (1. - p)
print('Odds: %.1f' %(o4))

Probability: 17.9%
Odds: 0.2


In [275]:
# Ranks 1,2,3
nf = 271-55
f = 126-12
total = nf + f
p = f*1./total
print('Probability: %.1f%%' %(p*100.))
o123 = p / (1. - p)
print('Odds: %.1f' %(o234))

Probability: 34.5%
Odds: 0.4


In [276]:
o4/o123

0.4133971291866028

Answer: Those that go to the lowest prestige ranking school have less than half the odds at getting in as everyone else.

## Part C. Analysis using `statsmodel`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [277]:
# TODO: Reference point?
#train_df = df.sample(frac = .8, random_state = 0)
#test_df = df.drop(train_df.index)

train_df = df
train_df['intercept']=1.
train_df.head()

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4,intercept
0,0,380.0,3.61,0.0,0.0,1.0,0.0,1.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0,1.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0,1.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0,1.0


In [278]:
columns = ['gre','gpa','prestige_2','prestige_3','prestige_4','intercept']
logit = sm.Logit(train_df['admit'], train_df[columns])
model = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [279]:
model.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Wed, 17 Aug 2016",Pseudo R-squ.:,0.08166
Time:,00:21:36,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372 -0.735
intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [280]:
params_model = model.params
conf = model.conf_int()
conf['odds_ratio'] = params_model
conf.columns = ['minus', 'plus', 'odds_ratio']
print np.exp(conf)

               minus      plus  odds_ratio
gre         1.000074  1.004372    1.002221
gpa         1.136120  4.183113    2.180027
prestige_2  0.272168  0.942767    0.506548
prestige_3  0.133377  0.515419    0.262192
prestige_4  0.093329  0.479411    0.211525
intercept   0.002207  0.194440    0.020716


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: The odds of admittance would decrease by half for a prestige_2 school.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: The odds of admittance double for a unit increase in GPA (e.g. 3 to 4). 

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [281]:
df_test = pd.DataFrame({'gre':[800.,800,800,800],'gpa':[4.,4,4,4],'prestige_1':[1,0,0,0],
                        'prestige_2':[0,1,0,0],'prestige_3':[0,0,1,0],'prestige_4':[0,0,0,1],
                       'intercept':[1.,1,1,1]})
df_test

Unnamed: 0,gpa,gre,intercept,prestige_1,prestige_2,prestige_3,prestige_4
0,4.0,800.0,1.0,1,0,0,0
1,4.0,800.0,1.0,0,1,0,0
2,4.0,800.0,1.0,0,0,1,0
3,4.0,800.0,1.0,0,0,0,1


In [282]:
columns = ['gre','gpa','prestige_2','prestige_3','prestige_4','intercept']
df_test['admit_predict'] = model.predict(df_test[columns])
df_test

Unnamed: 0,gpa,gre,intercept,prestige_1,prestige_2,prestige_3,prestige_4,admit_predict
0,4.0,800.0,1.0,1,0,0,0,0.73404
1,4.0,800.0,1.0,0,1,0,0,0.582995
2,4.0,800.0,1.0,0,0,1,0,0.419833
3,4.0,800.0,1.0,0,0,0,1,0.368608


Answer:

In [283]:
df_test2 = pd.DataFrame({'gre':[600.,600,600,600],'gpa':[3.,3,3,3],'prestige_1':[1,0,0,0],
                        'prestige_2':[0,1,0,0],'prestige_3':[0,0,1,0],'prestige_4':[0,0,0,1],
                       'intercept':[1.,1,1,1]})
df_test2

Unnamed: 0,gpa,gre,intercept,prestige_1,prestige_2,prestige_3,prestige_4
0,3.0,600.0,1.0,1,0,0,0
1,3.0,600.0,1.0,0,1,0,0
2,3.0,600.0,1.0,0,0,1,0
3,3.0,600.0,1.0,0,0,0,1


In [284]:
columns = ['gre','gpa','prestige_2','prestige_3','prestige_4','intercept']
df_test2['admit_predict'] = model.predict(df_test2[columns])
df_test2

Unnamed: 0,gpa,gre,intercept,prestige_1,prestige_2,prestige_3,prestige_4,admit_predict
0,3.0,600.0,1.0,1,0,0,0,0.448236
1,3.0,600.0,1.0,0,1,0,0,0.291536
2,3.0,600.0,1.0,0,0,1,0,0.175596
3,3.0,600.0,1.0,0,0,0,1,0.146639


## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [149]:
names_X = ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4']

def X_y(df):
    X = df[ names_X ]
    y = df.admit
    return X, y

train_X, train_y = X_y(train_df)
test_X, test_y = X_y(test_df)

In [150]:
model = linear_model.LogisticRegression().\
    fit(train_X, train_y)

print model.intercept_
print model.coef_

[-1.95266679]
[[ 0.00172658  0.19596439  0.35943456 -0.31929464 -0.88821548 -1.10459123]]


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [151]:
# TODO

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [152]:
# TODO

Answer: