# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [130]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [131]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [132]:
prestige = df.prestige
df.prestige.value_counts()

2.0    148
3.0    121
4.0     67
1.0     61
Name: prestige, dtype: int64

## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [133]:
df_prestige_onehot = pd.get_dummies(df.prestige, prefix="prestige")
df_prestige_onehot

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: n-1 will capture all the information

> ### Question 4.  Why are we doing this?

Answer: one hot encoding transforms categorial features into a format more easily consumed for classification models

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [134]:
df.drop('prestige', axis=1, inplace=True)
df = df.join(df_prestige_onehot)

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [135]:
pd.crosstab(df['admit'], df['prestige_1.0'])

prestige_1.0,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


In [136]:
pd.crosstab(df['admit'], df['prestige_1.0'], normalize='index')

prestige_1.0,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.896679,0.103321
1,0.738095,0.261905


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [137]:
prob = 33. / (28 + 33)
oddsA = prob / (1 - prob)
oddsA

1.1785714285714288

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [138]:
prob = 93. / (243 + 93)
oddsB = prob / (1 - prob)
oddsB

0.3827160493827161

> ### Question 9.  Finally, what's the odds ratio?

In [139]:
oddsA / oddsB

3.079493087557604

> ### Question 10.  Write this finding in a sentence.

Answer: 3.07 times more likely to be admitted if you went to a prestigious college

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [140]:
pd.crosstab(df['admit'], df['prestige_4.0'])

prestige_4.0,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


In [141]:
pA = 12. / (12 + 55)
oddsA = pA / ( 1 - pA )

pB = 114. / (114 + 216)
oddsB = pB / (1 - pB)

oddsA / oddsB

0.4133971291866028

Answer: you are 0.41 as likely to get in if you went to a low prestige school

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [142]:
# finally drop the unnecessary prestige column
df.drop('prestige_1.0', axis=1, inplace=True)

In [143]:
# manually add intercept
df['intercept'] = 1.0

In [144]:
train_cols = df.columns[1:]

In [145]:
# from patsy import dmatrices
# y, X = dmatrices('admit ~ gre + gpa + C(prestige)', df, return_type = 'dataframe')

import statsmodels.discrete.discrete_model as sm

y = df['admit']
X = df[train_cols]

logit = sm.Logit(y, X)
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [146]:
result.params

gre             0.002218
gpa             0.779337
prestige_2.0   -0.680137
prestige_3.0   -1.338677
prestige_4.0   -1.553411
intercept      -3.876854
dtype: float64

In [147]:
result.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 06 Feb 2017",Pseudo R-squ.:,0.08166
Time:,23:15:51,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2.0,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3.0,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4.0,-1.5534,0.417,-3.721,0.000,-2.372 -0.735
intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [149]:
logit.fit().conf_int(alpha=0.05)

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


Unnamed: 0,0,1
gre,7.4e-05,0.004362
gpa,0.127619,1.431056
prestige_2.0,-1.301337,-0.058936
prestige_3.0,-2.014579,-0.662776
prestige_4.0,-2.371624,-0.735197
intercept,-6.116077,-1.637631


In [150]:
# odds ratios
print np.exp(result.params)

gre             1.002221
gpa             2.180027
prestige_2.0    0.506548
prestige_3.0    0.262192
prestige_4.0    0.211525
intercept       0.020716
dtype: float64


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: Expect odds to decrease by about 50% if prestige = 2

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: Each point in GPA increases chance of admittance by factor of 2.18

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [189]:
guess = pd.DataFrame({
        'gre': [800,800,800,800],
        'gpa': [4,4,4,4],
        'prestige': [1., 2., 3., 4.],
        'intercept': [1., 1., 1., 1.]
    })

dummy_pres = pd.get_dummies(guess['prestige'], prefix='prestige')

guess = guess.join(dummy_pres)
guess.drop('prestige', axis=1, inplace=True)

prediction = result.predict(guess[train_cols])

with_pred = guess.copy()
with_pred['prediction'] = prediction 

with_pred

Unnamed: 0,gpa,gre,intercept,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0,prediction
0,4,800,1.0,1,0,0,0,0.73404
1,4,800,1.0,0,1,0,0,0.582995
2,4,800,1.0,0,0,1,0,0.419833
3,4,800,1.0,0,0,0,1,0.368608


Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [182]:
model = linear_model.LogisticRegression(C = 10 ** 2)
skresult = model.fit(X, y=y)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [181]:
np.exp(skresult.coef_)

array([[ 1.0020999 ,  2.05051732,  0.48210089,  0.24652421,  0.20131997,
         0.16981759]])

Answer: They are different, presumably because of the different C value, but I probably did something wrong

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [196]:
prediction = skresult.predict_proba(guess[train_cols])
prediction

guess.join(pd.DataFrame(prediction))

Unnamed: 0,gpa,gre,intercept,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0,0,1
0,4,800,1.0,1,0,0,0,0.268065,0.731935
1,4,800,1.0,0,1,0,0,0.431714,0.568286
2,4,800,1.0,0,0,1,0,0.597686,0.402314
3,4,800,1.0,0,0,0,1,0.645289,0.354711


Answer: About the same predictions as prior model