# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [101]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [102]:
df.groupby(['prestige', 'admit']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,gre,gpa
prestige,admit,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0,28,28
1.0,1,33,33
2.0,0,95,95
2.0,1,53,53
3.0,0,93,93
3.0,1,28,28
4.0,0,55,55
4.0,1,12,12


In [103]:
pd.crosstab(df['admit'], df['prestige'], rownames=['admit'])

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [104]:
df['prestige1'] = map(lambda x: 1 if x == 1 else 0, df['prestige'])
df['prestige2'] = map(lambda x: 1 if x == 2 else 0, df['prestige'])
df['prestige3'] = map(lambda x: 1 if x == 3 else 0, df['prestige'])
df['prestige4'] = map(lambda x: 1 if x == 4 else 0, df['prestige'])

In [105]:
df

Unnamed: 0,admit,gre,gpa,prestige,prestige1,prestige2,prestige3,prestige4
0,0,380.0,3.61,3.0,0,0,1,0
1,1,660.0,3.67,3.0,0,0,1,0
2,1,800.0,4.00,1.0,1,0,0,0
3,1,640.0,3.19,4.0,0,0,0,1
4,0,520.0,2.93,4.0,0,0,0,1
...,...,...,...,...,...,...,...,...
395,0,620.0,4.00,2.0,0,1,0,0
396,0,560.0,3.04,3.0,0,0,1,0
397,0,460.0,2.63,2.0,0,1,0,0
398,0,700.0,3.65,2.0,0,1,0,0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: 3. There are 4 possible prestige ranks. You can set up 3 binary variables, and the 4th is implied if the other 3 are 0.

> ### Question 4.  Why are we doing this?

Answer: Prestige ranks are categorical values. Some algorithms can handle categorical values, others cannot and require one hot encoding.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [106]:
df = df.drop('prestige', axis=1)
df = df.drop('prestige1', axis=1)
df

Unnamed: 0,admit,gre,gpa,prestige2,prestige3,prestige4
0,0,380.0,3.61,0,1,0
1,1,660.0,3.67,0,1,0
2,1,800.0,4.00,0,0,0
3,1,640.0,3.19,0,0,1
4,0,520.0,2.93,0,0,1
...,...,...,...,...,...,...
395,0,620.0,4.00,1,0,0
396,0,560.0,3.04,0,1,0
397,0,460.0,2.63,1,0,0
398,0,700.0,3.65,1,0,0


In [111]:
df['intercept'] = 1.0

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [126]:
df[(df['prestige2'] == 0) & (df['prestige3'] == 0) & (df['prestige4'] == 0)].groupby('admit').count()['gre']

admit
0    28
1    33
Name: gre, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [128]:
odds_prestige_1 = 33.0 / 28
odds_prestige_1

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [129]:
odds_prestige_not_1 = (53.0 + 28 + 12) / (95 + 93 + 55)
odds_prestige_not_1

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [130]:
odds_ratio = odds_prestige_1 / odds_prestige_not_1
odds_ratio

3.079493087557604

> ### Question 10.  Write this finding in a sentenance.

Answer: For someone who attended a #1 ranked college, the odds of them being admitted are approximately 3.1 times higher compared to the odds of someone being admitted who did not attend a #1 ranked college.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [131]:
odds_prestige_4 = 12.0 / 55
odds_prestige_4

0.21818181818181817

In [132]:
odds_prestige_not_4 = (33.0 + 53 + 28) / (28 + 95 + 93)
odds_prestige_not_4

0.5277777777777778

In [133]:
odds_ratio2 = odds_prestige_4 / odds_prestige_not_4
odds_ratio2

0.4133971291866028

Answer: For someone who attended a #4 ranked college, the odds of them being admitted are approximately 0.4 times the odds of someone being admitted who did not attend a #4 ranked college. (Conversely: the odds of admitance for someone who did not attend a #4 ranked college are approximately 2.5 times higher than the odds of admittance for someone who did attend a #4 ranked college).

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [134]:
train_cols = df.columns[1:]
logit = smf.Logit(df['admit'], df[train_cols])
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [135]:
result.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Tue, 17 Jan 2017",Pseudo R-squ.:,0.08166
Time:,13:57:54,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige2,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige3,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige4,-1.5534,0.417,-3.721,0.000,-2.372 -0.735
intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [136]:
np.exp(result.params)

gre          1.002221
gpa          2.180027
prestige2    0.506548
prestige3    0.262192
prestige4    0.211525
intercept    0.020716
dtype: float64

In [137]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
np.exp(conf)

Unnamed: 0,2.5%,97.5%,OR
gre,1.000074,1.004372,1.002221
gpa,1.13612,4.183113,2.180027
prestige2,0.272168,0.942767,0.506548
prestige3,0.133377,0.515419,0.262192
prestige4,0.093329,0.479411,0.211525
intercept,0.002207,0.19444,0.020716


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: The odds of being admitted in grad school given that an applicant attended a #2 ranked school are roughly half of the odds of being admitted given that an applicant did not attend a #2 ranked school. 

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: A one unit increase in GPA increases the odds of admittance by roughly 218%. (If the odds of admittance are 0.5 : 1 with a GPA of 2.0, the odds are (0.5 * 2.18) : 1 with a GPA of 3.0, or 1.09 : 1

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [145]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige2,prestige3,prestige4,intercept,pred_admit
0,0,380.0,3.61,0,1,0,1.0,0
1,1,660.0,3.67,0,1,0,1.0,0
2,1,800.0,4.0,0,0,0,1.0,1
3,1,640.0,3.19,0,0,1,1.0,0
4,0,520.0,2.93,0,0,1,1.0,0


In [158]:
test_inputs = pd.DataFrame(columns=['gre', 'gpa', 'prestige2', 'prestige3', 'prestige4','intercept'])

In [172]:
s1 = pd.Series([800, 800, 800, 800])
s2 = pd.Series([4, 4, 4, 4])
s3 = pd.Series([0,1,0,0])
s4 = pd.Series([0,0,1,0])
s5 = pd.Series([0,0,0,1])
s6 = pd.Series([1,1,1,1])

In [182]:
test_inputs = pd.concat([s1, s2, s3, s4, s5, s6], axis=1)
test_inputs.columns = ['gre', 'gpa', 'prestige2', 'prestige3', 'prestige4','intercept']
test_inputs

Unnamed: 0,gre,gpa,prestige2,prestige3,prestige4,intercept
0,800,4,0,0,0,1
1,800,4,1,0,0,1
2,800,4,0,1,0,1
3,800,4,0,0,1,1


In [183]:
result.predict(test_inputs)

array([ 0.73403998,  0.58299512,  0.41983282,  0.36860803])

Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [139]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C = 10 ** 2)
feature_cols = ['gre', 'gpa', 'prestige2', 'prestige3', 'prestige4']
X = df[feature_cols]
y = df['admit']
logreg.fit(X, y)
df['pred_admit'] = logreg.predict(X)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [140]:
zip(feature_cols, np.exp(logreg.coef_)[0])

[('gre', 1.0021605460502792),
 ('gpa', 1.9604125876547573),
 ('prestige2', 0.53321935757234395),
 ('prestige3', 0.28586733162404843),
 ('prestige4', 0.20829662748851235)]

Answer: They are different, but are fairly close to the odds ratios calculated with statsmodels.

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [186]:
test_inputs2 = test_inputs.drop('intercept', axis=1)

In [187]:
logreg.predict_proba(test_inputs2)

array([[ 0.28814605,  0.71185395],
       [ 0.43153702,  0.56846298],
       [ 0.58608936,  0.41391064],
       [ 0.66024514,  0.33975486]])

Answer: