# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [3]:
pd.crosstab(df['admit'], df['prestige'], rownames=['admit'], margins=True)

prestige,1.0,2.0,3.0,4.0,All
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,28,95,93,55,271
1,33,53,28,12,126
All,61,148,121,67,397


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
df.prestige = df.prestige.astype(int) #convert prestige from float to integer

prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

In [5]:
prestige_df

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: One 

> ### Question 4.  Why are we doing this?

Answer:  Logistic regression models the probability of the default class.  In order to model, we need to transform the feature(s) into a binary variable(s).

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [6]:
# TODO

df = df.join([prestige_df])
df = df.drop('prestige', 1)

In [7]:
df

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [8]:
pd.crosstab(df.admit, df.prestige_1, margins=True)

#pd.crosstab(df['admit'], df[['prestige_1','prestige_2','prestige_3','prestige_4']], margins=True)

prestige_1,0.0,1.0,All
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,243,28,271
1,93,33,126
All,336,61,397


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [9]:
admit_pres_1 = 33 / 61.
odds_admit_pres_1 = admit_pres_1 / (1 - admit_pres_1)

print admit_pres_1
print odds_admit_pres_1

0.540983606557
1.17857142857


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [10]:
admit_pres_0 = 93 / 336.
odds_admit_pres_0 = admit_pres_0 / (1 - admit_pres_0)

print admit_pres_0
print odds_admit_pres_0


0.276785714286
0.382716049383


> ### Question 9.  Finally, what's the odds ratio?

In [11]:
odds_admit_pres_1 / odds_admit_pres_0

3.079493087557604

> ### Question 10.  Write this finding in a sentence.

Answer: The odds of admission for an undergrad from a #1 ranked college is 3.1 times larger than one who is not from a #1 ranked college.


> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [12]:
# TODO

pd.crosstab(df.admit, df.prestige_4, margins=True)

prestige_4,0.0,1.0,All
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,216,55,271
1,114,12,126
All,330,67,397


In [23]:
admit_pres_4 = 12 / 67.
odds_admit_pres_4 = admit_pres_4 / (1 - admit_pres_4)

admit_NotPres_4 = 114 / 330.
odds_admit_NotPres_4 = admit_NotPres_4 / (1 - admit_NotPres_4)


odds_admit_pres_4 / odds_admit_NotPres_4

0.4133971291866028

In [24]:
odds_admit_NotPres_4 / odds_admit_pres_4 

2.418981481481482

Answer:  The odds of admission from a college other than one ranked #4 is 2.4 times larger than one who is from a college ranked #4. 

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [50]:

result = smf.ols(formula = 'admit ~ gre + gpa + prestige_1 + prestige_2 + prestige_3 + prestige_4', data = df).fit()


> ### Question 13.  Print the model's summary results.

In [51]:
result.summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.099
Model:,OLS,Adj. R-squared:,0.087
Method:,Least Squares,F-statistic:,8.594
Date:,"Thu, 09 Feb 2017",Prob (F-statistic):,9.71e-08
Time:,17:41:31,Log-Likelihood:,-239.02
No. Observations:,397,AIC:,490.0
Df Residuals:,391,BIC:,513.9
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.3459,0.165,-2.092,0.037,-0.671 -0.021
gre,0.0004,0.000,1.997,0.047,6.48e-06 0.001
gpa,0.1508,0.064,2.349,0.019,0.025 0.277
prestige_1,0.1081,0.066,1.636,0.103,-0.022 0.238
prestige_2,-0.0554,0.053,-1.051,0.294,-0.159 0.048
prestige_3,-0.1828,0.056,-3.240,0.001,-0.294 -0.072
prestige_4,-0.2158,0.059,-3.668,0.000,-0.332 -0.100

0,1,2,3
Omnibus:,152.312,Durbin-Watson:,1.946
Prob(Omnibus):,0.0,Jarque-Bera (JB):,50.314
Skew:,0.678,Prob(JB):,1.19e-11
Kurtosis:,1.904,Cond. No.,4.51e+18


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [54]:
# odds ratios and 95% CI

params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
np.exp(conf)

Unnamed: 0,2.5%,97.5%,OR
Intercept,0.511242,0.97933,0.707584
gre,1.000006,1.000837,1.000422
gpa,1.024866,1.31928,1.162792
prestige_1,0.978418,1.268859,1.114214
prestige_2,0.852945,1.049459,0.946114
prestige_3,0.745463,0.930649,0.832925
prestige_4,0.717823,0.904694,0.80586


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:  Prestige 2's odds ration is < 1 which means that, as there are more applicants from 2nd ranked undergraduate schools, the less probable of getting admitted into UCLA.


> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer:  The odds ration is > 1 which means, the higher the gpa, the greater the probability of being admitted into UCLA.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [55]:
# From a tier-1 undergrad school..

predict_X = [[800.0, 4.00, 1, 0, 0, 0]]

print model_prestige_1.predict(predict_X)
print model_prestige_1.predict_proba(predict_X)


NameError: name 'model_prestige_1' is not defined

In [None]:
# From a tier-2 undergrad school..

predict_X = [[800.0 , 4.00, 0, 1, 0, 0]]

print model_prestige_1.predict(predict_X)
print model_prestige_1.predict_proba(predict_X)


In [None]:
# From a tier-3 undergrad school..

predict_X = [[800.0 , 4.00, 0, 0, 1, 0]]

print model_prestige_1.predict(predict_X)
print model_prestige_1.predict_proba(predict_X)

In [None]:
# From a tier-4 undergrad school..

predict_X = [[800.0 , 4.00, 0, 0, 0, 1]]

print model_prestige_1.predict(predict_X)
print model_prestige_1.predict_proba(predict_X)

Answer:

Tier-1: 89%. 
Tier-2:  1%. 
Tier-3:  1%. 
Tier-4:  2%. 


## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [56]:
# TODO
X = df[['gre', 'gpa','prestige_1','prestige_2','prestige_3','prestige_4']]

model_prestige_1 = linear_model.LogisticRegression().fit(X, df.prestige_1)


In [57]:
print model_prestige_1.intercept_
print model_prestige_1.coef_

[-0.40364527]
[[ -2.26904711e-04  -5.25763886e-01   4.74517716e+00  -1.98093625e+00
   -1.79644243e+00  -1.37144376e+00]]


In [58]:
zip(X, np.exp(model_prestige_1.coef_[0]) - 1)

[('gre', -0.0002268789698139928),
 ('gpa', -0.40889634386194595),
 ('prestige_1', 114.02818163000299),
 ('prestige_2', -0.86205996917075911),
 ('prestige_3', -0.83411200139358777),
 ('prestige_4', -0.74625964460406968)]

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer:

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: