# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..','..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
freq_table=pd.crosstab(df['prestige'],df['admit'])
freq_table

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


In [4]:
#Check percent admit by prestige
freq_table.iloc[:,1] / (freq_table.iloc[:,0]+freq_table.iloc[:,1])

prestige
1.0    0.540984
2.0    0.358108
3.0    0.231405
4.0    0.179104
dtype: float64

## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [5]:
pd.get_dummies(df['prestige'])

Unnamed: 0,1.0,2.0,3.0,4.0
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


> ### Question 3.  How many of these binary variables do we need for modeling?

**Answer:** We only need 3, because the 4th dummy is redundant.  

> ### Question 4.  Why are we doing this?

**Answer:** By only using 3 binary features, instead of 4, we help to prevent overfitting the model and avoid colinearity between features.  This also helps to save memory by avoiding an extra feature.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [6]:
prestige_onehot = pd.get_dummies(df['prestige'], drop_first=True)
prestige_onehot.head()

Unnamed: 0,2.0,3.0,4.0
0,0,1,0
1,0,1,0
2,0,0,0
3,0,0,1
4,0,0,1


In [7]:
df['prestige_2']=prestige_onehot[2.0]
df['prestige_3']=prestige_onehot[3.0]
df['prestige_4']=prestige_onehot[4.0]
df.head()

Unnamed: 0,admit,gre,gpa,prestige,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,3.0,0,1,0
1,1,660.0,3.67,3.0,0,1,0
2,1,800.0,4.0,1.0,0,0,0
3,1,640.0,3.19,4.0,0,0,1
4,0,520.0,2.93,4.0,0,0,1


In [8]:
df.drop('prestige',axis=1,inplace=True)
df.head()

Unnamed: 0,admit,gre,gpa,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0,1,0
1,1,660.0,3.67,0,1,0
2,1,800.0,4.0,0,0,0
3,1,640.0,3.19,0,0,1
4,0,520.0,2.93,0,0,1


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [9]:
df.loc[(df['prestige_2']+df['prestige_3']+df['prestige_4']==0),'admit'].value_counts()

1    33
0    28
Name: admit, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [10]:
# ODDS of getting admitted when prestige=1
# ODDS = # admitted / #not admitted
odds_prestige1 = 33/28.
odds_prestige1

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [11]:
# Create frequency table for when prestige is 2,3 or 4
df.loc[(df['prestige_2']+df['prestige_3']+df['prestige_4']>0),'admit'].value_counts()

0    243
1     93
Name: admit, dtype: int64

In [12]:
# Using this frequency table, calculate the Odds of admission when prestige is 2,3, or 4
odds_prestige_not1 = 93/243.
odds_prestige_not1

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [13]:
odds_ratio = odds_prestige1 / odds_prestige_not1
odds_ratio

3.079493087557604

> ### Question 10.  Write this finding in a sentenance.

**Answer:** Individuals that graduated from top prestige undergraduate schools have 3.08 times higher odds to be admitted to UCLA graduate school, compared to individuals from lower prestige schools.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [14]:
df.loc[(df['prestige_4']==1),'admit'].value_counts()

0    55
1    12
Name: admit, dtype: int64

In [15]:
df.loc[(df['prestige_4']==0),'admit'].value_counts()

0    216
1    114
Name: admit, dtype: int64

In [16]:
odds_prestige4=12/55.
odds_prestige_not4=114/216.
odds_ratio4 = odds_prestige4/odds_prestige_not4
odds_ratio4

0.4133971291866028

**Answer:** For individuals that graduated from the lowest prestige undergraduate schools, the odds for being admitted are 59% lower.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [17]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0,1,0
1,1,660.0,3.67,0,1,0
2,1,800.0,4.0,0,0,0
3,1,640.0,3.19,0,0,1
4,0,520.0,2.93,0,0,1


In [18]:
X = df[['gre','gpa','prestige_2','prestige_3','prestige_4']]

#statsmodel requires us to explicitly create an intercept feature
X['intercept']=1 
X.head()

Unnamed: 0,gre,gpa,prestige_2,prestige_3,prestige_4,intercept
0,380.0,3.61,0,1,0,1
1,660.0,3.67,0,1,0,1
2,800.0,4.0,0,0,0,1
3,640.0,3.19,0,0,1,1
4,520.0,2.93,0,0,1,1


In [19]:
y = df['admit']
y.head()

0    0
1    1
2    1
3    1
4    0
Name: admit, dtype: int64

In [20]:
logit_smf = smf.Logit(y,X).fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [21]:
logit_smf.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 30 Jan 2017",Pseudo R-squ.:,0.08166
Time:,00:10:28,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372 -0.735
intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638


In [22]:
logit_smf.params

gre           0.002218
gpa           0.779337
prestige_2   -0.680137
prestige_3   -1.338677
prestige_4   -1.553411
intercept    -3.876854
dtype: float64

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [23]:
# Exponentials of each feature's coefficient is the odds ratio for each feature
# Interpretation: the increase in odds from a unit increase in the value of the feature
odds_ratios = np.exp(logit_smf.params)
odds_ratios = pd.DataFrame(odds_ratios,columns=['OR'])
odds_ratios

Unnamed: 0,OR
gre,1.002221
gpa,2.180027
prestige_2,0.506548
prestige_3,0.262192
prestige_4,0.211525
intercept,0.020716


In [24]:
# Now with the 95% confidency intervals
coeff_95conf = logit_smf.conf_int()
odds_ratios['2.5%']  = np.exp(coeff_95conf[0])
odds_ratios['97.5%'] = np.exp(coeff_95conf[1])
odds_ratios

Unnamed: 0,OR,2.5%,97.5%
gre,1.002221,1.000074,1.004372
gpa,2.180027,1.13612,4.183113
prestige_2,0.506548,0.272168,0.942767
prestige_3,0.262192,0.133377,0.515419
prestige_4,0.211525,0.093329,0.479411
intercept,0.020716,0.002207,0.19444


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

**Answer:** The odds of being addmitted decreases by 49.3% when prestige=2.

> ### Question 16.  Interpret the odds ratio of `gpa`.

**Answer:** The odds of being addmitted increases by 118% for every whole number increase in GPA (e.g. from 3.0 to 4.0).

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [25]:
Xnew = pd.DataFrame({'gre': 800,
                     'gpa': 4,
                     'prestige_2': [0,1,0,0],
                     'prestige_3': [0,0,1,0],
                     'prestige_4': [0,0,0,1],
                     'intercept': 1},columns=['gre','gpa','prestige_2','prestige_3','prestige_4','intercept'])
Xnew

Unnamed: 0,gre,gpa,prestige_2,prestige_3,prestige_4,intercept
0,800,4,0,0,0,1
1,800,4,1,0,0,1
2,800,4,0,1,0,1
3,800,4,0,0,1,1


In [26]:
logit_smf.predict(exog=Xnew)

array([ 0.73403998,  0.58299512,  0.41983282,  0.36860803])

**Answer:** Probabilities of admission by undergraduate school tier (with GRE of 800 and GPA of 4):
* Tier 1: 73.4%
* Tier 2: 58.3%
* Tier 3: 42.0%
* Tier 4: 36.9%

In [27]:
#Double check answer for tier 2 above, manually (should be close, but not exact because of rounding)
logodds2 = 800*0.0022 + 4*0.7793 -0.6801 - 3.8769
odds2 = np.exp(logodds2)
prob2 = odds2 / (1+odds2)
prob2

0.57937299290792632

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [28]:
X = df[['gre','gpa','prestige_2','prestige_3','prestige_4']]
y = df['admit']

In [29]:
# fit a logistic regression model and store the class predictions
logreg = linear_model.LogisticRegression(C = 10 ** 2)
logreg.fit(X, y)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [30]:
# Print coefficients and intercept
print logreg.coef_
print logreg.intercept_

[[ 0.00215822  0.67315495 -0.62882239 -1.25222745 -1.56879212]]
[-3.51478687]


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [31]:
odds_ratios_SKL = np.exp(np.transpose(logreg.coef_))
odds_ratios_SKL

array([[ 1.00216055],
       [ 1.96041259],
       [ 0.53321936],
       [ 0.28586733],
       [ 0.20829663]])

In [32]:
odds_ratio_compare = pd.DataFrame(odds_ratios_SKL,columns=['OR_sklearn'],index=['gre','gpa','prestige_2','prestige_3','prestige_4'])
odds_ratio_compare['OR_statsmodel'] = odds_ratios.iloc[0:5,0]
odds_ratio_compare

Unnamed: 0,OR_sklearn,OR_statsmodel
gre,1.002161,1.002221
gpa,1.960413,2.180027
prestige_2,0.533219,0.506548
prestige_3,0.285867,0.262192
prestige_4,0.208297,0.211525


Answer: The odds ratios from **`statsmodel`** are slightly higher for GRE and GPA and prestige 4, and slightly lower for prestige 2 and 3.

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [33]:
Xnew2 = pd.DataFrame({'gre': 800.,
                     'gpa': 4.,
                     'prestige_2': [0.,1.,0.,0.],
                     'prestige_3': [0.,0.,1.,0.],
                     'prestige_4': [0.,0.,0.,1.]},columns=['gre','gpa','prestige_2','prestige_3','prestige_4'])
Xnew2

Unnamed: 0,gre,gpa,prestige_2,prestige_3,prestige_4
0,800.0,4.0,0.0,0.0,0.0
1,800.0,4.0,1.0,0.0,0.0
2,800.0,4.0,0.0,1.0,0.0
3,800.0,4.0,0.0,0.0,1.0


In [34]:
logreg.predict_proba(Xnew2)

array([[ 0.28814605,  0.71185395],
       [ 0.43153702,  0.56846298],
       [ 0.58608936,  0.41391064],
       [ 0.66024514,  0.33975486]])

**Answer:** Probabilities of admission by undergraduate school tier (with GRE of 800 and GPA of 4):
* Tier 1: 71.2%
* Tier 2: 56.8%
* Tier 3: 41.4%
* Tier 4: 34.0%