# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
pd.crosstab(df['prestige'],df['admit'])

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
pd.get_dummies(df['prestige'], prefix ='prestige')

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: Three because you can figure out the fourth by knowing the other three. 

> ### Question 4.  Why are we doing this?

Answer: We are doing this to transform the categorical feature into a format that works better with classification and regression algorithms in scikit learn.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [5]:
prestige_one_hot = pd.get_dummies(df['prestige'], prefix ='prestige', drop_first=True)
prestige_one_hot.head()

Unnamed: 0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


In [6]:
binary_df = df.join(prestige_one_hot)
binary_df.head()

Unnamed: 0,admit,gre,gpa,prestige,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,3.0,0.0,1.0,0.0
1,1,660.0,3.67,3.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,4.0,0.0,0.0,1.0
4,0,520.0,2.93,4.0,0.0,0.0,1.0


In [7]:
binary_df = binary_df.drop('prestige', axis=1)
binary_df.head()

Unnamed: 0,admit,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0
2,1,800.0,4.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [8]:
prestige_1 = binary_df[(binary_df['prestige_2.0']==0.0) & (binary_df['prestige_3.0']==0.0) & (binary_df['prestige_4.0']==0.0)]
prestige_1['admit'].value_counts()

1    33
0    28
Name: admit, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [9]:
33.0/28.0

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [10]:
prestige_2to4 = binary_df[(binary_df['prestige_2.0']==1.0) | (binary_df['prestige_3.0']==1.0) | (binary_df['prestige_4.0']==1.0)]
prestige_2to4['admit'].value_counts()

0    243
1     93
Name: admit, dtype: int64

In [11]:
93.0/243.0

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [12]:
(33.0/28)/(93.0/243)


3.079493087557604

> ### Question 10.  Write this finding in a sentenance.

Answer: Applicants from 1 ranked prestige undergraduate schools will have 3.08 better odds of being admitted than applicants not from 1 ranked presitge undergraduate schools

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [13]:
prestige_4 = binary_df[(binary_df['prestige_4.0']==1.0)]
prestige_4['admit'].value_counts()


0    55
1    12
Name: admit, dtype: int64

In [14]:
prestige_not4 = binary_df[(binary_df['prestige_4.0']==0.0)]
prestige_not4['admit'].value_counts()

0    216
1    114
Name: admit, dtype: int64

In [15]:
prestige_4_odds =12.0/55.0
prestige_not4_odds = 114.0/216.0
odds_ratio = prestige_4_odds / prestige_not4_odds
print odds_ratio

0.413397129187


Answer: 
Students that attended the least prestigious undergraduate universities have 41.3% odds of being admitted compared to those that attended more presitgious undergradute universities. 

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [16]:
binary_df.head()

Unnamed: 0,admit,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0
2,1,800.0,4.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0


In [17]:
X=binary_df.drop(['admit'],axis=1)
y=binary_df['admit']

X['intercept']=1

results = smf.Logit(y,X).fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [18]:
results.summary()


0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 23 Jan 2017",Pseudo R-squ.:,0.08166
Time:,15:29:11,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2.0,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3.0,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4.0,-1.5534,0.417,-3.721,0.000,-2.372 -0.735
intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [19]:
# Odds ratios
np.exp(results.params)

gre             1.002221
gpa             2.180027
prestige_2.0    0.506548
prestige_3.0    0.262192
prestige_4.0    0.211525
intercept       0.020716
dtype: float64

In [20]:
# 95% confidence intervals
results.conf_int()


Unnamed: 0,0,1
gre,7.4e-05,0.004362
gpa,0.127619,1.431056
prestige_2.0,-1.301337,-0.058936
prestige_3.0,-2.014579,-0.662776
prestige_4.0,-2.371624,-0.735197
intercept,-6.116077,-1.637631


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: Applicants with a prestige of 2.0 have 49% lower odds of being admitted thatn applicants witha prestige of 1.0

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: For every 1.0 increase in GPA, the odds of being addmitted increases by 118% .


> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [21]:
tier_1_prediction = results.predict((800,4,0,0,0,1))
tier_2_prediction = results.predict((800,4,1,0,0,1))
tier_3_prediction = results.predict((800,4,0,1,0,1))
tier_4_prediction = results.predict((800,4,0,0,1,1))

In [22]:
 print tier_1_prediction, tier_2_prediction, tier_3_prediction, tier_4_prediction

[ 0.73403998] [ 0.58299512] [ 0.41983282] [ 0.36860803]


Answer:
Pres 1 = 73%
Pres 2 = 58%
Pres 3 = 42%
Pres 4 = 37%

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [23]:
from sklearn.linear_model import LogisticRegression


In [24]:
logreg = LogisticRegression(C=10**2)
logreg.fit(X, y)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [25]:
logreg.score(X,y)

0.7128463476070529

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [26]:
SK_odds = np.exp(np.transpose(logreg.coef_))
SK_odds = pd.DataFrame(SK_odds, columns = ['SK_odds'],index=['gre','gpa','prestige_2.0','prestige_3.0','prestige_4.0','intercept'])
print SK_odds

               SK_odds
gre           1.002100
gpa           2.050517
prestige_2.0  0.482101
prestige_3.0  0.246524
prestige_4.0  0.201320
intercept     0.169818


In [27]:
ST_odds = np.exp(np.transpose(results.params))
print ST_odds

gre             1.002221
gpa             2.180027
prestige_2.0    0.506548
prestige_3.0    0.262192
prestige_4.0    0.211525
intercept       0.020716
dtype: float64


In [28]:
odds_comp = pd.DataFrame(ST_odds, columns = ['ST_odds'], index=['gre','gpa','prestige_2.0','prestige_3.0','prestige_4.0','intercept'])
odds_comp = pd.concat([odds_comp, SK_odds], axis = 1, join_axes=[odds_comp.index])
odds_comp

Unnamed: 0,ST_odds,SK_odds
gre,1.002221,1.0021
gpa,2.180027,2.050517
prestige_2.0,0.506548,0.482101
prestige_3.0,0.262192,0.246524
prestige_4.0,0.211525,0.20132
intercept,0.020716,0.169818


Answer: The odds ratios for the two models are very similar. Statsmodel is slightly higher for all features.

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [29]:
test = pd.DataFrame({'gre': 800.,
                     'gpa': 4.,
                     'prestige_2.0': [0.,1.,0.,0.],
                     'prestige_3.0': [0.,0.,1.,0.],
                     'prestige_4.0': [0.,0.,0.,1.],
                     'intercept':1.},
                    columns=['gre','gpa','prestige_2.0','prestige_3.0','prestige_4.0','intercept'])
test

Unnamed: 0,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0,intercept
0,800.0,4.0,0.0,0.0,0.0,1.0
1,800.0,4.0,1.0,0.0,0.0,1.0
2,800.0,4.0,0.0,1.0,0.0,1.0
3,800.0,4.0,0.0,0.0,1.0,1.0


In [31]:
logreg.predict_proba(test)

array([[ 0.26806485,  0.73193515],
       [ 0.43171408,  0.56828592],
       [ 0.59768586,  0.40231414],
       [ 0.64528942,  0.35471058]])

Answer: 
Pres 1 = 73%
Pres 2 = 57%
Pres 3 = 40%
Pres 4 = 35%