# DS-SF-33 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [2]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)

import statsmodels.formula.api as smf
import statsmodels.api as sm

from sklearn import linear_model as lm

In [3]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [4]:
# TODO
pd.crosstab(df.prestige,df.admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


In [5]:
pd.crosstab(df.prestige,df.admit,normalize=True)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,0.070529,0.083123
2.0,0.239295,0.133501
3.0,0.234257,0.070529
4.0,0.138539,0.030227


In [6]:
pd.crosstab(df.prestige,df.admit,normalize=True).sum()

admit
0    0.68262
1    0.31738
dtype: float64

In [7]:
pd.crosstab(df['prestige'],df['admit'],normalize='index')

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,0.459016,0.540984
2.0,0.641892,0.358108
3.0,0.768595,0.231405
4.0,0.820896,0.179104


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [8]:
# Let's rescast prestige as int....
df.prestige = df.prestige.astype(int)
df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3
1,1,660.0,3.67,3
2,1,800.0,4.00,1
3,1,640.0,3.19,4
4,0,520.0,2.93,4
...,...,...,...,...
395,0,620.0,4.00,2
396,0,560.0,3.04,3
397,0,460.0,2.63,2
398,0,700.0,3.65,2


In [9]:
one_hot = pd.get_dummies(df.prestige)
one_hot

Unnamed: 0,1,2,3,4
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


In [10]:
df = df.join(one_hot)
df

Unnamed: 0,admit,gre,gpa,prestige,1,2,3,4
0,0,380.0,3.61,3,0,0,1,0
1,1,660.0,3.67,3,0,0,1,0
2,1,800.0,4.00,1,1,0,0,0
3,1,640.0,3.19,4,0,0,0,1
4,0,520.0,2.93,4,0,0,0,1
...,...,...,...,...,...,...,...,...
395,0,620.0,4.00,2,0,1,0,0
396,0,560.0,3.04,3,0,0,1,0
397,0,460.0,2.63,2,0,1,0,0
398,0,700.0,3.65,2,0,1,0,0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer:  Three of the binary (dummy) variables.

> ### Question 4.  Why are we doing this?

Answer:  Prestige is categorical variable moreso than a true scale, so we treat it as such.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [11]:
# Only run once
df.drop('prestige', axis = 1)


Unnamed: 0,admit,gre,gpa,1,2,3,4
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.00,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0,1,0,0
396,0,560.0,3.04,0,0,1,0
397,0,460.0,2.63,0,1,0,0
398,0,700.0,3.65,0,1,0,0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [16]:
# TODO
pd.crosstab(df[1], df['admit'])


admit,0,1
1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,93
1,28,33


In [13]:
pd.crosstab(df[2], df['admit'])


admit,0,1
2,Unnamed: 1_level_1,Unnamed: 2_level_1
0,176,73
1,95,53


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [20]:
# TODO
#hax
odds_7 = 33/(33+38)
print(odds_7 , 'to one')

0.4647887323943662 to one


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [22]:
# TODO
odds_8 = 93/(93+243)
print(odds_8, 'to one')

0.2767857142857143 to one


> ### Question 9.  Finally, what's the odds ratio?

In [25]:
# TODO

print(odds_7/odds_8 , ' to 1 is the odds ratio between those who attended a #1 ranked college and those who did not.')


1.679236710586097  to 1 is the odds ratio between those who attended a #1 ranked college and those who did not.


> ### Question 10.  Write this finding in a sentence.

Answer: 1.679236710586097  to 1 is the odds ratio between those who attended a #1 ranked college and those who did not.


> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [27]:
# TODO
pd.crosstab(df[3], df['admit'])


admit,0,1
3,Unnamed: 1_level_1,Unnamed: 2_level_1
0,178,98
1,93,28


In [29]:
odds_30 = 98/(98+178)
odds_31 = 28/(28+93)
odds_r11 = odds_31/odds_30
print('Answer: Those from lease prestigious schools were', odds_r11, 'more likely to be admitted to ucla.')

Answer: Those from lease prestigious schools were 0.6517119244391971 more likely to be admitted to ucla.


## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [127]:
# TODO
#Here we are going to use the glm and the logit function and compare
features = ['gre','gpa',1,2,3,4]
model0 = sm.Logit(df['admit'], df[features])
results = model0.fit()


Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [128]:
# TODO
results.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 24 Apr 2017",Pseudo R-squ.:,0.08166
Time:,17:26:35,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
1,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
2,-4.5570,1.113,-4.093,0.000,-6.739 -2.375
3,-5.2155,1.151,-4.530,0.000,-7.472 -2.959
4,-5.4303,1.140,-4.764,0.000,-7.664 -3.196


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [129]:
# TODO
print('Odds Ratios')
print(np.exp(results.params))
print('        ')
print('Confidence Intervals')
print(results.conf_int())

Odds Ratios
gre    1.002221
gpa    2.180027
1      0.020716
2      0.010494
3      0.005432
4      0.004382
dtype: float64
        
Confidence Intervals
            0         1
gre  0.000074  0.004362
gpa  0.127619  1.431056
1   -6.116077 -1.637631
2   -6.739307 -2.374674
3   -7.472244 -2.958819
4   -7.664377 -3.196152


In [130]:
print('Params')
print(results.params)

Params
gre    0.002218
gpa    0.779337
1     -3.876854
2     -4.556991
3     -5.215531
4     -5.430265
dtype: float64


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:  The odds ratio for prestige = 2 corresponds to the odds that someone gets admitted based on their prestige 2 undergraduate school vs the other factors.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer:The odds ratio for gpa corresponds to the odds gets admitted based on gpa vs the other factors.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [131]:
# TODOfeatures = df.columns[1:]
x = []
y = [1,2,3,4]
for i in [-3.876854,-4.556991,-5.215531,-5.430265]:
    x.append(1/(1+np.exp(-1)*(0.002218*(800)+0.779337*(4)+i)))
    
for n in x:
    print(n)

0.72814192349
0.89035285209
1.13521952742
1.24705279807


Answer: See above for the probability of admission from tier 1-4 in order (top to bottom).

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [77]:
# TODOdf = df[df.DEP_DEL15.notnull()]
model = lm.LogisticRegression(C = 10**2)
X = df[features]
Y =  df['admit']
model.fit(X,Y)

df['probability'] = model.predict_proba(X).T[1]

In [87]:
print(model.coef_)
print(model.intercept_)
print(model.n_iter_)



[[ 0.00215865  0.73484431 -0.05427652 -0.70070849 -1.34355883 -1.56529314]]
[-3.66383698]
[15]


In [122]:
odds = np.exp(model.coef_)

print(odds)

odds2 = np.exp(results.params)
print(odds2)

[[ 1.00216098  2.08515733  0.94717016  0.4962336   0.26091546  0.20902673]]
gre    1.002221
gpa    2.180027
1      0.020716
2      0.010494
3      0.005432
4      0.004382
dtype: float64


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [120]:
# TODO
coef_statsmodel = []
for j in results.params:
    coef_statsmodel.append(j)
    print(j)
    
coef_sklearn = []
for i in model.coef_:
    coef_sklearn.append(list(i))
    

0.00221840330635
0.779337228756
-3.87685407697
-4.55699068124
-5.21553127657
-5.43026458214
[[0.0021586522326353312, 0.73484431002824258, -0.054276517948434536, -0.70070849244835742, -1.343558827703081, -1.5652931410657156]]


Answer: Upon inspection, the values for the coefficients for Gre and Gpa are very similar to each other in both models,(gre: 0.00221840330635 to 0.0021586522326353312 and gpa: 0.779337228756  to 0.73484431002824258). That being said the values broken down by tier vary quite a bit, though both get the negative relationship in the exponent down.  In the stats model version, values for the tiers are -3.87685407697,-4.55699068124,-5.21553127657,-5.43026458214; while in sklearn they are -0.054276517948434536, -0.70070849244835742, -1.343558827703081, -1.5652931410657156.  This is probably due to the difference in normalization setting, something we set in sklearn and did not explicitly set in statsmodel.

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [124]:
# TODO
y = []
for i in [-0.054276517948434536,-0.70070849244835742,-1.343558827703081,-1.5652931410657156]:
    y.append(1/(1+np.exp(-1)*(0.0021586522326353312*(800)+0.73484431002824258*(4)+i)))
    
for n in y:
    print(n)

0.370827963423
0.406692659755
0.449970520935
0.467115901848


Answer:  See above for the probabilities from 1-4 tiered schools with 1 at the top.