# DS-SF-34 | Unit Project | 3 | Machine Learning Modeling and Executive Summary | Starter Code

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.  You will summarize and present your findings and the methods you used.

In [253]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import feature_selection, linear_model

import statsmodels.api as sm

In [254]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [255]:
pd.crosstab(df.prestige, df.admit, dropna = False)


admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Feature Engineering

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [256]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

In [257]:
prestige_df

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


In [258]:
prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
    'prestige_2.0': 'prestige_2',
    'prestige_3.0': 'prestige_3',
    'prestige_4.0': 'prestige_4'}, inplace = True)

In [259]:
prestige_df

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We should use all the four dummy variables we just created for the best model. 

> ### Question 4.  Why are we doing this?

Answer: My creating dummy variables or binary variables, we're able to more accurately determine the best fit model for predicting if a applicant was admitted. 

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [260]:
df = df.join([prestige_df])

In [261]:
df.columns

Index([u'admit', u'gre', u'gpa', u'prestige', u'prestige_1', u'prestige_2',
       u'prestige_3', u'prestige_4'],
      dtype='object')

In [262]:
df.drop('prestige', axis = 1, inplace = True)


In [263]:
df = df.dropna()


In [264]:
df.columns

Index([u'admit', u'gre', u'gpa', u'prestige_1', u'prestige_2', u'prestige_3',
       u'prestige_4'],
      dtype='object')

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [265]:
pd.crosstab(df.prestige_1, df.admit, dropna = False)


admit,0,1
prestige_1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,93
1,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [242]:
prob_A = 33/61.
odds_A = prob_A / (1 - prob_A)

print prob_A
print odds_A

0.540983606557
1.17857142857


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [243]:
prob_B = 93/336.
odds_B = prob_B / (1 - prob_B)

print prob_B
print odds_B

0.276785714286
0.382716049383


> ### Question 9.  Finally, what's the odds ratio?

In [244]:
odds_A / odds_B


3.079493087557604

> ### Question 10.  Write this finding in a sentence.

Answer: A person who graduated from a high school ranked with a prestige of 1 is 3x as more likely to be admitted to UCLA as an applicate who did not come from a high school with a prestige of 1. 

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [245]:
pd.crosstab(df.prestige_4, df.admit, dropna = False)


admit,0,1
prestige_4,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,114
1,55,12


In [246]:
prob_C = 12/67.
odds_C = prob_C / (1 - prob_C)

print prob_C
print odds_C

0.179104477612
0.218181818182


In [247]:
prob_D = 114/330.
odds_D = prob_D / (1 - prob_D)

print prob_D
print odds_D

0.345454545455
0.527777777778


In [248]:
odds_C / odds_D


0.4133971291866028

An applicate that attended a high school with a prestige level of 4 (lowest level of prestige) is 59% less likely to be admitted to UCLA compared to an applicant that did not attend a high school with a prestige level of 4.

## Part D. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [249]:
X1 = df[ ['gre',
    'gpa',
    'prestige_1',
    ] ]

sm.Logit(c, sm.add_constant(X1)).fit()

Optimization terminated successfully.
         Current function value: 0.584850
         Iterations 5


<statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x114cf08d0>

> ### Question 13.  Print the model's summary results.

In [250]:
sm.Logit(c, sm.add_constant(X1)).fit().summary()

Optimization terminated successfully.
         Current function value: 0.584850
         Iterations 5


0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,393.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 12 Jun 2017",Pseudo R-squ.:,0.06406
Time:,14:52:31,Log-Likelihood:,-232.19
converged:,True,LL-Null:,-248.08
,,LLR p-value:,5.815e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
const,-4.8499,1.094,-4.434,0.000,-6.994 -2.706
gre,0.0025,0.001,2.282,0.022,0.000 0.005
gpa,0.7125,0.326,2.185,0.029,0.073 1.352
prestige_1,1.0567,0.292,3.619,0.000,0.484 1.629


Answer: It seems that prestige-1 affects an applicants acceptance rate significantly. 

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [148]:
train_cols = df.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)

logit = sm.Logit(df['admit'], df[train_cols])

# fit the model
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


In [149]:
print result.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Mon, 12 Jun 2017   Pseudo R-squ.:                 0.08166
Time:                        14:25:24   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0022      0.001      2.028      0.043      7.44e-05     0.004
gpa            0.7793      0.333      2.344      0.019         0.128     1.431
prestige_1    -3.8769      1.142     -3.393      0.0

In [111]:
print np.exp(result.params)

gre           1.002221
gpa           2.180027
prestige_1    0.020716
prestige_2    0.010494
prestige_3    0.005432
prestige_4    0.004382
dtype: float64


In [112]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
print np.exp(conf)

                2.5%     97.5%        OR
gre         1.000074  1.004372  1.002221
gpa         1.136120  4.183113  2.180027
prestige_1  0.002207  0.194440  0.020716
prestige_2  0.001183  0.093045  0.010494
prestige_3  0.000569  0.051880  0.005432
prestige_4  0.000469  0.040919  0.004382


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

In [186]:
pd.crosstab(df.prestige_2, df.admit, dropna = False)


admit,0,1
prestige_2,Unnamed: 1_level_1,Unnamed: 2_level_1
0,176,73
1,95,53


In [187]:
prob_E = 53/148.
odds_E = prob_E / (1 - prob_E)

print prob_E
print odds_E

0.358108108108
0.557894736842


In [188]:
prob_F = 73/249.
odds_F = prob_F / (1 - prob_F)

print prob_F
print odds_F

0.293172690763
0.414772727273


In [189]:
odds_E/ odds_F


1.34506128334535

An applicant coming from a high school of a prestige of "2" is 1.35x more likely to be accepted by UCLA compared to an applicant that did not gradudate from a high school with a prestige level of "2." 

> ### Question 16.  Interpret the odds ratio of `gpa`.

I don't know how to do an odds ratio for a continuous variable.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [269]:
X1 = df[ ['gre',
    'gpa',
    'prestige_1',
    ] ]

In [270]:
predict_X1 = [ [800, 4, 1] ]

print model.predict(predict_X1)
print model.predict_proba(predict_X1)


ValueError: shapes (1,3) and (6,) not aligned: 3 (dim 1) != 6 (dim 0)

Answer: The applicant will be accepted with a 62% probability of success.

In [193]:
X2 = df[[
 'gre',
 'gpa',
 'prestige_2']]

In [194]:
predict_X2 = [ [800, 4, 2] ]

print model.predict(predict_X2)
print model.predict_proba(predict_X2)

ValueError: shapes (1,3) and (6,) not aligned: 3 (dim 1) != 6 (dim 0)

Answer: The applicant will be accepted with a 80% probability of success.

In [156]:
X3 = df[[
 'gre',
 'gpa',
 'prestige_3']]

In [157]:
predict_X3 = [ [800, 4, 3] ]

print model.predict(predict_X3)
print model.predict_proba(predict_X3)

ValueError: shapes (1,3) and (6,) not aligned: 3 (dim 1) != 6 (dim 0)

Answer: The applicant will be accepted with a 91% probability of success.

In [227]:
X4 = df[[
 'gre',
 'gpa',
 'prestige_4']]

In [228]:
predict_X4 = [ [800, 4, 4] ]

print model.predict(predict_X2)
print model.predict_proba(predict_X2)

ValueError: shapes (1,3) and (6,) not aligned: 3 (dim 1) != 6 (dim 0)

Answer: The applicant will be accepted with a 80% probability of success.

## Part E. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [274]:
X = df[ ['gre', 'gpa', 'prestige_1'] ]
y = df.admit

model = linear_model.LinearRegression()
model.fit(X,y)

print model.intercept_
print model.coef_

-0.484495669176
[ 0.0004861   0.14126271  0.24028523]


In [275]:
zip(X.columns.values, feature_selection.f_regression(X, y)[1])


[('gre', 0.0002843803572874063),
 ('gpa', 0.00049220860020325889),
 ('prestige_1', 3.9716554934324774e-05)]

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer: TODO

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [120]:
X1 = df[[
 'gre',
 'gpa',
 'prestige_1']]

In [121]:
predict_X1 = [ [800, 4, 1] ]

print model.predict(predict_X1)
print model.predict_proba(predict_X1)

ValueError: shapes (1,3) and (6,) not aligned: 3 (dim 1) != 6 (dim 0)

In [None]:
X2 = df[[
 'gre',
 'gpa',
 'prestige_2']]

In [None]:
predict_X2 = [ [800, 4, 2] ]

print model.predict(predict_X2)
print model.predict_proba(predict_X2)

In [None]:
X3 = df[[
 'gre',
 'gpa',
 'prestige_3']]

In [None]:
predict_X3 = [ [800, 4, 3] ]

print model.predict(predict_X3)
print model.predict_proba(predict_X3)

In [None]:
X4 = df[[
 'gre',
 'gpa',
 'prestige_4']]

In [None]:
predict_X4 = [ [800, 4, 4] ]

print model.predict(predict_X2)
print model.predict_proba(predict_X2)

## Part F.  Executive Summary

> ## Question 21.  Introduction
>
> Write a problem statement for this project.

Can we predict the likelihood that an applicant will be accepted to UCLA based on their GPA, GRE score, and the prestige of their high school. 

> ## Question 22.  Dataset
>
> Write up a description of your data and any cleaning that was completed.

I converted the prestige variable to 4 dummy variables and then removed the original prestige variable. 

> ## Question 23.  Demo
>
> Provide a table that explains the data by admission status.

In [73]:
df

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.00,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0,1,0,0
396,0,560.0,3.04,0,0,1,0
397,0,460.0,2.63,0,1,0,0
398,0,700.0,3.65,0,1,0,0


In [83]:
print pd.crosstab(df['admit'], df['prestige_1'], rownames=['admit'])

prestige_1    0   1
admit              
0           243  28
1            93  33


> ## Question 24.  Methods
>
> Write up the methods used in your analysis.

1. I first convert the prestige variable into 4 dummy variables to create an improved predictive model.
2. I then calculated the odds ratios to get an idea of what I should predict using my statsmodels and sklearn. This step acts as a gauge to make sure my results from the statsmodels and sklearn make sense. 
3. Used statsmodel to come up with a predictive model.
4. Used sklearn to come up with a predictive model.
5. Compared the statsmodel with the sklearn model. 

> ## Question 25.  Results
>
> Write up your results.

The prestige of the high school of the applicant can have a large impact on whether or not the applicant is accepted by UCLA. The more prestigious the high school, the more likely the applicant is to be accepted by UCLA. 

> ## Question 26.  Visuals
>
> Provide a table or visualization of these results.

In [273]:
df

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.00,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0,1,0,0
396,0,560.0,3.04,0,0,1,0
397,0,460.0,2.63,0,1,0,0
398,0,700.0,3.65,0,1,0,0


> ## Question 27.  Discussion
>
> Write up your discussion and future steps.

Answer: TODO