# DS-SF-34 | Unit Project | 3 | Machine Learning Modeling and Executive Summary | Starter Code

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.  You will summarize and present your findings and the methods you used.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [3]:
pd.crosstab(df.prestige, df.admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Feature Engineering

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
    'prestige_2.0': 'prestige_2',
    'prestige_3.0': 'prestige_3',
    'prestige_4.0': 'prestige_4'}, inplace = True)

prestige_df

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: There are 4 variables under prestige. We only need 3. 

> ### Question 4.  Why are we doing this?

Answer: We can omit one variable because the last one can be derived from the other three. We will have multicollinearity issues when including all binary variables. 

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [5]:
df = df.join([prestige_df])
df.drop('prestige', axis=1, inplace=True)

df

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.00,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0,1,0,0
396,0,560.0,3.04,0,0,1,0
397,0,460.0,2.63,0,1,0,0
398,0,700.0,3.65,0,1,0,0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [6]:
pd.crosstab(df.prestige_1, df.admit)

admit,0,1
prestige_1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,93
1,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [13]:
prob_1 = 33. / (33+28)
odds_1 = prob_1 / (1-prob_1)

print odds_1

1.17857142857


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [14]:
prob_2 = 93. / (93+243)
odds_2 = prob_2 / (1-prob_2)

print odds_2

0.382716049383


> ### Question 9.  Finally, what's the odds ratio?

In [15]:
odds_1 / odds_2

3.079493087557604

> ### Question 10.  Write this finding in a sentence.

Answer: An undergradute who attended a #1 ranked college has ~3 times more chance of getting admitted. 

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [18]:
pd.crosstab(df.prestige_4, df.admit)

admit,0,1
prestige_4,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,114
1,55,12


In [21]:
prob_1 = 12. / (12+55)
odds_1 = prob_1 / (1-prob_1)

prob_2 = 114. / (114+216)
odds_2 = prob_2 / (1-prob_2)

print odds_1 / odds_2
print 1 - (odds_1 / odds_2)

0.413397129187
0.586602870813


Answer: Undergraduate who attended the least prestigious undergraduate schools are 58% less likely to be admitted to UCLA's graduate school. 

## Part D. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [56]:
X1 = df[['gre','gpa','prestige_1', 'prestige_2', 'prestige_3', 'prestige_4']]
c = df.admit
results = smf.Logit(c, X1).fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [57]:
results.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Sun, 11 Jun 2017",Pseudo R-squ.:,0.08166
Time:,21:48:35,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_1,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
prestige_2,-4.5570,1.113,-4.093,0.000,-6.739 -2.375
prestige_3,-5.2155,1.151,-4.530,0.000,-7.472 -2.959
prestige_4,-5.4303,1.140,-4.764,0.000,-7.664 -3.196


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [58]:
params = results.params
conf = results.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'Odds Ratio']
print np.exp(conf)

                2.5%     97.5%  Odds Ratio
gre         1.000074  1.004372    1.002221
gpa         1.136120  4.183113    2.180027
prestige_1  0.002207  0.194440    0.020716
prestige_2  0.001183  0.093045    0.010494
prestige_3  0.000569  0.051880    0.005432
prestige_4  0.000469  0.040919    0.004382


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: Undergraduate who attended the second most prestigious undergraduate schools have 0.4% likelihood to be admitted to UCLA's graduate school. 

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: An undergradute who have a high GPA is 2 times more likely to be adimmited to UCLA's graduate school.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [62]:
predict_df = [(800,4., 1, 0, 0, 0),
(800,4., 0, 1, 0, 0),
(800,4., 0, 0, 1, 0),
(800,4., 0, 0, 0, 1)]

print results.predict(predict_df)

[ 0.73403998  0.58299512  0.41983282  0.36860803]


Answer: Assuming that GRE and GPA as consistent, the student will have a 73.4% chance of being admitted to UCLA's graduate program if she added a tier 1 school. She will have 58.3% chance if she attended at tier 2 school, 42.0% if she attended a tier 3 school, 36.9% at a tier 4 school. 

## Part E. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [68]:
X1 = df[['gre','gpa','prestige_1', 'prestige_2', 'prestige_3', 'prestige_4']]
c = df.admit
model = linear_model.LogisticRegression(C=10**2)
result = model.fit(X1,c)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [76]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'Odds Ratio']
print np.exp(conf)

AttributeError: 'LogisticRegression' object has no attribute 'params'

Answer: TODO

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [78]:
print result.predict(predict_df)
print result.predict_proba(predict_df)

[1 1 0 0]
[[ 0.27926916  0.72073084]
 [ 0.42538432  0.57461568]
 [ 0.58421161  0.41578839]
 [ 0.63723026  0.36276974]]


Answer: Assuming that GRE and GPA as consistent, the student will have a 72.% chance of being admitted to UCLA's graduate program if she added a tier 1 school. She will have 57.5% chance if she attended at tier 2 school. 

She will be more likelly to NOT be admitted to the graduate program if attends a tier 3 or 4 school. There is a 41.6% if she attended a tier 3 school, or 36.2% at a tier 4 school, chance of getting admittance. 

## Part F.  Executive Summary

> ## Question 21.  Introduction
>
> Write a problem statement for this project.

Answer: To be able to predict whether or not a student will be admitted to UCLA's graduate program with their GRE score, GPA score, and previous school's prestige.

What is the proability of getting admitted to UCLA's graduate program if GRE and GPA is set, but only prestige changes?

> ## Question 22.  Dataset
>
> Write up a description of your data and any cleaning that was completed.

Answer: This dataset is from the UCLA. There were no missing variables, but we created a binary variable for the four different prestige tiers.

> ## Question 23.  Demo
>
> Provide a table that explains the data by admission status.

Answer:

| Variable Name | Variable Description | Values/Labels |
|:---|:---|:---|
| `admit` | Indication of whether or not a candidate was admitted into UCLA| Binary Variable |
| `gre` | GRE Score | Integer (range 200-800) |
| `gpa` | GPA score | Double (range 1.0-4.0|
| `prestige`| prestiage of an applicate alta mater, where 1 is the highest tier| Integer (range 1-4)

> ## Question 24.  Methods
>
> Write up the methods used in your analysis.

Answer: Using two different python schemas of Logistic Regression to predict the admission probability of a potential student. This allows us to check the validity of the models and probability of admission.

> ## Question 25.  Results
>
> Write up your results.

Answer: If a student has perfect scores/grades and only the prestige of the school is different, the student has a higher chance of getting into UCLA's graduate program with a higher school prestige. The student is ~73% likely to be admitted if they went to a tier 1 school, but only ~36% likely to be admitted if they went to a tier 4 school.

> ## Question 26.  Visuals
>
> Provide a table or visualization of these results.

Answer: 

| `gre` | `gpa` | `prestige` | admit_probability (statsmodel)| admit_prediction (sklearn)
|:---|:---|:---|:---|
| 800 | 4.0 | 1 | 73.4% | 72.1% |
| 800 | 4.0 | 2 | 58.3% | 57.5% |
| 800 | 4.0 | 3 | 42.0% | 41.6% |
| 800 | 4.0 | 4 | 36.9% | 36.3% |

> ## Question 27.  Discussion
>
> Write up your discussion and future steps.

Answer: What would be interesting is to see the probability of admission if GRE or GPA changes while prestige stays constant. I would run the same model, but possibly group GRE by hundreds, and GPA by whole integers.