# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
df.groupby(['prestige','admit']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,gre,gpa
prestige,admit,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0,28,28
1.0,1,33,33
2.0,0,95,95
2.0,1,53,53
3.0,0,93,93
3.0,1,28,28
4.0,0,55,55
4.0,1,12,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
pdummies = pd.get_dummies(df['prestige'])
pdummies

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We need 3. Since there are 4 total, we know that if no variables are valued as 1, it must be the 4th.

> ### Question 4.  Why are we doing this?

Answer: We need to convert categorical variables to binary in order for linear/logistic regression to handle it.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [5]:
pdummies2 = pd.get_dummies(df['prestige'],drop_first = True)
pdummies2

Unnamed: 0,2.0,3.0,4.0
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0
...,...,...,...
395,1.0,0.0,0.0
396,0.0,1.0,0.0
397,1.0,0.0,0.0
398,1.0,0.0,0.0


In [6]:
df = pd.concat([df,pdummies2],axis=1)
df = df.drop('prestige',axis=1)

In [7]:
df.rename(columns = {2:'prestige2',3:'prestige3',4:'prestige4'},inplace = True)

In [8]:
df

Unnamed: 0,admit,gre,gpa,prestige2,prestige3,prestige4
0,0,380.0,3.61,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0
2,1,800.0,4.00,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0
...,...,...,...,...,...,...
395,0,620.0,4.00,1.0,0.0,0.0
396,0,560.0,3.04,0.0,1.0,0.0
397,0,460.0,2.63,1.0,0.0,0.0
398,0,700.0,3.65,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [9]:
df.groupby([df['prestige2' and 'prestige3'and 'prestige4']==0]).count()

Unnamed: 0_level_0,admit,gre,gpa,prestige2,prestige3,prestige4
prestige4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
False,67,67,67,67,67,67
True,330,330,330,330,330,330


In [10]:
pd.crosstab(df[('prestige4' and 'prestige2' and 'prestige3')]==0,'admit')

col_0,admit
prestige3,Unnamed: 1_level_1
False,121
True,276


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [11]:
# TODO

> ### Question 9.  Finally, what's the odds ratio?

In [12]:
# TODO

> ### Question 10.  Write this finding in a sentenance.

Answer:

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [13]:
# TODO

Answer:

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [15]:
feature_cols = ['gre', 'gpa','prestige2','prestige3','prestige4']
X = df[feature_cols]
y = df['admit']

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [17]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
# transform our training features
X_train_std = stdsc.fit_transform(X_train)
# transform the testing features in the same way
X_test_std = stdsc.transform(X_test)

In [18]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(penalty='l2', C=10)

> ### Question 13.  Print the model's summary results.

In [20]:
logreg.fit(X_train_std, y_train)
zip(feature_cols, logreg.coef_[0])

[('gre', 0.37496813294012232),
 ('gpa', 0.31756989530859919),
 ('prestige2', -0.2156055411197883),
 ('prestige3', -0.62238557701320429),
 ('prestige4', -0.62152306072838859)]

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [25]:
gre_odds = np.exp(0.37496813294012232)
gpa_odds = np.exp(0.31756989530859919)
prestige2_odds = np.exp(-0.2156055411197883)
prestige3_odds = np.exp(-0.62238557701320429)
prestige4_odds = np.exp(-0.62152306072838859)
print gre_odds, gpa_odds, prestige2_odds, prestige3_odds, prestige4_odds

1.45494504906 1.37378526265 0.806053194012 0.536662659208 0.53712573917


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: As the standard deviation for prestige increases by 1 (i.e. if your school has a prestige of 2), we see a 21.5% reduction in admittance chances

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: As the standard deviation for gpa increases by 1, we see a more than double chance of admittance.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [29]:
logreg2 = LogisticRegression(penalty='l2', C=10**2)

In [27]:
logreg2.fit(X_train_std, y_train)
zip(feature_cols, logreg.coef_[0])

[('gre', 0.37496813294012232),
 ('gpa', 0.31756989530859919),
 ('prestige2', -0.2156055411197883),
 ('prestige3', -0.62238557701320429),
 ('prestige4', -0.62152306072838859)]

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [28]:
gre_odds = np.exp(0.37496813294012232)
gpa_odds = np.exp(0.31756989530859919)
prestige2_odds = np.exp(-0.2156055411197883)
prestige3_odds = np.exp(-0.62238557701320429)
prestige4_odds = np.exp(-0.62152306072838859)
print gre_odds, gpa_odds, prestige2_odds, prestige3_odds, prestige4_odds

1.45494504906 1.37378526265 0.806053194012 0.536662659208 0.53712573917


Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: