# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [66]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [23]:
df.groupby('prestige').count()

Unnamed: 0_level_0,admit,gre,gpa
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,61,61,61
2.0,148,148,148
3.0,121,121,121
4.0,67,67,67


In [54]:
df['prestige'].value_counts()

2.0    148
3.0    121
4.0     67
1.0     61
Name: prestige, dtype: int64

## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [24]:
import sklearn

In [67]:
pdummies = pd.get_dummies(df['prestige'])
pdummies

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


In [68]:
pdummies2 = pd.get_dummies(df['prestige'])
pdummies2

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


In [69]:
df.rename(columns = {2:'prestige2',3:'prestige3',4:'prestige4'},inplace = True)

In [70]:
df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: 3. If all 3 binary variables are 0, then the remaining variable must be 1.

> ### Question 4.  Why are we doing this?

Answer: Categorical variables need to be converted into binary dummy variables for regression to handle them.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [71]:
df = pd.concat([df,pdummies2],axis=1)
df = df.drop('prestige',axis=1)

In [78]:
df.rename(columns = {1:'prestige1',2:'prestige2',3:'prestige3',4:'prestige4'},inplace = True)

In [79]:
df

Unnamed: 0,admit,gre,gpa,prestige1,prestige2,prestige3,prestige4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [87]:
df.groupby(['prestige1','admit']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,gre,gpa,prestige2,prestige3,prestige4
prestige1,admit,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,0,243,243,243,243,243
0.0,1,93,93,93,93,93
1.0,0,28,28,28,28,28
1.0,1,33,33,33,33,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [101]:
a=33/61.0

In [102]:
b=1-a

In [103]:
a/b

1.1785714285714288

Or

In [104]:
y=33/28.0
y

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [106]:
n=93/243.0
n

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [107]:
y/n

3.079493087557604

> ### Question 10.  Write this finding in a sentenance.

Answer: Applicant are three times as likely to be admitted to UCLA if they attended a #1 ranked college.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [109]:
df.groupby(['prestige4','admit']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,gre,gpa,prestige1,prestige2,prestige3
prestige4,admit,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,0,216,216,216,216,216
0.0,1,114,114,114,114,114
1.0,0,55,55,55,55,55
1.0,1,12,12,12,12,12


In [112]:
y2=12/55.0
y2

0.21818181818181817

In [113]:
n2=114/216.0
n2

0.5277777777777778

In [114]:
y2/n2

0.4133971291866028

Answer: An odds ratio between 0 and 1 means that an event is less likely to happen. The odds of admittance to UCLA at a level 4 prestige college mean that it is less likely that the applicant will be admitted.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [124]:
feature_cols = ['gre','gpa','prestige2','prestige3','prestige4']
X = df[feature_cols]
y = df['admit']

In [130]:
formula = 'admit ~ gre + gpa + prestige2 + prestige3 + prestige4'
mod = smf.ols(formula=formula, data=df)
res = mod.fit()

> ### Question 13.  Print the model's summary results.

In [122]:
print res.summary()

                            OLS Regression Results                            
Dep. Variable:                  admit   R-squared:                       0.099
Model:                            OLS   Adj. R-squared:                  0.087
Method:                 Least Squares   F-statistic:                     8.594
Date:                Tue, 17 Jan 2017   Prob (F-statistic):           9.71e-08
Time:                        23:55:06   Log-Likelihood:                -239.02
No. Observations:                 397   AIC:                             490.0
Df Residuals:                     391   BIC:                             513.9
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.2377      0.217     -1.095      0.2

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [None]:
# TODO

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer:

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [136]:
#prestige2
0.0004*800 + 0.1508*4 - 0.2377 

0.6855

In [137]:
#prestige2
0.0004*800 + 0.1508*4 + -0.1635*1 - 0.2377 

0.522

In [138]:
#prestige3
0.0004*800 + 0.1508*4 + -0.2910*1 - 0.2377 

0.3945000000000001

In [139]:
#prestige4
0.0004*800 + 0.1508*4 + -0.3240*1 - 0.2377 

0.36149999999999993

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [142]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [148]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

In [155]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=10**2)

In [160]:
logreg.fit(X_train_std, y_train)
zip(feature_cols, logreg.coef_[0])

[('gre', 0.37561930833368101),
 ('gpa', 0.31820568345905215),
 ('prestige2', -0.21849094053796464),
 ('prestige3', -0.6260474455431464),
 ('prestige4', -0.62507506217801911)]

In [162]:
logreg.intercept_[0]

-0.80551529668293465

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [163]:
#prestige1
0.37561930833368101*800 + 0.31820568345905215*4  + logreg.intercept_[0]

300.96275410409811

In [164]:
#prestige2
0.37561930833368101*800 + 0.31820568345905215*4 + -0.21849094053796464*1 + logreg.intercept_[0]

300.74426316356016

In [165]:
#prestige3
0.37561930833368101*800 + 0.31820568345905215*4 + -0.6260474455431464*1 + logreg.intercept_[0]

300.33670665855499

In [167]:
#prestige4
0.37561930833368101*800 + 0.31820568345905215*4 + -0.62507506217801911*1 + logreg.intercept_[0]

300.33767904192007

Answer: