# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [2]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn import linear_model

In [41]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [5]:
pd.crosstab(df.prestige, df.admit, dropna = False)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [42]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

In [43]:
prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
                           'prestige_2.0': 'prestige_2',
                           'prestige_3.0': 'prestige_3',
                           'prestige_4.0': 'prestige_4'}, inplace = True)

prestige_df

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We can use all 4 or either of the 4 binary variables for modeling

> ### Question 4.  Why are we doing this?

Answer: Prestige is a categorical variable. By doing one-hot encoding on prestige it is normalized and better suited to use with regression algorithms.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [44]:
df = df.join([prestige_df])
df.columns

Index(['admit', 'gre', 'gpa', 'prestige', 'prestige_1', 'prestige_2',
       'prestige_3', 'prestige_4'],
      dtype='object')

In [45]:
df.drop(['prestige'], axis = 1, inplace = True)
df.columns

Index(['admit', 'gre', 'gpa', 'prestige_1', 'prestige_2', 'prestige_3',
       'prestige_4'],
      dtype='object')

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [9]:
pd.crosstab(df.prestige_1, df.admit, dropna = False)

admit,0,1
prestige_1,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,243,93
1.0,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [33]:
33/397 

0.08312342569269521

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [34]:
1 - (33/397)

0.9168765743073048

> ### Question 9.  Finally, what's the odds ratio?

In [35]:
(33/397)/(1 - (33/397))

0.09065934065934064

> ### Question 10.  Write this finding in a sentence.

Answer: There is a 9% chance of getting admitted to UCLA if the student attended a #1 ranked undergrad college

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [37]:
(12/397)/(1 - (12/397))

0.031168831168831165

Answer: There is a 3% chance of getting admitted to UCLA if the student attended a #4 ranked undergrad college

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [22]:
model = smf.ols(formula = 'admit ~ gre + gpa + prestige_4', data = df).fit()

> ### Question 13.  Print the model's summary results.

In [23]:
model.summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.059
Model:,OLS,Adj. R-squared:,0.052
Method:,Least Squares,F-statistic:,8.181
Date:,"Tue, 07 Feb 2017",Prob (F-statistic):,2.7e-05
Time:,14:16:23,Log-Likelihood:,-247.69
No. Observations:,397,AIC:,503.4
Df Residuals:,393,BIC:,519.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.4413,0.211,-2.092,0.037,-0.856 -0.027
gre,0.0005,0.000,2.443,0.015,0.000 0.001
gpa,0.1404,0.065,2.158,0.032,0.012 0.268
prestige_4,-0.1428,0.061,-2.337,0.020,-0.263 -0.023

0,1,2,3
Omnibus:,410.966,Durbin-Watson:,1.938
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.503
Skew:,0.696,Prob(JB):,1.98e-13
Kurtosis:,1.735,Cond. No.,5740.0


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

odds ratios of the features are as below:
admit : 0.037
gre : 0.015
gpa : 0.032
prestige_4 : 0.020 

95% confidence intervals are as below:
admit : -0.856 -0.027
gre : 0.0 0.001
gpa : 0.012 0.268
prestige_4 : -0.263 -0.023

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: 
odds ratios of the features are as below:
admit : 0.010
gre : 0.017
gpa : 0.017
prestige_2 : 0.179 

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: gpa is more significant in determining the admission than gre or prestige

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [24]:
smf.ols(formula = 'admit ~ prestige_1 + prestige_2 + prestige_3 + prestige_4', data = df).fit().summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.064
Model:,OLS,Adj. R-squared:,0.056
Method:,Least Squares,F-statistic:,8.899
Date:,"Tue, 07 Feb 2017",Prob (F-statistic):,1.02e-05
Time:,14:28:13,Log-Likelihood:,-246.67
No. Observations:,397,AIC:,501.3
Df Residuals:,393,BIC:,517.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.2619,0.019,13.439,0.000,0.224 0.300
prestige_1,0.2791,0.049,5.702,0.000,0.183 0.375
prestige_2,0.0962,0.035,2.764,0.006,0.028 0.165
prestige_3,-0.0305,0.037,-0.817,0.415,-0.104 0.043
prestige_4,-0.0828,0.047,-1.760,0.079,-0.175 0.010

0,1,2,3
Omnibus:,218.144,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,57.254
Skew:,0.725,Prob(JB):,3.69e-13
Kurtosis:,1.834,Cond. No.,2820000000000000.0


Answer:
The probability of admission for a student coming from different tier undergraduate schools are as below:
prestige : probability
1 : 0.0
2 : 0.006
3 : 0.415
4 : 0.079

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [30]:
X = df[['gre', 'gpa', 'prestige_4']]
c = df.admit

model_sk = linear_model.LogisticRegression().\
    fit(X, c)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [31]:
print(model_sk.coef_)
print(model_sk.intercept_)

[[ 0.00200944  0.0726541  -0.73971444]]
[-2.06971923]


Answer:
odds ratios of the features are as below:
gre : 0.002
gpa : 0.072
prestige_4 : 0.739

odd ratio of gpa and prestige are much greater when compared woth statsmodels

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [29]:
X1 = df[['prestige_1', 'prestige_2', 'prestige_3','prestige_4']]
c1 = df.admit

model_sk = linear_model.LogisticRegression().\
    fit(X1, c1)

print(model_sk.coef_)
print(model_sk.intercept_)

[[ 0.73283094  0.03228189 -0.55783816 -0.82409122]]
[-0.61681655]


Answer:
prestige : probability
1 : 0.732
2 : 0.032
3 : 0.557
4 : 0.824