# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', 'DAT-NYC-45','unit-project', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
# TODO
pres = df.groupby(['prestige', 'admit'])

In [4]:
pres['prestige'].value_counts()

prestige  admit  prestige
1.0       0      1.0         28
          1      1.0         33
2.0       0      2.0         95
          1      2.0         53
3.0       0      3.0         93
          1      3.0         28
4.0       0      4.0         55
          1      4.0         12
Name: prestige, dtype: int64

## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [5]:
# TODO
pres2 = pd.get_dummies(df['prestige'], drop_first= True, prefix= 'prestige')

> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: Only need three of the four binary variables for modeling.

> ### Question 4.  Why are we doing this?

Answer: We are one-hot encoding 'prestige' because we want to make it machine legible; many learning algorithms either learn a single weight per feature, or they use distances between samples (for example, linear models such as logistic regression). If the observations within 'prestige' are not made into binary variables, incorrect distances and weights will be applied by our algorithm. We are dropping one of the binary variables we created from one-hot encoding to avoid multi-collinearity between our dummy variables.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [6]:
# TODO
df = df.join(pres2)

In [7]:
df = df.drop('prestige', axis = 1)

In [8]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0
2,1,800.0,4.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [9]:
# TODO
df2 = df[(df['prestige_2.0']==0.0) & (df['prestige_3.0']==0.0) & (df['prestige_4.0']==0.0)]

In [10]:
df3 = df2.groupby(['admit'])
df3['admit'].value_counts()

admit  admit
0      0        28
1      1        33
Name: admit, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [11]:
# TODO
33.0/28.0

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [12]:
# TODO
two_four= df[(df['prestige_2.0']==1.0) | (df['prestige_3.0']==1.0) | (df['prestige_4.0']==1.0)]

In [13]:
two_four2 = two_four.groupby(['admit'])

In [14]:
two_four2['admit'].value_counts()

admit  admit
0      0        243
1      1         93
Name: admit, dtype: int64

In [15]:
93.0/243.0

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [16]:
# TODO
(33.0/28.0)/(93.0/243.0)

3.079493087557604

> ### Question 10.  Write this finding in a sentenance.

Answer: Odds are approximately 3.1 to 1 that a student from one of the most prestigious schools will be admitted. Or there is a ~75% probability that a student from one of the most prestigious schools will be admitted.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [17]:
# TODO
12.0/55.0 # Odds of being admitted to graduate school for least prestigious

0.21818181818181817

In [18]:
(12.0/55.0)/((188.0+28.0)/(81.0+33.0)) # Odds ratio of being admitted to UCLA

0.11515151515151516

Answer: Odds are approximately 11.5 to 100 that an applicant from the least prestigious undergraduate schools will be admitted to UCLA's graduate school.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [19]:
# TODO
X, y = df.drop('admit', axis = 1), df['admit']

In [20]:
results = smf.Logit(y, X).fit()

Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5


> ### Question 13.  Print the model's summary results.

In [21]:
# TODO
print results.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      392
Method:                           MLE   Df Model:                            4
Date:                Mon, 16 Jan 2017   Pseudo R-squ.:                 0.05722
Time:                        19:08:19   Log-Likelihood:                -233.88
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.039e-05
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
gre              0.0014      0.001      1.308      0.191        -0.001     0.003
gpa             -0.1323      0.195     -0.680      0.497        -0.514     0.249
prestige_2.0    -0.9562      0.302     -3.17

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [22]:
params = np.exp(results.params) # Odds ratios
conf = results.conf_int() # 85% conf interval
conf['Odds Ratios'] = params
conf.columns = ['2.5%','97.5%', 'Odds Ratios']

In [23]:
conf

Unnamed: 0,2.5%,97.5%,Odds Ratios
gre,-0.00068,0.003414,1.001368
gpa,-0.513657,0.249045,0.876073
prestige_2.0,-1.547279,-0.365166,0.384342
prestige_3.0,-2.188769,-0.88623,0.214918
prestige_4.0,-2.656743,-1.083112,0.154135


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: Odds of an applicant from a prestige = 2 undergraduate school to be admitted into UCLA are 38% lower than an applicant from prestige = 1 undergraduate school

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: An increase of 1 for gpa results in about an 88% reduction to the odds of admittance

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [24]:
# TODO
pres_2 = X[(X['gre']==800) & (X['gpa']==4.0) & (X['prestige_2.0']==1.0)]
pres_3 = X[(X['gre']==800) & (X['gpa']==4.0) & (X['prestige_3.0']==1.0)]
pres_4 = X[(X['gre']==800) & (X['gpa']==4.0) & (X['prestige_4.0']==1.0)]
pres_1 = X[(X['gre']==800) & (X['gpa']==4.0) & (X['prestige_2.0']==0.0) & 
           (X['prestige_3.0']==0.0) & (X['prestige_4.0']==0.0)]

In [25]:
pres_1['admit_prob'] = results.predict(pres_1)
pres_2['admit_prob'] = results.predict(pres_2)
pres_3['admit_prob'] = results.predict(pres_3)
pres_4['admit_prob'] = results.predict(pres_4)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: ht

In [26]:
case_study = pd.concat([pres_1, pres_2, pres_3, pres_4])

In [27]:
case_study

Unnamed: 0,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0,admit_prob
2,800.0,4.0,0.0,0.0,0.0,0.637399
377,800.0,4.0,1.0,0.0,0.0,0.403204
33,800.0,4.0,0.0,1.0,0.0,0.274202
77,800.0,4.0,0.0,1.0,0.0,0.274202
10,800.0,4.0,0.0,0.0,1.0,0.213184


Answer:
* If prestige = 1, probabilty is ~64% 
* if prestige = 2, probability is ~40%
* if prestige = 3, probability is ~27%
* if prestige = 4, probability is 21%

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [34]:
# TODO
from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(C= 20)
logit.fit(X, y)

LogisticRegression(C=20, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [60]:
logit.coef_

array([[ 0.00215767,  0.6788388 , -0.65055814, -1.28611617, -1.56314553]])

In [45]:
coeff = np.exp(logit.coef_)
coeff

array([[ 1.00216   ,  1.971587  ,  0.52175448,  0.27634197,  0.20947612]])

In [58]:
# TODO
sum_tab = pd.DataFrame(coeff)
sum_tab.columns = 'gre', 'gpa', 'prestige_2.0', 'prestige_3.0', 'prestige_4.0'
sum_tab

Unnamed: 0,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
0,1.00216,1.971587,0.521754,0.276342,0.209476


Answer: The biggest difference between the two models' odds ratios is with GPA. While statsmodels is saying that an increase in GPA will decrease the log odds for admission, sklearn's model with C=20 has GPA increasing the log odds with an increase in GPA

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [31]:
# TODO 
p_2 = X[(X['gre']==800) & (X['gpa']==4.0) & (X['prestige_2.0']==1.0)]
p_3 = X[(X['gre']==800) & (X['gpa']==4.0) & (X['prestige_3.0']==1.0)]
p_4 = X[(X['gre']==800) & (X['gpa']==4.0) & (X['prestige_4.0']==1.0)]
p_1 = X[(X['gre']==800) & (X['gpa']==4.0) & (X['prestige_2.0']==0.0) & 
           (X['prestige_3.0']==0.0) & (X['prestige_4.0']==0.0)]

In [32]:
case1 = logit.predict_proba(p_1)
case2 = logit.predict_proba(p_2)
case3 = logit.predict_proba(p_3)
case4 = logit.predict_proba(p_4)

In [33]:
print case1
print case2
print case3
print case4

[[ 0.28814605  0.71185395]]
[[ 0.43153702  0.56846298]]
[[ 0.58608936  0.41391064]
 [ 0.58608936  0.41391064]]
[[ 0.66024514  0.33975486]]


Answer: 
* If prestige = 1, probabilty is ~71% 
* if prestige = 2, probability is ~57%
* if prestige = 3, probability is ~41%
* if prestige = 4, probability is ~34%