# DS-SF-25 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [21]:
import os
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import sklearn
from sklearn import linear_model, preprocessing

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

In [22]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [23]:
df.prestige.value_counts()

2.0    148
3.0    121
4.0     67
1.0     61
Name: prestige, dtype: int64

## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [24]:
df_dummies=pd.get_dummies(df.prestige)
df['prestige_1']=df_dummies[df_dummies.columns[0]]
df['prestige_2']=df_dummies[df_dummies.columns[1]]
df['prestige_3']=df_dummies[df_dummies.columns[2]]
df['prestige_4']=df_dummies[df_dummies.columns[2]]
df

Unnamed: 0,admit,gre,gpa,prestige,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,3.0,0.0,0.0,1.0,1.0
1,1,660.0,3.67,3.0,0.0,0.0,1.0,1.0
2,1,800.0,4.00,1.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,4.0,0.0,0.0,0.0,0.0
4,0,520.0,2.93,4.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
395,0,620.0,4.00,2.0,0.0,1.0,0.0,0.0
396,0,560.0,3.04,3.0,0.0,0.0,1.0,1.0
397,0,460.0,2.63,2.0,0.0,1.0,0.0,0.0
398,0,700.0,3.65,2.0,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: three!

> ### Question 4.  Why are we doing this?

Answer: because prestige is an ordinal value, and so we should treat it as a categorical variable

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [25]:
#df.drop(['prestige'],axis=1,inplace=True)

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [26]:
print df[df.prestige_1==1].admit.sum()
#print df[df.prestige_1==1].prestige.sum()
print df.admit.sum()
print df[df.prestige_1!=1].admit.sum()
#print df[df.prestige_1!=1].prestige.sum()

33
126
93


In [27]:
pd.crosstab(df.prestige, df.admit, margins=True)

admit,0,1,All
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,28,33,61
2.0,95,53,148
3.0,93,28,121
4.0,55,12,67
All,271,126,397


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [28]:
33.0/61

0.5409836065573771

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [29]:
93.0/927

0.10032362459546926

> ### Question 9.  Finally, what's the odds ratio?

In [30]:
(33.0/61)/(93.0/927)

5.392384981491275

> ### Question 10.  Write this finding in a sentenance.

Answer: The odds of admission for someone in the top 1 school is 5 time greater than someone who is not in a top school

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [31]:
pd.crosstab(df.prestige, df.admit, margins=True)

admit,0,1,All
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,28,33,61
2.0,95,53,148
3.0,93,28,121
4.0,55,12,67
All,271,126,397


In [32]:
odds=(12.0/67)/((126-12.0)/(397.0-67))
odds

0.5184603299293008

Answer: The odds of someone in the lowest tier school is half that of someone in a higher tier school

## Part C. Analysis using `statsmodel`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [84]:
df.drop(['prestige','prestige_1'],axis=1,inplace=True)
train_cols = df.columns[1:]

In [85]:
X=df[['gre','gpa','prestige_2','prestige_3','prestige_4']]
y=df['admit']

In [86]:
logit = smf.Logit(df['admit'], df[train_cols]).fit()


Optimization terminated successfully.
         Current function value: 0.620444
         Iterations 5


> ### Question 13.  Print the model's summary results.

In [87]:
logit.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,393.0
Method:,MLE,Df Model:,3.0
Date:,"Tue, 27 Sep 2016",Pseudo R-squ.:,0.007097
Time:,18:12:53,Log-Likelihood:,-246.32
converged:,True,LL-Null:,-248.08
,,LLR p-value:,0.318

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0013,0.001,1.323,0.186,-0.001 0.003
gpa,-0.3676,0.182,-2.024,0.043,-0.724 -0.012
prestige_2,-0.1401,0.248,-0.565,0.572,-0.626 0.346
prestige_3,-0.3530,,,,nan nan
prestige_4,-0.3530,,,,nan nan


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [88]:
print np.exp(logit.params)

gre           1.001326
gpa           0.692403
prestige_2    0.869286
prestige_3    0.702563
prestige_4    0.702563
dtype: float64


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: If a school is prestige=2, the odds of being admitted are 1.6 X greater

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer:  every point increase of gpa decreases the odds of being admitte by .69 X

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [118]:
print "tier 1", logit.predict([[800,4,0,0,0]])
print "tier 3", logit.predict([[800,4,1,0,0]])
print "tier 3", logit.predict([[800,4,0,1,0]])
print "tier 3", logit.predict([[800,4,0,0,1]])

tier 1 [ 0.39879845]
tier 3 [ 0.36573509]
tier 3 [ 0.31788805]
tier 3 [ 0.31788801]


Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [89]:
model=linear_model.LogisticRegression(C=10**2).\
    fit(X, y)
    
print model.coef_
print model.intercept_
print model.score(X, y)

[[ 0.00241547  0.8281105   0.02894509 -0.31669404 -0.31669404]]
[-4.87207503]
0.67758186398


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [91]:
print np.exp(model.coef_)

[[ 1.00241839  2.2889896   1.02936807  0.72855363  0.72855363]]


Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [116]:
print "tier 1", model.predict([[800,4,0,0,0]])
print "tier 3", model.predict([[800,4,1,0,0]])
print "tier 3", model.predict([[800,4,0,1,0]])
print "tier 3", model.predict([[800,4,0,0,1]])

tier 1 [1]
tier 3 [1]
tier 3 [1]
tier 3 [1]


Answer: