# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

import matplotlib.pyplot as plt
%matplotlib inline


In [2]:
pwd

u'C:\\Users\\jeana\\Documents\\jeana-curro-portfolio'

In [3]:
df = pd.read_csv(os.path.join('..', 'DAT-NYC-45', 'unit-project','dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


In [4]:
df.describe()

Unnamed: 0,admit,gre,gpa,prestige
count,397.0,397.0,397.0,397.0
mean,0.31738,587.858942,3.392242,2.488665
std,0.466044,115.717787,0.380208,0.947083
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.4,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [5]:
pd.crosstab(df['admit'], df['prestige'], rownames=['admit'])

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [6]:
pd.get_dummies(df['prestige'],prefix='prestige')

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: Three.  If we know something is not '2','3', or '4' for prestige, it must be '1'

> ### Question 4.  Why are we doing this?

Answer:  We are converting categorical variables into binary ones so we can apply scikit learn functions.  

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [7]:
df2=pd.concat([df, pd.get_dummies(df['prestige'],prefix='prestige')], axis=1)
df2=df2.drop('prestige',axis=1)
df2.head()


Unnamed: 0,admit,gre,gpa,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [8]:
pd.crosstab(df2['admit'], df2['prestige_1.0'], rownames=['admit'])

prestige_1.0,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


Read the above as for prestige = 1.0, 33 applicants were admitted, 28 were not.  

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [9]:
33.0/(28)

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [10]:
93.0/(243)

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [11]:
(33.0/28)/(93.0/243)

3.079493087557604

3:1 roughly

> ### Question 10.  Write this finding in a sentenance.

Answer: Applicants from the most prestigious colleges were roughly three times more likely than applicants from all other colleges to be admitted to UCLA grad program, as per this data.  

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [12]:
pd.crosstab(df2['admit'], df2['prestige_4.0'], rownames=['admit'])

prestige_4.0,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


In [13]:
# odds of being admitted if from a prestige_4.0 school
12.0/55

0.21818181818181817

In [14]:
# odds of being admitted for everyone else
114.0/216

0.5277777777777778

In [15]:
(12.0/55)/(114.0/216)

0.4133971291866028

Answer: For students who attended the least prestigious undergraduates, the odds of being accepted are only 40% of those who attended all other schools.  

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [16]:
df2.head()

Unnamed: 0,admit,gre,gpa,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


In [17]:
X=df2.drop(['admit','prestige_1.0'],axis=1)
y=df['admit']

X.head()

Unnamed: 0,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
0,380.0,3.61,0.0,1.0,0.0
1,660.0,3.67,0.0,1.0,0.0
2,800.0,4.0,0.0,0.0,0.0
3,640.0,3.19,0.0,0.0,1.0
4,520.0,2.93,0.0,0.0,1.0


In [18]:
X['intercept']=1

In [19]:
X.head()

Unnamed: 0,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0,intercept
0,380.0,3.61,0.0,1.0,0.0,1
1,660.0,3.67,0.0,1.0,0.0,1
2,800.0,4.0,0.0,0.0,0.0,1
3,640.0,3.19,0.0,0.0,1.0,1
4,520.0,2.93,0.0,0.0,1.0,1


In [20]:
logreg=smf.Logit(y,X)
result=logreg.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [21]:
result.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 16 Jan 2017",Pseudo R-squ.:,0.08166
Time:,15:09:10,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2.0,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3.0,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4.0,-1.5534,0.417,-3.721,0.000,-2.372 -0.735
intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [22]:
result.params
#these are the coefficients

gre             0.002218
gpa             0.779337
prestige_2.0   -0.680137
prestige_3.0   -1.338677
prestige_4.0   -1.553411
intercept      -3.876854
dtype: float64

In [23]:
result.conf_int()
#these are the 95% confidence intervals

Unnamed: 0,0,1
gre,7.4e-05,0.004362
gpa,0.127619,1.431056
prestige_2.0,-1.301337,-0.058936
prestige_3.0,-2.014579,-0.662776
prestige_4.0,-2.371624,-0.735197
intercept,-6.116077,-1.637631


In [24]:
np.exp(result.params)
#these are the odds ratios

gre             1.002221
gpa             2.180027
prestige_2.0    0.506548
prestige_3.0    0.262192
prestige_4.0    0.211525
intercept       0.020716
dtype: float64

Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: For applicants with a prestige score of 2.0, we would expect their odds of being admitted to be ~50% of those who had prestige score 1.0.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: We would expect the odds of being admitted to more than double (increase by 118%) for every 1 unit increase in GPA.  

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [25]:
#probability if from tier 1.0
result.predict((800,4,0,0,0,1))

array([ 0.73403998])

In [26]:
#probability if from tier 2.0
result.predict((800,4,1,0,0,1))

array([ 0.58299512])

In [27]:
#probability if from tier 3.0
result.predict((800,4,0,1,0,1))

array([ 0.41983282])

In [28]:
#probability if from tier 4.0
result.predict((800,4,0,0,1,1))

array([ 0.36860803])

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [29]:
from sklearn.linear_model import LogisticRegression

In [30]:
df2.head()

Unnamed: 0,admit,gre,gpa,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


In [31]:
X=df2.drop(['admit','prestige_1.0'],axis=1)
y=df['admit']

X.head()

Unnamed: 0,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
0,380.0,3.61,0.0,1.0,0.0
1,660.0,3.67,0.0,1.0,0.0
2,800.0,4.0,0.0,0.0,0.0
3,640.0,3.19,0.0,0.0,1.0
4,520.0,2.93,0.0,0.0,1.0


In [32]:
logreg = LogisticRegression(C=10**2)
logreg.fit(X,y)
logreg.score(X,y)

0.70528967254408059

In [33]:
# jeana testing
logreg.predict_proba(np.array([800,4.0,0,0,0]))



array([[ 0.28814605,  0.71185395]])

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [34]:
logreg.coef_
#for coefficients

array([[ 0.00215822,  0.67315495, -0.62882239, -1.25222745, -1.56879212]])

In [35]:
np.exp(logreg.coef_)
#these are the odds ratios from scikit learn

array([[ 1.00216055,  1.96041259,  0.53321936,  0.28586733,  0.20829663]])

In [36]:
np.exp(result.params)
#these are the odds ratios from statslab from before


gre             1.002221
gpa             2.180027
prestige_2.0    0.506548
prestige_3.0    0.262192
prestige_4.0    0.211525
intercept       0.020716
dtype: float64

In [37]:
#doing this to show odds ratios side by side
OddsRat=pd.DataFrame({'sklearn':[1.00216055,  1.96041259,  0.53321936,  0.28586733,  0.20829663],'statslab':[1.00221, 2.180027, 0.506548, 0.262192, 0.211525]})
OddsRat

Unnamed: 0,sklearn,statslab
0,1.002161,1.00221
1,1.960413,2.180027
2,0.533219,0.506548
3,0.285867,0.262192
4,0.208297,0.211525


Answer: they are very close!  

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [38]:
#probability if from tier 1.0
logreg.predict_proba((800,4.0,0,0,0))



array([[ 0.28814605,  0.71185395]])

In [39]:
#probability if from tier 2.0
logreg.predict_proba((800,4.0,1,0,0))



array([[ 0.43153702,  0.56846298]])

In [40]:
#probability if from tier 3.0
logreg.predict_proba((800,4.0,0,1,0))



array([[ 0.58608936,  0.41391064]])

In [41]:
#probability if from tier 4.0
logreg.predict_proba((800,4.0,0,0,1))



array([[ 0.66024514,  0.33975486]])

Answer:  Note these probabilities of admission are very similar to those found using statsmodels, however they seem low in both cases.  