# Project 3

In this project, you will perform a logistic regression on the admissions data we've been working with in projects 1 and 2.

In [187]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
import seaborn as sns

In [188]:
df_raw = pd.read_csv("../assets/admissions.csv")
df = df_raw.dropna() 
print df.head()

   admit    gre   gpa  prestige
0      0  380.0  3.61       3.0
1      1  660.0  3.67       3.0
2      1  800.0  4.00       1.0
3      1  640.0  3.19       4.0
4      0  520.0  2.93       4.0


## Part 1. Frequency Tables

#### 1. Let's create a frequency table of our variables relative to whether someone got admitted or not. Think in terms of for a certain prestige level, how many people got admitted and didnt get admitted

In [189]:
# frequency table for prestige and whether or not someone was admitted
#answers below: 
#option #1: pd.crosstab(df['admit'], df['prestige']) -->also worked
#option #2: pd.pivot_table(df[['admit','prestige']], index=['admit'], columns=['prestige'],aggfunc=len)
#option #3: 
df.groupby(['prestige','admit']).size().unstack('prestige')

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


## Part 2. Return of dummy variables

#### 2.1 Create class or dummy variables for prestige 

In [190]:
#df2 = df.join(pd.get_dummies(df['prestige'],prefix="prestige")) ->didn't work well with the preset code
dummy_ranks = pd.get_dummies(df['prestige'],prefix="prestige")

In [191]:
dummy_ranks.head()

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1


#### 2.2 When modeling our class variables, how many do we need? 



Answer: We only need n-1 dummy variables. Otherwise we might run into the issue of having collinearity. 

## Part 3. Hand calculating odds ratios

Develop your intuition about expected outcomes by hand calculating odds ratios.

In [192]:
cols_to_keep = ['admit', 'gre', 'gpa']
handCalc = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_1.0':]) 
print handCalc.head()

   admit    gre   gpa  prestige_1.0  prestige_2.0  prestige_3.0  prestige_4.0
0      0  380.0  3.61             0             0             1             0
1      1  660.0  3.67             0             0             1             0
2      1  800.0  4.00             1             0             0             0
3      1  640.0  3.19             0             0             0             1
4      0  520.0  2.93             0             0             0             1


In [193]:
#crosstab prestige 1 admission 
#frequency table cutting prestige and whether or not someone was admitted
pd.crosstab(handCalc['admit'], handCalc['prestige_1.0'])

prestige_1.0,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


#### 3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

In [194]:
odds_table_pres1 = pd.crosstab(handCalc['admit'], handCalc['prestige_1.0']).apply(lambda p: (p/p.sum())/(1-(p/p.sum())), axis=0)
#the odds of being admitted = p / (1-p) where p = (33/(28+33))
odds_table_pres1

prestige_1.0,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.612903,0.848485
1,0.382716,1.178571


Answer: The odds of admission for someone attending a #1 ranked college = 1.178571

#### 3.2 Now calculate the odds of admission if you did not attend a #1 ranked college

In [None]:
#the odds of being admitted here = p / (1-p) where p = (93/(93+243))

Using the same odds_table_pres1 table, we can see that the odds of admission = 0.382716	for someone not attending a #1 ranked college. 

#### 3.3 Calculate the odds ratio

In [195]:
#odds ratio = 1.178571 / 0.382716
odds_ratio = odds_table_pres1[1][1] / odds_table_pres1[0][1]
odds_ratio

3.0794930875576041

#### 3.4 Write this finding in a sentenance: 

Answer: The odds of getting admitted to grad schools is 3.08 times higher for someone who attended a #1 ranked college than for someone who did not attend a #1 ranked college. 

#### 3.5 Print the cross tab for prestige_4

In [196]:
pd.crosstab(handCalc['admit'], handCalc['prestige_4.0'], rownames=['admit'])

prestige_4.0,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


#### 3.6 Calculate the OR 

In [197]:
odds_table_pres4 = pd.crosstab(handCalc['admit'], handCalc['prestige_4.0']).apply(lambda p: (p/p.sum())/(1-(p/p.sum())), axis=0)
odds_table_pres4

prestige_4.0,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1.894737,4.583333
1,0.527778,0.218182


In [198]:
odds_table_pres4[1][1] / odds_table_pres4[0][1]

0.41339712918660282

#### 3.7 Write this finding in a sentence

Answer: The odds of getting admitted to grad schools is approximately 59% lower for someone who attended a #4 ranked college than for someone who did not attend a #4 ranked college.

## Part 4. Analysis

In [199]:
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
print data.head()

   admit    gre   gpa  prestige_2.0  prestige_3.0  prestige_4.0
0      0  380.0  3.61             0             1             0
1      1  660.0  3.67             0             1             0
2      1  800.0  4.00             0             0             0
3      1  640.0  3.19             0             0             1
4      0  520.0  2.93             0             0             1


We're going to add a constant term for our Logistic Regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.

In [200]:
# manually add the intercept
data['intercept'] = 1.0

In [201]:
scaled_data = pd.DataFrame(data=scaled_var, columns=['gre_z','gpa_z'])
scaled_data.head()

Unnamed: 0,gre_z,gpa_z
0,-1.798524,0.573457
1,0.624209,0.731464
2,1.835576,1.600504
3,0.451157,-0.532595
4,-0.587158,-1.217294


In [202]:
data2 = df[['admit']].join(scaled_data).join(dummy_ranks.ix[:, 'prestige_2':])
data2.head()

Unnamed: 0,admit,gre_z,gpa_z,prestige_2.0,prestige_3.0,prestige_4.0
0,0,-1.798524,0.573457,0,1,0
1,1,0.624209,0.731464,0,1,0
2,1,1.835576,1.600504,0,0,0
3,1,0.451157,-0.532595,0,0,1
4,0,-0.587158,-1.217294,0,0,1


#### 4.1 Set the covariates to a variable called train_cols

In [203]:
train_cols2 = data2.columns[1:] #excluding the admit column
train_cols2

Index([u'gre_z', u'gpa_z', u'prestige_2.0', u'prestige_3.0', u'prestige_4.0'], dtype='object')

#### 4.2 Fit the model

In [204]:
logit = sm.Logit(data2['admit'], data2[train_cols2])
result = logit.fit()
#seems like this one doesn't work

ValueError: On entry to DLASCL parameter number 5 had an illegal value

In [205]:
train_cols = data.columns[1:] #excluding the admit column
train_cols
logit = sm.Logit(data['admit'], data[train_cols])
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


#### 4.3 Print the summary results

In [206]:
result.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 08 May 2017",Pseudo R-squ.:,0.08166
Time:,00:13:41,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2.0,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3.0,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4.0,-1.5534,0.417,-3.721,0.000,-2.372 -0.735
intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638


#### 4.4 Calculate the odds ratios of the coeffiencents and their 95% CI intervals

hint 1: np.exp(X)

hint 2: conf['OR'] = params
        
           conf.columns = ['2.5%', '97.5%', 'OR']

In [207]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
np.exp(conf)

Unnamed: 0,2.5%,97.5%,OR
gre,1.000074,1.004372,1.002221
gpa,1.13612,4.183113,2.180027
prestige_2.0,0.272168,0.942767,0.506548
prestige_3.0,0.133377,0.515419,0.262192
prestige_4.0,0.093329,0.479411,0.211525
intercept,0.002207,0.19444,0.020716


#### 4.5 Interpret the OR of Prestige_2

Answer: The odds of getting admitted to grad schools decrease by approximately 49.4% if someone attended a #2 ranked college.

#### 4.6 Interpret the OR of GPA

Answer: For every one unit increase in GPA, the log odds of being admitted to grad school increase by 2.18. 

## Part 5: Predicted probablities


As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called cartesian (above).

We're going to use np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value--in our case just the min/max observed values.

In [208]:
def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.
    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.
    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.
    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])
    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m,1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m,1:] = out[0:m,1:]
    return out

In [232]:
# instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max 
gres = np.linspace(data['gre'].min(), data['gre'].max(), 10)
print gres
# array([ 220.        ,  284.44444444,  348.88888889,  413.33333333,
#         477.77777778,  542.22222222,  606.66666667,  671.11111111,
#         735.55555556,  800.        ])
gpas = np.linspace(data['gpa'].min(), data['gpa'].max(), 10)
print gpas
# array([ 2.26      ,  2.45333333,  2.64666667,  2.84      ,  3.03333333,
#         3.22666667,  3.42      ,  3.61333333,  3.80666667,  4.        ])


# enumerate all possibilities
combos = pd.DataFrame(cartesian([gres, gpas, [1, 2, 3, 4], [1.]]))

[ 220.          284.44444444  348.88888889  413.33333333  477.77777778
  542.22222222  606.66666667  671.11111111  735.55555556  800.        ]
[ 2.26        2.45333333  2.64666667  2.84        3.03333333  3.22666667
  3.42        3.61333333  3.80666667  4.        ]


#### 5.1 Recreate the dummy variables

In [233]:
combos = combos.dropna()

In [234]:
# recreate the dummy variables
combos.columns = ['gre', 'gpa', 'prestige', 'intercept']
dummy_ranks = pd.get_dummies(combos['prestige'], prefix='prestige')
dummy_ranks.columns = ['prestige_1.0', 'prestige_2.0', 'prestige_3.0', 'prestige_4.0']
# keep only what we need for making predictions
cols_to_keep = ['gre', 'gpa', 'prestige', 'intercept']
combos = combos[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2.0':])

#### 5.2 Make predictions on the enumerated dataset

In [235]:
combos['admit_pred'] = result.predict(combos[train_cols])

In [244]:
combos2 = combos.join(df['admit'])
combos2.tail()

Unnamed: 0,gre,gpa,prestige,intercept,prestige_2.0,prestige_3.0,prestige_4.0,admit_pred,admit
395,800.0,3.806667,4.0,1.0,0,0,1,0.334286,0.0
396,800.0,4.0,1.0,1.0,0,0,0,0.73404,0.0
397,800.0,4.0,2.0,1.0,1,0,0,0.582995,0.0
398,800.0,4.0,3.0,1.0,0,1,0,0.419833,0.0
399,800.0,4.0,4.0,1.0,0,0,1,0.368608,0.0


#### 5.3 Interpret findings for the last 4 observations

Answer: By looking at the last four observations, it seems like without prestige, GRE and GPA cannot provide a good enough indication of admission. 

## Bonus

Plot the probability of being admitted into graduate school, stratified by GPA and GRE score.