# Project 3

In this project, you will perform a logistic regression on the admissions data we've been working with in projects 1 and 2.

In [32]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.formula.api as smf
import pylab as pl
import numpy as np, scipy.stats as st

In [3]:
df_raw = pd.read_csv("../assets/admissions.csv")
df = df_raw.dropna() 
print df.head()

   admit    gre   gpa  prestige
0      0  380.0  3.61       3.0
1      1  660.0  3.67       3.0
2      1  800.0  4.00       1.0
3      1  640.0  3.19       4.0
4      0  520.0  2.93       4.0


## Part 1. Frequency Tables

#### 1. Let's create a frequency table of our variables

In [4]:
# frequency table for prestige and whether or not someone was admitted
df.groupby(['admit'])['prestige'].value_counts().sort_index()

admit  prestige
0      1.0         28
       2.0         95
       3.0         93
       4.0         55
1      1.0         33
       2.0         53
       3.0         28
       4.0         12
Name: prestige, dtype: int64

## Part 2. Return of dummy variables

#### 2.1 Create class or dummy variables for prestige 

In [5]:
df_dummies = pd.get_dummies(df.prestige,'prestige') 
df_dummies

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
5,0,1,0,0
6,1,0,0,0
7,0,1,0,0
8,0,0,1,0
9,0,1,0,0


#### 2.2 When modeling our class variables, how many do we need? 



Answer: Since there are 4 values (1,2,3,4), we'll only need 3. 

### Part 3. Hand calculating odds ratios
Develop your intuition about expected outcomes by hand calculating odds ratios.

In [6]:
cols_to_keep = ['admit', 'gre', 'gpa'] #already drops prestige here
handCalc = df[cols_to_keep].join(df_dummies.ix[:, 'prestige_1':]) #don't we have to drop first prestige? can switch colon. 
print handCalc.head()

   admit    gre   gpa  prestige_1.0  prestige_2.0  prestige_3.0  prestige_4.0
0      0  380.0  3.61             0             0             1             0
1      1  660.0  3.67             0             0             1             0
2      1  800.0  4.00             1             0             0             0
3      1  640.0  3.19             0             0             0             1
4      0  520.0  2.93             0             0             0             1


In [7]:
cols_to_keep = ['admit', 'gre', 'gpa'] #already drops prestige here
handCalc = df[cols_to_keep].join(df_dummies)#another way to list out the dummies. 
print handCalc.head()

   admit    gre   gpa  prestige_1.0  prestige_2.0  prestige_3.0  prestige_4.0
0      0  380.0  3.61             0             0             1             0
1      1  660.0  3.67             0             0             1             0
2      1  800.0  4.00             1             0             0             0
3      1  640.0  3.19             0             0             0             1
4      0  520.0  2.93             0             0             0             1


In [8]:
#crosstab prestige 1 admission 
# frequency table cutting prestige and whether or not someone was admitted
from sklearn import cross_validation

In [9]:
pd.crosstab(df.admit, df.prestige, margins=True)

prestige,1.0,2.0,3.0,4.0,All
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,28,95,93,55,271
1,33,53,28,12,126
All,61,148,121,67,397


#### 3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

Answer: OR_1 = 15.4%

#### 3.2 Now calculate the odds of admission if you did not attend a #1 ranked college

 Answer: OR_rest = 84.6%

#### 3.3 Calculate the odds ratio

In [10]:
OR_prestige = 5.5

#### 3.4 Write this finding in a sentenance: 

Answer: Since the odds ratio is greater than one, there is a very high chance that if a student is not from a #1 ranked school, you will get admited. 

#### 3.5 Print the cross tab for prestige_4

In [12]:
pd.crosstab(handCalc.admit, handCalc['prestige_4.0'], margins=True)

prestige_4.0,0,1,All
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,216,55,271
1,114,12,126
All,330,67,397


#### 3.6 Calculate the OR 

In [14]:
OR = 0.20

#### 3.7 Write this finding in a sentence

Answer: OR is less than 1, showing that students from a prestige 4 level school has only a 20% chance of being admitted.

## Part 4. Analysis

In [54]:
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(df_dummies.ix[:, 'prestige_2':])
print data.head()

   admit    gre   gpa  prestige_2.0  prestige_3.0  prestige_4.0
0      0  380.0  3.61             0             1             0
1      1  660.0  3.67             0             1             0
2      1  800.0  4.00             0             0             0
3      1  640.0  3.19             0             0             1
4      0  520.0  2.93             0             0             1


We're going to add a constant term for our Logistic Regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.

In [72]:
# manually add the intercept.  The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.

data['intercept'] = 1.0

#### 4.1 Set the covariates to a variable called train_cols

In [73]:
train_cols = data[['gre','gpa','prestige_2.0','prestige_3.0','prestige_4.0']]
train_cols

Unnamed: 0,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
0,380.0,3.61,0,1,0
1,660.0,3.67,0,1,0
2,800.0,4.00,0,0,0
3,640.0,3.19,0,0,1
4,520.0,2.93,0,0,1
5,760.0,3.00,1,0,0
6,560.0,2.98,0,0,0
7,400.0,3.08,1,0,0
8,540.0,3.39,0,1,0
9,700.0,3.92,1,0,0


#### 4.2 Fit the model

In [71]:
sm.Logit?

In [69]:
#use stats model api for this. 
import statsmodels.api as sm
logit = sm.Logit(data['admit'], train_cols)
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5


#### 4.3 Print the summary results

In [74]:
#print the full summary
print result.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      392
Method:                           MLE   Df Model:                            4
Date:                Tue, 21 Feb 2017   Pseudo R-squ.:                 0.05722
Time:                        18:47:21   Log-Likelihood:                -233.88
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.039e-05
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
gre              0.0014      0.001      1.308      0.191        -0.001     0.003
gpa             -0.1323      0.195     -0.680      0.497        -0.514     0.249
prestige_2.0    -0.9562      0.302     -3.17

#### 4.4 Calculate the odds ratios of the coeffiencents and their 95% CI intervals

hint 1: np.exp(X)

hint 2: conf['OR'] = params
        
           conf.columns = ['2.5%', '97.5%', 'OR']

In [78]:
# since OR = log of coefficients
coef = np.exp(result.params)
print coef

gre             1.001368
gpa             0.876073
prestige_2.0    0.384342
prestige_3.0    0.214918
prestige_4.0    0.154135
dtype: float64


In [79]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
print np.exp(conf)

                  2.5%     97.5%        OR
gre           0.999320  1.003420  1.001368
gpa           0.598303  1.282800  0.876073
prestige_2.0  0.212826  0.694082  0.384342
prestige_3.0  0.112055  0.412207  0.214918
prestige_4.0  0.070176  0.338540  0.154135


#### 4.5 Interpret the OR of Prestige_2

Answer: there is a 38% chance that someone from a prestige 2 school will get admited.

#### 4.6 Interpret the OR of GPA

Answer: there is a 87% chance that GPA is highly correlated with whether or not someone will get admitted. 

## Part 5: Predicted probablities


As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called cartesian (above).

We're going to use np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value--in our case just the min/max observed values.

In [82]:
def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.
    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.
    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.
    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])
    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m,1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m,1:] = out[0:m,1:]
    return out

In [128]:
# instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max 
gres = np.linspace(data['gre'].min(), data['gre'].max(), 10)
print gres
# array([ 220.        ,  284.44444444,  348.88888889,  413.33333333,
#         477.77777778,  542.22222222,  606.66666667,  671.11111111,
#         735.55555556,  800.        ])
gpas = np.linspace(data['gpa'].min(), data['gpa'].max(), 10)
print gpas
# array([ 2.26      ,  2.45333333,  2.64666667,  2.84      ,  3.03333333,
#         3.22666667,  3.42      ,  3.61333333,  3.80666667,  4.        ])


# enumerate all possibilities
combos = pd.DataFrame(cartesian([gres, gpas, [1,2,3,4], [1.]]))

[ 220.          284.44444444  348.88888889  413.33333333  477.77777778
  542.22222222  606.66666667  671.11111111  735.55555556  800.        ]
[ 2.26        2.45333333  2.64666667  2.84        3.03333333  3.22666667
  3.42        3.61333333  3.80666667  4.        ]


#### 5.1 Recreate the dummy variables

In [129]:
# recreate the dummy variables
combos.columns = ['gre','gpa','prestige','intercept']

In [140]:
# keep only what we need for making predictions
dummy_ranks = pd.get_dummies(combos['prestige'])
dummy_ranks.columns = ['prestige_1', 'prestige_2', 'prestige_3', 'prestige_4']
cols_to_keep_combo = ['gre','gpa','prestige','intercept']
combos_new = combos[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])

In [141]:
combos_new

Unnamed: 0,gre,gpa,prestige,intercept,prestige_2,prestige_3,prestige_4
0,220.0,2.260000,1.0,1.0,0,0,0
1,220.0,2.260000,2.0,1.0,1,0,0
2,220.0,2.260000,3.0,1.0,0,1,0
3,220.0,2.260000,4.0,1.0,0,0,1
4,220.0,2.453333,1.0,1.0,0,0,0
5,220.0,2.453333,2.0,1.0,1,0,0
6,220.0,2.453333,3.0,1.0,0,1,0
7,220.0,2.453333,4.0,1.0,0,0,1
8,220.0,2.646667,1.0,1.0,0,0,0
9,220.0,2.646667,2.0,1.0,1,0,0


#### 5.2 Make predictions on the enumerated dataset

In [146]:
combos['admit_pred'] = result.predict(combos_new[train_cols])
print combos_new.head()

ValueError: Must pass DataFrame with boolean values only

#### 5.3 Interpret findings for the last 4 observations

Answer: 

## Bonus

Plot the probability of being admitted into graduate school, stratified by GPA and GRE score.