# Project 3

In this project, you will perform a logistic regression on the admissions data we've been working with in projects 1 and 2.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
plt.style.use('ggplot')
import seaborn as sns

In [2]:
df_raw = pd.read_csv("../assets/admissions.csv")
df = df_raw.dropna() 
print df.head()

   admit    gre   gpa  prestige
0      0  380.0  3.61       3.0
1      1  660.0  3.67       3.0
2      1  800.0  4.00       1.0
3      1  640.0  3.19       4.0
4      0  520.0  2.93       4.0


## Part 1. Frequency Tables

#### 1. Let's create a frequency table of our variables

In [3]:
# frequency table for prestige and whether or not someone was admitted
crosstab= pd.crosstab(df['admit'], df['prestige'], rownames=['admit']) 
print crosstab 


prestige  1.0  2.0  3.0  4.0
admit                       
0          28   95   93   55
1          33   53   28   12


In [4]:
# frequency table for prestige and whether or not someone was admitted
crosstab2 = pd.crosstab(index=df["admit"], 
                            columns=df["prestige"],
                             margins=True)   # Include row and column totals

crosstab2.columns = ["prestige1","prestige2","prestige3","prestige4","rowtotal"]
crosstab2.index= ["0","1","coltotal"]

crosstab2

Unnamed: 0,prestige1,prestige2,prestige3,prestige4,rowtotal
0,28,95,93,55,271
1,33,53,28,12,126
coltotal,61,148,121,67,397


## Part 2. Return of dummy variables

#### 2.1 Create class or dummy variables for prestige 

In [5]:
dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
print dummy_ranks.head()

   prestige_1.0  prestige_2.0  prestige_3.0  prestige_4.0
0             0             0             1             0
1             0             0             1             0
2             1             0             0             0
3             0             0             0             1
4             0             0             0             1


In [6]:
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
print data.head()

   admit    gre   gpa  prestige_2.0  prestige_3.0  prestige_4.0
0      0  380.0  3.61             0             1             0
1      1  660.0  3.67             0             1             0
2      1  800.0  4.00             0             0             0
3      1  640.0  3.19             0             0             1
4      0  520.0  2.93             0             0             1


In [7]:
data['intercept'] = 1.0

#### 2.2 When modeling our class variables, how many do we need? 



In [8]:
#We need 1 less than the actual number of results in order to maintain a baseline. 
#So we can get rid of presige 1 because if a record is not presitge 2-4 we know it is prestige 1

## Part 3. Hand calculating odds ratios

Develop your intuition about expected outcomes by hand calculating odds ratios.

In [10]:
p1=df['admit'].groupby(df['prestige']).mean()  
print p1

prestige
1.0    0.540984
2.0    0.358108
3.0    0.231405
4.0    0.179104
Name: admit, dtype: float64


#### 3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

In [12]:
cross1=crosstab[1]/crosstab[1].sum() 
print cross1

admit
0    0.459016
1    0.540984
Name: 1.0, dtype: float64


#### 3.2 Now calculate the odds of admission if you did not attend a #1 ranked college

In [13]:
crosstab2/crosstab2.ix["coltotal"] 
#How do I filter out columns here?

Unnamed: 0,prestige1,prestige2,prestige3,prestige4,rowtotal
0,0.459016,0.641892,0.768595,0.820896,0.68262
1,0.540984,0.358108,0.231405,0.179104,0.31738
coltotal,1.0,1.0,1.0,1.0,1.0


#### 3.3 Calculate the odds ratio

In [14]:
df.admit.mean()

0.31738035264483627

#### 3.4 Write this finding in a sentenance: 

Answer: Overall you have a 31.7% chance of being admitted no matter what level of prestige you have. The students from schools with level 2 prestige had the best change of getting in with a 37% chance of getting in

#### 3.5 Print the cross tab for prestige_4

In [55]:

prestige=pd.crosstab(df['admit'], df['prestige'], rownames=['admit']) 
prestige_4=prestige[:4] 
print prestige_4

prestige  1.0  2.0  3.0  4.0
admit                       
0          28   95   93   55
1          33   53   28   12


#### 3.6 Calculate the OR 

In [56]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
print np.exp(conf)

                  2.5%     97.5%        OR
gre           1.000074  1.004372  1.002221
gpa           1.136120  4.183113  2.180027
prestige_2.0  0.272168  0.942767  0.506548
prestige_3.0  0.133377  0.515419  0.262192
prestige_4.0  0.093329  0.479411  0.211525
intercept     0.002207  0.194440  0.020716


#### 3.7 Write this finding in a sentence

Answer: You have a 1.5% chance of getting into the program from a school that is listed as prestige 4

## Part 4. Analysis

In [58]:
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(data.ix[:, 'prestige_2.0':])
print data.head()

   admit    gre   gpa  prestige_2.0  prestige_3.0  prestige_4.0  intercept
0      0  380.0  3.61             0             1             0        1.0
1      1  660.0  3.67             0             1             0        1.0
2      1  800.0  4.00             0             0             0        1.0
3      1  640.0  3.19             0             0             1        1.0
4      0  520.0  2.93             0             0             1        1.0


We're going to add a constant term for our Logistic Regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.

In [59]:
# manually add the intercept
df_raw['intercept'] = 1.0

#### 4.1 Set the covariates to a variable called train_cols

In [60]:
train_cols = data.columns[1:]

#### 4.2 Fit the model

In [61]:
logit = sm.Logit(data['admit'], data[train_cols])

# fit the model
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


#### 4.3 Print the summary results

In [62]:
print result.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Wed, 08 Mar 2017   Pseudo R-squ.:                 0.08166
Time:                        19:18:57   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
gre              0.0022      0.001      2.028      0.043      7.44e-05     0.004
gpa              0.7793      0.333      2.344      0.019         0.128     1.431
prestige_2.0    -0.6801      0.317     -2.14

#### 4.4 Calculate the odds ratios of the coeffiencents and their 95% CI intervals

hint 1: np.exp(X)

hint 2: conf['OR'] = params
        
           conf.columns = ['2.5%', '97.5%', 'OR']

In [63]:
print np.exp(result.params)

gre             1.002221
gpa             2.180027
prestige_2.0    0.506548
prestige_3.0    0.262192
prestige_4.0    0.211525
intercept       0.020716
dtype: float64


In [64]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
print np.exp(conf)

                  2.5%     97.5%        OR
gre           1.000074  1.004372  1.002221
gpa           1.136120  4.183113  2.180027
prestige_2.0  0.272168  0.942767  0.506548
prestige_3.0  0.133377  0.515419  0.262192
prestige_4.0  0.093329  0.479411  0.211525
intercept     0.002207  0.194440  0.020716


#### 4.5 Interpret the OR of Prestige_2

Answer: The OR for prestige 2 tells us that the odds of being admitted decrease by 50% if you went to an undergraduate school with a prestige of 2 

#### 4.6 Interpret the OR of GPA

Answer: The OR of GPA tells us that you are 2.18 times more likely to get in for every 1 point that you can raise your total GPA. 

## Part 5: Predicted probablities


As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called cartesian (above).

We're going to use np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value--in our case just the min/max observed values.

In [38]:
def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.
    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.
    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.
    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])
    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m,1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m,1:] = out[0:m,1:]
    return out

In [39]:
# instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max 
gres = np.linspace(data['gre'].min(), data['gre'].max(), 10)
print gres

gpas = np.linspace(data['gpa'].min(), data['gpa'].max(), 10)
print gpas



# enumerate all possibilities
combos = pd.DataFrame(cartesian([gres, gpas, [1, 2, 3, 4], [1.]])) 


[ 220.          284.44444444  348.88888889  413.33333333  477.77777778
  542.22222222  606.66666667  671.11111111  735.55555556  800.        ]
[ 2.26        2.45333333  2.64666667  2.84        3.03333333  3.22666667
  3.42        3.61333333  3.80666667  4.        ]


In [40]:
# enumerate all possibilities
combos = pd.DataFrame(cartesian([gres, gpas, [1.0, 2.0, 3.0, 4.0], [1.]])) 
combos.columns = ['gre', 'gpa', 'prestige', 'intercept']

#### 5.1 Recreate the dummy variables

In [41]:
# recreate the dummy variables
dummy_ranks = pd.get_dummies(combos['prestige'], prefix='prestige')
dummy_ranks.columns = ['prestige_1.0', 'prestige_2.0', 'prestige_3.0', 'prestige_4.0']

# keep only what we need for making predictions
cols_to_keep = ['gre', 'gpa', 'intercept']
combos = combos[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2.0':])



#### 5.2 Make predictions on the enumerated dataset

In [42]:
combos['admit_pred'] = result.predict(combos[train_cols])

print combos.head()

     gre       gpa  intercept  prestige_2.0  prestige_3.0  prestige_4.0  \
0  220.0  2.260000        1.0             0             0             0   
1  220.0  2.260000        1.0             1             0             0   
2  220.0  2.260000        1.0             0             1             0   
3  220.0  2.260000        1.0             0             0             1   
4  220.0  2.453333        1.0             0             0             0   

   admit_pred  
0    0.164173  
1    0.090492  
2    0.048977  
3    0.039890  
4    0.185907  


In [43]:
combos

Unnamed: 0,gre,gpa,intercept,prestige_2.0,prestige_3.0,prestige_4.0,admit_pred
0,220.0,2.260000,1.0,0,0,0,0.164173
1,220.0,2.260000,1.0,1,0,0,0.090492
2,220.0,2.260000,1.0,0,1,0,0.048977
3,220.0,2.260000,1.0,0,0,1,0.039890
4,220.0,2.453333,1.0,0,0,0,0.185907
5,220.0,2.453333,1.0,1,0,0,0.103682
6,220.0,2.453333,1.0,0,1,0,0.056492
7,220.0,2.453333,1.0,0,0,1,0.046078
8,220.0,2.646667,1.0,0,0,0,0.209795
9,220.0,2.646667,1.0,1,0,0,0.118543


#### 5.3 Interpret findings for the last 4 observations

Answer: If you hold GPA and GRE scores to be the same (both perfect scores) then you have a 73% chance to get in from a school with prestige of 1, a 58% chance to get in from a school with prestige 2, a 41% chance to get in from a school with prestige 3, and a 36% chance to get in from a school with prestige 4.

## Bonus

Plot the probability of being admitted into graduate school, stratified by GPA and GRE score.