# Project 4

In this project, you will summarize and present your analysis from Projects 1-3.

### Intro: Write a problem Statement/ Specific Aim for this project

Answer: We're seeking to understand the relationship between students' GPA, GRE, prestige of undergraduate school, and likelihood of admission to graduate school based on admissions.csv. Exploring this relationship will allow us to infer which factors drive admission as well as predict whether a student will be admitted or not.

### Dataset:  Write up a description of your data and any cleaning that was completed

Answer: The data consists of 400 rows of student admits (1/0 for yes/no, binary categorical), GPA (continuous, 0-4), GRE (discrete, 0-800), and undergraduate school prestige (discrete, 1-4). In order to clean the data, I'm going to drop any rows with missing data and create dummy variables for the categorical prestige data. This will prepare the data for logistic regression.

In [72]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

df_raw = pd.read_csv("../assets/admissions.csv")
df = df_raw.dropna() 
df[pd.isnull(df).any(axis=1)]   #Checking that all of the NA values were dropped

Unnamed: 0,admit,gre,gpa,prestige


### Demo: Provide a table that explains the data by admission status

Mean (STD) or counts by admission status for each variable 

| Not Admitted | Admitted
---| ---|---
GPA | 3.35 (.38)  | 3.49 (.37)
GRE | 573.58 (116.05) | 618.57 (109.26)
Prestige 1 | 28 (46%) | 33 (54%)
Prestige 2 | 95 (64%) | 53 (36%)
Prestige 3 | 93 (77%) | 28 (23%)
Prestige 4 | 55 (82%) | 12 (18%)

In [73]:
# Below code calculates the frequency of being admitted/not admitted based on alma mater prestige level.
rank1 = df.loc[df['prestige'] == 1, ['admit']].count(axis=0)
rank1_Adm = df.loc[df['prestige'] == 1, ['admit']].sum(axis=0)
rank1_NonAdm = rank1 - rank1_Adm

rank2 = df.loc[df['prestige'] == 2, ['admit']].count(axis=0)
rank2_Adm = df.loc[df['prestige'] == 2, ['admit']].sum(axis=0)
rank2_NonAdm = rank2 - rank2_Adm

rank3 = df.loc[df['prestige'] == 3, ['admit']].count(axis=0)
rank3_Adm = df.loc[df['prestige'] == 3, ['admit']].sum(axis=0)
rank3_NonAdm = rank3 - rank3_Adm

rank4 = df.loc[df['prestige'] == 4, ['admit']].count(axis=0)
rank4_Adm = df.loc[df['prestige'] == 4, ['admit']].sum(axis=0)
rank4_NonAdm = rank4 - rank4_Adm

print "Prestige 1 admit frequency: ", rank1_Adm/rank1
print "Prestige 1 non-admit frequency: ", rank1_NonAdm/rank1

print "Prestige 2 admit frequency: ", rank2_Adm/rank2
print "Prestige 2 non-admit frequency: ", rank2_NonAdm/rank2

print "Prestige 3 admit frequency: ", rank3_Adm/rank3
print "Prestige 3 non-admit frequency: ", rank3_NonAdm/rank3

print "Prestige 4 admit frequency: ", rank4_Adm/rank4
print "Prestige 4 non-admit frequency: ", rank4_NonAdm/rank4

Prestige 1 admit frequency:  admit    0.540984
dtype: float64
Prestige 1 non-admit frequency:  admit    0.459016
dtype: float64
Prestige 2 admit frequency:  admit    0.358108
dtype: float64
Prestige 2 non-admit frequency:  admit    0.641892
dtype: float64
Prestige 3 admit frequency:  admit    0.231405
dtype: float64
Prestige 3 non-admit frequency:  admit    0.768595
dtype: float64
Prestige 4 admit frequency:  admit    0.179104
dtype: float64
Prestige 4 non-admit frequency:  admit    0.820896
dtype: float64


In [74]:
df.groupby(['admit']).mean()

Unnamed: 0_level_0,gre,gpa,prestige
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,573.579336,3.347159,2.645756
1,618.571429,3.489206,2.150794


In [75]:
df.groupby(['admit']).std()

Unnamed: 0_level_0,gre,gpa,prestige
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,116.052798,0.376355,0.918922
1,109.257233,0.371655,0.921455


### Methods: Write up the methods used in your analysis

In [76]:
df = pd.get_dummies(df,columns=['prestige'],drop_first=True)

Answer: 
1.) Create dummy variables for the prestige categorical values in order to prepare them for logistic regression. 
2.) Calculate and examine the characteristics of the data. This includes calculating/building:
        -- Mean, standard deviation, range, and quartiles of the data
        -- Plotting histograms/boxplots of the variables
        -- Plotting scatter plots of independent variables on dependent ones to determine if transformations are required.
3.) Test variables for multi-collinearity, which if present may affect confidence in model and conclusions reached. We used correlation tables to identify possible collinearity.
4.) Fit logistic regression model to the data. Regress admittance on GRE, GPA, and prestige variables.
5.) Examine test statistics to identify which predictor variables are significant, relying upon odds ratios and confidence intervals to determine their effect on the outcome variable.

### Results: Write up your results

Answer: All of the explanatory variables (GPA, GRE, and prestige) were significant according to our regression model. The positive coefficients on these covariates means that increasing GPA, GRE, and prestige rank (higher meaning going from prestige 4 toward prestige 1 in this case) is positively related to chances of admittance to graduate school. In addition, the odds ratios that we calculated for school prestige indicate that students who go to better schools tend to have higher admittance rates, ceteris paribus (keeping all else the same).

### Visuals: Provide a table or visualization of these results

Mean (STD) or counts by admission status for each variable 

| 2.5% | 97.5% | OR
---| ---|---|---|
GRE | 1.000074 | 1.004372 | 1.002221 |
GPA | 1.136120  | 4.183113 | 2.180027 |
Prestige 2 | 0.272168 | 0.942767 | 0.506548 |
Prestige 3 | 0.133377 | 0.515419 | 0.262192 |
Prestige 4 | 0.093329 | 0.479411 | 0.211525 |
Intercept | 0.002207 | 0.194440 | 0.020716 |

### Discussion: Write up your discussion and future steps

Answer: We could extend this analysis by testing it on an out-of-sample data set. Cross-validation could be utilized to perform this testing with less data than might otherwise be necessary. We could also look to add other potential variables in order to increase the model's accuracy. In doing so though, we run the risk of increasing complexity to the point of over-fitting, introducing variance to the model. As the model incorporates new variables, we must "penalize" this increase in complexity through regularizing the new regression equation. Some potential additional predictors that could be included are which type of graduate school program the applicant is applying to, the prevalence of any family members that attended the school, as well as extracurricular activities the applicant was involved with.