# DS-NYC-45 | Unit Project 4: Notebook with Executive Summary

In this project, you will summarize and present your analysis from Unit Projects 1-3.

> ## Question 1.  Introduction
> Write a problem statement for this project.

Using UCLA's graduate school admissions data, determine the influence gre score, gpa score, and alma mater prestige has on admission into UCLA's graduate school.

> ## Question 2.  Dataset
> Write up a description of your data and any cleaning that was completed.

The data has three predictors gre, gpa, and prestige, and one dependent variable admit. Both gre and gpa are continuous variables while admit and prestige are categorical. Admit is a binary variable that denotes whether an applicant was accpeted in UCLA's graduate school or not (accepted = 1, not accepted = 0). gre is the score that an applicant scored on the Graduate Record Examination (range 200-800, mean of ~588), gpa is that applicant's Grade Point Average (max of 4.0, mean of ~3.39), and prestige is the prestige of the applicant's alma mater with 1 as the highest tier and 4 as the lowest tier. gpa is normally distributed while gre is left-skewed; admit and prestige are categorical values so they have Bernoulli distributions. The cleaning required for this dataset was removing the null values, and, for modelling purposes, creating dummy variables from prestige.  

> ## Question 3.  Demo
> Provide a table that explains the data by admission status.

In [2]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
data = pd.read_csv(os.path.join('..', 'DAT-NYC-45','unit-project', 'dataset', 'ucla-admissions.csv'))

In [4]:
admit_sum = data.groupby(['admit', 'gre', 'gpa', 'prestige'])

In [7]:
admit_sum['admit'].value_counts()

admit  gre    gpa   prestige  admit
0      220.0  2.83  3.0       0        1
       300.0  2.92  4.0       0        1
              3.01  3.0       0        1
       340.0  2.90  1.0       0        1
              2.92  3.0       0        1
              3.15  3.0       0        1
       360.0  2.56  3.0       0        1
              3.00  3.0       0        1
              3.14  1.0       0        1
              3.27  3.0       0        1
       380.0  2.91  4.0       0        1
              2.94  3.0       0        1
              3.33  4.0       0        1
              3.34  3.0       0        1
              3.38  2.0       0        1
              3.43  3.0       0        1
              3.59  4.0       0        1
              3.61  3.0       0        1
       400.0  2.93  3.0       0        1
              3.05  2.0       0        1
              3.08  2.0       0        1
              3.31  3.0       0        1
              3.35  3.0       0        1
              3.36  2

> ## Question 4. Methods
> Write up the methods used in your analysis.

First, dropped "na" observations from the dataset- did not want those null values affecting exploratory analysis or modelling. Before fitting the logistic regression models, I one hot encoded the feature "prestige", thus creating three more binary variables. One-hot encoded 'prestige' to make the feature machine legible; many learning algorithms either learn a single weight per feature, or they use distances between samples (for example, linear models such as logistic regression). If the observations within 'prestige' are not made into binary variables, incorrect distances and weights will be applied by our algorithm. Also, I dropped one of the binary variables created from one-hot encoding to avoid multi-collinearity between dummy variables. 

Used Logistic regression from Stats Models and Scikit learn in order to be able to classify potential future applicants as accepted or not accepted. Both models use maximum likelihood estimation.

> ## Question 5. Results
> Write up your results.

Answer: Stats Models calculated the following log coefficients: gre = 0.0014, gpa = -0.1323, prestige of 2.0 = -0.9562, prestige of 3.0 = -1.5375, and prestige of 4.0 = -1.8699.

These coefficients translate to odds ratios of: gre = 1.001368, gpa = 0.876073, prestige of 2.0 = 0.384342, prestige of 3.0 = 0.214918, and prestige of 4.0 = 0.154135.

Thus, an increase of 1 for gre results in about an 100% increase to the odds of admittance. An increase of 1 for gpa results in about an 88% reduction to the odds of admittance. And regarding prestige, odds of an applicant from a prestige = 2.0 undergraduate school to be admitted into UCLA are 61% lower than an applicant from prestige = 1 undergraduate school; for prestige = 3.0 applicants the odds are 78% lower than an applicant from prestige = 1 undergraduate school; for prestige = 4.0 applicants the odds are 84% lower than an applicant from prestige = 1 undergraduate school.

> ## Question 6. Visuals
> Provide a table or visualization of these results.

Feature | coeff | Odds Ratio
---| ---| ---
gre |  0.0014 | 1.001368
gpa | -0.1323| 0.876073
prestige_2.0 | -0.9562 | 0.384342
prestige_3.0 | -1.5375 | 0.214918
prestige_4.0 | -1.8699 | 0.154135


> ## Question 7.  Discussion
> Write up your discussion and future steps.

Answer: Overall, it seems like gre is the most important feature when determining the likelihood of whether an applicant will be admitted to UCLA's graduate school.  While the results give us insight to the likelihood that an applicant will get admitted, we have yet to evaluate the model itself. Next step would be to evaluate the accuracy of the model on a training set and then a test set. After judging if the accuracy results are acceptable, then the model can be used for production. If the accuracy results are not acceptable, then further feature engineering might be necessary, or it might be necessary to introduce new features. Without scoring the model, it is difficult to understand how meaningful the log coefficients and Odds Ratios are. 

The biggest risk with the data and the model is that there is no stated timeframe. The values of the variables are affected by the timeframe of the data, especially since past admission is not necessarily indicative of future admission. Also, the prestige score variable brings inherent risk. Assigning prestige is subjective, so it would be best if we had the rubric that was used to assign prestige score. Finally, one must question if all GPAs are created equal. Is there a difference in GPA mean/max/min amongst applicants across different Majors and/or different colleges?