# DS-NYC-45 | Unit Project 4: Notebook with Executive Summary

In this project, you will summarize and present your analysis from Unit Projects 1-3.

> ## Question 1.  Introduction
> Write a problem statement for this project.

### Problem Statement
Determine the various factors that may influence admission into UCLA's graduate school. Using a sample of cross-sectional UCLA admissions data, we would like to explore the association of GRE score, GPA score, and Prestige with admission to UCLA's graduate school. We will test whether applicants will be more likely to be admitted into UCLA's graduate school when they have higher GRE and GPA scores, and when they come from a more prestigious alma mater.

> ## Question 2.  Dataset
> Write up a description of your data and any cleaning that was completed.

### Dataset

The dataset we will be analyzing includes 400 observations of UCLA students with the following data:

Variable | Description | Type of Variable | Range
---|---|---|---
admit | 0 = Not admitted into UCLA, 1 = admitted into UCLA | Categorical | N/A
gre | GRE (Graduate Record Examination) score | Continuous | 220-800
gpa | GPA (Grade Point Average) score | Continuous | 2.26-4.00
prestige | Prestige of applicant's alma mater, with 1 as highest tier (most prestigeous)<br> and 4 as the lowest tier (least prestigeous). | Categorical | N/A

**Data Cleaning Steps**
* 3 Obesrvations were excluded from the analysis due to missing data for either GRE, GPA, or Prestige
* Created 4 new features based on the categorial feature **Prestige** (i.e. one-hot encoding):
    * **prestige_1**: 1 if student from tier-1 undergraduate school, otherwise 0
    * **prestige_2**: 1 if student from tier-2 undergraduate school, otherwise 0
    * **prestige_3**: 1 if student from tier-3 undergraduate school, otherwise 0
    * **prestige_4**: 1 if student from tier-4 undergraduate school, otherwise 0

> ## Question 3.  Demo
> Provide a table that explains the data by admission status.

### Description of Data by Admission Status

Feature | Admitted | Not Admitted
---|---|---
**admit**     |count=127|count=273
**gre**       |mean=619, std=109  |mean=574, std=116
**gpa**       |mean=3.49, std=0.37|mean=3.35, std=0.38
**prestige_1**|count=33<br>|count=28
**prestige_2**|count=53<br>|count=97
**prestige_3**|count=28<br>|count=93
**prestige_4**|count=12<br>|count=55


> ## Question 4. Methods
> Write up the methods used in your analysis.

### Analytical Methods
**Exploratory Analysis**
* Understand distribution of each feature:
    * For GRE & GPA (Continous Features):
        * Calculated Mean, Min, Max, Standard Deviation
        * Created Box plots to understand distibution and identify possible outliers
        * Created Histograms for a more detailed look at the distribution
    * For Prestige (Categorical Feature):
        * Created Histogram to understand the distribution of tiers
* Calculated correlations between each feature to determine any potential colinearity.

**Logistic Regression**
* Potential Risks:
    * We are assuming that GRE and GPA variables are uniformly distributed.  Their distributions were relatively close to make this assumption.
    * We are also assuming no colinearity, but we did see some covariance between the features.
* Manually calculated odds ratios for Prestige.
* Fit Logistic Regression using GRE, GPA, and Prestige (using tier-1 as our reference). Used both **`statsmodels`** and **`sklearn`** to fit the model.
* Calculated odds ratios based on model coefficients for each feature.
* Predicted the probability of admission by undergraduate school tier (with GRE of 800 and GPA of 4) using our fitted models.

> ## Question 5. Results
> Write up your results.

### Results

Our goal was to determine how GRE, GPA, and undergraduate school Prestige influence admission into UCLA's graduate school. Below are our findings based on our Logistic Regression model.

*Note: Results below are based on the model using `sklearn`*

The odds of being addmitted into UCLA's graduate school, relative to Tier-1 undergraduate prestige:
* Increases by 2.16% for every 1pt increase in GRE
* Increases by 96.0% for every 1pt increase in GPA
* Descreases by 46.7% when Prestige=2 (instead of Prestige=1)
* Descreases by 71.4% when Prestige=3 (instead of Prestige=1)
* Descreases by 72.2% when Prestige=4 (instead of Prestige=1)

> ## Question 6. Visuals
> Provide a table or visualization of these results.

### Visuals

**Feature Distributions**
![Distribution: Admit](../images/Admit_Distribution.png)
![Distribution: GRE](../images/GRE_Dist.png)
![Distribution: GPA](../images/GPA_Dist.png)
![Distribution: Prestige](../images/Prestige_Dist.png)

**Feature Correlations**
	
|admit|gre|gpa|prestige
---|---|---|---|---
**admit**|1.000000|0.181202|0.174116|-0.243563
**gre**|0.181202|1.000000|0.382408|-0.124533
**gpa**|0.174116|0.382408|1.000000|-0.060976
**prestige**|-0.243563|-0.124533|-0.060976|1.000000

**Logistic Regression Cofficients & Odds Ratios**

*Note: Model coeffificients and odds ratios are relative to prestige=1*

Feature|Cofficient (`sklearn`)|Odds Ratio (`sklearn`)|Cofficient (`statsmodels`)|Odds Ratio (`statsmodels`)|
----------|:---------:|:------:|:-------:|:------:
gre       | 0.00215822|1.002161| 0.002218|1.002221
gpa       | 0.67315495|1.960413| 0.779337|2.180027
prestige_2|-0.62882239|0.533219|-0.680137|0.506548
prestige_3|-1.25222745|0.285867|-1.338677|0.262192
prestige_4|-1.56879212|0.208297|-1.553411|0.211525
intercept |-3.51478687|N/A     |-3.876854|N/A

> ## Question 7.  Discussion
> Write up your discussion and future steps.

### Discusion and Next Steps

**Conclusion**

During this project we were able to determine the degree of influence that GRE scores, GPA, and the Prestige of the applicant's undergraduate school has on their probability to be admitted into UCLA.  Our results show that all features have significant impact on the probability to be admitted. We were also able to validate our hypothesis that applicants will be more likely to be admitted into UCLA's graduate school when they have higher GRE and GPA scores, and when they come from a more prestigious alma mater. 

**Next Steps**

Now that we understand the association between GRE scores, GPA, and Prestige to admission, our next goal could be to build a rebust predictive model. Below are some considerations and next steps to achieve this:
* We will need to split our data between Train and Test to properly evaluate the predictive power of our model. We can use cross-validation to assess models on the training data, but we will assess the performance of the final model based on the Test data.
* We can adjust are Logistic Regression predictor by using L1, or L2 regularization.  We can also adjust the threshold for classifying admission based on the resulting probability to be admitted (e.g. doesn't need to be 0.5).
* We can plot learning curves and determine whether we need to obtain more observations of data or new features to improve the predictive power of our model.
* Also, in this study it was important to have a intuitive understanding of the model, but going forward we can test other predictors such as Random Forest and potentially try to improve models results by leveraging PCA.