# DS-SF-27 | Unit Project 4: Notebook with Executive Summary

In this project, you will summarize and present your analysis from Unit Projects 1-3.

In [2]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [3]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

> ## Question 1.  Introduction
> Write a problem statement for this project.

Answer: Using data from the UCLA admissions dataset, determine if there is an association between admission into UCLA and a students high school prestige.

> ## Question 2.  Dataset
> Write up a description of your data and any cleaning that was completed.

Answer: The dataset had 400 observations initially and was reduced to 397 once I removed observations that were not a number. The dataset included the four variables shown below.

Variable | Description | Type of Variable
---|---|---
`admit` | 0 = Not admitted, 1 = Admitted | Categorical
`gre` | GRE (range: 200-800) | Continuous
`gpa` | GPA (range: 0-4.0) | Continuous
`prestige` | 1 = Highest prestige, 2, 3, 4 = Not prestigious | Categorical

> ## Question 3.  Demo
> Provide a table that explains the data by admission status.

Answer:

In [4]:
pd.crosstab([df.prestige, df.gre, df.gpa], df.admit)

Unnamed: 0_level_0,Unnamed: 1_level_0,admit,0,1
prestige,gre,gpa,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,340.0,2.90,1,0
1.0,360.0,3.14,1,0
1.0,420.0,2.96,1,0
1.0,420.0,3.02,1,0
1.0,440.0,3.22,1,0
...,...,...,...,...
4.0,740.0,3.74,1,0
4.0,780.0,3.63,0,1
4.0,780.0,3.87,1,0
4.0,800.0,3.15,1,0


> ## Question 4. Methods
> Write up the methods used in your analysis.

Answer:.
1. Check for missing data and remove observations.
2. Explore the data and the relationships between variables using boxplots and histograms.
3. Check for colinearity using a correlation.
4. Check for normal distribution using qqplots.
5. Try to model the association between admittance and high school prestige using statsmodel and sklearn.

> ## Question 5. Results
> Write up your results.

Answer: If we hold GPA and GRE constant, it can be seen that there is a relationship between admittance and high school prestige. As expected, students from prestige 1 schools are more likely to be admitted into UCLA. Using sklearn to model the relationship, it can be seen that a prestige 1 student has a 71% probability of being admitted. The probabilities decrease to 57%, 41%, and 34% for students from prestige 2, 3, and 4 high schools respectively. Using statsmodel, these results are similar yielding results of 73%, 58%, 42%, and 37% for high school prestige 1, 2, 3, 4 respectively.

> ## Question 6. Visuals
> Provide a table or visualization of these results.

Answer:

High School Prestige | Probability of Admittance | GPA | GRE
---|---|---|---
`1` | 71% | 4.0 | 800
`2` | 57% | 4.0 | 800
`3` | 41% | 4.0 | 800
`4` | 34% | 4.0 | 800

In [5]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
                              'prestige_2.0': 'prestige_2',
                              'prestige_3.0': 'prestige_3',
                              'prestige_4.0': 'prestige_4'}, inplace = True)

df = df[ ['admit', 'gre', 'gpa'] ].join(prestige_df)

X = df[ ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4'] ]
y = df.admit

predict_X = pd.DataFrame({'intercept': [1, 1, 1, 1],
    'gre': [800, 800, 800, 800],
    'gpa': [4, 4, 4, 4],
    'prestige_2': [0, 1, 0, 0],
    'prestige_3': [0, 0, 1, 0],
    'prestige_4': [0, 0, 0, 1]})

model = linear_model.LogisticRegression(C = 10 ** 2).fit(X, y)
predict_X
predict_X.drop('intercept', axis = 1, inplace = True)
model.predict_proba(predict_X[ ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4'] ])

array([[ 0.28814605,  0.71185395],
       [ 0.43153702,  0.56846298],
       [ 0.58608936,  0.41391064],
       [ 0.66024514,  0.33975486]])

> ## Question 7.  Discussion
> Write up your discussion and future steps.

Answer: Using the UCLA dataset, it is evident that students that graduate from more prestigious undergraduate programs are more likely to get into UCLA. Some interesting future data explorations would be to consider how these probabilties change as a student's GRE and GPA scores decrease. 