# DS-SF-36 | Unit Project | 3 | Machine Learning Modeling and Executive Summary | Starter Code

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Project 1 and 2.  You will summarize and present your findings and the methods you used.

In [23]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf
import scipy.stats as stats
from sklearn import linear_model

In [24]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [25]:
# TODO
pd.crosstab(df.prestige, df.admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Feature Engineering

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [26]:
# TODO
prestige_df = pd.get_dummies(df.prestige, prefix = 'Prestige')
prestige_df

Unnamed: 0,Prestige_1.0,Prestige_2.0,Prestige_3.0,Prestige_4.0
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We need any 3 of the 4 dummy variables. 

> ### Question 4.  Why are we doing this?

Answer: From a categorical variable with n possible values, one-hot encoding will produce n binary variables. For each sample, one and only one binary variable will be at 1 while the others will all be at 0. Therefore, if you know any of these n - 1 binary variables, you can derive the remaining one.
For this reason, when modeling with linear regression, omit one binary variable; whichever binary variable you omit becomes a for which coefficients of the other binary variables will refer to. Moreover, since the sum of all binary variables of a one-hot encoding adds up to 1, we will have multicollinearity issues when including all binary variables of two or more one-hot encodings.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [27]:
prestige_df.rename(columns = {'Prestige_1.0': 'Prestige_1',
    'Prestige_2.0': 'Prestige_2',
    'Prestige_3.0': 'Prestige_3',
    'Prestige_4.0': 'Prestige_4'}, inplace = True)
prestige_df

Unnamed: 0,Prestige_1,Prestige_2,Prestige_3,Prestige_4
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


In [28]:
# TODO

df = df.join([prestige_df])
df

Unnamed: 0,admit,gre,gpa,prestige,Prestige_1,Prestige_2,Prestige_3,Prestige_4
0,0,380.0,3.61,3.0,0,0,1,0
1,1,660.0,3.67,3.0,0,0,1,0
2,1,800.0,4.00,1.0,1,0,0,0
3,1,640.0,3.19,4.0,0,0,0,1
4,0,520.0,2.93,4.0,0,0,0,1
...,...,...,...,...,...,...,...,...
395,0,620.0,4.00,2.0,0,1,0,0
396,0,560.0,3.04,3.0,0,0,1,0
397,0,460.0,2.63,2.0,0,1,0,0
398,0,700.0,3.65,2.0,0,1,0,0


In [29]:
df.drop('prestige', axis=1, inplace=True)
df

Unnamed: 0,admit,gre,gpa,Prestige_1,Prestige_2,Prestige_3,Prestige_4
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.00,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0,1,0,0
396,0,560.0,3.04,0,0,1,0
397,0,460.0,2.63,0,1,0,0
398,0,700.0,3.65,0,1,0,0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [32]:
# TODO
table = pd.crosstab(df.Prestige_1, df.admit)
table

admit,0,1
Prestige_1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,93
1,28,33


### > ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [39]:
# TODO
odds_prestigious = 33.0/(33+28)
odds_prestigious

0.5409836065573771

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [38]:
# TODO
odds_nonprestigious = 93.0/(93+243)
odds_nonprestigious

0.2767857142857143

> ### Question 9.  Finally, what's the odds ratio?

In [40]:
# TODO
oddsratio, pvalue = stats.fisher_exact(table)
print("OddsR: ", oddsratio, "p-Value:", pvalue)

('OddsR: ', 3.0794930875576036, 'p-Value:', 8.4926306424423656e-05)


> ### Question 10.  Write this finding in a sentence.

Answer: An odds ratio of more than 1 means that there is a higher odds of admissions for undergaduates who graduated from a prestigious university

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [42]:
# TODO
table2 = pd.crosstab(df.Prestige_4, df.admit)
table2

admit,0,1
Prestige_4,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,114
1,55,12


In [43]:
oddsratio, pvalue = stats.fisher_exact(table2)
print("OddsR: ", oddsratio, "p-Value:", pvalue)

('OddsR: ', 0.41339712918660287, 'p-Value:', 0.0091089814685382381)


Answer: An odds ratio is less than 1 is associated with lower odds.In other words, students with degrees with the least prestigious schools have a lower likelihood of admittance into grad school

## Part D. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [None]:
# TODO

> ### Question 13.  Print the model's summary results.

In [None]:
# TODO

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [None]:
# TODO

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: TODO

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: TODO

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: TODO

## Part E. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [None]:
# TODO

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer: TODO

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: TODO

## Part F.  Executive Summary

> ## Question 21.  Introduction
>
> Write a problem statement for this project.

Answer: TODO

> ## Question 22.  Dataset
>
> Write up a description of your data and any cleaning that was completed.

Answer: TODO

> ## Question 23.  Demo
>
> Provide a table that explains the data by admission status.

Answer: TODO

> ## Question 24.  Methods
>
> Write up the methods used in your analysis.

Answer: TODO

> ## Question 25.  Results
>
> Write up your results.

Answer: TODO

> ## Question 26.  Visuals
>
> Provide a table or visualization of these results.

Answer: TODO

> ## Question 27.  Discussion
>
> Write up your discussion and future steps.

Answer: TODO