- check final regression example
- query why dataset appears to be missing values, but in actuality is not (398 obs for some predictors)

# Project 1

In this first project you will create a framework to scope out data science projects. This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

### Read and evaluate the following problem statement: 
Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and activity score `1 = active user`, `0 = inactive user`) based on Hooli data from Jan-Apr 2015. 


#### 1. What is the outcome?

Answer: predict/determine which customers will convert from free-tier to paying

#### 2. What are the predictors/covariates? 

Answer: age, gender, location, and profession

#### 3. What timeframe is this data relevant for?

Answer: Jan-Apr 2015

#### 4. What is the hypothesis?

Answer: Customers with more favorable useage data (recent log in and/or activity score = 1) will more likely convert from the free-tier.

## Let's get started with our dataset

#### 1. Create a data dictionary 

Answer: 

Variable | Description | Type of Variable
---| ---| ---
admit | 1 - admitted, 0 - not admitted | binary
gpa | floating point indicating grade point average | continuous 
gre | integer indicating score on graduate exam | continuous
prestige | provides prestige on scale of 1-4 for school | categorical 


In [1]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("/Users/antuanweeks/DAT-NYC-37/projects/unit-projects/project-1/assets/admissions.csv")

In [3]:
df.dtypes # checking the variable types in the dataset

admit         int64
gre         float64
gpa         float64
prestige    float64
dtype: object

In [4]:
df.describe() # we see there are 400 observations, but some rows are missing indpendent variable values

Unnamed: 0,admit,gre,gpa,prestige
count,400.0,398.0,398.0,399.0
mean,0.3175,588.040201,3.39093,2.486216
std,0.466087,115.628513,0.38063,0.945333
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.395,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


In [5]:
df.hist() # visualization of frequencies of variables within dataset

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x117ba7b50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11840e390>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1183c1490>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x118610a90>]], dtype=object)

In [6]:
null_data = df[df.isnull().any(axis=1)] # no values missing despite count from describe(). unsure cause of discrepancy

In [7]:
with pd.option_context('display.max_rows', 999, 'display.max_columns', 5): # manually viewing yields no missing data
    print df

     admit    gre   gpa  prestige
0        0  380.0  3.61       3.0
1        1  660.0  3.67       3.0
2        1  800.0  4.00       1.0
3        1  640.0  3.19       4.0
4        0  520.0  2.93       4.0
5        1  760.0  3.00       2.0
6        1  560.0  2.98       1.0
7        0  400.0  3.08       2.0
8        1  540.0  3.39       3.0
9        0  700.0  3.92       2.0
10       0  800.0  4.00       4.0
11       0  440.0  3.22       1.0
12       1  760.0  4.00       1.0
13       0  700.0  3.08       2.0
14       1  700.0  4.00       1.0
15       0  480.0  3.44       3.0
16       0  780.0  3.87       4.0
17       0  360.0  2.56       3.0
18       0  800.0  3.75       2.0
19       1  540.0  3.81       1.0
20       0  500.0  3.17       3.0
21       1  660.0  3.63       2.0
22       0  600.0  2.82       4.0
23       0  680.0  3.19       4.0
24       1  760.0  3.35       2.0
25       1  800.0  3.66       1.0
26       1  620.0  3.61       1.0
27       1  520.0  3.74       4.0
28       1  78

In [8]:
gre_count = df['gre'].value_counts() #showing distribution of gre scores

gre_count.head(10).plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x118610a90>

#### What is the impact of GRE score on admission (see graph below)?

In [9]:
df800 = df[df['gre'] > 700]['admit'] # computing the counts for admission based on score range 
df700 = df[(df['gre'] > 600) & (df['gre'] <= 700)]['admit']
df600 = df[(df['gre'] > 500) & (df['gre'] <= 600)]['admit']
df500 = df[(df['gre'] > 400) & (df['gre'] <= 500)]['admit']
df800_1 = df[(df['gre'] > 700) & (df['admit'] == 1)]['admit'].count()
df700_1 = df[(df['gre'] > 600) & (df['gre'] <= 700) & (df['admit'] == 1)]['admit'].count()
df600_1 = df[(df['gre'] > 500) & (df['gre'] <= 600) & (df['admit'] == 1)]['admit'].count()
df500_1 = df[(df['gre'] > 400) & (df['gre'] <= 500) & (df['admit'] == 1)]['admit'].count()
df800_0 = df[(df['gre'] > 700) & (df['admit'] == 0)]['admit'].count()
df700_0 = df[(df['gre'] > 600) & (df['gre'] <= 700) & (df['admit'] == 0)]['admit'].count()
df600_0 = df[(df['gre'] > 500) & (df['gre'] <= 600) & (df['admit'] == 0)]['admit'].count()
df500_0 = df[(df['gre'] > 400) & (df['gre'] <= 500) & (df['admit'] == 0)]['admit'].count()
df800_std = df800.std()
df700_std = df700.std()
df600_std = df600.std()
df500_std = df500.std()
agg_admit = [df500_1, df600_1, df700_1, df800_1]
agg_nonadmit = [df500_0, df600_0, df700_0, df800_0]
agg_std = [df800_std, df700_std, df600_std, df500_std]
loc = np.arange(4)
width = 0.6

The graph below shows that a smaller proprotion of students applying with lower GRE score ranges are admitted compared to the number of students applying within that score range.

In [10]:
plot1 = plt.bar(loc, agg_admit, width, color='b')
plot2 = plt.bar(loc, agg_nonadmit, width, color='r', bottom=agg_admit)
plt.ylabel('Number of Applicants')
plt.title('Admission by GRE Score Range')
plt.xticks(loc + width/2., ('401-500', '501-600', '601-700', '701-800'))
plt.legend((plot1[0], plot2[0]), ('Admitted', 'Not Admitted'))

<matplotlib.legend.Legend at 0x102632090>

We would like to explore the association between 'admit' and 'gpa,' 'gre,' and 'prestige.'

#### 2. What is the outcome?

Answer: determine probability of admission based on underlying characteristics/determinants

#### 3. What are the predictors/covariates? 

Answer: 'gpa,' 'gre,' and 'prestige' are the predictors/covariates, as these variables are used in the admissions process to determine whether or not an applicant will be admitted 

#### 4. What timeframe is this data relevant for?

Answer: there is no explicit timeframe given with the dataset (we assume one application cycle). The generated data is hypothetical.

#### 4. What is the hypothesis?

Answer: No single predictor alone will predict graduate school admission. Admission will inversely vary with decreasing undergraduate college prestige and have a positive correlation with gpa and gre, allowing for greater likelihood of admission for students with good scores from prestigious schools.

    Using the above information, write a well-formed problem statement. 


## Problem Statement

Discern the correlation between UCLA graduate applicant characteristics and admission. Using an admissions dataset published by UCLA, I will test for the drivers of graduate school admission. The data is hypothetical and generated for the purposes of providing an example for R Data analysis using Logit Regression. The dataset includes four variables ('admit,' 'gre,' 'gpa,' and 'prestige') in which 'admit' is the dependent variable and 'gre,' 'gpa,' and 'prestige' are predictor variables. I believe no single predictor alone can sufficiently predict graduate school admision. My hypothesis is that admission will vary inversely with decreasing undergraduate college prestige and have a positive correlation with gpa and gre, allowing for greater likelihood of admission for students with good scores from prestigious schools.

### Exploratory Analysis Plan

Using the lab from a class as a guide, create an exploratory analysis plan. 

#### 1. What are the goals of the exploratory analysis? 

Answer: One goal of the exploratory analysis is to become familiar with the dataset: its variables and the completeness of the dataset. It is important to understand what the variable values represent (e.g. is a prestige score of 1 good or bad) so that interpretations from analysis are accurate. It is also important to determine the completeness of the dataset in order to gauge how much confidence you can place in some of your findings. If half of the observations are null, or there are few observations in the first place, it will be challenging/not possible to establish causal links or correlation.

#### 2a. What are the assumptions of the distribution of data? 

Answer: There will be multiple distribution types since the variable type varies. Poisson binomial distribution for the binary 'admit' variable, and normal distributions for the continuous and categorical variables. 

#### 2b. How will you determine the distribution of your data? 

Answer: We can plot the frequencies of each variable using .hist() to see the distribution of each variable (see below).

In [11]:
df.hist()

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1188f18d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1189af910>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x118965dd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x118b94e50>]], dtype=object)

#### 3a. How might outliers impact your analysis? 

Answer: The data does not appear to be skewed much by outliers. Removal of any outliers might actually impact the efficacy of a predictive model. The box plots below provide some insight into why we may not remove outliers in our analyses. 

In [12]:
df.boxplot('gre', return_type='axes')

<matplotlib.axes._subplots.AxesSubplot at 0x118b94e50>

In [13]:
df.boxplot('gpa', return_type='axes')

<matplotlib.axes._subplots.AxesSubplot at 0x118b94e50>

In [14]:
df.boxplot('prestige', return_type='axes')

<matplotlib.axes._subplots.AxesSubplot at 0x118b94e50>

#### 3b. How will you test for outliers? 

Answer: We can test for outliers using standard deviation testing. If a value is more than 3 standard deviations from the mean, we can choose to omit the value (see below). 

In [15]:
df[np.abs(df.gre-df.gre.mean())>=(3*df.gre.std())] # this value is the only outlier with this method (I tested other 
# variables as well). Below I will test if removing the value has any significant effect. If not, we will keep the value.

Unnamed: 0,admit,gre,gpa,prestige
304,0,220.0,2.83,3.0


In [16]:
df2 = df[np.abs(df.gre-df.gre.mean())<=(3*df.gre.std())] # setting new data frame w/out outlier

In [17]:
df.corr() # original dataframe correlation matrix

Unnamed: 0,admit,gre,gpa,prestige
admit,1.0,0.182919,0.175952,-0.241355
gre,0.182919,1.0,0.382408,-0.124533
gpa,0.175952,0.382408,1.0,-0.059031
prestige,-0.241355,-0.124533,-0.059031,1.0


In [18]:
df2.corr() # correlation matrix with outlier removed

Unnamed: 0,admit,gre,gpa,prestige
admit,1.0,0.179844,0.172145,-0.242864
gre,0.179844,1.0,0.376383,-0.1218
gpa,0.172145,0.376383,1.0,-0.059141
prestige,-0.242864,-0.1218,-0.059141,1.0


the above tests show that there is no significant effect from removing the outlier. df represents the original dataset, while df2 represents the dataset with outlier removed. Note magnitudes are similar and all signs are the same.

#### 4a. What is collinearity? 

Answer: Collinearity occurs when two or more predictor variables are highly correlated, leading to predictability of the variables based on the presence or occurrence of the other. This is an issue in regression, because it is hard to discern what is the ultimate cause or reason for a dependent change if multiple features in the analysis are contributing to the change.

#### 4b. How will you test for collinearity? 

Answer: The above tests provide a matrix of correlation coefficients for the variables in the database. From first glance, there appear to be collinearity between the 'gre' and 'gpa' variables. We can attempt to isolate one of the variables by running the model without the other in order to tease out the effect of an individual variable.

#### 5. What is your exploratory analysis plan?
Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis one year from now. 

Answer: To begin, we will become familiar with the dataset. To do this, we will plot variable frequency tables to see the data distributions, and test for outliers within the dataset. If we find that the dataset includes outliers that significantly impact the analysis, we will drop the outliers. We will test for outliers by removing any values that are more than three standard deviations away from the mean. Following our test for outliers, we will test for multicollinearity by genearating a correlation matrix. If we find that any of the predictor variables are highly correlated, we will attempt to discern the true impact by dropping one or more of the potential candidates for multicollinearity. This will allow us to more accurately tease out the effects of individual predictors.

## Bonus Questions:
1. Outline your analysis method for predicting your outcome
2. Write an alternative problem statement for your dataset
3. Articulate the assumptions and risks of the alternative model

#### 1. Analysis Method Outline

To predict an outcome, we will use logistic regression. This will allow us to discern the causal effects of each predictor variable upon our dependent variable. Below are the steps to prepare and analyze the dataset:
- acquire dataset
- explore dataset in order to understand variables
- modify datset in preparation for logistic regresion (linear regression has continuous dependent variable; logistic has a limited number of outcomes. since our dataset dependent variable is binary, logistic fits well)
    - create dummy variables for categorical 'pretige' variable
    - determine if any variables are effected by multicollinearity; if so determine if variables will be analyzed separately
- from regression results, determine if results are significant
- interpret and summarize results

#### 2. Alternative Problem Statement
Discern the pretige of a UCLA graduate school applicant's undergraduate school. Using an admissions dataset published by UCLA, I will test to see if the prestige of an applicant's undergraduate college is inferable from other variables within the dataset. The data is hypothetical and generated for the purposes of providing an example for R Data analysis using Logit Regression. The dataset includes four variables ('admit,' 'gre,' 'gpa,' and 'prestige') in which 'prestige' is the dependent variable and 'gre,' 'gpa,' and 'admit' are predictor variables. My hypothesis is that there will be a positive correlation between prestige and the various predictor variables, and that GRE score will contribute most to the predictive model--a higher GRE score coupled with the other variables being favorable will imply an applicant from a more prestigious undergraduate school.

#### 3. Assumptions and Risks of Alternative Model
We will assume that the continuous variables are distributed normally. If the dataset is not a random sampling, there might be biases introduced into our interpretations and analysis. We must test for multicollinearity to make sure that we are determining the true effect of predictor variables and not conflating the impact of a particular set of variables. 

In [19]:
import statsmodels.api as sm

In [20]:
dummy_prestige = pd.get_dummies(df['prestige'], prefix='prestige', drop_first=True) # dummies with k-1

In [21]:
dummy_prestige.head() # view dummy DataFrame to make sure we have what we expected

Unnamed: 0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


In [22]:
kept_columns = ['admit', 'gre', 'gpa']

In [23]:
df3 = df[kept_columns].join(dummy_prestige.ix[:, :]) # help from yhat

In [24]:
df3.head() # viewing the new dataframe to make sure the data is in correct form

Unnamed: 0,admit,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0
2,1,800.0,4.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0


In [25]:
predictors = df3.columns[1:]

In [26]:
df3[predictors].head()

Unnamed: 0,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
0,380.0,3.61,0.0,1.0,0.0
1,660.0,3.67,0.0,1.0,0.0
2,800.0,4.0,0.0,0.0,0.0
3,640.0,3.19,0.0,0.0,1.0
4,520.0,2.93,0.0,0.0,1.0


In [27]:
df3.describe()

Unnamed: 0,admit,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
count,400.0,398.0,398.0,400.0,400.0,400.0
mean,0.3175,588.040201,3.39093,0.375,0.3025,0.1675
std,0.466087,115.628513,0.38063,0.484729,0.459916,0.373889
min,0.0,220.0,2.26,0.0,0.0,0.0
25%,0.0,520.0,3.13,0.0,0.0,0.0
50%,0.0,580.0,3.395,0.0,0.0,0.0
75%,1.0,660.0,3.67,1.0,1.0,0.0
max,1.0,800.0,4.0,1.0,1.0,1.0


In [28]:
logit = sm.Logit(df3['admit'], df3[predictors]) # unsure why this returns an error
result = logit.fit()

ValueError: On entry to DLASCL parameter number 5 had an illegal value