# DS-NYC-45 | Unit Project 1: Research Design Write-Up

In this first unit project you will create a framework to scope out data science projects.  This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

## Part A.  Evaluate the following problem statement:

> "Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and `activity score 1 = active user`, `0 = inactive user`) based on Hooli data from January-April 2015."

> ### Question 1.  What is the outcome?

Answer: Convert to paying customer indicator (yes or no).

> ### Question 2.  What are the predictors/covariates?

Answer: age, gender, location, profession, days since last log in, activity score.

> ### Question 3.  What timeframe is this data relevent for?

Answer: January-April 2015.

> ### Question 4.  What is the hypothesis?

Answer: Demographic data collected at signup and customer usage data will allow us to predict if a customer will convert to a paying customer.

## Part B.  Let's start exploring our UCLA dataset and answer some simple questions:

In [1]:
import os
import pandas as pd

df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


> ### Question 5.  Create a data dictionary.

Answer: 

Variable | Description | Type of Variable
---|---|---
admit | 0 = Not admitted into UCLA, 1 = admitted into UCLA | Categorical
gre | GRE (Graduate Record Examination) score | Continuous
gpa | GPA (Grade Point Average) score | Continuous
prestige | Prestige of an applicant alma mater, with 1 as highest tier (most prestigeous) and 4 as the lowest tier (least prestigeous). | Categorical

We would like to explore the association of GRE score, GPA score, and Prestige with admission to UCLA's graduate school.

> ### Question 6.  What is the outcome?

Answer: Indicator of whether an applicant was admitted into UCLA's graduate school (yes or no)

> ### Question 7.  What are the predictors/covariates?

Answer: GRE score, GPA score, Prestige

> ### Question 8.  What timeframe is this data relevent for?

Answer: Cross-sectional sample of historical UCLA application data.  Exact dates unknown.

> ### Question 9.  What is the hypothesis?

Answer: Applicants are more likely to be admitted into UCLA's graduate school when they have higher GRE and GPA scores, and when they come from a more prestigious alma mater.

> ### Question 10.  What's the problem statement?

> Using your answers to the above questions, write a well-formed problem statement.

Answer: Determine the various factors that may influence admission into UCLA's graduate school.  Using a sample of cross-sectional UCLA admissions data, we would like to explore the association of GRE score, GPA score, and Prestige with admission to UCLA's graduate school. We will test whether applicants will be more likely to be admitted into UCLA's graduate school when they have higher GRE and GPA scores, and when they come from a more prestigious alma mater. 

## Part C.  Create an exploratory analysis plan by answering the following questions:

Because the answers to these questions haven't yet been covered in class yet, this section is optional.  This is by design.  By having you guess or look around for these answers will help make sense once we cover this material in class.  You will not be penalized for wrong answers but we encourage you to give it a try!

> ### Question 11. What are the goals of the exploratory analysis?

Answer:
* Gain intuition
* Sanity check
* Handle variable types
* Identify and treat missing data
* Identify and treat outliers
* Summarize the data
* Visualize

> ### Question 12.  What are the assumptions of the distribution of data?

Answer:
The following assumptions about the distrubtion of data are made for a logistic regression:
* The outcome variable must be binary, where 1=admitted and 0=not admitted
* The predictors are independent variables

> ### Question 13.  How will determine the distribution of your data?

Answer: We can use Pandas/Multiplot to determine the distribution of the data.

> ### Question 14.  How might outliers impact your analysis?

Answer:
* Outliers could be values that are outside the range of what we expect to be real (e.g. negative GPA or GRE scores), which would skew the data artificially in one direction
* Other outliers may still be possible, but rare or unlikely, which could still have the same effect of skewing results.

> ### Question 15.  How will you test for outliers?

Answer: Using pd.describe() we can determine if the min or max values are far from the mean (e.g. multiple standard deviations away)

> ### Question 16.  What is colinearity?

Answer: Colinearity is when two or more predictors are highly correlated with each other.

> ### Question 17.  How will you test for covariance?

Answer: Pandas has coveriance method we can use on a dataframe. For example, GRE and GPA are have high covariance.

In [2]:
#covariances
df.cov()

Unnamed: 0,admit,gre,gpa,prestige
admit,0.217237,9.871271,0.031191,-0.106189
gre,9.871271,13369.95304,16.824761,-13.648068
gpa,0.031191,16.824761,0.144879,-0.02126
prestige,-0.106189,-13.648068,-0.02126,0.893654


In [3]:
#correlations
df.corr()

Unnamed: 0,admit,gre,gpa,prestige
admit,1.0,0.182919,0.175952,-0.241355
gre,0.182919,1.0,0.382408,-0.124533
gpa,0.175952,0.382408,1.0,-0.059031
prestige,-0.241355,-0.124533,-0.059031,1.0


> ### Question 18.  What is your exploratory analysis plan?

> Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis one year from now.

Answer:
* Check for missing values and outliers in the data
* Use Pandas to understand high-level distribution of each variable separately (min, max, mean, standard dev., etc.)
* Use Mulitplot to get visuals of each variable's distribution, as well as plot each variable against each other
* Use Pandas to determine if the predictors are highly correlated with each other (since we are assuming independence)