# Project 1

In this first project you will create a framework to scope out data science projects. This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

## Exercise 1: Read and evaluate the following problem statement: 
Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and activity score 1 = active user, 0= inactive user) based on Hooli data from Jan-Apr 2015. 


#### 1. What is the outcome?

Answer: 
 - quantitative deliverable: Count of conversion to paying customers
 - prediction deliverable: Likelihood of conversion to paying customers

#### 2. What are the predictors/covariates? 

Answer: 
- predictors are age, gender, location, and profession
- days since last log in should be a covariate, since it is an independant variable and relates to time

#### 3. What timeframe is this data relevent for?

Answer: Jan-Apr 2015

#### 4. What is the hypothesis?

Answer: Using a null hypothesis...    
H0: The features, age, gender, location, and profession of free-tier customers are not related to their eventual choice to become paying customers

## Exercise 2: Let's get started with our dataset (use admissions.csv)

#### 1. Create a data dictionary 

Answer: 

Variable | Description | Type of Variable
---| ---| ---
Var 1 | 0 = not thing 1 = thing | categorical
Var 2 | thing in unit X | continuous 


In [6]:
import numpy as np
import pandas as pd
%matplotlib inline

In [7]:
df = pd.read_csv('../assets/admissions.csv')

In [8]:
gres = list(df['gre'])
gpas = list(df['gpa'])
prestiges = list(df['prestige'])
admits = list(df['admit'])

In [18]:
data_dictionary = {}
data_dictionary['Variable'] = ["admit", "gre", "gpa", "prestige"]
data_dictionary['Data Type'] = ["Boolean", "Float", "Float", "Float"]
data_dictionary['Type of Variable'] = ["categorical", "continuous", "continuous", "continuous"]
data_dictionary['Description'] = ["True/False", "values between 130 - 170", "4.0 Scale", "1 - 4, 1 is highest"]


In [19]:
new_ddata = pd.DataFrame(data_dictionary)
new_ddata

Unnamed: 0,Data Type,Description,Type of Variable,Variable
0,Boolean,True/False,categorical,admit
1,Float,values between 130 - 170,continuous,gre
2,Float,4.0 Scale,continuous,gpa
3,Float,"1 - 4, 1 is highest",continuous,prestige


We would like to explore the association between X and Y 

#### 2. What is the outcome?

Answer:  We are trying to predict admit

#### 3. What are the predictors/covariates? 

Answer:  gpa, gre, prestige

#### 4. What timeframe is this data relevent for?

Answer:  There is no time interval indicated in the data

#### 5. What is the hypothesis?

Answer:  
H0:  All predictor variable coefficients are equal to zero  
$\beta_{gre} + \beta_{gpa} +  \beta_{prestige} = 0$

#### 6. Using the above information, write a well-formed problem statement. 


Answer:
- We want to see if we can predict if a student will be admitted if they have comparable gpas, gre scores, and associate the school with high prestige



We have data on gre, gpa, and prestige, but we don't know if there is a relation to admission
 - Do gre scores, gpas, and college prestige affect a schools decision to admit?
 - Can we predict admission to this school based on three factors?
 - Can we predict probability of admission to this school based on 3 factors?

## Exercise 3: Exploratory Analysis Plan (Materials will be covered on Tuesday's class)

Using the lab from a class as a guide, create an exploratory analysis plan. 

#### 1. What are the goals of the exploratory analysis? 

Answer: We want to download the data in a suitable format, create a data dictionary, that describes the features and the datapoints.  Then we want to sort the data into categorical and continuous variables.  We can also sort the data into meaningful match-ups in order to investigate the correlation of the features to their target.  
We also look at statistical data, such as mean, median, mode, and standard deviation.  We look at the distribution of the data and decide on a model to predict correlation of variables to their target

#### 2a. What are the assumptions of the distribution of data? 

Answer: Assume a normal distribution

#### 2b. How will determine the distribution of your data? 

Answer: df.describe() will show the interquartile range of your data, including the mean, median and standard deviation

#### 3a. How might outliers impact your analysis? 

Answer: they change the stretch of the box plot, and have an extreme impact on the mean.

#### 3b. How will you test for outliers? 

Answer:  Multiply 1.5 times the q3

#### 4a. What is colinearity? 

Answer: Collinearity is when two or more predictors are related.

#### 4b. How will you test for colinearity? 

Answer:  use a scatter plot to see the relation or use df.corr()
- collinear datapoints will appear to form a line


#### 5. What is your exploratory analysis plan?
Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis 1 year from now. 

Answer: download the same data, create a data dictionary, and look at statistical data, searching for statistically significant patterns of correlation.  As mentioned earlier, there is no time data associated with the dataframe in question.  So, simply downloading the same data, using the same python packages to visualize it and compute the values of statistical significance, would likely reproduce the same result.

## Bonus Questions:
1. Outline your analysis method for predicting your outcome
2. Write an alternative problem statement for your dataset
3. Articulate the assumptions and risks of the alternative model

##### 1. EDA Outline for predicting outcome

Identify the Problem
 - Identify Business/Product Objectives
 - Identify and Hypothesize goals and criteria for success
 - Create a set of questions for identifying correct data set
 
Acquire the Data
 - identify the "right" data set or sets
 - Import data and set up local or remote data structure
 - Determine the Most Appropriate tools to work with the data
 
Parse the Data
 - Read any documentation provided with the data
 - Perform EDA (Exploratory Data Analysis)
 - Verify the Quality of the Data
 
Mine the Data
 - Determine the sampling methodology and sample data
 - Format, clean slice, and combine data in Python
 - Create necessary derived columns from the data (new data)


Refine the Data
 - Identify Trends and Outliers
 - Apply descriptive and inferential statistics
 - Document and transform data
 
Build a Data Model
 - Select Appropriate Model
 - Build Model
 - Evaluate and Refine Model
 
Present the Results
 - Summarize Findings with narative storytelling techniques
 - Present Limitations and assumptions of your analysis
 - Identify follow-up problems and questions for future analysis

##### 2. Alternate Problem Statement

H1:  alternate hypothesis that at least one of them is different not zero  
 - $\beta_{gre} + \beta_{gpa} +  \beta_{prestige} > 0$
 - $\beta_{gre} + \beta_{gpa} +  \beta_{prestige} < 0$

##### 3. Assumptions and risks of alternative model

There is no time data, so this snapshot dataset could be outdated
small dataset and number of variables

##### Limitations:
- There are few (4) features to this dataset, any one of them or all of them could be covariates
- time is a convenient covariate, but we do not have it in this data set.

scale of the datapoints is irregular... 
- we need to treat each feature separately, and, 
- for prestige, make it into four variables

compare gpa range (0 - 4) and prestige range (1 - 4) 
- they are on different scales, and reversed, since highest gpa is 4.0 and highest prestige is 1.0