## Problem:##
Udacity has a lot of data about its users habits on the platform, but doesn’t know how it can improve their business

## Solution: ## 
Hire Chris as a consultant to tell them how ;)

## Stakeholder Analysis: ##
Udacity makes money by people paying to use their educational tools (and partnerships with Tech companies, probably)

From user engagement data, we can look at user retention (which, again, is how they make $$$)

Any actions we can suggest to increase user retention impacts the company's bottomline

## Data at a Glance: ##
3 CSV files taken from the Udacity Intro to Data Analysis Course (also available on their GitHub page)

Enrollment Data, Daily User Engagement Data, Project Data

## Enrollment Data: ##

account key, cancel date, days to cancel, is canceled, is udactiy, join date, status

#### account key: ###
The unique identifier for each person

#### cancel date: ###
The date a user canceled their subscription, blank if the user is current

#### days to cancel: ###
The duration of a user's subscription, blank if the user is current

#### is canceled: ###
True/False if the user has canceled

#### is udacity: ###
True/False if the account is a Udacity test account

#### join date: ###
The date a user began their subscription

#### status: ###
Current/Canceled, redunadant with is canceled

for example...<img src = "Desktop/enrollment.png" width="500" height="400">

## Daily User Engagement Data: ##
account key, lessons completed, number of courses visited, projects completed, total minutes visited, utc_date

#### account key: ####
The unique identifier for each person
#### lessons completed: ###
Number of lessons completed by the user on this day
#### number of courses visited: ###
Number of courses visited by the user on this day
#### projects completed: ###
Number of projects completed by the user
#### total minutes visited: ###
Minutes spent by the user on this day
#### utc_date: ###
The date of the observation

<img src = "Desktop/daily2.png" width="500" height="500">

## Project Data: ##
account key, rating, creation date, completion date, lesson, processing state

#### account key: ###
The unique identifier for each person

#### rating: ###
UNGRADED/INCOMPLETE/PASSED/DISTINCTION, self-explanatory

#### creation date: ###
Date the user began working on the project

#### comletion date: ###
Date the user finished working on the project

#### lesson: ###
Identifier for which lesson the project was for

#### processing state: ###
Evaluted/Created, Udacity's grading state

<img src = "Desktop/projects.png" width="500" height="500">

## Analysis of Data Quality:

As the pictures suggest, this data was very raw; some was useless, some needed cleaning and none of it was model ready.

I spent a lot of time doing exploratory data analysis, looking at different slices of the data, the histogram of each variable, and the scatter plots of some choice combinations. This helped me get a sense of which features needed to be created. (see below)

<img src = "Desktop/days_to_cancel.png" width="500" height="500">

<img src = "Desktop/visits_to_completion.png" width="500" height="500">

<img src = "Desktop/minutes_to_completion.png" width="500" height="500">

<img src = "Desktop/days_to_completion.png" width="500" height="500">

<img src = "Desktop/obs_per_day.png" width="500" height="500">

Using the daily and project data, I looked at the total data for each account. I created features for the number of visits, total/average/max days between visits, total/average minutes spent, average courses visited, how many days the account had been active, the number of projects they completed and if they completed one or more projects.

<img src = "Desktop/avg_minutes_spent.png" width="500" height="500">

<img src = "Desktop/completed_projects.png" width="500" height="500">

## Domain Knowledge:
#### 1) Projects are the "capstone" of each of the courses
So project data will act as a reasonable proxy for course completion

#### 2) Udacity was offering a one week free trial during the time this data was taken
So the data for users that signed up and canceled within the first week is not being used

## Thought Process: 
I wanted to see how well I could predict if a student will pass at least one project, so I figured I’d start by getting a benchmark by making a prediction using the total data for each user. 

Obviously I couldn’t include the number of completed projects in the model because the dependent variable is a derivative of it, but I included all the others knowing that regularization would knock out the unimportant/duplicative features.

<img src = "Desktop/heat_map.png" width="500" height="500">

## Benchmark Model:
I threw together a grid-searched, cross-validated logistic regression, which correctly classified each account ~90% of the time, with 80% specificity.

<img src = "Desktop/classification.png" width="500" height="500">

## What's specificity, and why do I care?

Specificity is the name of the game here. In simple terms, specificity is the percent of people that we correctly predicted wouldn’t pass the course out of all those that don't pass. I’m trying to identify which users aren’t likely to pass the project (my proxy for retention), so Udacity knows who to target with their user retention efforts.

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

## Thought Process (again): 

Having the Benchmark as something to use to get a feel for the data and compare against is nice, but it's practical use is low. Predicting if someone will or will not pass after 9 months isn't very useful because they've almost invariably already either passed the course and are onto the next one, or failed and canceled their subscription.

From Udacity's point of view, it'd be best to know as soon as possible which users are at risk not to complete the course (and therefore stop paying them $, and probably not pay them again in the future). So I built a model based on the cumulative user data as of day 8 (the first day after the week-long free trial), but admittedly I was worried that might not be very predictive, and I was interested in figuring out the earliest time it might be predictive. So I built more models, one for each day I had data for (roughly ~270).

## Daily Models:

## A MODEL PER DAY?! THAT'S A LOT OF MODELS!!!

Confession: I didn't build one model for each day; I built **4** (with a 5th in production!), with some clever (well, maybe not) for loops to run each for each day, and capture their output.

## List of Models:

- Logisitc Regression

- Gradient Boosting Classifier

- AdaBoosted Decision Tree Classifier

- Random Forest Classifier

- Support Vector Classifier (In production! But keep this between us. You could say its "Classified")

## Discussion of Model Choices:

1) Having options is good. The code for the different models in scikit learn is only nonimnally different, so time cost is low and it's a good way to practice my machine learning techniques

2) Natural Experiment: I can compare and contrast the model performances

3) Diversity: Check/Confirm model selection

- Logistic Regression:
Nothing fancy, baseline classification technique
- Gradient Boosting Classifier:
Boosting is useful when there is a disparity between the number of each class. Plus its hot on Kaggle right now.
- AdaBoosted Decision Tree Classifier:
Again boosting, also
- Random Forest Classifier:
Uses Bagging, which is a good comparison to the boosting models. Also creates many small trees with different subsets of features, which tend to be a stronger learner than trees by themselves

## Visual Representation of the difference between models (and which to prefer):

Yeah, being honest, I haven't actually gotten this far yet. This is where you would see a stacked specificity over days graph, with a line for each model. The analysis in my presentation hinges on this, so look forward to seeing it tomorrow!

## Current Stakeholder Value:

- Can accurately predict ~70% of at-risk users as of day 8
- Model accuracy increases as days increase
- "Actionable" information

## Looking to the Future:

- Use AWS EC2 to optimize different models
- Build a Bayesian updating model to predict probability of passing the course on each day
- Build a Bayesian model to predict how many courses a user will pass/how long they will be on the site, and estimate $ value of each student
- Automate data pipelines