## Requirements:
* Create Jupyter notebook with code, visualizations, markdown and fully ran top to bottom.
* Summarize your exploratory data analysis.
* Frame code so as to enhance your explanations.
* Explain your choice of validation and prediction metrics.
* Visualize relationships between your dependent variable and two of the strongest independent variables.
* Identify areas where new data could help improve the model.

## Summary of exploratory data analysis

## Approach
1. Describe general model selection and success metrics
    * Logistic Regression
    * Chosing will start with recall, plan to test custom cost benefit model (cost of impression vs LTV of application)
    * Benchmark is the campaigns average conversion rate, or maybe "ROI" (applications * value / impression cost)
        * conversion rate is kind of like  precision metric though... 
2. Data prep
    * Split between data between Train and Test
    * Regularize
3. 

### 1. Write up approach (come back to this)
* Import data and clean data
    * Clean data for clicks/conversions
    * Clean data for TimeDiff (so that we handle the nulls)
* Run L1 logistic regression to reduce set of variables
    * Split into test/train
    * Regularize
    * Define CV
    * V1 -> L1 Logistic Regression
    * V2 -> RF??? (if you want)
    * Compare these to our baseline
* Feature Engineering
    * Use domain knowledge to create new features:
        * Halo effect of exposure across the funnel (U + L)
        * Others?


## Important Data and Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

In [25]:
data = pd.read_csv('../data/DATA_FOR_MODEL_20perc.csv', sep=',')
data.head(10)

Unnamed: 0,User_ID,Impressions,TimeDiff_Minutes,TimeDiff_Minutes_AVG,Funnel_Upper_Imp,Funnel_Middle_Imp,Funnel_Lower_Imp,Campaign_Message_Travel_Imp,Campaign_Message_Service_Imp,Campaign_Message_Family_Travel_Imp,...,Creative_Size_320x480_Imp,Creative_Size_Uknown_Imp,Device_Desktop_Imp,Device_Other_Imp,Device_Mobile_Imp,Active_View_Eligible_Impressions,Active_View_Measurable_Impressions,Active_View_Viewable_Impressions,Clicks,Conversions
0,AMsySZYlP3l94iys9P9WaBXIWE6B,3,2.0,1.0,0,3,0,0,3,0,...,0,0,0,0,3,0,0,0,,
1,AMsySZbdCy7kK0BqCq38AvgzDJ7y,3,1.0,0.5,0,3,0,3,0,0,...,0,0,3,0,0,3,3,1,,
2,AMsySZbsx0jjk_iOfpRCVx2ss3v8,2,12.0,12.0,0,2,0,0,0,0,...,0,0,2,0,0,2,2,1,,
3,AMsySZZnAR-zSA0aCGVZkfhupUhU,5,23033.0,5758.25,0,5,0,0,0,0,...,0,0,0,0,5,5,5,5,,
4,AMsySZb1yHw6ewPTnb7h39vBdCh8,2,188.0,188.0,0,2,0,0,0,0,...,0,0,2,0,0,2,2,1,,
5,AMsySZbMUcclS_w01crSZ23CgZsa,1,,,0,1,0,0,0,0,...,0,0,0,0,1,1,1,0,,
6,AMsySZbN3U0Y-dpMwJBHb5KWpB6S,444,73303.0,165.469526,0,85,359,283,161,0,...,0,85,444,0,0,359,359,314,,
7,AMsySZZYbLJhgekE6XQ52ea1bpKb,2,80.0,80.0,0,2,0,0,0,0,...,0,0,2,0,0,2,2,1,,
8,AMsySZaWJ0IyUoZmTqnYg4_LsylU,1,,,0,1,0,0,0,0,...,0,0,1,0,0,1,1,0,,
9,AMsySZbQxjkNdQtKIjG1m3fso1u_,9,56231.0,7028.875,0,9,0,0,0,0,...,0,0,0,0,9,7,7,3,,


In [26]:
# For Clicks,Conversion, convert NULL values to zero
data['Clicks'].fillna(value=0,inplace=True)
data['Conversions'].fillna(value=0,inplace=True)

# Create a new categorical variable Converted, which will be 1 if the the user converted at least once, 
# and 0 if the user did not convert.
data['Converted'] = pd.Categorical([1 if x>0 else 0 for x in data['Conversions']])

# For TimeDiff_Minutes and TimeDiff_AVG, it is NULL when we only have 1 impression
# For now, replace with the median value and then add columns flagging the rows where we did this
# We will explore other options for handling this data in the feature engineering section
data['TimeDiff_NULL_FLAG'] = pd.Categorical(data['TimeDiff_Minutes'].isnull())

data['TimeDiff_Minutes'].fillna(value=data['TimeDiff_Minutes'].median(),inplace=True)
data['TimeDiff_Minutes_AVG'].fillna(value=data['TimeDiff_Minutes_AVG'].median(),inplace=True)

In [28]:
# confirm not nulls left dataset
data.isnull().sum()

User_ID                                 0
Impressions                             0
TimeDiff_Minutes                        0
TimeDiff_Minutes_AVG                    0
Funnel_Upper_Imp                        0
Funnel_Middle_Imp                       0
Funnel_Lower_Imp                        0
Campaign_Message_Travel_Imp             0
Campaign_Message_Service_Imp            0
Campaign_Message_Family_Travel_Imp      0
Campaign_Card_Cash_Rewards_Imp          0
Campaign_Card_Premium_Rewards_Imp       0
Campaign_Card_Other_Imp                 0
Creative_Type_Display_Imp               0
Creative_Type_TrueView_Imp              0
Creative_Type_RichMediaExpanding_Imp    0
Creative_Type_RichMedia_Imp             0
Creative_Size_728x90_Imp                0
Creative_Size_300x600_Imp               0
Creative_Size_300x250_Imp               0
Creative_Size_160x600_Imp               0
Creative_Size_468x60_Imp                0
Creative_Size_300x50_Imp                0
Creative_Size_320x50_Imp          

In [None]:
#Drop dummies with redundant information
# Funnel

# Campaign

# Creative Type/Size

# Device

In [30]:
X=data.drop(['User_ID','Conversions','Converted'],axis=1)
y=data['Converted']

## Feature Selection

In [None]:
# Use L1 regularization with Logistic Regression to Identify Important/Non-Important variables

In [34]:
# Split between train vs test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=7)

In [35]:
# standardization: bring all of our features onto the same scale
# this makes it easier for ML algorithms to learn
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
# transform our training features
X_train_std = stdsc.fit_transform(X_train)
# transform the testing features in the same way
X_test_std = stdsc.transform(X_test)

In [None]:
# 20 cross validation iterations with 30% test / 70% train
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=20, test_size=0.3, random_state=0)

In [None]:
# Define Logistic Regression


In [None]:
# Grid Search