## Summary of exploratory data analysis

The bullets that matter:
    

## Approach
* L1 logistic regression
    * Reason is so that cofficients are understable and intuitive to non-technical audience
    * L1 to reduce features due to correlations identified in EDA
* GridSearch on C value
    * Use AUC metric, as a metric for the quality of the resulting signal of **predicted probabilities** to apply for a credit card
    *  we will use AUC to evaluate the quality of our models with `gridsearchcv` as signal to identify converters
    * Plot ROC to visualize
* Experiment/Feature Engineering
    * Based on above, remove the features identified by L1 LogReg before trying to add new features to experiment
    * Funnel Halo Effect
    * Categorical for Time Diff
    * % Viewable Impressions
* Business Performance Metrics
    * Baseline will be provided for the campaign that ran in Q4 (see details more below)
    * Model can be compared to this by using or model signal to 


## Define Business Performance Metrics

While we will use AUC to evaluate the quality of our models with `gridsearchcv`, we also want to caclulate performance business performance metrics of **ROI** and **Net Value**, as defined below.  Our focus will be to increase **Net Value** from an ad campaign, but we will also calculate ROI for reference.

To calculate these metrics, we need define how we will interpret the confusion matrix:

|True Class: Positive|True Class: Negative
------------------------------|:-------------:|:-------------:
**Predicted Class: Positive** |True Positives |False Positives
**Predicted Class: Negative** |False Negatives|True Negatives

Since we plan to use our model to predict how we should design our ad campaigns, we can interpret the predicted class to help us identify whether or not we'd like to reach users in a similar way:
* Positive: Signal to spend on user because of high likelihood to convert
* Negative: Signal to **not** spend on user
For example, if we predict a positive class for an observation in our dataset that was reached by 10 video ads and 15 display ads, we would plan on spend on similar users like this.

Based on this interpretation, we can develope a cost-benefit methodology based on the confusion matrix:

|True Class: Positive|True Class: Negative
-----------------------------|:------------------------------------------:|:-------------:
**Predicted Class Positive** |Application Value<br>- Cost of Reaching User|-Cost of Reaching User
**Predicted Class Negative** |0|0
**Assumptions**
* **ApplicationValue** = \$500 (The average value of a credit card application)
* **ReachCost** = \$0.033827 (The average cost of reaching a single user for our campaign)


Using this interpretation allows us define the following **ROI** and **Net Value** functions:

\begin{equation*}
ROI = \frac{Total Value}{Total Cost} - 1
\end{equation*}

\begin{equation*}
Net Value = Total Value - Total Cost
\end{equation*}

\begin{equation*}
Total Value = ApplicationValue * TP
\end{equation*}

\begin{equation*}
Total Cost = ReachCost * (TP + FP)
\end{equation*}

Once we have determined our best estimator with AUC and `gridsearchsv`, we can then tune our **threshold** for a postive signal based on the predicted probability from our logistic regression model.  We will tune the **threshold** to maximize **Net Value**.

## Important Data and Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc
%matplotlib inline

In [2]:
#data = pd.read_csv('../data/DATA_FOR_MODEL_20perc.csv', sep=',')
data = pd.read_csv('../data/DATA_FOR_MODEL_FULL.csv', sep=',')
data.head(10)

Unnamed: 0,User_ID,Impressions,TimeDiff_Minutes,TimeDiff_Minutes_AVG,Funnel_Upper_Imp,Funnel_Middle_Imp,Funnel_Lower_Imp,Campaign_Message_Travel_Imp,Campaign_Message_Service_Imp,Campaign_Message_Family_Travel_Imp,...,Creative_Size_320x480_Imp,Creative_Size_Uknown_Imp,Device_Desktop_Imp,Device_Other_Imp,Device_Mobile_Imp,Active_View_Eligible_Impressions,Active_View_Measurable_Impressions,Active_View_Viewable_Impressions,Clicks,Conversions
0,AMsySZb5URoHQAqFtc2yx7eWq2AQ,4,9.0,3.0,0,4,0,0,0,0,...,0,0,4,0,0,4,4,1,,
1,AMsySZZBemBdfIkICNi3QoUi495D,2,39.0,39.0,0,2,0,0,0,0,...,0,0,2,0,0,2,2,2,,
2,AMsySZYC0gKN-GlCxK2WHC9VbmRV,4,301.0,100.333333,0,4,0,0,0,0,...,0,0,4,0,0,4,4,3,,
3,AMsySZZYuKRxsvW7VFSOGRWlsYZ6,1,,,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,,
4,AMsySZarmBmNJttVh1RdvZNlN7d5,3,103.0,51.5,0,3,0,0,0,0,...,0,0,3,0,0,3,3,1,,
5,AMsySZZF6A8-Mo46fGpuijpIL7cP,1,,,1,0,0,1,0,0,...,0,1,0,1,0,0,0,0,,
6,AMsySZaOxWidhMNLX5hVPrNdHPc7,1,,,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,,
7,AMsySZYY1IxHAlXNUPDIIleOGYBu,3,1.0,0.5,0,3,0,0,0,0,...,0,0,3,0,0,3,2,2,,
8,AMsySZaGCfGA6E4bWakJjek7gQQo,1,,,0,1,0,1,0,0,...,0,0,1,0,0,1,1,0,,
9,AMsySZb1P1fsmv4REnCL0Ah_9Hcl,2,4.0,4.0,2,0,0,1,1,0,...,0,2,2,0,0,0,0,0,,


## Handle Missing Data

In [3]:
# For Clicks,Conversion, convert NULL values to zero
data['Clicks'].fillna(value=0,inplace=True)
data['Conversions'].fillna(value=0,inplace=True)

# Create a new categorical variable Converted, which will be 1 if the the user converted at least once, 
# and 0 if the user did not convert.
data['Converted'] = pd.Categorical([1 if x>0 else 0 for x in data['Conversions']])

# For TimeDiff_Minutes and TimeDiff_AVG, it is NULL when we only have 1 impression
# For now, replace with the median value and then add columns flagging the rows where we did this
# We will explore other options for handling this data in the feature engineering section
data['TimeDiff_NULL_FLAG'] = pd.Categorical(data['TimeDiff_Minutes'].isnull())

data['TimeDiff_Minutes'].fillna(value=data['TimeDiff_Minutes'].median(),inplace=True)
data['TimeDiff_Minutes_AVG'].fillna(value=data['TimeDiff_Minutes_AVG'].median(),inplace=True)

In [4]:
# confirm not nulls left dataset
data.isnull().sum()

User_ID                                 0
Impressions                             0
TimeDiff_Minutes                        0
TimeDiff_Minutes_AVG                    0
Funnel_Upper_Imp                        0
Funnel_Middle_Imp                       0
Funnel_Lower_Imp                        0
Campaign_Message_Travel_Imp             0
Campaign_Message_Service_Imp            0
Campaign_Message_Family_Travel_Imp      0
Campaign_Card_Cash_Rewards_Imp          0
Campaign_Card_Premium_Rewards_Imp       0
Campaign_Card_Other_Imp                 0
Creative_Type_Display_Imp               0
Creative_Type_TrueView_Imp              0
Creative_Type_RichMediaExpanding_Imp    0
Creative_Type_RichMedia_Imp             0
Creative_Size_728x90_Imp                0
Creative_Size_300x600_Imp               0
Creative_Size_300x250_Imp               0
Creative_Size_160x600_Imp               0
Creative_Size_468x60_Imp                0
Creative_Size_300x50_Imp                0
Creative_Size_320x50_Imp          

In [5]:
#Show the distribution of impressions (should affect all other rows features based on exploratory analysis)
data['Impressions'].describe()

count    1.517734e+07
mean     4.881202e+00
std      1.534301e+01
min      1.000000e+00
25%      1.000000e+00
50%      2.000000e+00
75%      3.000000e+00
max      5.600000e+03
Name: Impressions, dtype: float64

In [6]:
#Remove outliers (users with more than ~50 impressions, 3 standard devs above the mean of 4.9 impressions)
print max(data['Impressions'])
data = data[data['Impressions']<=50]
print max(data['Impressions'])

5600
50


In [7]:
X=data.drop(['User_ID','Conversions','Converted'],axis=1)
y=data['Converted']

## Business Performance Functions and Baseline Benchmark

In [9]:
# Define ROI function
# *See section "Define Business Performance Metrics" for more details
def ROI(y,y_pred,app_value,avg_user_cost):
    tn, fp, fn, tp = confusion_matrix(y_true=y, y_pred=y_pred).ravel()
    
    total_value=float(tp*app_value)
    total_cost=float((tp+fp)*avg_user_cost)
    ROI=total_value/total_cost-1
    
    return ROI

In [10]:
# Define New Value function
# *See section "Define Business Performance Metrics" for more details
def Net_Value(y,y_pred,app_value,avg_user_cost):
    tn, fp, fn, tp = confusion_matrix(y_true=y, y_pred=y_pred).ravel()
    
    total_value=float(tp*app_value)
    total_cost=float((tp+fp)*avg_user_cost)
    Net_Value=total_value-total_cost
    
    return Net_Value

#### Basline Performance
Get baseline performance by assuming our classified producted "1" (or "positive") for all of our dataset observations.

In [12]:
# Assumptions:
#    Value of credit card application=$500
#    Avg cost of reaching a user=$0.033827
ROI_benchmark = ROI(y=data['Converted'],
                    y_pred=np.ones(data['Converted'].size),
                    app_value=500,
                    avg_user_cost=0.033827)
Net_Value_benchmark = Net_Value(y=data['Converted'],
                                y_pred=np.ones(data['Converted'].size),
                                app_value=500,
                                avg_user_cost=0.033827)

plt.

In [None]:
# Convert ROI function into a model scorer
# This factory function wraps scoring functions for use in GridSearchCV and cross_val_score
ROI_scorer = make_scorer(ROI, app_value=500, avg_user_cost=0.033827, greater_is_better=True)

## L1 Regressions (before new features)

In [None]:
# Use L1 regularization with Logistic Regression to Identify Important/Non-Important variables

In [None]:
# Split between train vs test
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=7)

In [None]:
# standardization: bring all of our features onto the same scale
# this makes it easier for ML algorithms to learn
stdsc = StandardScaler()
# transform our training features
X_train_std = stdsc.fit_transform(X_train)
# transform the testing features in the same way
X_test_std = stdsc.transform(X_test)

In [None]:
# 20 cross validation iterations with 30% test / 70% train
#from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)

In [None]:
%%time
from sklearn.model_selection import GridSearchCV
# the parameters we want to search in a dictionary
# use the parameter name from sklearn as the key
# and the possible values you want to test as the values

logregL1 = LogisticRegression(penalty='l1')
parameters = {'C': [0.001,0.05,0.1,0.3,0.5]}
clf = GridSearchCV(logregL1, parameters, cv=cv,scoring="roc_auc")  #other options are "recall" an "f1"
clf.fit(pd.DataFrame(X_train_std), pd.Series(y_train))   #turning into pandas dataframe or series prevents an issue

In [None]:
print clf.best_params_, clf.best_score_

In [None]:
clf.cv_results_

In [None]:
best_logreg = clf.best_estimator_

In [None]:
# Look at coefficients that are zero
#lgl1_coeff = pd.DataFrame(zip(X.columns,logreg.coef_),columns=['Features','Coefficients'])
best_coeff=pd.DataFrame({'Features':X.columns,
                         'Coefficients':best_logreg.coef_[0]})
#lgl1_coeff

In [None]:
#Show features with coefficients as zero
best_coeff[best_coeff['Coefficients']==0]

In [None]:
best_coeff[best_coeff['Coefficients']!=0]

In [None]:
#ROC curve (change this to Test?)
scores = best_logreg.predict_proba(X_test_std)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, scores)# pos_label=2)
plt.plot(fpr,tpr)

z=np.linspace(0,1,20)
plt.plot(z,z);

In [None]:
auc(fpr, tpr)

In [None]:
#Distribution of predictions
plt.hist(scores);

In [None]:
#Distribution of predictions (Zoom in)
max_in_chart=0.0003
plt.hist(scores[scores<max_in_chart],bins=20);
#plt.xlim(xmin=0,xmax=.005)

In [None]:
scores.shape

In [None]:
roc_df = pd.DataFrame({'thresholds':thresholds,
                  'tpr':tpr,
                  'fpr':fpr}, columns=['thresholds','tpr','fpr'])
roc_df.head()

In [None]:
#Get Threshold at various tpr and fpr
breakouts=100
roc_cuts=None
roc_cuts=pd.DataFrame()
for i in range(0,breakouts):
    threshold_cut = i/float(breakouts)+0.1
    temp=roc_df[roc_df['tpr']<=threshold_cut].sort_values(by='thresholds',ascending=True).head(1)
    roc_cuts=pd.concat([roc_cuts,temp])
roc_cuts

In [None]:
#ROI benchmark
# Get baseline value of campaign AS IS
# Baseline scenario is we go after ALL the users in the dataset
# Value of credit card application=$500
# Avg cost of reaching user=$0.033827
ROI(y=y_test,
    y_pred=np.ones(y_test.size),  #predicting all will convert (or worthwhile going after)
    app_value=500,
    avg_user_cost=0.033827)

In [None]:
%%time
#breakouts=10
#ROI_results=pd.DataFrame()
#for i in range(0,breakouts):
#    threshold_cut = i/10.+0.1
#    roc_cuts=pd.concat([roc_cuts,temp])
#roc_cuts

ROIs=[]
Net_Values=[]

for i in range(0,breakouts):
    prob_threshold=roc_cuts.iloc[i,0]
    scores = best_logreg.predict_proba(X_test_std)[:,1]
    y_pred = [1 if x>=prob_threshold else 0 for x in scores]
    
    ROIs.append(ROI(y=y_test,y_pred=y_pred,app_value=500,avg_user_cost=0.033827))
    Net_Values.append(Net_Value(y=y_test,y_pred=y_pred,app_value=500,avg_user_cost=0.033827))

In [None]:
value_df=pd.DataFrame({'Thresholds':roc_cuts['thresholds'],
                       'tpr':roc_cuts['tpr'],
                       'fpr':roc_cuts['fpr'],
                       'ROI':ROIs,
                       'Net_Value':Net_Values})


value_df

In [None]:
fig, ax1 = plt.subplots()
ax1.plot(value_df['Thresholds'],value_df['ROI'])


#ax1.plot(t, s1, 'b-')
#ax1.set_xlabel('time (s)')
# Make the y-axis label, ticks and tick labels match the line color.
#ax1.set_ylabel('exp', color='b')
#ax1.tick_params('y', colors='b')

ax2 = ax1.twinx()
ax2.plot(value_df['Thresholds'],value_df['Net_Value'])

#s2 = np.sin(2 * np.pi * t)
#ax2.plot(t, s2, 'r.')
#ax2.set_ylabel('sin', color='r')
#ax2.tick_params('y', colors='r')

fig.tight_layout()

In [None]:
plt.plot(value_df['tpr'],value_df['Net_Value']);

In [None]:
plt.plot(value_df['tpr'],value_df['ROI']);

In [None]:
#Print the distribution of scares by actual 1 or 0 on test data? (like in ROC yt video)

In [None]:
#np.exp(-0.167144)

In [None]:
X.describe().transpose()