# <center> Lord of the Machines - Data Science Hackathon

<b>Problem Statement</b><br>
Email Marketing is still the most successful marketing channel and the essential element of any digital marketing strategy. Marketers spend a lot of time in writing that perfect email, labouring over each word, catchy layouts on multiple devices to get them best in-industry open rates & click rates.

How can I build my campaign to increase the click-through rates of email? - a question that is often heard when marketers are creating their email marketing plans.

Can we optimize our email marketing campaigns with Data Science?

It's time to unlock marketing potential and build some exceptional data-science products for email marketing.

Analytics Vidhya sends out marketing emailers for various events such as conferences, hackathons, etc. We have provided a sample of user-email interaction data from July 2017 to December 2017. You are required to predict the click probability of links inside a mailer for email campaigns from January 2018 to March 2018.

<b>Evaluation Metric</b><br>
The evaluation metric for this competition is AUC-ROC score.

<b>Datasets:</b><br>
We are provided with 3 datasets:
1. campaign_data.csv -  Contains the features related to 52 email Campaigns
2. train.csv - Contains the click and open information for each user corresponding to given campaign id (Jul 17 - Dec 17)
3. test.csv - Contains the user and campaigns for which is_click needs to be predicted (Jan 18 - Mar 18)

For more information on this hackathon and for downloading datasets:<br>
https://datahack.analyticsvidhya.com/contest/lord-of-the-machines/?utm_source=sendinblue&utm_campaign=Lord_Of_The_Machines__Go_live&utm_medium=email

In [1]:
#Import basic necessary datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell #To print multiple outputs
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
#Read the campaign, train and test csv files provided and view the data
campaign = pd.read_csv('campaign_data.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample = pd.read_csv('sample_submission.csv')
campaign.shape
campaign.head(3)
train.shape
train.head()
test.shape
test.head()
sample.shape
sample.head()

(52, 9)

Unnamed: 0,campaign_id,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,email_body,subject,email_url
0,29,Newsletter,67,61,12,3,"Dear AVians,\r\n \r\nWe are shaping up a super...",Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...
1,30,Upcoming Events,18,14,7,1,"Dear AVians,\r\n \r\nAre your eager to know wh...",[July] Data Science Expert Meetups & Competiti...,http://r.newsletters.analyticsvidhya.com/7up0e...
2,31,Conference,15,13,5,1,Early Bird Pricing Till August 07  Save upto ...,Last chance to convince your boss before the E...,http://r.newsletters.analyticsvidhya.com/7usym...


(1023191, 6)

Unnamed: 0,id,user_id,campaign_id,send_date,is_open,is_click
0,42_14051,14051,42,01-09-2017 19:55,0,0
1,52_134438,134438,52,02-11-2017 12:53,0,0
2,33_181789,181789,33,24-07-2017 15:15,0,0
3,44_231448,231448,44,05-09-2017 11:36,0,0
4,29_185580,185580,29,01-07-2017 18:01,0,0


(773858, 4)

Unnamed: 0,id,campaign_id,user_id,send_date
0,63_122715,63,122715,01-02-2018 22:35
1,56_76206,56,76206,02-01-2018 08:15
2,57_96189,57,96189,05-01-2018 18:25
3,56_166917,56,166917,02-01-2018 08:15
4,56_172838,56,172838,02-01-2018 08:12


(773858, 2)

Unnamed: 0,id,is_click
0,63_122715,0.516037
1,56_76206,0.535007
2,57_96189,0.481492
3,56_166917,0.527826
4,56_172838,0.525384


In [3]:
#Merge train and test datasets with campaign dataset on 'campaign_id' feature
train_merged = pd.merge(train, campaign, on='campaign_id', how='left').reset_index(drop=True)
test_merged = pd.merge(test, campaign, on='campaign_id', how='left').reset_index(drop=True)
train_merged.shape
test_merged.shape
train_merged.head(3)

(1023191, 14)

(773858, 12)

Unnamed: 0,id,user_id,campaign_id,send_date,is_open,is_click,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,email_body,subject,email_url
0,42_14051,14051,42,01-09-2017 19:55,0,0,Newsletter,88,79,13,4,"September Newsletter\r\n \r\nDear AVians,\r\n ...",[September] Exciting days ahead with DataHack ...,http://r.newsletters.analyticsvidhya.com/7v3rd...
1,52_134438,134438,52,02-11-2017 12:53,0,0,Newsletter,67,62,10,4,"November Newsletter\r\n \r\nDear AVians,\r\n \...",[Newsletter] Stage for DataHack Summit 2017 is...,http://r.newsletters.analyticsvidhya.com/7vtb2...
2,33_181789,181789,33,24-07-2017 15:15,0,0,Others,7,3,1,1,Fireside Chat with DJ Patil - the master is he...,"[Delhi NCR] Fireside Chat with DJ Patil, Forme...",http://r.newsletters.analyticsvidhya.com/7uvlg...


In [4]:
#Check datatypes of the features in train data
train_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1023191 entries, 0 to 1023190
Data columns (total 14 columns):
id                      1023191 non-null object
user_id                 1023191 non-null int64
campaign_id             1023191 non-null int64
send_date               1023191 non-null object
is_open                 1023191 non-null int64
is_click                1023191 non-null int64
communication_type      1023191 non-null object
total_links             1023191 non-null int64
no_of_internal_links    1023191 non-null int64
no_of_images            1023191 non-null int64
no_of_sections          1023191 non-null int64
email_body              1023191 non-null object
subject                 1023191 non-null object
email_url               1023191 non-null object
dtypes: int64(8), object(6)
memory usage: 109.3+ MB


In [None]:
#The feature 'send_date' is of type object which needs to be converted into datetime
train_merged['send_date'] = pd.to_datetime(train_merged.send_date)
test_merged['send_date'] = pd.to_datetime(test_merged.send_date)

In [5]:
#Check unique communication types in train and test datasets
train_merged.communication_type.unique()
test_merged.communication_type.unique()

array(['Newsletter', 'Others', 'Upcoming Events', 'Conference',
       'Corporate', 'Hackathon', 'Webinar'], dtype=object)

array(['Newsletter', 'Upcoming Events', 'Hackathon', 'Corporate'], dtype=object)

Note that communication types such as Others, Conference, Webinar are not present in test dataset

In [6]:
#Let us encode the labels of communication types using sklearn.labelencoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_merged.communication_type = le.fit_transform(train_merged.communication_type)
test_merged.communication_type = le.fit_transform(test_merged.communication_type)
train_merged.communication_type.unique()
test_merged.communication_type.unique()

array([3, 4, 5, 0, 1, 2, 6], dtype=int64)

array([2, 3, 1, 0], dtype=int64)

## Feature Engineering

In [None]:
#Lets split the feature 'send_date' to gain more insights
train_merged['day_of_week'] = train_merged['send_date'].dt.dayofweek
train_merged['hour'] = train_merged['send_date'].dt.hour
train_merged['day'] = train_merged['send_date'].dt.day
train_merged['month'] = train_merged['send_date'].dt.month
train_merged['IsWeekend'] = train_merged['day_of_week'].apply(lambda x : 0 if x==0 | x==6 else 1)

In [None]:
test_merged['day_of_week'] = test_merged['send_date'].dt.dayofweek
test_merged['hour'] = test_merged['send_date'].dt.hour
test_merged['day'] = test_merged['send_date'].dt.day
test_merged['month'] = test_merged['send_date'].dt.month
test_merged['IsWeekend'] = test_merged['day_of_week'].apply(lambda x : 0 if x==0 | x==6 else 1)

In [7]:
train_merged = pd.read_csv('train_merged.csv')
test_merged = pd.read_csv('test_merged.csv')

In [8]:
train_merged.head(2)

Unnamed: 0,id,user_id,campaign_id,send_date,is_open,is_click,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,email_body,subject,email_url,day_of_week,hour,day,month,IsWeekend
0,42_14051,14051,42,2017-01-09 19:55:00,0,0,3,88,79,13,4,"September Newsletter\r\r\n \r\r\nDear AVians,\...",[September] Exciting days ahead with DataHack ...,http://r.newsletters.analyticsvidhya.com/7v3rd...,0,19,9,1,1
1,52_134438,134438,52,2017-02-11 12:53:00,0,0,3,67,62,10,4,"November Newsletter\r\r\n \r\r\nDear AVians,\r...",[Newsletter] Stage for DataHack Summit 2017 is...,http://r.newsletters.analyticsvidhya.com/7vtb2...,5,12,11,2,1


In [9]:
#Check the the spread of target classes
train_merged.is_click.value_counts()

0    1010409
1      12782
Name: is_click, dtype: int64

## Danger of Imbalanced Classes

Count of target class 0 is predominant over target class 1, any model trying to predict the target class is biased more towards class 0.<br>
As a result of this, class 1 predictions also come out as class 0. Standard accuracy no longer reliably measures performance, which makes model training much trickier.

In [10]:
#Take the necessary features only for modelling
feature_cols = train_merged.columns.drop(['id','user_id','send_date','email_body', 'subject', 'email_url','is_open','is_click'])

In [55]:
#First, let's import the Logistic Regression algorithm and the accuracy metric from Scikit-Learn.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics

# Separate input features (X) and target variable (y)
X = train_merged[feature_cols]
y = train_merged.is_click

#train-test split
x_train, x_test, y_train, y_test = train_test_split(X, y)

#cross validation
logreg = LogisticRegression().fit(x_train,y_train)
logreg_score = cross_val_score(logreg, x_train, y_train, cv=10, scoring='accuracy')
logreg_score.mean() #training accuracy

# Predict on training set
y_pred = logreg.predict(x_test)
 
# How's our accuracy?
metrics.accuracy_score(y_test, y_pred) # testing accuracy

0.98751487180816788

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

0.98748621959514926

As mentioned, many machine learning algorithms are designed to maximize overall accuracy by default.<br>
So our model has 98% overall accuracy, but is it because it is predicting only 1 class?

In [56]:
# Is our model still predicting just one class?
np.unique(y_pred)

array([0], dtype=int64)

As you can see, this model is only predicting class 0, which means it's completely ignoring the minority class in favor of the majority class.

Dealing with imbalanced datasets entails strategy of balancing classes in the training data (data preprocessing) before providing the data as input to the machine learning algorithm.

The main objective of balancing classes is to either increasing the frequency of the minority class or decreasing the frequency of the majority class. This is done in order to obtain approximately the same number of instances for both the classes. Let us look at a few resampling techniques.

## Random Over-Sampling

Ramdon over-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal. There are several heuristics for doing so, but the most common way is to simply resample with replacement.

In [57]:
# Separate majority and minority classes
df_majority = train_merged[train_merged.is_click==0]
df_minority = train_merged[train_merged.is_click==1]

In [32]:
from sklearn.utils import resample #import the resampling module from Scikit-Learn

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,                               # sample with replacement
                                 n_samples=len(df_majority),    # to match majority class
                                 random_state=123)                           # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled.is_click.value_counts()

1    1010409
0    1010409
Name: is_click, dtype: int64

As you can see, the new upsampled DataFrame has more observations than the original, and the ratio of the two classes is now 1:1.<br>
Now, lets train the upsampled data using Logistic Regression.

In [58]:
# Separate input features (X) and target variable (y)
y = df_upsampled.is_click
X = df_upsampled[feature_cols]

#train-test split
x_train, x_test, y_train, y_test = train_test_split(X, y)

#cross validation
logreg1 = LogisticRegression().fit(x_train,y_train)
logreg1_score = cross_val_score(logreg1, x_train, y_train, cv=10, scoring='accuracy')
logreg1_score.mean() # training accuracy

#prediction on test data
y_pred = logreg1.predict(x_test)
metrics.accuracy_score(y_test, y_pred)

0.54581611753134363

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

0.54645935808236257

In [59]:
# Check if our model is still predicting one class
np.unique(y_pred)

array([0, 1], dtype=int64)

In [60]:
from collections import Counter
Counter(y_pred).items()

dict_items([(1, 285910), (0, 219295)])

Now the model is no longer predicting just one class while the accuracy also is more meaningful now as an evaluation metric.<br>
But for this hackathon, the evaluation metric is AOC_RUC score.

In [61]:
#predict probability of train data and compare with test data
y_pred = logreg1.predict_proba(x_test)
metrics.roc_auc_score(y_test,y_pred[:,1])

0.57835772890656112

## Random Under-Sampling

Random Under-Sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm. The most common heuristic for doing so is resampling without replacement.

In [51]:
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,                   # sample without replacement
                                 n_samples=len(df_minority),      # to match minority class
                                 random_state=123)                # reproducible results
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.is_click.value_counts()

1    12782
0    12782
Name: is_click, dtype: int64

In [63]:
#Again apply Logistic Regression model on downsampled data

y = df_downsampled.is_click
X = df_downsampled[feature_cols]

#train-test split
x_train, x_test, y_train, y_test = train_test_split(X, y)

#cross validation
logreg2 = LogisticRegression().fit(x_train,y_train)
logreg2_score = cross_val_score(logreg2, x_train, y_train, cv=10, scoring='roc_auc')
logreg2_score.mean()

#prediction on test data
y_pred = logreg2.predict_proba(x_test)
metrics.roc_auc_score(y_test, y_pred[:,1])

0.57306561014374335

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

0.58434883986042829

Down sampling of the data is giving good roc_auc score than Up sampling the data.<br>
Let us try more advanced sampling methods and check for improvement of roc_auc score.

## SMOTE - Synthetic Minority Over-sampling Technique

This technique is followed to avoid overfitting which occurs when exact replicas of minority instances are added to the main dataset. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models.

In [66]:
#Take independent features into x and depenent feature into y
x = train_merged[feature_cols]
y = train_merged.is_click

In [68]:
from imblearn.over_sampling import SMOTE
sm = SMOTE()
x_res, y_res = sm.fit_sample(x, y)
Counter(y_res).items() #To print value counts of target classes

dict_items([(0, 1010409), (1, 1010409)])

In [69]:
#train-test split
x_train, x_test, y_train, y_test = train_test_split(x_res, y_res)
x_train.shape
x_test.shape
y_train.shape
y_test.shape

(1515613, 11)

(505205, 11)

(1515613,)

(505205,)

In [70]:
#Apply Logistic Regression model on data upsampled using SMOTE

#cross validation
logreg3 = LogisticRegression().fit(x_train,y_train)
logreg3_score = cross_val_score(logreg3, x_train, y_train, cv=10, scoring='roc_auc')
logreg3_score.mean()

#prediction on test data
y_pred = logreg3.predict_proba(x_test)
metrics.roc_auc_score(y_test, y_pred[:,1])

0.57667426285942702

0.57650612027061232

The ROC_AUC score of SMOTE is slightly lesser than Random Under-Sampling ROC_AUC score of 0.58434883986042829

## ADASYN - Adaptive Synthetic Sampling Approach

The purpose of ADASYN algorithm is to improve class balance by synthetically creating new examples from the minority class via linear interpolation between existing minority class examples. This approach by itself is known as SMOTE method (Synthetic Minority Oversampling TEchnique). ADASYN is an extension of SMOTE, creating more examples in the vicinity of the boundary between the two classes than in the interior of the minority class.

In [71]:
from imblearn.over_sampling import ADASYN #import ADASYN package from imblearn.over_sampling
from collections import Counter 
x_res, y_res = ADASYN().fit_sample(x, y)
Counter(y_res).items()

dict_items([(0, 1010409), (1, 1007333)])

In [73]:
#train-test split
x_train, x_test, y_train, y_test = train_test_split(x_res, y_res)
x_train.shape
x_test.shape
y_train.shape
y_test.shape

(1513306, 11)

(504436, 11)

(1513306,)

(504436,)

In [74]:
#Apply Logistic Regression model on data over-sampled using ADASYN

#cross validation
logreg4 = LogisticRegression().fit(x_train,y_train)
logreg4_score = cross_val_score(logreg4, x_train, y_train, cv=10, scoring='roc_auc')
logreg4_score.mean()

#prediction on test data
y_pred = logreg4.predict_proba(x_test)
metrics.roc_auc_score(y_test, y_pred[:,1])

0.57537144176039556

0.57536879432242283

The ROC_AUC score of ADASYN is slightly lesser than Random Under-Sampling ROC_AUC score of 0.58434883986042829

## Ensemble of Samplers - EasyEnsemble

EasyEnsemble creates an ensemble of data set by randomly under-sampling the original set:

In [77]:
from imblearn.ensemble import EasyEnsemble #import EasyEnsemble package from imblearn.ensemble
x_res, y_res = EasyEnsemble().fit_sample(x, y)
Counter(y_res[0]).items()

dict_items([(0, 12782), (1, 12782)])

In [79]:
#train-test split
x_train, x_test, y_train, y_test = train_test_split(x_res[0], y_res[0])
x_train.shape
x_test.shape
y_train.shape
y_test.shape

(19173, 11)

(6391, 11)

(19173,)

(6391,)

In [80]:
#cross validation
logreg5 = LogisticRegression().fit(x_train,y_train)
logreg5_score = cross_val_score(logreg5, x_train, y_train, cv=10, scoring='roc_auc')
logreg5_score.mean()

#prediction on test data
y_pred = logreg5.predict_proba(x_test)
metrics.roc_auc_score(y_test, y_pred[:,1])

0.572294328023504

0.57374436369740278

The ROC_AUC score of EasyEnsemble is slightly lesser than Random Under-Sampling ROC_AUC score of 0.58434883986042829

## Ensemble of Samplers - BalanceCascade

BalanceCascade differs from the previous method by using a classifier (using the parameter estimator) to ensure that misclassified samples can again be selected for the next subset. In fact, the classifier play the role of a “smart” replacement method.

In [83]:
from imblearn.ensemble import BalanceCascade #import BalanceCascade package from imblearn.ensemble
bc = BalanceCascade(estimator=LogisticRegression())
x_res, y_res = bc.fit_sample(x, y)
Counter(y_res[0]).items()

dict_items([(0, 12782), (1, 12782)])

In [84]:
#train-test split
x_train, x_test, y_train, y_test = train_test_split(x_res[0], y_res[0])
x_train.shape
x_test.shape
y_train.shape
y_test.shape

(19173, 11)

(6391, 11)

(19173,)

(6391,)

In [85]:
#cross validation
logreg6 = LogisticRegression().fit(x_train,y_train)
logreg6_score = cross_val_score(logreg6, x_train, y_train, cv=10, scoring='roc_auc')
logreg6_score.mean()

#prediction on test data
y_pred = logreg6.predict_proba(x_test)
metrics.roc_auc_score(y_test, y_pred[:,1])

0.57810171672976129

0.57246687815804653

The ROC_AUC score of BalanceCascade is slightly lesser than Random Under-Sampling ROC_AUC score of 0.58434883986042829

## Ensemble of Samplers - BalancedBaggingClassifier

BalancedBaggingClassifier allows to resample each subset of data before to train each estimator of the ensemble. In short, it combines the output of an EasyEnsemble sampler with an ensemble of classifiers (i.e. BaggingClassifier). Therefore, BalancedBaggingClassifier takes the same parameters as the scikit-learn BaggingClassifier.

In [92]:
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier

#train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y)

bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                ratio='auto',
                                replacement=False,
                                random_state=0).fit(x_train, y_train)
#cross-validation
bbc_score = cross_val_score(bbc, x_train, y_train, cv=10, scoring='roc_auc')
bbc_score.mean()

#predict on test data
y_pred = bbc.predict_proba(x_test)
metrics.roc_auc_score(y_test,y_pred[:,1])

0.58968331422425724

0.59995108831399091

The ROC_AUC score of BalancedBaggingClassifier is slightly higher than Random Under-Sampling ROC_AUC score of 0.58434883986042829