# <center> Lord of the Machines - Data Science Hackathon

<b>Problem Statement</b><br>
Email Marketing is still the most successful marketing channel and the essential element of any digital marketing strategy. Marketers spend a lot of time in writing that perfect email, labouring over each word, catchy layouts on multiple devices to get them best in-industry open rates & click rates.

How can I build my campaign to increase the click-through rates of email? - a question that is often heard when marketers are creating their email marketing plans.

Can we optimize our email marketing campaigns with Data Science?

It's time to unlock marketing potential and build some exceptional data-science products for email marketing.

Analytics Vidhya sends out marketing emailers for various events such as conferences, hackathons, etc. We have provided a sample of user-email interaction data from July 2017 to December 2017. You are required to predict the click probability of links inside a mailer for email campaigns from January 2018 to March 2018.

<b>EVALUATION METRIC</b><br>
The evaluation metric for this competition is AUC-ROC score.

https://datahack.analyticsvidhya.com/contest/lord-of-the-machines/?utm_source=sendinblue&utm_campaign=Lord_Of_The_Machines__Go_live&utm_medium=email

In [2]:
#Import basic necessary datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell #To print multiple outputs
InteractiveShell.ast_node_interactivity = 'all'

We are provided with 3 datasets:
1. Campaign details of AV
2. Train dataset which shows the information whether user opened/clicked links in the campaign mail or not
3. Test dataset - to submit with click predictions

In [7]:
#Read the datasets provided and view the data
campaign = pd.read_csv('campaign_data.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample = pd.read_csv('sample_submission.csv')
campaign.shape
campaign.head()
train.shape
train.head()
test.shape
test.head()
sample.shape
sample.head()

(52, 9)

Unnamed: 0,campaign_id,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,email_body,subject,email_url
0,29,Newsletter,67,61,12,3,"Dear AVians,\r\n \r\nWe are shaping up a super...",Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...
1,30,Upcoming Events,18,14,7,1,"Dear AVians,\r\n \r\nAre your eager to know wh...",[July] Data Science Expert Meetups & Competiti...,http://r.newsletters.analyticsvidhya.com/7up0e...
2,31,Conference,15,13,5,1,Early Bird Pricing Till August 07  Save upto ...,Last chance to convince your boss before the E...,http://r.newsletters.analyticsvidhya.com/7usym...
3,32,Conference,24,19,7,1,\r\n \r\nHi ?\r\n \r\nBefore I dive into why y...,A.I. & Machine Learning: 5 reasons why you sho...,http://r.newsletters.analyticsvidhya.com/7uthl...
4,33,Others,7,3,1,1,Fireside Chat with DJ Patil - the master is he...,"[Delhi NCR] Fireside Chat with DJ Patil, Forme...",http://r.newsletters.analyticsvidhya.com/7uvlg...


(1023191, 6)

Unnamed: 0,id,user_id,campaign_id,send_date,is_open,is_click
0,42_14051,14051,42,01-09-2017 19:55,0,0
1,52_134438,134438,52,02-11-2017 12:53,0,0
2,33_181789,181789,33,24-07-2017 15:15,0,0
3,44_231448,231448,44,05-09-2017 11:36,0,0
4,29_185580,185580,29,01-07-2017 18:01,0,0


(773858, 4)

Unnamed: 0,id,campaign_id,user_id,send_date
0,63_122715,63,122715,01-02-2018 22:35
1,56_76206,56,76206,02-01-2018 08:15
2,57_96189,57,96189,05-01-2018 18:25
3,56_166917,56,166917,02-01-2018 08:15
4,56_172838,56,172838,02-01-2018 08:12


(773858, 2)

Unnamed: 0,id,is_click
0,63_122715,0
1,56_76206,0
2,57_96189,0
3,56_166917,0
4,56_172838,0


In [8]:
#Merge train and test datasets with campaign dataset on 'campaign_id' feature
train_merged = pd.merge(train, campaign, on='campaign_id', how='left').reset_index()
test_merged = pd.merge(test, campaign, on='campaign_id', how='left').reset_index()
train_merged.shape
test_merged.shape
train_merged.head()

(1023191, 15)

(773858, 13)

Unnamed: 0,index,id,user_id,campaign_id,send_date,is_open,is_click,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,email_body,subject,email_url
0,0,42_14051,14051,42,01-09-2017 19:55,0,0,Newsletter,88,79,13,4,"September Newsletter\r\n \r\nDear AVians,\r\n ...",[September] Exciting days ahead with DataHack ...,http://r.newsletters.analyticsvidhya.com/7v3rd...
1,1,52_134438,134438,52,02-11-2017 12:53,0,0,Newsletter,67,62,10,4,"November Newsletter\r\n \r\nDear AVians,\r\n \...",[Newsletter] Stage for DataHack Summit 2017 is...,http://r.newsletters.analyticsvidhya.com/7vtb2...
2,2,33_181789,181789,33,24-07-2017 15:15,0,0,Others,7,3,1,1,Fireside Chat with DJ Patil - the master is he...,"[Delhi NCR] Fireside Chat with DJ Patil, Forme...",http://r.newsletters.analyticsvidhya.com/7uvlg...
3,3,44_231448,231448,44,05-09-2017 11:36,0,0,Upcoming Events,60,56,19,6,"[September Events]\r\n \r\nDear AVians,\r\n \r...","[September] Data Science Hackathons, Meetups a...",http://r.newsletters.analyticsvidhya.com/7veam...
4,4,29_185580,185580,29,01-07-2017 18:01,0,0,Newsletter,67,61,12,3,"Dear AVians,\r\n \r\nWe are shaping up a super...",Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...


In [9]:
train_merged.index

RangeIndex(start=0, stop=1023191, step=1)

In [10]:
#Check datatypes of the features in train data
train_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1023191 entries, 0 to 1023190
Data columns (total 15 columns):
index                   1023191 non-null int64
id                      1023191 non-null object
user_id                 1023191 non-null int64
campaign_id             1023191 non-null int64
send_date               1023191 non-null object
is_open                 1023191 non-null int64
is_click                1023191 non-null int64
communication_type      1023191 non-null object
total_links             1023191 non-null int64
no_of_internal_links    1023191 non-null int64
no_of_images            1023191 non-null int64
no_of_sections          1023191 non-null int64
email_body              1023191 non-null object
subject                 1023191 non-null object
email_url               1023191 non-null object
dtypes: int64(9), object(6)
memory usage: 117.1+ MB


In [11]:
#The feature 'send_date' is of type object which needs to be converted into datetime
train_merged.send_date = pd.to_datetime(train_merged.send_date)
test_merged.send_date = pd.to_datetime(test_merged.send_date)

In [12]:
train_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1023191 entries, 0 to 1023190
Data columns (total 15 columns):
index                   1023191 non-null int64
id                      1023191 non-null object
user_id                 1023191 non-null int64
campaign_id             1023191 non-null int64
send_date               1023191 non-null datetime64[ns]
is_open                 1023191 non-null int64
is_click                1023191 non-null int64
communication_type      1023191 non-null object
total_links             1023191 non-null int64
no_of_internal_links    1023191 non-null int64
no_of_images            1023191 non-null int64
no_of_sections          1023191 non-null int64
email_body              1023191 non-null object
subject                 1023191 non-null object
email_url               1023191 non-null object
dtypes: datetime64[ns](1), int64(9), object(5)
memory usage: 117.1+ MB


In [13]:
#Check unique communication types in train and test datasets
train_merged.communication_type.unique()
test_merged.communication_type.unique()

array(['Newsletter', 'Others', 'Upcoming Events', 'Conference',
       'Corporate', 'Hackathon', 'Webinar'], dtype=object)

array(['Newsletter', 'Upcoming Events', 'Hackathon', 'Corporate'],
      dtype=object)

Note that communication types such as Others, Conference, Webinar are not present in test dataset

In [15]:
#Let us encode the labels of communication types using sklearn.labelencoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train_merged.communication_type)
#le.fit(test_merged.communication_type)
train_merged.communication_type = le.transform(train_merged.communication_type)
#test_merged.communication_type = le.transform(test_merged.communication_type)
train_merged.communication_type.unique()
#test_merged.communication_type.unique()

LabelEncoder()

array([3, 4, 5, 0, 1, 2, 6], dtype=int64)

In [16]:
le.fit(test_merged.communication_type)
test_merged.communication_type = le.transform(test_merged.communication_type)
test_merged.communication_type.unique()

LabelEncoder()

array([2, 3, 1, 0], dtype=int64)

## Feature Engineering

In [17]:
#Lets split the feature 'send_date' to gain more insights
train_merged['day_of_week'] = train_merged['send_date'].dt.dayofweek
train_merged['hour'] = train_merged['send_date'].dt.hour
train_merged['day'] = train_merged['send_date'].dt.day
train_merged['month'] = train_merged['send_date'].dt.month
train_merged['IsWeekend'] = train_merged['day_of_week'].apply(lambda x : 0 if x==0 | x==6 else 1)

In [18]:
test_merged['day_of_week'] = test_merged['send_date'].dt.dayofweek
test_merged['hour'] = test_merged['send_date'].dt.hour
test_merged['day'] = test_merged['send_date'].dt.day
test_merged['month'] = test_merged['send_date'].dt.month
test_merged['IsWeekend'] = test_merged['day_of_week'].apply(lambda x : 0 if x==0 | x==6 else 1)

In [19]:
train_merged.head(2)

Unnamed: 0,index,id,user_id,campaign_id,send_date,is_open,is_click,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,email_body,subject,email_url,day_of_week,hour,day,month,IsWeekend
0,0,42_14051,14051,42,2017-01-09 19:55:00,0,0,3,88,79,13,4,"September Newsletter\r\n \r\nDear AVians,\r\n ...",[September] Exciting days ahead with DataHack ...,http://r.newsletters.analyticsvidhya.com/7v3rd...,0,19,9,1,1
1,1,52_134438,134438,52,2017-02-11 12:53:00,0,0,3,67,62,10,4,"November Newsletter\r\n \r\nDear AVians,\r\n \...",[Newsletter] Stage for DataHack Summit 2017 is...,http://r.newsletters.analyticsvidhya.com/7vtb2...,5,12,11,2,1


In [20]:
train_merged.columns

Index(['index', 'id', 'user_id', 'campaign_id', 'send_date', 'is_open',
       'is_click', 'communication_type', 'total_links', 'no_of_internal_links',
       'no_of_images', 'no_of_sections', 'email_body', 'subject', 'email_url',
       'day_of_week', 'hour', 'day', 'month', 'IsWeekend'],
      dtype='object')

In [22]:
feature_cols = train_merged.columns.drop(['id', 'user_id', 'campaign_id', 'send_date','email_body', 'subject', 'email_url','is_open','is_click'])
#feature_cols
train_merged[feature_cols]

Unnamed: 0,index,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,day_of_week,hour,day,month,IsWeekend
0,0,3,88,79,13,4,0,19,9,1,1
1,1,3,67,62,10,4,5,12,11,2,1
2,2,4,7,3,1,1,0,15,24,7,1
3,3,5,60,56,19,6,1,11,9,5,1
4,4,3,67,61,12,3,5,18,7,1,1
5,5,0,119,117,16,1,3,15,28,9,1
6,6,3,88,79,13,4,0,20,9,1,1
7,7,3,88,79,13,4,0,20,9,1,1
8,8,0,119,117,16,1,3,15,28,9,1
9,9,5,18,14,7,1,6,14,7,5,0


In [34]:
train_merged[feature_cols].to_csv('train_merged.csv', index=False)

In [30]:
x = train_merged[feature_cols]
y = train_merged.is_click
x.head()
y.head()

Unnamed: 0,index,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,day_of_week,hour,day,month,IsWeekend
0,0,3,88,79,13,4,0,19,9,1,1
1,1,3,67,62,10,4,5,12,11,2,1
2,2,4,7,3,1,1,0,15,24,7,1
3,3,5,60,56,19,6,1,11,9,5,1
4,4,3,67,61,12,3,5,18,7,1,1


0    0
1    0
2    0
3    0
4    0
Name: is_click, dtype: int64

In [31]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(x)
x_scaled = pd.DataFrame(columns=x.columns, data=scaled)
x_scaled.head()

Unnamed: 0,index,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,day_of_week,hour,day,month,IsWeekend
0,-1.732049,0.370532,0.427526,0.359969,0.628424,0.795973,-1.694876,0.665775,-0.556001,-1.091076,0.535918
1,-1.732046,0.370532,-0.016091,-0.009085,0.010564,0.795973,0.785045,-1.089739,-0.245041,-0.740702,0.535918
2,-1.732042,0.93009,-1.283568,-1.28992,-1.843018,-0.98092,-1.694876,-0.337376,1.776202,1.011166,0.535918
3,-1.732039,1.489648,-0.163964,-0.13934,1.864145,1.980569,-1.198892,-1.340527,-0.556001,0.310419,0.535918
4,-1.732036,0.370532,-0.016091,-0.030794,0.422471,0.203675,0.785045,0.414987,-0.866961,-1.091076,0.535918


In [32]:
train_merged.is_click.value_counts()

0    1010409
1      12782
Name: is_click, dtype: int64

We see that there is huge imbalance between classes 0 and 1.
To overcome the class imbalance issue, we need to use SMOTE to upsample the minor class instances.

In [33]:
from kmeans_smote import KMeansSMOTE

kmeans_smote = KMeansSMOTE(
    kmeans_args={
        'n_clusters': 100
    },
    smote_args={
        'k_neighbors': 10
    }
)
x_resampled, y_resampled = kmeans_smote.fit_sample(x_scaled, y)

[print('Class {} has {} instances after oversampling'.format(label, count))
 for label, count in zip(*np.unique(y_resampled, return_counts=True))]

ImportError: cannot import name 'logsumexp'

In [43]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y)
x_train.shape
x_test.shape
y_train.shape
y_test.shape

(767393, 10)

(255798, 10)

(767393,)

(255798,)

In [59]:
y_train.value_counts()
y_test.value_counts()

0    757855
1      9538
Name: is_click, dtype: int64

0    252554
1      3244
Name: is_click, dtype: int64

In [51]:
test_merged[feature_cols].head()

Unnamed: 0,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,day_of_week,hour,day,month,IsWeekend
0,2,68,64,15,5,1,22,2,1,1
1,2,42,38,10,4,3,8,1,2,1
2,3,40,36,15,4,1,18,1,5,1
3,2,42,38,10,4,3,8,1,2,1
4,2,42,38,10,4,3,8,1,2,1


In [55]:
test_scaled = pd.DataFrame(columns=feature_cols, data=scaler.fit_transform(test_merged[feature_cols]))
test_scaled.head()

Unnamed: 0,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,day_of_week,hour,day,month,IsWeekend
0,-0.129635,0.405615,0.44806,0.684008,0.794145,-0.790201,1.33103,-0.436621,-0.987799,0.145226
1,-0.129635,-0.245307,-0.234024,-0.370742,0.099247,0.446796,-1.888278,-0.608705,-0.628489,0.145226
2,1.026943,-0.295378,-0.286492,0.684008,0.099247,-0.790201,0.411228,-0.608705,0.449443,0.145226
3,-0.129635,-0.245307,-0.234024,-0.370742,0.099247,0.446796,-1.888278,-0.608705,-0.628489,0.145226
4,-0.129635,-0.245307,-0.234024,-0.370742,0.099247,0.446796,-1.888278,-0.608705,-0.628489,0.145226


In [61]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
lr = LogisticRegression()
lr_cv_score = cross_val_score(lr, x_train,y_train,cv=10,scoring='roc_auc')
lr_cv_score.mean()
lr.fit(x_train, y_train)
lr_pred = lr.predict(x_test)
metrics.roc_auc_score(y_test, lr_pred)

0.575881400543178

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

0.5

In [67]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc_cv_score = cross_val_score(dtc, x_train, y_train, cv=10, scoring='roc_auc')
dtc_cv_score.mean()
dtc.fit(x_train, y_train)
dtc_pred = dtc.predict(x_test)
metrics.roc_auc_score(y_test, dtc_pred)

0.5939141049919565

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

0.5

In [64]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc_cv_score = cross_val_score(rfc, x_train, y_train, cv=10, scoring='roc_auc')
rfc_cv_score.mean()
rfc.fit(x_train, y_train)
rfc_pred = rfc.predict(x_test)
metrics.roc_auc_score(y_test, rfc_pred)

0.5939386260433296

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

0.5

In [68]:
test_pred = dtc.predict(test_scaled)
test_pred.max()

0

In [72]:
sample['is_click'] = test_pred
sample.head()
sample.to_csv('sample_submission.csv', index=False)

Unnamed: 0,id,is_click
0,63_122715,0
1,56_76206,0
2,57_96189,0
3,56_166917,0
4,56_172838,0


AV Score - 0.5