---
---

## DHS2019 CROSS VALIDATION PART 1


#### PROBLEM STATEMENT

Email Marketing is still the most successful marketing channel and the essential element of any digital marketing strategy. Marketers spend a lot of time in writing that perfect email, labouring over each word, catchy layouts on multiple devices to get them best in-industry open rates & click rates.

How can I build my campaign to increase the click-through rates of email? - a question that is often heard when marketers are creating their email marketing plans.
 
Can we optimize our email marketing campaigns with Data Science?

It's time to unlock marketing potential and build some exceptional data-science products for email marketing.

Analytics Vidhya sends out marketing emailers for various events such as conferences, hackathons, etc. We have provided a sample of user-email interaction data from July 2017 to December 2017. You are required to predict the click probability of links inside a mailer for email campaigns from January 2018 to March 2018. 


---

***Dataset URL: https://datahack.analyticsvidhya.com/contest/workshop_enigma-codefest-machine-learning/***


---

In [1]:
# importing required libraries
import numpy as np
import pandas as pd
import time
from sklearn import model_selection, preprocessing, metrics, ensemble
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [9]:
## Read the input files 
train_df = pd.read_csv("dataset/train.csv")
test_df = pd.read_csv("dataset/test.csv")
camp_df = pd.read_csv("dataset/campaign_data.csv")

In [10]:
train_df.head()

Unnamed: 0,id,user_id,campaign_id,send_date,is_open,is_click
0,42_14051,14051,42,01-09-2017 19:55,0,0
1,52_134438,134438,52,02-11-2017 12:53,0,0
2,33_181789,181789,33,24-07-2017 15:15,0,0
3,44_231448,231448,44,05-09-2017 11:36,0,0
4,29_185580,185580,29,01-07-2017 18:01,0,0


In [11]:
test_df.head()

Unnamed: 0,id,campaign_id,user_id,send_date
0,63_122715,63,122715,01-02-2018 22:35
1,56_76206,56,76206,02-01-2018 08:15
2,57_96189,57,96189,05-01-2018 18:25
3,56_166917,56,166917,02-01-2018 08:15
4,56_172838,56,172838,02-01-2018 08:12


In [12]:
camp_df.head()

Unnamed: 0,campaign_id,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,email_body,subject,email_url
0,29,Newsletter,67,61,12,3,"Dear AVians,\r\n \r\nWe are shaping up a super...",Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...
1,30,Upcoming Events,18,14,7,1,"Dear AVians,\r\n \r\nAre your eager to know wh...",[July] Data Science Expert Meetups & Competiti...,http://r.newsletters.analyticsvidhya.com/7up0e...
2,31,Conference,15,13,5,1,Early Bird Pricing Till August 07  Save upto ...,Last chance to convince your boss before the E...,http://r.newsletters.analyticsvidhya.com/7usym...
3,32,Conference,24,19,7,1,\r\n \r\nHi ?\r\n \r\nBefore I dive into why y...,A.I. & Machine Learning: 5 reasons why you sho...,http://r.newsletters.analyticsvidhya.com/7uthl...
4,33,Others,7,3,1,1,Fireside Chat with DJ Patil - the master is he...,"[Delhi NCR] Fireside Chat with DJ Patil, Forme...",http://r.newsletters.analyticsvidhya.com/7uvlg...


In [13]:
### EXERCISE-1
## Convert the date to datetime format
train_df["send_date"] = pd.to_datetime(train_df["send_date"], format="%d-%m-%Y %H:%M")
test_df["send_date"] = pd.to_datetime(test_df["send_date"], format="%d-%m-%Y %H:%M")

In [14]:
### EXERCISE-2
## Create a new column ordinal date 
train_df["ordinal_date"] = train_df["send_date"].apply(lambda x: time.mktime(x.timetuple()))
test_df["ordinal_date"] = test_df["send_date"].apply(lambda x: time.mktime(x.timetuple()))

In [15]:
## Sort values by date
train_df = train_df.sort_values(by="ordinal_date").reset_index(drop=True)
test_df = test_df.sort_values(by="ordinal_date").reset_index(drop=True)

---
---

## Check for categorical variable shift between train and test

### Campaign ID
---

In [16]:
### EXERCISE-3
# unique campaign id train data
train_df["campaign_id"].unique()

array([29, 30, 32, 33, 35, 34, 31, 36, 38, 37, 39, 40, 41, 43, 42, 44, 45,
       46, 47, 49, 48, 50, 51, 52, 53, 54])

In [17]:
# unique campaign id test data
test_df["campaign_id"].unique()

array([56, 57, 55, 59, 58, 60, 61, 62, 63, 64, 65, 67, 68, 66, 69, 70, 71,
       72, 73, 74, 75, 76, 77, 78, 80, 79])

In [18]:
# common user ids in train and test data
train_users = set(train_df['user_id'].unique())
test_users = set(test_df['user_id'].unique())

print("Train users count : ", len(train_users))
print("Test users count : ",len(test_users))
print("Common users count : ", len(train_users.intersection(test_users)))

Train users count :  168236
Test users count :  198219
Common users count :  145737


In [19]:
### EXERCISE-4
## Create a for loop using sklearn.model_selection.GroupKFold using the identified group in the previous section
kf = model_selection.GroupKFold(n_splits=5)

# train_df is data that needs to be split
# train_df["is_click"].values is the target variable
# train_df["campaign_id"].values column on which we need to divide the groups 
for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
    dev_df, val_df = train_df.loc[dev_index,:], train_df.loc[val_index,:]
    print("Dev camps : ", dev_df["campaign_id"].unique())
    print("Val camps : ", val_df["campaign_id"].unique())
    print()

Dev camps :  [30 32 33 34 31 36 38 39 41 42 44 45 49 48 50 51 52 53]
Val camps :  [29 35 37 40 43 46 47 54]

Dev camps :  [29 30 33 35 34 31 36 38 37 39 40 41 43 42 46 47 49 48 50 52 54]
Val camps :  [32 44 45 51 53]

Dev camps :  [29 32 33 35 34 31 36 38 37 39 40 43 42 44 45 46 47 49 50 51 53 54]
Val camps :  [30 41 48 52]

Dev camps :  [29 30 32 35 31 36 38 37 40 41 43 42 44 45 46 47 48 50 51 52 53 54]
Val camps :  [33 34 39 49]

Dev camps :  [29 30 32 33 35 34 37 39 40 41 43 44 45 46 47 49 48 51 52 53 54]
Val camps :  [31 36 38 42 50]

