---
---

## DHS2019 FEATURE ENGINEERING
---

#### PROBLEM STATEMENT
---


Email Marketing is still the most successful marketing channel and the essential element of any digital marketing strategy. Marketers spend a lot of time in writing that perfect email, labouring over each word, catchy layouts on multiple devices to get them best in-industry open rates & click rates.

How can I build my campaign to increase the click-through rates of email? - a question that is often heard when marketers are creating their email marketing plans.

Can we optimize our email marketing campaigns with Data Science?

It's time to unlock marketing potential and build some exceptional data-science products for email marketing.

Analytics Vidhya sends out marketing emailers for various events such as conferences, hackathons, etc. We have provided a sample of user-email interaction data from July 2017 to December 2017. You are required to predict the click probability of links inside a mailer for email campaigns from January 2018 to March 2018.


#### **Contest URL: https://datahack.analyticsvidhya.com/contest/workshop_lord-of-the-machines-data-science-hackath/**



---

In [1]:
# importing required libraries
import numpy as np
import pandas as pd
import time
from sklearn import model_selection, preprocessing, metrics, ensemble
import lightgbm as lgb

In [31]:
# read the datasets
train_df = pd.read_csv("dataset/train.csv")
test_df = pd.read_csv("dataset/test.csv")
campaign_df = pd.read_csv("dataset/campaign_data.csv")

# create another column to label train data as 1 and test data as 0
train_df["train_set"] = 1
test_df["train_set"] = 0
print("Train and test shape : ", train_df.shape, test_df.shape)

Train and test shape :  (1023191, 7) (773858, 5)


In [35]:
# view the train data
train_df.head()

Unnamed: 0,id,user_id,campaign_id,send_date,is_open,is_click,train_set
0,42_14051,14051,42,01-09-2017 19:55,0,0,1
1,52_134438,134438,52,02-11-2017 12:53,0,0,1
2,33_181789,181789,33,24-07-2017 15:15,0,0,1
3,44_231448,231448,44,05-09-2017 11:36,0,0,1
4,29_185580,185580,29,01-07-2017 18:01,0,0,1


In [36]:
# view the test data
test_df.head()

Unnamed: 0,id,campaign_id,user_id,send_date,train_set
0,63_122715,63,122715,01-02-2018 22:35,0
1,56_76206,56,76206,02-01-2018 08:15,0
2,57_96189,57,96189,05-01-2018 18:25,0
3,56_166917,56,166917,02-01-2018 08:15,0
4,56_172838,56,172838,02-01-2018 08:12,0


In [38]:
# view the campaign data
campaign_df.head(3)

Unnamed: 0,campaign_id,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,email_body,subject,email_url
0,29,Newsletter,67,61,12,3,"Dear AVians,\r\n \r\nWe are shaping up a super...",Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...
1,30,Upcoming Events,18,14,7,1,"Dear AVians,\r\n \r\nAre your eager to know wh...",[July] Data Science Expert Meetups & Competiti...,http://r.newsletters.analyticsvidhya.com/7up0e...
2,31,Conference,15,13,5,1,Early Bird Pricing Till August 07  Save upto ...,Last chance to convince your boss before the E...,http://r.newsletters.analyticsvidhya.com/7usym...


In [41]:
# convert send date to date time format in both train and test data
train_df["send_date"] = pd.to_datetime(train_df["send_date"], format="%d-%m-%Y %H:%M")
test_df["send_date"] = pd.to_datetime(test_df["send_date"], format="%d-%m-%Y %H:%M")

# create new featrure oridnal date 
# The timestamp returns a long integer containing the number of 
# seconds between the Unix Epoch (January 1, 1970, 00:00:00 GMT) and the time specified.
train_df["ordinal_date"] = train_df["send_date"].apply(lambda x: time.mktime(x.timetuple()))
test_df["ordinal_date"] = test_df["send_date"].apply(lambda x: time.mktime(x.timetuple()))

# sort the dataframes using oridnal date
train_df = train_df.sort_values(by="ordinal_date").reset_index(drop=True)
test_df = test_df.sort_values(by="ordinal_date").reset_index(drop=True)

# add more features to train and test data by merging it with campaign data  
train_df = pd.merge(train_df, campaign_df, on="campaign_id")
test_df = pd.merge(test_df, campaign_df, on="campaign_id")

In [44]:
# Label encode the column: [communication_type]
for c in ["communication_type"]:
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(train_df[c].values.astype('str')) + list(test_df[c].values.astype('str')))
    train_df[c] = lbl.transform(list(train_df[c].values.astype('str')))
    test_df[c] = lbl.transform(list(test_df[c].values.astype('str')))

train_df.head()

Unnamed: 0,id,user_id,campaign_id,send_date,is_open,is_click,train_set,ordinal_date,communication_type_x,total_links_x,...,subject_y,email_url_y,communication_type,total_links,no_of_internal_links,no_of_images,no_of_sections,email_body,subject,email_url
0,29_170634,170634,29,2017-07-01 18:01:00,0,0,1,1498912000.0,3,67,...,Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...,3,67,61,12,3,"Dear AVians,\r\n \r\nWe are shaping up a super...",Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...
1,29_95676,95676,29,2017-07-01 18:01:00,0,0,1,1498912000.0,3,67,...,Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...,3,67,61,12,3,"Dear AVians,\r\n \r\nWe are shaping up a super...",Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...
2,29_7740,7740,29,2017-07-01 18:01:00,0,0,1,1498912000.0,3,67,...,Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...,3,67,61,12,3,"Dear AVians,\r\n \r\nWe are shaping up a super...",Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...
3,29_20155,20155,29,2017-07-01 18:01:00,0,0,1,1498912000.0,3,67,...,Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...,3,67,61,12,3,"Dear AVians,\r\n \r\nWe are shaping up a super...",Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...
4,29_134093,134093,29,2017-07-01 18:01:00,0,0,1,1498912000.0,3,67,...,Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...,3,67,61,12,3,"Dear AVians,\r\n \r\nWe are shaping up a super...",Sneak Peek: A look at the emerging data scienc...,http://r.newsletters.analyticsvidhya.com/7um44...


### Baseline Model

In [45]:
# define a function to train the lightGBM model
def runLGB(train_X, train_y, test_X, test_y=None, test_X2=None): 
    params = {}
    params["objective"] = "binary"
    params['metric'] = 'auc'
    params["max_depth"] = 4
    params["min_data_in_leaf"] = 100
    params["learning_rate"] = 0.001
    params["bagging_fraction"] = 0.7
    params["feature_fraction"] = 0.7
    params["bagging_freq"] = 5
    params["bagging_seed"] = 0
    params["verbosity"] = -1
    num_rounds = 10000

    plst = list(params.items())
    lgtrain = lgb.Dataset(train_X, label=train_y)

    if test_y is not None:
        lgtest = lgb.Dataset(test_X, label=test_y)
        model = lgb.train(params, lgtrain, num_rounds, valid_sets=[lgtest], early_stopping_rounds=100, verbose_eval=50)
    else:
        lgtest = lgb.DMatrix(test_X)
        model = lgb.train(params, lgtrain, num_rounds)

    pred_test_y = model.predict(test_X, num_iteration=model.best_iteration)
    pred_test_y2 = model.predict(test_X2, num_iteration=model.best_iteration)

    loss = 0
    if test_y is not None:
        loss = metrics.roc_auc_score(test_y, pred_test_y)
        print(loss)
        return pred_test_y, loss, pred_test_y2
    else:
        return pred_test_y, loss, pred_test_y2

In [4]:
# filter the columns that you want to use to train the model
usecols = ["user_id", "communication_type", "total_links", "no_of_internal_links", "no_of_images", "no_of_sections"]
train_X = train_df[usecols]
test_X = test_df[usecols]
train_y = train_df["is_click"].values

In [5]:
# GroupKFold with n_splits=5
cv_scores = []
pred_test_full = 0
kf = model_selection.GroupKFold(n_splits=5)
for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
    # print the campaign IDs of the current training set
    print(train_df["campaign_id"].loc[val_index].unique())
    dev_X, val_X = train_X.loc[dev_index,:], train_X.loc[val_index,:]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    # train a lgb model 
    pred_val, loss, pred_test = runLGB(dev_X, dev_y, val_X, val_y, test_X)
    cv_scores.append(loss)
    pred_test_full += pred_test
    print(cv_scores)
pred_test_full /= 5.

[29 35 37 40 43 46 47 54]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.503927
[100]	valid_0's auc: 0.5043
Early stopping, best iteration is:
[1]	valid_0's auc: 0.535532
0.535532069116465
[0.535532069116465]
[32 44 45 51 53]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.579146
[100]	valid_0's auc: 0.579145
Early stopping, best iteration is:
[40]	valid_0's auc: 0.579148
0.579147651136752
[0.535532069116465, 0.579147651136752]
[30 41 48 52]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.501905
[100]	valid_0's auc: 0.501894
[150]	valid_0's auc: 0.502117
[200]	valid_0's auc: 0.502108
[250]	valid_0's auc: 0.502523
[300]	valid_0's auc: 0.502825
[350]	valid_0's auc: 0.502732
[400]	valid_0's auc: 0.502571
Early stopping, best iteration is:
[316]	valid_0's auc: 0.502826
0.5028262333534865
[0.535532069116465, 0.579147651136752, 0.5028262333534865]
[33 34 39 49]
Training until validati

In [6]:
# save the calculated results in the csv file and submit it on the contest page to get the results
# to create the sample submission file refer to the sample submission file as given on the contest page
sub_df = pd.DataFrame({"id":test_df["id"].values})
sub_df["is_click"] = pred_test_full
sub_df.to_csv("sub1.csv", index=False)

#### Score is: 0.53

### Check if you can remove some variables and still the model results are similar

If you think some of the variables above might be not add much value, then remove them and re-run the models

In [7]:
# filter the columns that you want to use to train the model
# ******* EXERCISE 1 **********

usecols = ["user_id", "communication_type"] 
train_X = train_df[usecols]
test_X = test_df[usecols]
train_y = train_df["is_click"].values

In [8]:
# GroupKFold with n_splits=5
cv_scores = []
pred_test_full = 0
kf = model_selection.GroupKFold(n_splits=5)
for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
    # print the campaign IDs of the current training set
    print(train_df["campaign_id"].loc[val_index].unique())
    dev_X, val_X = train_X.loc[dev_index,:], train_X.loc[val_index,:]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    # train a lgb model
    pred_val, loss, pred_test = runLGB(dev_X, dev_y, val_X, val_y, test_X)
    pred_test_full += pred_test
    cv_scores.append(loss)
    print(cv_scores)
pred_test_full /= 5.

[29 35 37 40 43 46 47 54]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.54335
[100]	valid_0's auc: 0.540487
Early stopping, best iteration is:
[15]	valid_0's auc: 0.543784
0.5437840421618861
[0.5437840421618861]
[32 44 45 51 53]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.593185
[100]	valid_0's auc: 0.591552
[150]	valid_0's auc: 0.593148
Early stopping, best iteration is:
[55]	valid_0's auc: 0.593937
0.5939369648958792
[0.5437840421618861, 0.5939369648958792]
[30 41 48 52]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.495496
[100]	valid_0's auc: 0.494837
Early stopping, best iteration is:
[1]	valid_0's auc: 0.503977
0.5039767329895155
[0.5437840421618861, 0.5939369648958792, 0.5039767329895155]
[33 34 39 49]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.49383
[100]	valid_0's auc: 0.493111
Early stopping, best iteration is:
[1]	valid_0

In [9]:
# submit the results to check the scores
sub_df = pd.DataFrame({"id":test_df["id"].values})
sub_df["is_click"] = pred_test_full
sub_df.to_csv("sub2.csv", index=False)

## This has reduced the score
### Score is: 0.49

---
### Temporal variables 

Which temporal variables can be created for this problem? Please extract the same and use them to rebuild the models.

---

In [10]:
# extract a new feature hour from the send_date
train_df["hour"] = pd.to_datetime(train_df["send_date"]).dt.hour
test_df["hour"] = pd.to_datetime(test_df["send_date"]).dt.hour

In [11]:
# use this new column also and check if it helps to improve the scores
usecols = ["user_id", "communication_type", "hour"] 
train_X = train_df[usecols]
test_X = test_df[usecols]
train_y = train_df["is_click"].values

In [12]:
# GroupKFold with n_splits=5
cv_scores = []
kf = model_selection.GroupKFold(n_splits=5)
for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
    # print the campaign IDs of the current training set
    print(train_df["campaign_id"].loc[val_index].unique())
    dev_X, val_X = train_X.loc[dev_index,:], train_X.loc[val_index,:]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    # train a lgb model
    pred_val, loss, pred_test = runLGB(dev_X, dev_y, val_X, val_y, test_X)
    cv_scores.append(loss)
    print(cv_scores)

[29 35 37 40 43 46 47 54]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.577333
[100]	valid_0's auc: 0.574894
Early stopping, best iteration is:
[17]	valid_0's auc: 0.580598
0.5805976041918889
[0.5805976041918889]
[32 44 45 51 53]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.598628
[100]	valid_0's auc: 0.599174
Early stopping, best iteration is:
[2]	valid_0's auc: 0.608676
0.6086759342199294
[0.5805976041918889, 0.6086759342199294]
[30 41 48 52]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.494414
[100]	valid_0's auc: 0.494412
Early stopping, best iteration is:
[2]	valid_0's auc: 0.494592
0.49459173368654896
[0.5805976041918889, 0.6086759342199294, 0.49459173368654896]
[33 34 39 49]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.514315
[100]	valid_0's auc: 0.531783
[150]	valid_0's auc: 0.529677
Early stopping, best iteration is:
[92]	val

### Target Encoding

In [13]:
def getDVEncodeVar(compute_df, target_df, var_name, target_var="is_click", min_cutoff=1):
    if type(var_name) != type([]):
        var_name = [var_name]
    grouped_df = target_df.groupby(var_name)[target_var].agg(["mean"]).reset_index()
    grouped_df.columns = var_name + ["mean_value"]
    merged_df = pd.merge(compute_df, grouped_df, how="left", on=var_name)
    merged_df.fillna(np.mean(target_df[target_var].values), inplace=True)
    return list(merged_df["mean_value"])

cols_to_use = []
kf = model_selection.GroupKFold(n_splits=5)
#### columns to target encode 
# ****** EXERCISE 3*******
for col in [["user_id"], ["user_id", "communication_type"]]:
        train_enc_values = np.zeros(train_df.shape[0])
        test_enc_values = 0
        for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
            dev_X, val_X = train_df.loc[dev_index], train_df.loc[val_index]
            train_enc_values[val_index] = np.array( getDVEncodeVar(val_X[col], dev_X, col, 'is_click'))
            test_enc_values += np.array( getDVEncodeVar(test_df[col], dev_X, col, 'is_click'))
        test_enc_values /= 5.
        if isinstance(col, list):
            col = "_".join(col)
        train_df[col + "_enc"] = train_enc_values
        test_df[col + "_enc"] = test_enc_values
        cols_to_use.append(col + "_enc")

In [14]:
#### EXERCISE 4
usecols = ["user_id", "communication_type"] + cols_to_use
print(usecols)
train_X = train_df[usecols]
test_X = test_df[usecols]
train_y = train_df["is_click"].values

['user_id', 'communication_type', 'user_id_enc', 'user_id_communication_type_enc']


In [15]:
cv_scores = []
pred_test_full = 0
kf = model_selection.GroupKFold(n_splits=5)
for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
    print(train_df["campaign_id"].loc[val_index].unique())
    dev_X, val_X = train_X.loc[dev_index,:], train_X.loc[val_index,:]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    
    pred_val, loss, pred_test = runLGB(dev_X, dev_y, val_X, val_y, test_X)
    cv_scores.append(loss)
    pred_test_full += pred_test
    print(cv_scores)
pred_test_full /= 5.

[29 35 37 40 43 46 47 54]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.633344
[100]	valid_0's auc: 0.633874
Early stopping, best iteration is:
[5]	valid_0's auc: 0.63756
0.6375598817669201
[0.6375598817669201]
[32 44 45 51 53]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.661916
[100]	valid_0's auc: 0.662224
Early stopping, best iteration is:
[20]	valid_0's auc: 0.664003
0.6640025317157285
[0.6375598817669201, 0.6640025317157285]
[30 41 48 52]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.611326
[100]	valid_0's auc: 0.611038
[150]	valid_0's auc: 0.611964
[200]	valid_0's auc: 0.610521
Early stopping, best iteration is:
[130]	valid_0's auc: 0.611994
0.6119938319722543
[0.6375598817669201, 0.6640025317157285, 0.6119938319722543]
[33 34 39 49]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.618856
[100]	valid_0's auc: 0.620204
Early stopping

In [16]:
# submit the results on the contest page
sub_df = pd.DataFrame({"id":test_df["id"].values})
sub_df["is_click"] = pred_test_full
sub_df.to_csv("sub3.csv", index=False)

### Score is 0.581

### Target encoding with open as target

In [17]:
cols_to_use = []
kf = model_selection.GroupKFold(n_splits=5)
#### EXERCISE 5
for col in [["user_id"], ["user_id", "communication_type"]]:
        train_enc_values = np.zeros(train_df.shape[0])
        test_enc_values = 0
        for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
            dev_X, val_X = train_df.loc[dev_index], train_df.loc[val_index]
            train_enc_values[val_index] = np.array( getDVEncodeVar(val_X[col], dev_X, col, 'is_open'))
            test_enc_values += np.array( getDVEncodeVar(test_df[col], dev_X, col, 'is_open'))
        test_enc_values /= 5.
        if isinstance(col, list):
            col = "_".join(col)
        train_df[col + "_open_enc"] = train_enc_values
        test_df[col + "_open_enc"] = test_enc_values
        cols_to_use.append(col + "_open_enc")

In [18]:
#### EXERCISE 6
usecols = ["user_id", "communication_type", 'user_id_enc', 'user_id_communication_type_enc'] + cols_to_use
print(usecols)
train_X = train_df[usecols]
test_X = test_df[usecols]
train_y = train_df["is_click"].values

['user_id', 'communication_type', 'user_id_enc', 'user_id_communication_type_enc', 'user_id_open_enc', 'user_id_communication_type_open_enc']


In [19]:
cv_scores = []
pred_test_full = 0
kf = model_selection.GroupKFold(n_splits=5)
for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
    print(train_df["campaign_id"].loc[val_index].unique())
    dev_X, val_X = train_X.loc[dev_index,:], train_X.loc[val_index,:]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    
    pred_val, loss, pred_test = runLGB(dev_X, dev_y, val_X, val_y, test_X)
    cv_scores.append(loss)
    pred_test_full += pred_test
    print(cv_scores)
pred_test_full /= 5.

[29 35 37 40 43 46 47 54]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.700779
[100]	valid_0's auc: 0.700296
[150]	valid_0's auc: 0.700404
Early stopping, best iteration is:
[57]	valid_0's auc: 0.700952
0.70095210412181
[0.70095210412181]
[32 44 45 51 53]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.695309
[100]	valid_0's auc: 0.695144
Early stopping, best iteration is:
[37]	valid_0's auc: 0.696679
0.696678575834816
[0.70095210412181, 0.696678575834816]
[30 41 48 52]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.706568
[100]	valid_0's auc: 0.707789
[150]	valid_0's auc: 0.707884
[200]	valid_0's auc: 0.707921
[250]	valid_0's auc: 0.708038
[300]	valid_0's auc: 0.708031
[350]	valid_0's auc: 0.708183
[400]	valid_0's auc: 0.708186
Early stopping, best iteration is:
[334]	valid_0's auc: 0.708295
0.7082953026851153
[0.70095210412181, 0.696678575834816, 0.7082953026851153]
[33 34 3

In [20]:
sub_df = pd.DataFrame({"id":test_df["id"].values})
sub_df["is_click"] = pred_test_full
sub_df.to_csv("sub4.csv", index=False)
### Score is: 0.666

### Count for each target encoding variables

In [21]:
def getCountVar(compute_df, count_df, var_name, count_var="v1"):
    grouped_df = count_df.groupby(var_name)[count_var].agg('count').reset_index()
    grouped_df.columns = var_name + ["var_count"]

    merged_df = pd.merge(compute_df, grouped_df, how="left", on=var_name)
    merged_df.fillna(np.mean(grouped_df["var_count"].values), inplace=True)
    return list(merged_df["var_count"])

cols_to_use = []
kf = model_selection.GroupKFold(n_splits=5)
#### EXERCISE 7
for col in [["user_id"], ["user_id", "communication_type"]]:
        train_enc_values = np.zeros(train_df.shape[0])
        test_enc_values = 0
        for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
            dev_X, val_X = train_df.loc[dev_index], train_df.loc[val_index]
            train_enc_values[val_index] = np.array( getCountVar(val_X[col], dev_X, col, 'is_open'))
            test_enc_values += np.array( getCountVar(test_df[col], dev_X, col, 'is_open'))
        test_enc_values /= 5.
        if isinstance(col, list):
            col = "_".join(col)
        train_df[col + "_count"] = train_enc_values
        test_df[col + "_count"] = test_enc_values
        cols_to_use.append(col + "_count")

In [22]:
### EXERCISE 8
usecols = ['user_id_enc', 'user_id_communication_type_enc', 'user_id_open_enc', 'user_id_communication_type_open_enc'] + cols_to_use
print(usecols)
train_X = train_df[usecols]
test_X = test_df[usecols]
train_y = train_df["is_click"].values

['user_id_enc', 'user_id_communication_type_enc', 'user_id_open_enc', 'user_id_communication_type_open_enc', 'user_id_count', 'user_id_communication_type_count']


In [23]:
cv_scores = []
pred_test_full = 0
kf = model_selection.GroupKFold(n_splits=5)
for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
    print(train_df["campaign_id"].loc[val_index].unique())
    dev_X, val_X = train_X.loc[dev_index,:], train_X.loc[val_index,:]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    
    pred_val, loss, pred_test = runLGB(dev_X, dev_y, val_X, val_y, test_X)
    cv_scores.append(loss)
    pred_test_full += pred_test
    print(cv_scores)
pred_test_full /= 5.

[29 35 37 40 43 46 47 54]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.700923
[100]	valid_0's auc: 0.70188
Early stopping, best iteration is:
[20]	valid_0's auc: 0.702435
0.7024346893239648
[0.7024346893239648]
[32 44 45 51 53]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.698061
[100]	valid_0's auc: 0.698116
Early stopping, best iteration is:
[46]	valid_0's auc: 0.699058
0.6990579354329693
[0.7024346893239648, 0.6990579354329693]
[30 41 48 52]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.722059
[100]	valid_0's auc: 0.72251
Early stopping, best iteration is:
[39]	valid_0's auc: 0.723146
0.723146388239573
[0.7024346893239648, 0.6990579354329693, 0.723146388239573]
[33 34 39 49]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.722793
[100]	valid_0's auc: 0.723697
[150]	valid_0's auc: 0.72353
[200]	valid_0's auc: 0.724025
[250]	valid_0's au

In [24]:
sub_df = pd.DataFrame({"id":test_df["id"].values})
sub_df["is_click"] = pred_test_full
sub_df.to_csv("sub5.csv", index=False)
### Score is: 0.6609

### More Features

So far we have created features based on interactions and textbook ideas. Now let us wear the business hat and come up with new features

In [25]:
#### EXERCISE 9

train_df2 = pd.read_csv("dataset/train.csv")
test_df2 = pd.read_csv("dataset/test.csv")

train_df2["send_date"] = pd.to_datetime(train_df2["send_date"], format="%d-%m-%Y %H:%M")
test_df2["send_date"] = pd.to_datetime(test_df2["send_date"], format="%d-%m-%Y %H:%M")

train_df2["ordinal_date"] = train_df2["send_date"].apply(lambda x: time.mktime(x.timetuple()))
test_df2["ordinal_date"] = test_df2["send_date"].apply(lambda x: time.mktime(x.timetuple()))

train_df2 = train_df2.sort_values(by="ordinal_date").reset_index(drop=True)
test_df2 = test_df2.sort_values(by="ordinal_date").reset_index(drop=True)

## Combine both the datasets 
full_df = train_df2.append(test_df2).reset_index(drop=True)

## User cumulative count
full_df["user_cum_count"] = full_df.groupby("user_id")["id"].cumcount()

## User count
gdf = full_df.groupby("user_id")["id"].size().reset_index()
gdf.columns = ["user_id", "user_count"]
full_df = pd.merge(full_df, gdf, on="user_id", how="left")

## User previous date diff and camp diff
full_df["user_prev_date"] = full_df.groupby("user_id")["ordinal_date"].shift(1)
full_df["user_prev_camp"] = full_df.groupby("user_id")["campaign_id"].shift(1)
full_df["user_date_diff"] = full_df["ordinal_date"] - full_df["user_prev_date"]
full_df["user_camp_diff"] = full_df["campaign_id"] - full_df["user_prev_camp"]

## User - camp min, max, mean and std dates
gdf = full_df.groupby("user_id")["ordinal_date"].agg(["min", "mean", "max", "std"]).reset_index()
gdf.columns = ["user_id", "user_min_date", "user_mean_date", "user_max_date", "user_std_date"]
full_df = pd.merge(full_df, gdf, on="user_id")

## attended camps
pivot_df = pd.pivot_table(full_df, values="ordinal_date", index="user_id", columns="campaign_id", aggfunc="count", fill_value=0).reset_index()
pivot_df.columns = ["user_id"] + ["camp_"+str(i) for i in range(29,81)]
full_df = pd.merge(full_df, pivot_df, on="user_id")

#train_df = pd.merge(train_df, full_df, on="id", how="left")
#test_df = pd.merge(test_df, full_df, on="id", how="left")

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


In [26]:
fc = ["id", "user_cum_count", "user_count", "user_date_diff", "user_camp_diff", "user_min_date", "user_mean_date", "user_max_date", "user_std_date"]
fc = fc + ["camp_"+str(i) for i in range(29,81)]
train_df = pd.merge(train_df, full_df[fc], on="id", how="left")
test_df = pd.merge(test_df, full_df[fc], on="id", how="left")
train_df.head()

Unnamed: 0,id,user_id,campaign_id,send_date,is_open,is_click,train_set,ordinal_date,communication_type,total_links,...,camp_71,camp_72,camp_73,camp_74,camp_75,camp_76,camp_77,camp_78,camp_79,camp_80
0,29_170634,170634,29,2017-07-01 18:01:00,0,0,1,1498912000.0,3,67,...,1,0,0,1,1,0,0,0,0,0
1,29_7740,7740,29,2017-07-01 18:01:00,0,0,1,1498912000.0,3,67,...,0,0,0,0,1,0,0,0,0,0
2,29_20155,20155,29,2017-07-01 18:01:00,0,0,1,1498912000.0,3,67,...,0,0,0,0,0,1,0,0,0,0
3,29_134093,134093,29,2017-07-01 18:01:00,0,0,1,1498912000.0,3,67,...,1,0,0,1,1,0,0,0,0,0
4,29_82171,82171,29,2017-07-01 18:01:00,0,0,1,1498912000.0,3,67,...,0,0,0,0,0,0,0,0,0,0


In [27]:
usecols = ['user_id_enc', 'user_id_communication_type_enc', 
           'user_id_open_enc', 'user_id_communication_type_open_enc',
           'user_id_count', 'user_id_communication_type_count'
          ]
usecols = usecols + ["user_cum_count", "user_count", "user_date_diff", "user_camp_diff", 
                     "user_min_date", "user_mean_date", "user_max_date", "user_std_date"]
usecols = usecols + ["camp_"+str(i) for i in range(29,81)]
train_X = train_df[usecols]
test_X = test_df[usecols]
train_y = train_df["is_click"].values
test_id = test_df["id"].values
print(usecols)

['user_id_enc', 'user_id_communication_type_enc', 'user_id_open_enc', 'user_id_communication_type_open_enc', 'user_id_count', 'user_id_communication_type_count', 'user_cum_count', 'user_count', 'user_date_diff', 'user_camp_diff', 'user_min_date', 'user_mean_date', 'user_max_date', 'user_std_date', 'camp_29', 'camp_30', 'camp_31', 'camp_32', 'camp_33', 'camp_34', 'camp_35', 'camp_36', 'camp_37', 'camp_38', 'camp_39', 'camp_40', 'camp_41', 'camp_42', 'camp_43', 'camp_44', 'camp_45', 'camp_46', 'camp_47', 'camp_48', 'camp_49', 'camp_50', 'camp_51', 'camp_52', 'camp_53', 'camp_54', 'camp_55', 'camp_56', 'camp_57', 'camp_58', 'camp_59', 'camp_60', 'camp_61', 'camp_62', 'camp_63', 'camp_64', 'camp_65', 'camp_66', 'camp_67', 'camp_68', 'camp_69', 'camp_70', 'camp_71', 'camp_72', 'camp_73', 'camp_74', 'camp_75', 'camp_76', 'camp_77', 'camp_78', 'camp_79', 'camp_80']


In [28]:
cv_scores = []
pred_test_full = 0
kf = model_selection.GroupKFold(n_splits=5)
for dev_index, val_index in kf.split(train_df, train_df["is_click"].values, train_df["campaign_id"].values):
    print(train_df["campaign_id"].loc[val_index].unique())
    dev_X, val_X = train_X.loc[dev_index,:], train_X.loc[val_index,:]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    
    pred_val, loss, pred_test = runLGB(dev_X, dev_y, val_X, val_y, test_X)
    pred_test_full += pred_test
    cv_scores.append(loss)
    print(cv_scores)
pred_test_full /= 5.

[29 35 37 40 43 46 47 54]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.724284
[100]	valid_0's auc: 0.727106
[150]	valid_0's auc: 0.728877
[200]	valid_0's auc: 0.729128
[250]	valid_0's auc: 0.729
[300]	valid_0's auc: 0.728296
Early stopping, best iteration is:
[213]	valid_0's auc: 0.729415
0.7294152523130001
[0.7294152523130001]
[32 44 45 51 53]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.726925
[100]	valid_0's auc: 0.729549
[150]	valid_0's auc: 0.728537
Early stopping, best iteration is:
[97]	valid_0's auc: 0.729634
0.7296338112087778
[0.7294152523130001, 0.7296338112087778]
[30 41 48 52]
Training until validation scores don't improve for 100 rounds
[50]	valid_0's auc: 0.741775
[100]	valid_0's auc: 0.741272
Early stopping, best iteration is:
[48]	valid_0's auc: 0.741808
0.741807628073808
[0.7294152523130001, 0.7296338112087778, 0.741807628073808]
[33 34 39 49]
Training until validation scores don't improve f

In [29]:
sub_df = pd.DataFrame({"id":test_df["id"].values})
sub_df["is_click"] = pred_test_full
sub_df.to_csv("sub6.csv", index=False)

### Score is: 0.7014