## Introduction

In this exercise, I will develop a baseline model for predicting if a customer will buy an app after clicking on an ad. With this baseline model, I'll be able to see how your feature engineering and selection efforts improve the model's performance.

In [1]:
import pandas as pd

In [18]:
data = pd.read_csv('train_sample.csv',parse_dates=['click_time']) #parsing click_time col into date
data.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,89489,3,1,13,379,2017-11-06 15:13:23,,0
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1
2,3437,6,1,13,459,2017-11-06 15:42:32,,0
3,167543,3,1,13,379,2017-11-06 15:56:17,,0
4,147509,3,1,13,379,2017-11-06 15:57:01,,0


In [19]:
print("Total samples are ====> {}".format(data.shape[0]))

Total samples are ====> 2300561


#### Creating Features from Timestamps column

In [20]:
columns = data.columns
columns

Index(['ip', 'app', 'device', 'os', 'channel', 'click_time', 'attributed_time',
       'is_attributed'],
      dtype='object')

In [21]:
copy_data = data.copy() #coping the data
copy_data['day'] = copy_data['click_time'].dt.day.astype('uint8') #adding day columns creating from click_time col
copy_data['hour'] = copy_data['click_time'].dt.hour.astype('uint8') #adding hour columns creating from click_time col
copy_data['minute'] =copy_data['click_time'].dt.minute.astype('uint8') #adding minute columns creating from click_time col
copy_data['second'] = copy_data['click_time'].dt.second.astype('uint8') #adding second columns creating from click_time col

In [22]:
copy_data.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second
0,89489,3,1,13,379,2017-11-06 15:13:23,,0,6,15,13,23
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7
2,3437,6,1,13,459,2017-11-06 15:42:32,,0,6,15,42,32
3,167543,3,1,13,379,2017-11-06 15:56:17,,0,6,15,56,17
4,147509,3,1,13,379,2017-11-06 15:57:01,,0,6,15,57,1


#### Label Encoding

For each of the categorical features ['ip', 'app', 'device', 'os', 'channel'], use scikit-learn's LabelEncoder to create new features in the clicks DataFrame. The new column names should be the original column name with '_labels' appended, like ip_labels.

In [26]:
from sklearn.preprocessing import LabelEncoder
categories =['ip','app','device','os','channel']
enc = LabelEncoder() #calling label encoding class
for feature in categories:
    encoded_feature = enc.fit_transform(copy_data[feature]) #labelencoding the  features 
    col_name = feature+"_label" #concatenating feature name with "_label" becomes
    copy_data[col_name] = encoded_feature #adding the encoded column

In [27]:
copy_data

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_label,app_label,device_label,os_label,channel_label
0,89489,3,1,13,379,2017-11-06 15:13:23,,0,6,15,13,23,27226,3,1,13,120
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,110007,35,1,13,10
2,3437,6,1,13,459,2017-11-06 15:42:32,,0,6,15,42,32,1047,6,1,13,157
3,167543,3,1,13,379,2017-11-06 15:56:17,,0,6,15,56,17,76270,3,1,13,120
4,147509,3,1,13,379,2017-11-06 15:57:01,,0,6,15,57,1,57862,3,1,13,120
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2300556,32457,2,1,19,477,2017-11-09 15:59:59,,0,9,15,59,59,9791,2,1,19,166
2300557,20266,14,1,13,446,2017-11-09 15:59:59,,0,9,15,59,59,6240,14,1,13,146
2300558,49383,12,2,17,178,2017-11-09 16:00:00,,0,9,16,0,0,15098,12,2,17,50
2300559,34894,12,1,15,145,2017-11-09 16:00:00,,0,9,16,0,0,10538,12,1,15,41


#### Creating Train/Test Splits

Here we'll create training, validation, and test splits. First, clicks DataFrame is sorted in order of increasing time. The first 80% of the rows are the train set, the next 10% are the validation set, and the last 10% are the test set.

In [42]:
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_label', 'app_label', 'device_label',
                'os_label', 'channel_label'] #these are the features of the dataset being using to train the model 
valid_fraction = 0.1
sorted_data = copy_data.sort_values('click_time') #sorting the data with respect to click_time column
valid_rows = int(len(sorted_data)*valid_fraction) #getting the 1/10 of the data numbers 
train_data = copy_data[:-valid_rows*2] # [:-230056*2] getting values from reverse(-) order
# valid size == test size, last two sections of the data
valid_data = copy_data[-valid_rows*2:-valid_rows] #[-230056*2:-230056]
test_data = copy_data[-valid_rows:]# [:-230056:]

### Train with LightGBM (Light Gradiet boosting machine)

LightGBM is a gradient boosting framework that uses tree based learning algorithms.This is a tree-based model that typically provides the best performance, even compared to XGBoost. It's also relatively fast to train

In [43]:
import lightgbm as lgb

In [47]:
#initializing dataset
dtrain = lgb.Dataset(train_data[feature_cols], # X data
                     label=train_data['is_attributed']) #label=> Y labels
dvalid = lgb.Dataset(valid_data[feature_cols],label=valid_data['is_attributed'])
dtest = lgb.Dataset(test_data[feature_cols],label=test_data['is_attributed'])
param = {'num_leaves':64,'objective':'binary','metric':'auc'}
num_round = 1000
bst = lgb.train(params = param, #param ==> parameters for training
               train_set = dtrain,
               num_boost_round = num_round,
               valid_sets=[dvalid],
               early_stopping_rounds=10)

[1]	valid_0's auc: 0.949852
Training until validation scores don't improve for 10 rounds
[2]	valid_0's auc: 0.950133
[3]	valid_0's auc: 0.951059
[4]	valid_0's auc: 0.951016
[5]	valid_0's auc: 0.951419
[6]	valid_0's auc: 0.951832
[7]	valid_0's auc: 0.951894
[8]	valid_0's auc: 0.952461
[9]	valid_0's auc: 0.9524
[10]	valid_0's auc: 0.952548
[11]	valid_0's auc: 0.952519
[12]	valid_0's auc: 0.952673
[13]	valid_0's auc: 0.952653
[14]	valid_0's auc: 0.952794
[15]	valid_0's auc: 0.952972
[16]	valid_0's auc: 0.953534
[17]	valid_0's auc: 0.953712
[18]	valid_0's auc: 0.953945
[19]	valid_0's auc: 0.954136
[20]	valid_0's auc: 0.954266
[21]	valid_0's auc: 0.954338
[22]	valid_0's auc: 0.954449
[23]	valid_0's auc: 0.954532
[24]	valid_0's auc: 0.954739
[25]	valid_0's auc: 0.955312
[26]	valid_0's auc: 0.95555
[27]	valid_0's auc: 0.955692
[28]	valid_0's auc: 0.95577
[29]	valid_0's auc: 0.956254
[30]	valid_0's auc: 0.956293
[31]	valid_0's auc: 0.956606
[32]	valid_0's auc: 0.956718
[33]	valid_0's auc: 0.95

[278]	valid_0's auc: 0.962167
[279]	valid_0's auc: 0.962174
[280]	valid_0's auc: 0.962172
Early stopping, best iteration is:
[270]	valid_0's auc: 0.962181


### Evaluating the model

In [48]:
from sklearn import metrics

In [53]:
ypred = bst.predict(test_data[feature_cols])
score = metrics.roc_auc_score(test_data['is_attributed'],ypred)
#Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)from prediction scores.
print("Test Score ===> {}".format(score))

Test Score ===> 0.9726035015157417
