# AV-WNS Analytics Wizard 2019-Hackathon

**Approach and Codes for the Analytics Vidhya WNS Analytics Wizard 2019 Hackathon**

## Competition Link
https://datahack.analyticsvidhya.com/contest/wns-analytics-wizard-2019/

## Problem Statement
Zbay is an E-commerce website which sells a variety of products at its online platform. Zbay records user behaviour of its customers and stores it as a log. However, most of the times, users do not buy the products instantly and there is a time gap during which the customer might surf the internet and maybe visit competitor websites.

 

Now, to improve sales of products, Zbay has hired Adiza, an Adtech company which built a system such that ads are being shown for Zbay’s products on its partner websites.

 

If a user comes to Zbay’s website and searches for a product, and then visits these partner websites or apps, his/her previously viewed items or their similar items are shown on as an ad. If the user clicks this ad, he/she will be redirected to the Zbay’s website and might buy the product.


In this problem, the task was to predict click probability i.e. probability of user clicking the ad which is shown to them on the partner websites for the next 7 days on the basis of historical view log data, ad impression data and user data.

## Available Data Sources:
* View Logs- View data of Users at an User level containing various information such as User ID, Item ID, Timestamp of view, Device used etc
* Item Logs- Item data containing various details about each item on the merchant website catalog
* Train data- Actual click information data

## Approach 
**(Zbay)**

## Final Ranks Obtained
Public Leaderbaord-
Private Leaderboard-

#### Importing Required Libraries

In [8]:
#importing requried library
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
import datetime 
from collections import Counter
import collections
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from IPython.core.interactiveshell import InteractiveShell
from math import sqrt
import time
import random
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
pd.options.mode.chained_assignment = None
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

#### Retriving Data

In [9]:
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')

In [10]:
train.head()
test.head()

Unnamed: 0,impression_id,impression_time,user_id,app_code,os_version,is_4G,is_click
0,c4ca4238a0b923820dcc509a6f75849b,2018-11-15 00:00:00,87862,422,old,0,0
1,45c48cce2e2d7fbdea1afc51c7c6ad26,2018-11-15 00:01:00,63410,467,latest,1,1
2,70efdf2ec9b086079795c442636b55fb,2018-11-15 00:02:00,71748,259,intermediate,1,0
3,8e296a067a37563370ded05f5a3bf3ec,2018-11-15 00:02:00,69209,244,latest,1,0
4,182be0c5cdcd5072bb1864cdee4d3d6e,2018-11-15 00:02:00,62873,473,latest,0,0


Unnamed: 0,impression_id,impression_time,user_id,app_code,os_version,is_4G
0,a9e7126a585a69a32bc7414e9d0c0ada,2018-12-13 07:44:00,44754,127,latest,1
1,caac14a5bf2ba283db7708bb34855760,2018-12-13 07:45:00,29656,44,latest,0
2,13f10ba306a19ce7bec2f3cae507b698,2018-12-13 07:46:00,25234,296,latest,1
3,39c4b4dc0e9701b55a0a4f072008fb3f,2018-12-13 07:47:00,22988,207,latest,1
4,bf5a572cca75f5fc67f4b14e58b11d70,2018-12-13 07:48:00,35431,242,latest,1


#### Data Preparation

In [11]:
train['datetime'] = pd.to_datetime(train['impression_time'])

test['datetime'] = pd.to_datetime(test['impression_time'])


train.head()

Unnamed: 0,impression_id,impression_time,user_id,app_code,os_version,is_4G,is_click,datetime
0,c4ca4238a0b923820dcc509a6f75849b,2018-11-15 00:00:00,87862,422,old,0,0,2018-11-15 00:00:00
1,45c48cce2e2d7fbdea1afc51c7c6ad26,2018-11-15 00:01:00,63410,467,latest,1,1,2018-11-15 00:01:00
2,70efdf2ec9b086079795c442636b55fb,2018-11-15 00:02:00,71748,259,intermediate,1,0,2018-11-15 00:02:00
3,8e296a067a37563370ded05f5a3bf3ec,2018-11-15 00:02:00,69209,244,latest,1,0,2018-11-15 00:02:00
4,182be0c5cdcd5072bb1864cdee4d3d6e,2018-11-15 00:02:00,62873,473,latest,0,0,2018-11-15 00:02:00


In [12]:
train['day'] = train['datetime'].dt.weekday
train['hour'] = train['datetime'].dt.hour

test['day'] = test['datetime'].dt.weekday
test['hour'] = test['datetime'].dt.hour

train['hour1'] = train['hour']%4
test['hour1'] = test['hour']%4


train.head()

Unnamed: 0,impression_id,impression_time,user_id,app_code,os_version,is_4G,is_click,datetime,day,hour,hour1
0,c4ca4238a0b923820dcc509a6f75849b,2018-11-15 00:00:00,87862,422,old,0,0,2018-11-15 00:00:00,3,0,0
1,45c48cce2e2d7fbdea1afc51c7c6ad26,2018-11-15 00:01:00,63410,467,latest,1,1,2018-11-15 00:01:00,3,0,0
2,70efdf2ec9b086079795c442636b55fb,2018-11-15 00:02:00,71748,259,intermediate,1,0,2018-11-15 00:02:00,3,0,0
3,8e296a067a37563370ded05f5a3bf3ec,2018-11-15 00:02:00,69209,244,latest,1,0,2018-11-15 00:02:00,3,0,0
4,182be0c5cdcd5072bb1864cdee4d3d6e,2018-11-15 00:02:00,62873,473,latest,0,0,2018-11-15 00:02:00,3,0,0


#### Cleansing Data

In [13]:
train = train.drop(columns=['impression_id', 'impression_time','datetime', 'hour',])
test = test.drop(columns=['impression_time','datetime', 'hour', ])
train.head()

Unnamed: 0,user_id,app_code,os_version,is_4G,is_click,day,hour1
0,87862,422,old,0,0,3,0
1,63410,467,latest,1,1,3,0
2,71748,259,intermediate,1,0,3,0
3,69209,244,latest,1,0,3,0
4,62873,473,latest,0,0,3,0


#### Missing Value Analysis

In [14]:
print(train.isnull().sum(axis=0))
print(test.isnull().sum(axis = 0))

user_id       0
app_code      0
os_version    0
is_4G         0
is_click      0
day           0
hour1         0
dtype: int64
impression_id    0
user_id          0
app_code         0
os_version       0
is_4G            0
day              0
hour1            0
dtype: int64


#### Feature Statistics Summary

In [15]:
train.describe()
test.describe()

Unnamed: 0,user_id,app_code,is_4G,is_click,day,hour1
count,237609.0,237609.0,237609.0,237609.0,237609.0,237609.0
mean,46454.526828,249.099971,0.361312,0.045714,2.852977,1.418141
std,26802.726666,135.213609,0.480382,0.208864,2.007192,1.124943
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,23197.0,163.0,0.0,0.0,1.0,0.0
50%,46597.0,213.0,0.0,0.0,3.0,1.0
75%,69684.0,385.0,1.0,0.0,5.0,2.0
max,92586.0,522.0,1.0,1.0,6.0,3.0


Unnamed: 0,user_id,app_code,is_4G,day,hour1
count,90675.0,90675.0,90675.0,90675.0,90675.0
mean,46417.71013,254.793703,0.357872,3.089032,1.424097
std,26835.33356,133.411434,0.479377,2.061584,1.131886
min,1.0,0.0,0.0,0.0,0.0
25%,23029.0,190.0,0.0,1.0,0.0
50%,46557.0,213.0,0.0,3.0,1.0
75%,69764.5,386.0,1.0,5.0,2.0
max,92586.0,522.0,1.0,6.0,3.0


#### Data Modeling

In [16]:
from sklearn.preprocessing import LabelEncoder
train = train.astype('object')
test = test.astype('object')
col_with_null = ['user_id', 'os_version']
cols = ['app_code', 'is_4G']
for col in col_with_null:
    lbl = LabelEncoder()
    temp_train = train[col]
    temp_test = test[col]
    mask = train[col].isnull()
    mask1 = test[col].isnull()
    train[col] = train[col].astype(str)
    test[col] = test[col].astype(str)
    lbl.fit(list(train[col].values) + list(test[col].values))
    train[col] = lbl.transform(list(train[col].values))
    test[col] = lbl.transform(list(test[col].values))
    train[col] = train[col].where(~mask, temp_train)
    test[col] = test[col].where(~mask1, temp_test)
    
for col in cols:
    lbl = LabelEncoder()
    lbl.fit(list(train[col].values) + list(test[col].values))
    train[col] = lbl.transform(list(train[col].values))
    test[col] = lbl.transform(list(test[col].values))

LabelEncoder()

LabelEncoder()

LabelEncoder()

LabelEncoder()

In [17]:
print(test.head())
print(train.head())

                      impression_id  user_id  app_code  os_version  is_4G day  \
0  a9e7126a585a69a32bc7414e9d0c0ada    37187       127           1      1   3   
1  caac14a5bf2ba283db7708bb34855760    21032        44           1      0   3   
2  13f10ba306a19ce7bec2f3cae507b698    16290       296           1      1   3   
3  39c4b4dc0e9701b55a0a4f072008fb3f    13892       207           1      1   3   
4  bf5a572cca75f5fc67f4b14e58b11d70    27218       242           1      1   3   

  hour1  
0     3  
1     3  
2     3  
3     3  
4     3  
   user_id  app_code  os_version  is_4G is_click day hour1
0    83322       421           2      0        0   3     0
1    57116       466           1      1        1   3     0
2    66041       259           0      1        0   3     0
3    63317       244           1      1        0   3     0
4    56541       472           1      0        0   3     0


In [18]:
train = train.astype(int)
for col in test.columns[1:]: test[col] = test[col].astype(int)

In [19]:
from sklearn.model_selection import train_test_split

Y_t = train['is_click'].astype('float')
X_t = train.drop(columns=['is_click'])
X_train, X_test, y_train, y_test = train_test_split(X_t, Y_t, test_size=0.25, random_state=69)

In [20]:
import xgboost as xgb
num_round = 1000
param = {'objective': 'binary:logistic',  
         'tree_method': 'hist', 
         'max_depth': 10,
         'eta': 0.2,
         'eval_metric': 'auc',
         'early_stopping_rounds':50,
         'nthread': 4,
         'max_bin':512,
         #'grow_policy':'lossguide',
         'random_state':69,
         }
dtrain = xgb.DMatrix(X_train, label=y_train, missing = -1)
dtest = xgb.DMatrix(X_test, label=y_test, missing = -1)
gbm = xgb.train(param, dtrain, num_round, evals=[(dtest, 'test')])
predicted = gbm.predict(xgb.DMatrix(X_test, missing = -1))

  if getattr(data, 'base', None) is not None and \


[0]	test-auc:0.680891
[1]	test-auc:0.683167
[2]	test-auc:0.684971
[3]	test-auc:0.685745
[4]	test-auc:0.685745
[5]	test-auc:0.686008
[6]	test-auc:0.68749
[7]	test-auc:0.688189
[8]	test-auc:0.690281
[9]	test-auc:0.691425
[10]	test-auc:0.691853
[11]	test-auc:0.69408
[12]	test-auc:0.695143
[13]	test-auc:0.694854
[14]	test-auc:0.693788
[15]	test-auc:0.694062
[16]	test-auc:0.694445
[17]	test-auc:0.69473
[18]	test-auc:0.694431
[19]	test-auc:0.693876
[20]	test-auc:0.693784
[21]	test-auc:0.694048
[22]	test-auc:0.693912
[23]	test-auc:0.694048
[24]	test-auc:0.693798
[25]	test-auc:0.692106
[26]	test-auc:0.691561
[27]	test-auc:0.691141
[28]	test-auc:0.691045
[29]	test-auc:0.691871
[30]	test-auc:0.691935
[31]	test-auc:0.691521
[32]	test-auc:0.692017
[33]	test-auc:0.691847
[34]	test-auc:0.692004
[35]	test-auc:0.691606
[36]	test-auc:0.691839
[37]	test-auc:0.691888
[38]	test-auc:0.69147
[39]	test-auc:0.691601
[40]	test-auc:0.691497
[41]	test-auc:0.691242
[42]	test-auc:0.691726
[43]	test-auc:0.691643
[4

#### Confusion Matrix

In [21]:
from sklearn.metrics import confusion_matrix, roc_auc_score, accuracy_score as auc 
predicted = gbm.predict(xgb.DMatrix(X_test, missing = -1))
print(confusion_matrix(y_test, np.round(predicted)))
print(roc_auc_score(y_test, np.round(predicted)))
print(auc(y_test, np.round(predicted)))

[[55977   653]
 [ 2662   111]]
0.5142489294901729
0.9441947376395131


In [22]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50, random_state=23)
model = clf.fit(X_train, y_train)

# Use predict_proba to predict probability of the class
y_pred = clf.predict_proba(X_test)[:,1]

In [23]:
from plot_metric.functions import BinaryClassification
# Visualisation with plot_metric
bc = BinaryClassification(y_test, y_pred, labels=["Class 1", "Class 2"])

# Figures
plt.figure(figsize=(5,5))
bc.plot_roc_curve()
plt.show()

<Figure size 360x360 with 0 Axes>

(array([0.00000000e+00, 1.76584849e-05, 5.29754547e-05, ...,
        4.41638707e-01, 4.41674024e-01, 1.00000000e+00]),
 array([0.        , 0.        , 0.        , ..., 0.61377569, 0.61377569,
        1.        ]),
 array([1.98      , 0.98      , 0.96      , ..., 0.00285714, 0.00222222,
        0.        ]),
 0.6000304231560112)

In [25]:
preds = gbm.predict(xgb.DMatrix(test[test.columns[1:]], missing = -1))
sub = pd.DataFrame({'impression_id':test['impression_id'], 'is_click':preds})
sub.to_csv('F:/My Projects/Zbay/submission.csv', index=False)

### Results
The results from evaluation are as follows:

#### Confusion Matrix:

(55977   653)
( 2662   111]

#### ROC Accuracy Score :
0.51 ie., 51%
#### Accuracy Score
0.94 ie., 94%