<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [81]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import pickle
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

Reading original data

In [16]:
# Read the training and test data sets

PATH_TO_DATA = ('')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

In [17]:
# Switch time1, ..., time10 columns to datetime type
times = ['time%s' % i for i in range(1, 11)]
train_df[times] = train_df[times].apply(pd.to_datetime)
test_df[times] = test_df[times].apply(pd.to_datetime)

# Sort the data by time
train_df = train_df.sort_values(by='time1')

# Look at the first rows of the training set
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,2013-01-12 08:05:57,55.0,2013-01-12 08:05:57,,NaT,,NaT,,NaT,...,NaT,,NaT,,NaT,,NaT,,NaT,0
54843,56,2013-01-12 08:37:23,55.0,2013-01-12 08:37:23,56.0,2013-01-12 09:07:07,55.0,2013-01-12 09:07:09,,NaT,...,NaT,,NaT,,NaT,,NaT,,NaT,0
77292,946,2013-01-12 08:50:13,946.0,2013-01-12 08:50:14,951.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:16,...,2013-01-12 08:50:16,948.0,2013-01-12 08:50:16,784.0,2013-01-12 08:50:16,949.0,2013-01-12 08:50:17,946.0,2013-01-12 08:50:17,0
114021,945,2013-01-12 08:50:17,948.0,2013-01-12 08:50:17,949.0,2013-01-12 08:50:18,948.0,2013-01-12 08:50:18,945.0,2013-01-12 08:50:18,...,2013-01-12 08:50:18,947.0,2013-01-12 08:50:19,945.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:20,0
146670,947,2013-01-12 08:50:20,950.0,2013-01-12 08:50:20,948.0,2013-01-12 08:50:20,947.0,2013-01-12 08:50:21,950.0,2013-01-12 08:50:21,...,2013-01-12 08:50:21,946.0,2013-01-12 08:50:21,951.0,2013-01-12 08:50:22,946.0,2013-01-12 08:50:22,947.0,2013-01-12 08:50:22,0


Separate target feature 

In [85]:
y = train_df['target']

In [86]:
y.head()

session_id
21669     0
54843     0
77292     0
114021    0
146670    0
Name: target, dtype: int64

In [19]:
# Change site1, ..., site10 columns type to integer and fill NA-values with zeros
sites = ['site%s' % i for i in range(1, 11)]
train_df[sites] = train_df[sites].fillna(0).astype('int')
test_df[sites] = test_df[sites].fillna(0).astype('int')

# Load websites dictionary
with open(r"site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [158]:
times = ['time%s' % i for i in range(1, 11)]
train_df[times] = train_df[times].fillna(0).astype('datetime64[ns]')
test_df[times] = test_df[times].fillna(0).astype('datetime64[ns]')

In [159]:
# Our target variable
y_train = train_df['target']

# United dataframe of the initial data 
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

# Index to split the training and test data sets
idx_split = train_df.shape[0]

full_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,site6,time6,site7,time7,site8,time8,site9,time9,site10,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
21669,56,2013-01-12 08:05:57,55,2013-01-12 08:05:57,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00
54843,56,2013-01-12 08:37:23,55,2013-01-12 08:37:23,56,2013-01-12 09:07:07,55,2013-01-12 09:07:09,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00
77292,946,2013-01-12 08:50:13,946,2013-01-12 08:50:14,951,2013-01-12 08:50:15,946,2013-01-12 08:50:15,946,2013-01-12 08:50:16,945,2013-01-12 08:50:16,948,2013-01-12 08:50:16,784,2013-01-12 08:50:16,949,2013-01-12 08:50:17,946,2013-01-12 08:50:17
114021,945,2013-01-12 08:50:17,948,2013-01-12 08:50:17,949,2013-01-12 08:50:18,948,2013-01-12 08:50:18,945,2013-01-12 08:50:18,946,2013-01-12 08:50:18,947,2013-01-12 08:50:19,945,2013-01-12 08:50:19,946,2013-01-12 08:50:19,946,2013-01-12 08:50:20
146670,947,2013-01-12 08:50:20,950,2013-01-12 08:50:20,948,2013-01-12 08:50:20,947,2013-01-12 08:50:21,950,2013-01-12 08:50:21,952,2013-01-12 08:50:21,946,2013-01-12 08:50:21,951,2013-01-12 08:50:22,946,2013-01-12 08:50:22,947,2013-01-12 08:50:22


In [160]:
# Dataframe with indices of visited websites in session
full_sites = full_df[sites]
full_sites.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
21669,56,55,0,0,0,0,0,0,0,0
54843,56,55,56,55,0,0,0,0,0,0
77292,946,946,951,946,946,945,948,784,949,946
114021,945,948,949,948,945,946,947,945,946,946
146670,947,950,948,947,950,952,946,951,946,947


In [161]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 336358 entries, 21669 to 82797
Data columns (total 20 columns):
site1     336358 non-null int32
time1     336358 non-null datetime64[ns]
site2     336358 non-null int32
time2     336358 non-null datetime64[ns]
site3     336358 non-null int32
time3     336358 non-null datetime64[ns]
site4     336358 non-null int32
time4     336358 non-null datetime64[ns]
site5     336358 non-null int32
time5     336358 non-null datetime64[ns]
site6     336358 non-null int32
time6     336358 non-null datetime64[ns]
site7     336358 non-null int32
time7     336358 non-null datetime64[ns]
site8     336358 non-null int32
time8     336358 non-null datetime64[ns]
site9     336358 non-null int32
time9     336358 non-null datetime64[ns]
site10    336358 non-null int32
time10    336358 non-null datetime64[ns]
dtypes: datetime64[ns](10), int32(10)
memory usage: 41.1 MB


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [None]:
# You code here

Add features based on the session start time: hour, whether it's morning, day or night and so on.

говорят хватит бинарных ранее утро, утро, вечер, обед и воскресенье

In [162]:
%%time
new_full_df = full_df 

for i in range(1, 11):
    #new_full_df['hour%s' % i] = new_full_df['time%s' % i].apply(lambda time: pd.Timestamp(time).hour)
    new_full_df['early_morning%s' % i] = new_full_df['time%s' % i].apply(lambda time: 1 if pd.Timestamp(time).hour < 8 else 0)
    new_full_df['morning%s' % i]       = new_full_df['time%s' % i].apply(lambda time: 1 if (pd.Timestamp(time).hour >= 8) and (pd.Timestamp(time).hour < 12) else 0)
    new_full_df['lunch%s' % i]       = new_full_df['time%s' % i].apply(lambda time: 1 if (pd.Timestamp(time).hour >= 12) and (pd.Timestamp(time).hour < 16) else 0)
    new_full_df['evening%s' % i]       = new_full_df['time%s' % i].apply(lambda time: 1 if (pd.Timestamp(time).hour >= 16) and (pd.Timestamp(time).hour < 24) else 0)
    new_full_df['sunday%s' % i]       = new_full_df['time%s' % i].apply(lambda time: 1 if pd.Timestamp(time).dayofweek == 6 else 0)
    new_full_df

Wall time: 1min 8s


In [187]:
for i in range(1, 11):
    new_full_df.drop('time%s' % i, axis = 1, inplace=True)
    new_full_df.drop('site%s' % i, axis = 1, inplace=True)

In [188]:
new_full_df.head()

Unnamed: 0_level_0,early_morning1,morning1,lunch1,evening1,sunday1,early_morning2,morning2,lunch2,evening2,sunday2,...,early_morning9,morning9,lunch9,evening9,sunday9,early_morning10,morning10,lunch10,evening10,sunday10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,0,1,0,0,0,0,1,0,0,0,...,1,0,0,0,0,1,0,0,0,0
54843,0,1,0,0,0,0,1,0,0,0,...,1,0,0,0,0,1,0,0,0,0
77292,0,1,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,1,0,0,0
114021,0,1,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,1,0,0,0
146670,0,1,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,1,0,0,0


In [189]:
new_full_df.shape

(336358, 50)

In [190]:
X_test = new_full_df.values[idx_split:, :]
X_train = new_full_df.values[:idx_split, :]
y_train = train_df['target']

In [191]:
y_train.shape

(253561,)

In [192]:
X_train.shape

(253561, 50)

In [193]:
X_test.shape

(82797, 50)

In [194]:
new_full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 336358 entries, 21669 to 82797
Data columns (total 50 columns):
early_morning1     336358 non-null int64
morning1           336358 non-null int64
lunch1             336358 non-null int64
evening1           336358 non-null int64
sunday1            336358 non-null int64
early_morning2     336358 non-null int64
morning2           336358 non-null int64
lunch2             336358 non-null int64
evening2           336358 non-null int64
sunday2            336358 non-null int64
early_morning3     336358 non-null int64
morning3           336358 non-null int64
lunch3             336358 non-null int64
evening3           336358 non-null int64
sunday3            336358 non-null int64
early_morning4     336358 non-null int64
morning4           336358 non-null int64
lunch4             336358 non-null int64
evening4           336358 non-null int64
sunday4            336358 non-null int64
early_morning5     336358 non-null int64
morning5           336358

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [None]:
# You code here

Perform cross-validation with logistic regression.

In [None]:
%%time
idx = int(round(X_train.shape[0] * 0.9))

tss = TimeSeriesSplit(n_splits=7)

#logit_pipe = Pipeline([
    #('vectorizer', CountVectorizer(max_features = 100000, ngram_range = (1, 3))),
    #('clf', LogisticRegression(random_state=17))])
#logit_pipe_params = {'clf__C': np.logspace(-8, 8, 17)}

logit_cv = LogisticRegressionCV(Cs = list(np.logspace(-8, 8, 17)), n_jobs=-1, scoring ='roc_auc', cv=tss)
logit_cv.fit(X_train, y_train)

[1e-08,
 9.9999999999999995e-08,
 9.9999999999999995e-07,
 1.0000000000000001e-05,
 0.0001,
 0.001,
 0.01,
 0.10000000000000001,
 1.0,
 10.0,
 100.0,
 1000.0,
 10000.0,
 100000.0,
 1000000.0,
 10000000.0,
 100000000.0]

In [200]:
logit_cv.scores_

{1: array([[ 0.69352879,  0.69692897,  0.7326676 ,  0.7443924 ,  0.74739275,
          0.74857247,  0.74618874,  0.74548932,  0.7445619 ,  0.74335881],
        [ 0.87504697,  0.87593787,  0.87399357,  0.87283353,  0.86999479,
          0.86543562,  0.86236909,  0.86224544,  0.86224544,  0.86059318],
        [ 0.70576924,  0.71121776,  0.71418643,  0.71214807,  0.70936194,
          0.70953618,  0.71405503,  0.71409563,  0.71412728,  0.71435363],
        [ 0.92579001,  0.92523153,  0.92301103,  0.92331327,  0.9226634 ,
          0.92248923,  0.92249466,  0.92077014,  0.9206806 ,  0.92064884],
        [ 0.72614468,  0.7277095 ,  0.73632618,  0.75471177,  0.73934987,
          0.72171816,  0.72417643,  0.72481418,  0.72308025,  0.72490587],
        [ 0.89224907,  0.89308894,  0.89405806,  0.89644096,  0.89654697,
          0.89647832,  0.89633947,  0.89627607,  0.89627566,  0.89627057],
        [ 0.90169757,  0.90247399,  0.90049554,  0.89726888,  0.89731613,
          0.8960465 ,  0.8947

In [182]:
y_pred = logit_cv.predict_proba(X_test)[:, 1]
y_pred.shape

(82797,)

Make prediction for the test set and form a submission file.

In [183]:
test_pred = logit_cv.predict_proba(X_test)[:, 1]

In [184]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [185]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")