<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [2]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import pickle
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

Reading original data

In [4]:
# Read the training and test data sets

PATH_TO_DATA = ('')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

In [170]:
# Switch time1, ..., time10 columns to datetime type
times = ['time%s' % i for i in range(1, 11)]
train_df[times] = train_df[times].apply(pd.to_datetime)
test_df[times] = test_df[times].apply(pd.to_datetime)

# Sort the data by time
train_df = train_df.sort_values(by='time1')

# Look at the first rows of the training set
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,2013-01-12 08:05:57,55,2013-01-12 08:05:57,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,...,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0
54843,56,2013-01-12 08:37:23,55,2013-01-12 08:37:23,56,2013-01-12 09:07:07,55,2013-01-12 09:07:09,0,1970-01-01 00:00:00,...,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0
77292,946,2013-01-12 08:50:13,946,2013-01-12 08:50:14,951,2013-01-12 08:50:15,946,2013-01-12 08:50:15,946,2013-01-12 08:50:16,...,2013-01-12 08:50:16,948,2013-01-12 08:50:16,784,2013-01-12 08:50:16,949,2013-01-12 08:50:17,946,2013-01-12 08:50:17,0
114021,945,2013-01-12 08:50:17,948,2013-01-12 08:50:17,949,2013-01-12 08:50:18,948,2013-01-12 08:50:18,945,2013-01-12 08:50:18,...,2013-01-12 08:50:18,947,2013-01-12 08:50:19,945,2013-01-12 08:50:19,946,2013-01-12 08:50:19,946,2013-01-12 08:50:20,0
146670,947,2013-01-12 08:50:20,950,2013-01-12 08:50:20,948,2013-01-12 08:50:20,947,2013-01-12 08:50:21,950,2013-01-12 08:50:21,...,2013-01-12 08:50:21,946,2013-01-12 08:50:21,951,2013-01-12 08:50:22,946,2013-01-12 08:50:22,947,2013-01-12 08:50:22,0


Separate target feature 

In [171]:
y = train_df['target']

In [172]:
y.head()

session_id
21669     0
54843     0
77292     0
114021    0
146670    0
Name: target, dtype: int64

In [173]:
# Change site1, ..., site10 columns type to integer and fill NA-values with zeros
sites = ['site%s' % i for i in range(1, 11)]
train_df[sites] = train_df[sites].fillna(0).astype('int')
test_df[sites] = test_df[sites].fillna(0).astype('int')

# Load websites dictionary
with open(r"site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [174]:
times = ['time%s' % i for i in range(1, 11)]
train_df[times] = train_df[times].fillna(0).astype('datetime64[ns]')
test_df[times] = test_df[times].fillna(0).astype('datetime64[ns]')

In [175]:
# Our target variable
y_train = train_df['target']

# United dataframe of the initial data 
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

# Index to split the training and test data sets
idx_split = train_df.shape[0]

full_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,site6,time6,site7,time7,site8,time8,site9,time9,site10,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
21669,56,2013-01-12 08:05:57,55,2013-01-12 08:05:57,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00
54843,56,2013-01-12 08:37:23,55,2013-01-12 08:37:23,56,2013-01-12 09:07:07,55,2013-01-12 09:07:09,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00,0,1970-01-01 00:00:00
77292,946,2013-01-12 08:50:13,946,2013-01-12 08:50:14,951,2013-01-12 08:50:15,946,2013-01-12 08:50:15,946,2013-01-12 08:50:16,945,2013-01-12 08:50:16,948,2013-01-12 08:50:16,784,2013-01-12 08:50:16,949,2013-01-12 08:50:17,946,2013-01-12 08:50:17
114021,945,2013-01-12 08:50:17,948,2013-01-12 08:50:17,949,2013-01-12 08:50:18,948,2013-01-12 08:50:18,945,2013-01-12 08:50:18,946,2013-01-12 08:50:18,947,2013-01-12 08:50:19,945,2013-01-12 08:50:19,946,2013-01-12 08:50:19,946,2013-01-12 08:50:20
146670,947,2013-01-12 08:50:20,950,2013-01-12 08:50:20,948,2013-01-12 08:50:20,947,2013-01-12 08:50:21,950,2013-01-12 08:50:21,952,2013-01-12 08:50:21,946,2013-01-12 08:50:21,951,2013-01-12 08:50:22,946,2013-01-12 08:50:22,947,2013-01-12 08:50:22


In [176]:
# Dataframe with indices of visited websites in session
full_sites = full_df[sites]
full_sites.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
21669,56,55,0,0,0,0,0,0,0,0
54843,56,55,56,55,0,0,0,0,0,0
77292,946,946,951,946,946,945,948,784,949,946
114021,945,948,949,948,945,946,947,945,946,946
146670,947,950,948,947,950,952,946,951,946,947


In [177]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 336358 entries, 21669 to 82797
Data columns (total 20 columns):
site1     336358 non-null int32
time1     336358 non-null datetime64[ns]
site2     336358 non-null int32
time2     336358 non-null datetime64[ns]
site3     336358 non-null int32
time3     336358 non-null datetime64[ns]
site4     336358 non-null int32
time4     336358 non-null datetime64[ns]
site5     336358 non-null int32
time5     336358 non-null datetime64[ns]
site6     336358 non-null int32
time6     336358 non-null datetime64[ns]
site7     336358 non-null int32
time7     336358 non-null datetime64[ns]
site8     336358 non-null int32
time8     336358 non-null datetime64[ns]
site9     336358 non-null int32
time9     336358 non-null datetime64[ns]
site10    336358 non-null int32
time10    336358 non-null datetime64[ns]
dtypes: datetime64[ns](10), int32(10)
memory usage: 41.1 MB


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [None]:
# You code here

Add features based on the session start time: hour, whether it's morning, day or night and so on.

говорят хватит бинарных ранее утро, утро, вечер, обед и воскресенье

In [178]:
%%time
new_full_df = full_df 

for i in range(1, 11):
    new_full_df['hour']          = new_full_df['time1'].apply(lambda time: pd.Timestamp(time).hour)
    new_full_df['early_morning'] = new_full_df['time1'].apply(lambda time: 1 if pd.Timestamp(time).hour < 10 else 0)
    new_full_df['morning']       = new_full_df['time1'].apply(lambda time: 1 if (pd.Timestamp(time).hour >= 10) and (pd.Timestamp(time).hour < 12) else 0)
    new_full_df['lunch']         = new_full_df['time1'].apply(lambda time: 1 if (pd.Timestamp(time).hour >= 12) and (pd.Timestamp(time).hour < 16) else 0)
    new_full_df['evening']       = new_full_df['time1'].apply(lambda time: 1 if (pd.Timestamp(time).hour >= 16) and (pd.Timestamp(time).hour < 24) else 0)
    new_full_df['sunday']        = new_full_df['time1'].apply(lambda time: 1 if pd.Timestamp(time).dayofweek == 6 else 0)
    

Wall time: 6min 47s


In [179]:
for i in range(1, 11):
    new_full_df.drop('time%s' % i, axis = 1, inplace=True)

In [180]:
# последовательность с индексами
sites_flatten = full_sites.values.flatten()

# искомая матрица
full_sites_sparse = csr_matrix(([1] * sites_flatten.shape[0],
                                sites_flatten,
                                range(0, sites_flatten.shape[0] + 10, 10)))[:, 1:]

In [139]:
for i in range(1, 11):    
    new_full_df.drop('site%s' % i, axis = 1, inplace=True)

In [181]:
new_full_df.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,...,morning,lunch,evening,monday,tuesday,wednes,thursday,friday,saturnday,sunday
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,55,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
54843,56,55,56,55,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
77292,946,946,951,946,946,945,948,784,949,946,...,0,0,0,0,0,0,0,0,1,0
114021,945,948,949,948,945,946,947,945,946,946,...,0,0,0,0,0,0,0,0,1,0
146670,947,950,948,947,950,952,946,951,946,947,...,0,0,0,0,0,0,0,0,1,0


In [182]:
full_sites_sparse.shape

(336358, 48371)

In [183]:
new_full_df.values.shape #наш датафрейм, содержащий временные признаки, в формате массива

(336358, 22)

In [184]:
result_crx = hstack([new_full_df.values, full_sites_sparse]).tocsr()

In [185]:
scaler = StandardScaler(copy=False, with_mean=False)
scaler.fit_transform(result_crx)

<336358x48393 sparse matrix of type '<class 'numpy.float64'>'
	with 6071402 stored elements in Compressed Sparse Row format>

In [186]:
X_test = result_crx[idx_split:, :]
X_train = result_crx[:idx_split, :]
y_train = train_df['target']

In [187]:
y_train.shape

(253561,)

In [188]:
X_train.shape

(253561, 48393)

In [189]:
X_test.shape

(82797, 48393)

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [None]:
# You code here

Perform cross-validation with logistic regression.

In [190]:
%%time
idx = int(round(X_train.shape[0] * 0.9))

tss = TimeSeriesSplit(n_splits=7)

#logit_pipe = Pipeline([
    #('vectorizer', CountVectorizer(max_features = 100000, ngram_range = (1, 3))),
    #('clf', LogisticRegression(random_state=17))])
#logit_pipe_params = {'clf__C': np.logspace(-8, 8, 17)}

logit_cv = LogisticRegressionCV(Cs = list(np.logspace(-8, 8, 17)), n_jobs=-1, scoring ='roc_auc', cv=tss)
#logit_cv = LogisticRegressionCV(Cs = [1], n_jobs=-1, scoring ='roc_auc', cv=tss)
logit_cv.fit(X_train, y_train)

Wall time: 1min 51s


In [191]:
logit_cv.scores_

{1: array([[ 0.44904719,  0.45108739,  0.467634  ,  0.49996078,  0.51205814,
          0.52844787,  0.57266737,  0.61237179,  0.63019994,  0.63019697,
          0.63019743,  0.63019641,  0.63019781,  0.63019632,  0.63019799,
          0.63019614,  0.63019781],
        [ 0.43963007,  0.4419699 ,  0.46225583,  0.54895794,  0.64678367,
          0.77368708,  0.81413032,  0.84369683,  0.84709125,  0.86304632,
          0.87427409,  0.87219829,  0.87220182,  0.87219777,  0.87220248,
          0.87219777,  0.87220209],
        [ 0.38864464,  0.38953818,  0.39832078,  0.42178047,  0.49547987,
          0.69558799,  0.73618256,  0.77809555,  0.78048398,  0.78054536,
          0.78054312,  0.78054056,  0.7805444 ,  0.7805412 ,  0.78054472,
          0.78054088,  0.7805444 ],
        [ 0.43168043,  0.44348719,  0.49674328,  0.60924833,  0.73443264,
          0.91851922,  0.95304763,  0.95304815,  0.95304866,  0.95304876,
          0.95304886,  0.95304866,  0.95304876,  0.95304856,  0.95304897,
 

In [164]:
y_pred = logit_cv.predict_proba(X_test)[:, 1]
y_pred.shape

(82797,)

Make prediction for the test set and form a submission file.

In [165]:
test_pred = logit_cv.predict_proba(X_test)[:, 1]

In [166]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [167]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")