<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [21]:
%load_ext jupyternotify

The jupyternotify extension is already loaded. To reload it, use:
  %reload_ext jupyternotify


In [22]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
import pickle
from scipy.sparse import csr_matrix, hstack, vstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit, cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer

In [23]:
prev_auc_score = 0
prev_cv_score = 0

Reading original data

In [158]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'websites_train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'websites_test_sessions.csv'), index_col='session_id')

# Sort the data by time
train_df = train_df.sort_values(by='time1')


train_df.drop(train_df[train_df['time1']<'2013-09-01'].index, axis=0, inplace=True)

# Load websites dictionary
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)
    
inv_site_dict = {v: k for k, v in site_dict.items()}

Helper function to save predictions

In [159]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

    

Separate target feature 

In [160]:
y = train_df['target']

In [161]:
# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [162]:
# Column names for sites and times
sitescolumns = ['site%s' % i for i in range(1, 11)]
timescolumns = ['time%s' % i for i in range(1, 11)]

train_df[timescolumns] = train_df[timescolumns].apply(pd.to_datetime)
test_df[timescolumns] = test_df[timescolumns].apply(pd.to_datetime)

In [163]:
train_df[sitescolumns].fillna(0).astype(np.int32).to_csv('train_sessions_text.txt', sep=' ', index=None, header=None)
test_df[sitescolumns].fillna(0).astype(np.int32).to_csv('test_sessions_text.txt', sep=' ', index=None, header=None)

In [164]:
# Index to split the training and test data sets
idx_split = train_df.shape[0]

full_sites = pd.concat([train_df[sitescolumns], test_df[sitescolumns]])
# sequence of indices
sites_flatten = full_sites.values.flatten()

# and the matrix we are looking for
full_sites_sparse = csr_matrix(([1] * sites_flatten.shape[0],
                                sites_flatten,
                                range(0, sites_flatten.shape[0]  + 10, 10)))[:, 1:]

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [165]:
%%time
cv = CountVectorizer(ngram_range=(1, 3), max_features=100000)
with open('train_sessions_text.txt') as inp_train_file:
    X_train_sites_vect = cv.fit_transform(inp_train_file)
with open('test_sessions_text.txt') as inp_test_file:
    X_test_sites_vect = cv.transform(inp_test_file)
print(X_train_sites_vect.shape, X_test_sites_vect.shape)

(236183, 100000) (82797, 100000)
CPU times: user 13.8 s, sys: 83.4 ms, total: 13.9 s
Wall time: 13.9 s


Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [166]:
def get_time_features(df):
    time_df = pd.DataFrame(index=df.index)
    
    time_df['start_month'] = df['time1'].apply(lambda ts: 100 * ts.year + ts.month)
    
    hour = df['time1'].apply(lambda ts: ts.hour)
    time_df['morning'] = ((hour >= 7) & (hour <= 11)).astype('int')
    time_df['day'] = ((hour >= 12) & (hour <= 18)).astype('int')
    time_df['evening'] = ((hour >= 19) & (hour <= 23)).astype('int')
    time_df['night'] = ((hour >= 0) & (hour <= 6)).astype('int')
    
    mincolumn = df[timescolumns].min(axis=1)
    maxcolumn = df[timescolumns].max(axis=1)
    # Calculate sessions' duration in seconds
    time_df['seconds'] = (maxcolumn - mincolumn) / np.timedelta64(1, 's')
    time_df['hour'] = hour
    time_df = pd.get_dummies(time_df, columns = ['hour'])
    return time_df

In [167]:
columns_to_scale = ['seconds', 'start_month']

X_train_time = get_time_features(train_df)
X_test_time = get_time_features(test_df)

scaler = StandardScaler()
for col in columns_to_scale:
    X_train_time[col] = scaler.fit_transform(X_train_time[col].reshape(-1,1))
    X_test_time[col] = scaler.transform(X_test_time[col].reshape(-1,1))

print(X_train_time.shape, X_test_time.shape)
X_train_time.head()

(236183, 23) (82797, 23)


Unnamed: 0_level_0,start_month,morning,day,evening,night,seconds,hour_7,hour_8,hour_9,hour_10,...,hour_14,hour_15,hour_16,hour_17,hour_18,hour_19,hour_20,hour_21,hour_22,hour_23
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4633,-1.820538,1,0,0,0,0.494621,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
124706,-1.820538,1,0,0,0,1.998222,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
239542,-1.820538,1,0,0,0,-0.152199,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
173721,-1.820538,1,0,0,0,-0.118334,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
12984,-1.820538,1,0,0,0,-0.341842,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [168]:
X_train = hstack([X_train_time, X_train_sites_vect, full_sites_sparse[:idx_split,:]])
X_test = hstack([X_test_time, X_test_sites_vect, full_sites_sparse[idx_split:,:]])
print(X_train.shape, X_test.shape)

(236183, 148394) (82797, 148394)


Scale these features and combine them with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

Perform cross-validation with logistic regression.

In [179]:
%%time
tss = TimeSeriesSplit(n_splits=5)

lrcv = LogisticRegressionCV(scoring='roc_auc',
                            #penalty='l1', solver='saga',
                            Cs=np.logspace(-3,1,40), 
                            #class_weight='balanced',
                            #Cs=10,
                            random_state=42,
                            cv=tss, n_jobs=-1, verbose = 20)
lrcv.fit(X_train, y)

[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  2.3min remaining:  3.5min
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:  2.9min remaining:  2.0min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  4.1min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  4.1min finished


CPU times: user 11 s, sys: 1.71 s, total: 12.7 s
Wall time: 4min 17s


In [180]:
print(lrcv.Cs_)
print('Best C:', lrcv.C_[0])
print ('Max auc_roc:', lrcv.scores_[1].mean(axis=0).max()) 
try:
    print ('Prev max auc_roc:', prev_auc_roc_cv) 
except:
    pass

[1.00000000e-03 1.26638017e-03 1.60371874e-03 2.03091762e-03
 2.57191381e-03 3.25702066e-03 4.12462638e-03 5.22334507e-03
 6.61474064e-03 8.37677640e-03 1.06081836e-02 1.34339933e-02
 1.70125428e-02 2.15443469e-02 2.72833338e-02 3.45510729e-02
 4.37547938e-02 5.54102033e-02 7.01703829e-02 8.88623816e-02
 1.12533558e-01 1.42510267e-01 1.80472177e-01 2.28546386e-01
 2.89426612e-01 3.66524124e-01 4.64158883e-01 5.87801607e-01
 7.44380301e-01 9.42668455e-01 1.19377664e+00 1.51177507e+00
 1.91448198e+00 2.42446202e+00 3.07029063e+00 3.88815518e+00
 4.92388263e+00 6.23550734e+00 7.89652287e+00 1.00000000e+01]
Best C: 0.46415888336127775
Max auc_roc: 0.945364620230589
Prev max auc_roc: 0.9184291952249257


In [181]:
prev_auc_roc_cv = lrcv.scores_[1].mean(axis=0).max()

In [182]:
def get_auc_lr_valid(X, y, C=1.0, seed=42, ratio = 0.9):
    # Split the data into the training and validation sets
    idx = int(round(X.shape[0] * ratio))
    # Classifier training
    lr = LogisticRegression(C=C, 
                            #class_weight='balanced', 
                            random_state=seed, n_jobs=-1).fit(X[:idx, :], y[:idx])
    # Prediction for validation set
    y_pred = lr.predict_proba(X[idx:, :])[:, 1]
    # Calculate the quality
    score = roc_auc_score(y[idx:], y_pred)
    
    return score

In [183]:
%%time
auc_score = get_auc_lr_valid(X_train.tocsr(), y, C=lrcv.C_[0], ratio = 0.75)
print(auc_score)

try:
   print (prev_auc_score)
except:
    pass

0.961670701347521
0.961670701347521
CPU times: user 42.3 s, sys: 1min, total: 1min 42s
Wall time: 13.3 s


In [184]:
prev_auc_score = auc_score

In [185]:
cv_scores = cross_val_score(LogisticRegression(C=lrcv.C_[0], random_state=42, n_jobs=-1), 
                            X_train, y, 
                            scoring='roc_auc', cv=tss, 
                            n_jobs=-1, verbose=10)
print(np.mean(cv_scores))
try:
   print (prev_cv_score)
except:
    pass

[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV] ....................... , score=0.6695689129913087, total=   5.6s
[CV] ....................... , score=0.9416486209834728, total=  12.8s


[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   13.2s remaining:   19.8s


[CV] ....................... , score=0.9702064479638008, total=  24.2s


[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:   24.7s remaining:   16.5s


[CV] ....................... , score=0.9594850140722488, total=  26.4s
[CV] ....................... , score=0.9766645984594703, total=  31.7s
0.9035147188940602
0.9020929348781814


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   32.6s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   32.6s finished


In [186]:
prev_cv_score = np.mean(cv_scores)

Make prediction for the test set and form a submission file.

In [187]:
test_pred = lrcv.predict_proba(X_test)[:, 1]

In [188]:
%%notify
write_to_submission_file(test_pred, "alice_submission_base.csv")

<IPython.core.display.Javascript object>