<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [3]:
PATH_TO_DATA = ('../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

Separate target feature 

In [4]:
y = train_df['target']

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [6]:
import math
def to_str(a):
    if not a or math.isnan(a):
        return "0"
    return str(int(a))

def df_to_sites(df):
    col_names = ['site'+str(i) for i in range(1,11)]
    for idx, row in df.iterrows():
        arr = list(map(to_str, row[col_names]))
        yield ' '.join(arr)


In [7]:
%%time
vct = TfidfVectorizer(ngram_range=(1,3), max_features=100000, stop_words=['0'])
rez = vct.fit_transform(df_to_sites(train_df))

CPU times: user 3min 29s, sys: 503 ms, total: 3min 29s
Wall time: 3min 29s


Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [None]:
from datetime import datetime

def last_ts(row):
    for idx in range(10, 1, -1):
        col = 'time' + str(idx)
        val = row[col]
#         print(val)
        if isinstance(val, str) and val:
            return val
    return None

def new_features(df):
    for idx, row in df.iterrows():
        ts_start = datetime.strptime(row['time1'], '%Y-%m-%d %H:%M:%S')
        ts_end = last_ts(row)
        if ts_end:
            ts_end = datetime.strptime(ts_end, '%Y-%m-%d %H:%M:%S')
            duration = ts_end - ts_start
            features = [ts_start.hour, duration.seconds]
        else:
            features = [ts_start.hour, 0]
        if ts_start.hour < 8 or ts_start.hour > 22:
            features.extend([0,1])
        else:
            features.extend([1,0])
        yield features
#         print(features)

# new_features(train_df.head(10))
from scipy.sparse import hstack, coo_matrix
add_features = coo_matrix(list(new_features(train_df)), shape=(len(train_df),4))

In [199]:
add_features = coo_matrix(list(new_features(train_df.head(10))), shape=(10,4))

In [202]:
add_feat.todense()

matrix([[ 10,   0,   1,   0],
        [ 11,  26,   1,   0],
        [ 16,   7,   1,   0],
        [ 10, 270,   1,   0],
        [ 10, 246,   1,   0],
        [ 15, 686,   1,   0],
        [ 16, 102,   1,   0],
        [ 10,   6,   1,   0],
        [ 16,  45,   1,   0],
        [ 16,  87,   1,   0]])

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [152]:
# You code here

Perform cross-validation with logistic regression.

In [9]:
%%time
regr = LogisticRegressionCV(n_jobs=-1)
regr.fit(rez, y)
# for train_index, test_index in skf.split(X, y):
#     X_train, X_test = X[train_index], X[test_index]
#     y_train, y_test = y[train_index], y[test_index]
#     print(y_train)
#     regr.fit(X_train, y_train)
#     print(regr.score(X_test, y_test))
    

CPU times: user 7.98 s, sys: 420 ms, total: 8.4 s
Wall time: 51.6 s


Make prediction for the test set and form a submission file.

In [157]:
a = vct.transform(df_to_sites(test_df))

In [166]:
test_pred = regr.predict(a)

In [172]:
len(test_pred)

82797

In [173]:
set(y)

{0, 1}

In [161]:
rez.shape

(253561, 100000)

In [160]:
train_df.shape

(253561, 21)

In [162]:
a.shape

(82797, 100000)

In [178]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

write_to_submission_file(test_pred, "assignment6_alice_submission.csv")