<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [52]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from datetime import datetime, time
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [53]:
PATH_TO_DATA = ('../../data/user_identification')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

Separate target feature 

In [54]:
y = train_df['target']

In [69]:
import seaborn as sns

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker
%matplotlib inline

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [207]:
def prepare_features(df, scaler=None, vectorizer=None):
    sites_df = df[['site%d' % i for i in range(1, 11)]].fillna(0).astype('int').values
    
    sessions = list(map(lambda sites_ids: ' '.join(map(lambda site_id: str(site_id), filter(lambda site_id: site_id != 0, sites_ids))), sites_df))
    if vectorizer is None:
        vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=100000, sublinear_tf=True)
        vectorizer.fit(sessions)
    tfidf_features = vectorizer.transform(sessions)
    
    if scaler is None:
        scaler = StandardScaler()
        scaler.fit(sites_df)
    sites_df = scaler.transform(sites_df)
    
    df['session_start'] = df['time1'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
    df['year'] = df['session_start'].apply(lambda x: int(x.year))
    df['month'] = df['session_start'].apply(lambda x: int(x.month))
    df['day'] = df['session_start'].apply(lambda x: int(x.day))
    df['dow'] = df['session_start'].apply(lambda x: int(x.weekday()))
    df['is_weekend'] = df['dow'].apply(lambda x: 1 if x in (5, 6) else 0)
    df['tod'] = df['session_start'].apply(lambda x: int(x.hour))
    
    df['is_night'] = df['tod'].apply(lambda x: 1 if 22 <= x <= 24 or 0 <= x < 8 else 0)
    df['is_morning'] = df['tod'].apply(lambda x: 1 if 8 <= x < 10 else 0)
    df['is_before_dinner'] = df['tod'].apply(lambda x: 1 if 10 <= x < 13 else 0)
    df['is_dinner'] = df['tod'].apply(lambda x: 1 if 13 <= x < 15 else 0)
    df['is_after_dinner'] = df['tod'].apply(lambda x: 1 if 15 <= x < 19 else 0)
    df['is_evening'] = df['tod'].apply(lambda x: 1 if 19 <= x < 22 else 0)
    df['is_alice_dow'] = df['dow'].apply(lambda x: 1 if x == 0 or x == 1 or x == 3 or x == 4 else 0)
    df['is_alice_time'] = df['session_start'].apply(lambda x: 1 if 12 <= x.hour <= 13 or time(hour=15, minute=50) <= x.time() <= time(hour=18, minute=20) else 0)
    
    time_bool_features = df[['is_weekend', 'is_night', 'is_before_dinner', 'is_dinner', 'is_after_dinner', 'is_evening', 'is_alice_dow', 'is_alice_time']]
    time_categorical_features = OneHotEncoder(n_values=[7, 24]).fit_transform(df[['dow', 'tod']])
    
    features = hstack([tfidf_features, time_bool_features, time_categorical_features]).tocsr()
    return (scaler, vectorizer, features)

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [208]:
%%time
(scaler, vectorizer, X) =  prepare_features(train_df)

CPU times: user 36 s, sys: 250 ms, total: 36.3 s
Wall time: 36.3 s


Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [209]:
train_share = int(.7 * X.shape[0])
X_train, y_train = X[:train_share, :], y[:train_share]
X_valid, y_valid = X[train_share:, :], y[train_share:]

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)

Perform cross-validation with logistic regression.

In [210]:
%%time
logit_searcher = LogisticRegressionCV(Cs=[0.3484], cv=skf, scoring='accuracy', class_weight='balanced', random_state=17, n_jobs=-1)
logit_searcher.fit(X_train, y_train)

logit_scores = np.mean(list(map(lambda x: x[1], logit_searcher.scores_.items()))[0], axis=0)
best_score_index = np.argmax(logit_scores)
print('Best C: {0}, best score: {1}'.format(logit_searcher.Cs_[best_score_index], logit_scores[best_score_index]))

Best C: 0.3484, best score: 0.9626800081130419
CPU times: user 4.93 s, sys: 140 ms, total: 5.07 s
Wall time: 12.8 s


In [211]:
%%time
logit_valid_pred_proba = logit_searcher.predict_proba(X_valid)[:, 1]

print(accuracy_score(y_valid, logit_searcher.predict(X_valid)))
print(np.std(list(map(lambda x: x[1], logit_searcher.scores_.items()))[0], axis=0)[best_score_index]/logit_scores[best_score_index]*100)
print(roc_auc_score(y_valid, logit_valid_pred_proba))

0.9627969343622238
0.055908713140284706
0.9852330819393322
CPU times: user 50 ms, sys: 0 ns, total: 50 ms
Wall time: 52.2 ms


Make prediction for the test set and form a submission file.

In [212]:
%%time
logit = LogisticRegression(C=logit_searcher.Cs_[best_score_index], class_weight='balanced', random_state=17, n_jobs=-1)
logit.fit(X, y)

(scaler, vectorizer, X_test) =  prepare_features(test_df, scaler, vectorizer)
logit_test_pred_proba = logit.predict_proba(X_test)[:, 1]

CPU times: user 11.8 s, sys: 20 ms, total: 11.8 s
Wall time: 12 s


In [213]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [214]:
write_to_submission_file(logit_test_pred_proba, "assignment6_alice_submission.csv")