<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [169]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Imputer
from sklearn.model_selection import GridSearchCV

Reading original data

In [170]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

In [171]:
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,718,2014-02-20 10:02:45,,,,,,,,,...,,,,,,,,,,0
2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,2014-02-22 11:19:51,...,2014-02-22 11:19:51,3847.0,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16,0
3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,2013-12-16 16:40:19,...,2013-12-16 16:40:19,14768.0,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24,0
4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,2014-03-28 10:54:12,...,2014-03-28 10:54:42,782.0,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42,0
5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,2014-02-28 10:55:23,...,2014-02-28 10:55:59,175.0,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11,0


Separate target feature 

In [172]:
y = train_df['target']

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

Select site columns

In [173]:
site_columns = [col for col in train_df.columns if col.startswith("site")]
time_columns = [col for col in train_df.columns if col.startswith("time")]
site_columns, time_columns

(['site1',
  'site2',
  'site3',
  'site4',
  'site5',
  'site6',
  'site7',
  'site8',
  'site9',
  'site10'],
 ['time1',
  'time2',
  'time3',
  'time4',
  'time5',
  'time6',
  'time7',
  'time8',
  'time9',
  'time10'])

In [174]:
train_df[site_columns].fillna(0, inplace=True)
test_df[site_columns].fillna(0, inplace=True)

In [175]:
sites_train = train_df[site_columns].to_string(index = False, header = False).split('\n')
sites_test = test_df[site_columns].to_string(index = False, header = False).split('\n')

In [176]:
# You code here
vectorizer = TfidfVectorizer(ngram_range=(1,3), max_features=100000)
vectorizer.fit_transform(sites_train)
X_train_tfidf = vectorizer.transform(sites_train)
X_test_tfidf = vectorizer.transform(sites_test)

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [177]:
X_train_hour = np.array([pd.to_datetime(train_df[col]).dt.hour for col in time_columns]).transpose()
X_test_hour = np.array([pd.to_datetime(test_df[col]).dt.hour for col in time_columns]).transpose()

hour_imputer = Imputer(missing_values='NaN', strategy='mean')
hour_imputer.fit(X_train_hour)
X_train_hour = hour_imputer.transform(X_train_hour)
hour_imputer.fit(X_test_hour)
X_test_hour = hour_imputer.transform(X_test_hour)

X_train_hour

array([[ 10.        ,  12.28118178,  12.27228362, ...,  12.24498351,
         12.24227746,  12.23774302],
       [ 11.        ,  11.        ,  11.        , ...,  11.        ,
         11.        ,  11.        ],
       [ 16.        ,  16.        ,  16.        , ...,  16.        ,
         16.        ,  16.        ],
       ..., 
       [ 14.        ,  14.        ,  14.        , ...,  12.24498351,
         12.24227746,  12.23774302],
       [ 15.        ,  15.        ,  15.        , ...,  15.        ,
         15.        ,  15.        ],
       [  9.        ,   9.        ,   9.        , ...,   9.        ,
          9.        ,   9.        ]])

In [178]:
# You code here
scaler = MinMaxScaler()
scaler.fit(X_train_hour)
X_train_hour = scaler.transform(X_train_hour)
X_test_hour = scaler.transform(X_test_hour)
X_test_hour

array([[ 0.25  ,  0.25  ,  0.25  , ...,  0.25  ,  0.25  ,  0.25  ],
       [ 0.25  ,  0.25  ,  0.25  , ...,  0.25  ,  0.25  ,  0.25  ],
       [ 0.5   ,  0.5   ,  0.5   , ...,  0.5   ,  0.5   ,  0.5   ],
       ..., 
       [ 0.25  ,  0.25  ,  0.25  , ...,  0.25  ,  0.25  ,  0.25  ],
       [ 0.1875,  0.1875,  0.1875, ...,  0.1875,  0.1875,  0.1875],
       [ 0.1875,  0.1875,  0.1875, ...,  0.1875,  0.1875,  0.1875]])

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [180]:
X_train = hstack([X_train_tfidf])
X_test = hstack([X_test_tfidf])
X_train.shape, X_test.shape

((253561, 100000), (82797, 100000))

Perform cross-validation with logistic regression.

Make prediction for the test set and form a submission file.

In [181]:
lr = LogisticRegression()
params = {'C' : [9]}
clf = GridSearchCV(lr, params, n_jobs=-1, error_score='roc_auc', cv=5)
clf.fit(X_train, y)

GridSearchCV(cv=5, error_score='roc_auc',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1, param_grid={'C': [9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [182]:
clf.best_params_, clf.best_score_

({'C': 9}, 0.99228587992632933)

In [183]:
test_pred = clf.best_estimator_.predict_proba(X_test)[:, 1]# You code here

In [62]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [184]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")