<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
%load_ext jupyternotify

<IPython.core.display.Javascript object>

In [24]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
import pickle
from scipy.sparse import csr_matrix, hstack, vstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit, cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer

In [3]:
prev_auc_score = 0
prev_cv_score = 0

Reading original data

In [102]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'websites_train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'websites_test_sessions.csv'), index_col='session_id')

# Sort the data by time
train_df = train_df.sort_values(by='time1')
train_df.drop(train_df[train_df['time1']<'2013-09-01'].index, axis=0, inplace=True)

# Load websites dictionary
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

Separate target feature 

In [103]:
y = train_df['target']

In [104]:
# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [105]:
# United dataframe of the initial data 
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

# Index to split the training and test data sets
idx_split = train_df.shape[0]

# Column names for sites and times
sitescolumns = ['site%s' % i for i in range(1, 11)]
timescolumns = ['time%s' % i for i in range(1, 11)]

full_df[timescolumns] = full_df[timescolumns].apply(pd.to_datetime)
full_df[sitescolumns] = full_df[sitescolumns].fillna(-1).astype(np.int32)

In [106]:
# do the same as in Assignment 4
full_sites = full_df[sitescolumns]
print(full_sites.shape)
full_sites.head()

(318980, 10)


Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
4633,41475,41475,41476,41475,6725,41475,41475,41475,6725,41475
124706,41476,41475,41475,6725,41476,6725,41475,41476,41476,41476
239542,21,21,22,23,21,22,23,21,722,-1
173721,820,21,21,23,22,23,22,21,-1,-1
12984,982,812,39,676,812,5932,679,812,679,676


In [107]:
# sequence of indices
sites_flatten = full_sites.values.flatten()

# and the matrix we are looking for
full_sites_sparse = csr_matrix(([1] * sites_flatten.shape[0],
                                sites_flatten,
                                range(0, sites_flatten.shape[0]  + 10, 10)))[:, 1:]

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [108]:
full_sites['sites'] = full_sites.astype(str).apply(lambda x: ' '.join(x), axis=1)

In [109]:
full_sites.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,sites
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4633,41475,41475,41476,41475,6725,41475,41475,41475,6725,41475,41475 41475 41476 41475 6725 41475 41475 41475...
124706,41476,41475,41475,6725,41476,6725,41475,41476,41476,41476,41476 41475 41475 6725 41476 6725 41475 41476 ...
239542,21,21,22,23,21,22,23,21,722,-1,21 21 22 23 21 22 23 21 722 -1
173721,820,21,21,23,22,23,22,21,-1,-1,820 21 21 23 22 23 22 21 -1 -1
12984,982,812,39,676,812,5932,679,812,679,676,982 812 39 676 812 5932 679 812 679 676


In [170]:
# You code here
#tfidf = TfidfVectorizer(ngram_range=(1, 3), max_features = 100000) #change to None
#tfidf_matrix = tfidf.fit_transform(full_sites['sites'])
#vect_df = vect.fit_transform(full_sites['site1'].astype(str))
#tfidf_matrix = tfidftrans.fit_transform(vect_df)

vect = CountVectorizer(ngram_range=(1, 4), max_features=50000)
tfidf_train = vect.fit_transform(full_sites[:idx_split]['sites'])
tfidf_test = vect.transform(full_sites[idx_split:]['sites'])
print(tfidf_train.shape)
print(tfidf_test.shape)
tfidf_matrix = vstack([tfidf_train, tfidf_test])
print(tfidf_matrix.shape)

(236183, 50000)
(82797, 50000)
(318980, 50000)


Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [247]:
# create separate dataframes for numerical and for binary (OHE) features
numerical_df = pd.DataFrame(index=full_df.index)

numerical_df['start_month'] = full_df['time1'].apply(lambda ts: 100 * ts.year + ts.month)
numerical_df['hour'] = full_df['time1'].apply(lambda ts: ts.hour)

# minute slightly reduces CV score
#numerical_df['minute'] = full_df['time1'].apply(lambda ts: ts.hour*60+ts.minute)
# Find sessions' starting and ending
numerical_df['min'] = full_df[timescolumns].min(axis=1)
numerical_df['max'] = full_df[timescolumns].max(axis=1)

# Calculate sessions' duration in seconds
numerical_df['seconds'] = (numerical_df['max'] - numerical_df['min']) / np.timedelta64(1, 's')

# Calculate differences between times
#from __future__ import division
#numerical_df['diffs'] =full_df[timescolumns].apply(lambda row: [pd.Timedelta(sorted(row)[n] - sorted(row)[n-1]).seconds for n in range(1,10)], axis=1)
#numerical_df['max_interval'] = numerical_df['diffs'].apply(lambda lst: max(lst)).fillna(0)
#numerical_df['min_interval'] = numerical_df['diffs'].apply(lambda lst: min(lst)).fillna(0)
#numerical_df['mean_interval'] = numerical_df['diffs'].apply(lambda lst: np.mean(lst)).fillna(0)
#numerical_df.drop(['diffs'], axis=1, inplace=True)

numerical_df.drop(['max', 'min'], axis=1, inplace=True)
numerical_df.head()

Unnamed: 0_level_0,start_month,hour,seconds
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4633,201309,7,285.0
124706,201309,8,729.0
239542,201309,8,94.0
173721,201309,8,104.0
12984,201309,8,38.0


In [248]:
categ_df = pd.DataFrame(index=full_df.index)

categ_df['morning'] = ((numerical_df['hour'] >= 7) & (numerical_df['hour'] < 11)).astype(np.int32)
categ_df['day'] = ((numerical_df['hour'] >= 11) & (numerical_df['hour'] < 17)).astype(np.int32)
categ_df['evening'] = ((numerical_df['hour'] >= 17) & (numerical_df['hour'] <= 21)).astype(np.int32)
categ_df['night'] = ((numerical_df['hour'] >= 22) | (numerical_df['hour'] < 7)).astype(np.int32)
categ_df['is_weekend'] = full_df['time1'].apply(lambda x: 1 if x.date().weekday() in (5, 6) else 0)

ohe_weekday_df = pd.get_dummies(full_df['time1'].apply(lambda ts: ts.dayofweek), prefix='dayofweek')
ohe_hour_df = pd.get_dummies(full_df['time1'].apply(lambda ts: ts.hour), prefix='hour')
ohe_daymonth_df = pd.get_dummies(full_df['time1'].apply(lambda ts: ts.day), prefix='day')
ohe_month_df = pd.get_dummies(full_df['time1'].apply(lambda ts: ts.month), prefix='month')

categ_df = pd.concat([categ_df, 
                      ohe_weekday_df
#                      ohe_hour_df, 
                     #ohe_daymonth_df, 
#                      ohe_month_df
], axis=1)

#for col in ['morning', 'day', 'evening', 'night']:
#    categ_df[col + '_weekend'] = categ_df[col] * categ_df['is_weekend']
    
#for i in range(7, 24):
#    categ_df['weekend_hour_' + str(i)] = categ_df['hour_' + str(i)] * categ_df['is_weekend']
numerical_df.drop(['hour'], axis=1, inplace=True)
categ_df.head()

Unnamed: 0_level_0,morning,day,evening,night,is_weekend,dayofweek_0,dayofweek_1,dayofweek_2,dayofweek_3,dayofweek_4,dayofweek_5,dayofweek_6
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4633,1,0,0,0,0,0,0,0,1,0,0,0
124706,1,0,0,0,0,0,0,0,1,0,0,0
239542,1,0,0,0,0,0,0,0,1,0,0,0
173721,1,0,0,0,0,0,0,0,1,0,0,0
12984,1,0,0,0,0,0,0,0,1,0,0,0


In [249]:
#from sklearn.preprocessing import PolynomialFeatures

#poly = PolynomialFeatures(2)
#categ_df = poly.fit_transform(categ_df)

Scale these features and combine them with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [250]:
# add quadratic features for numericals
from sklearn.preprocessing import PolynomialFeatures

#poly = PolynomialFeatures(2)
#numerical_df = poly.fit_transform(numerical_df)
#print(numerical_df.shape)
#print(poly_df.shape)

#scaler = StandardScaler().fit(poly_df[:idx_split])
#numerical_scaled_df = scaler.transform(poly_df)

In [251]:
scaler = StandardScaler().fit(numerical_df[:idx_split])
numerical_df = scaler.transform(numerical_df)

In [252]:
#catscaler = StandardScaler().fit(categ_df[:idx_split])
#categ_df = catscaler.transform(categ_df)

In [253]:
full_matrix = hstack([tfidf_matrix, numerical_df, categ_df, full_sites_sparse]).tocsr()

Perform cross-validation with logistic regression.

In [254]:
full_matrix.shape

(318980, 98385)

In [255]:
%%time
X_train = full_matrix[:idx_split,:]

tss = TimeSeriesSplit(n_splits=10)

lrcv = LogisticRegressionCV(scoring='roc_auc',
                            #penalty='l1', solver='saga',
                            Cs=np.logspace(-3,1,40), 
                            #class_weight='balanced',
                            #Cs=10,
                            random_state=42,
                            cv=tss, n_jobs=-1, verbose = 20)
lrcv.fit(X_train, y)

[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done   2 out of  10 | elapsed:  1.8min remaining:  7.3min
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:  2.2min remaining:  5.2min
[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:  2.6min remaining:  3.9min
[Parallel(n_jobs=-1)]: Done   5 out of  10 | elapsed:  3.0min remaining:  3.0min
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:  3.3min remaining:  2.2min
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:  3.5min remaining:  1.5min
[Parallel(n_jobs=-1)]: Done   8 out of  10 | elapsed:  3.7min remaining:   56.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  5.5min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  5.5min finished


CPU times: user 9.32 s, sys: 2.03 s, total: 11.3 s
Wall time: 5min 35s


In [256]:
print(lrcv.Cs_)
print('Best C:', lrcv.C_[0])
print ('Max auc_roc:', lrcv.scores_[1].mean(axis=0).max()) 
print ('Prev max auc_roc:', prev_auc_roc_cv) 

[1.00000000e-03 1.26638017e-03 1.60371874e-03 2.03091762e-03
 2.57191381e-03 3.25702066e-03 4.12462638e-03 5.22334507e-03
 6.61474064e-03 8.37677640e-03 1.06081836e-02 1.34339933e-02
 1.70125428e-02 2.15443469e-02 2.72833338e-02 3.45510729e-02
 4.37547938e-02 5.54102033e-02 7.01703829e-02 8.88623816e-02
 1.12533558e-01 1.42510267e-01 1.80472177e-01 2.28546386e-01
 2.89426612e-01 3.66524124e-01 4.64158883e-01 5.87801607e-01
 7.44380301e-01 9.42668455e-01 1.19377664e+00 1.51177507e+00
 1.91448198e+00 2.42446202e+00 3.07029063e+00 3.88815518e+00
 4.92388263e+00 6.23550734e+00 7.89652287e+00 1.00000000e+01]
Best C: 1.1937766417144358
Max auc_roc: 0.8960403786685441
Prev max auc_roc: 0.925152735884005


In [257]:
prev_auc_roc_cv = lrcv.scores_[1].mean(axis=0).max()

In [258]:
def get_auc_lr_valid(X, y, C=1.0, seed=42, ratio = 0.9):
    # Split the data into the training and validation sets
    idx = int(round(X.shape[0] * ratio))
    # Classifier training
    lr = LogisticRegression(C=C, 
                            #class_weight='balanced', 
                            random_state=seed, n_jobs=-1).fit(X[:idx, :], y[:idx])
    # Prediction for validation set
    y_pred = lr.predict_proba(X[idx:, :])[:, 1]
    # Calculate the quality
    score = roc_auc_score(y[idx:], y_pred)
    
    return score

In [259]:
%%time
auc_score = get_auc_lr_valid(X_train, y, C=lrcv.C_[0], ratio = 0.75)
print(auc_score)
print(prev_auc_score)

0.9405739931287873
0.9483379152576903
CPU times: user 54.8 s, sys: 1min 23s, total: 2min 17s
Wall time: 17.5 s


In [260]:
prev_auc_score = auc_score

In [261]:
cv_scores = cross_val_score(LogisticRegression(C=lrcv.C_[0], random_state=42, n_jobs=-1), 
                            X_train, y, 
                            scoring='roc_auc', cv=tss, 
                            n_jobs=-1, verbose=10)
print(np.mean(cv_scores))
print(prev_cv_score)

0.8912782799938197
0.9161734457241405


In [262]:
prev_cv_score = np.mean(cv_scores)

Make prediction for the test set and form a submission file.

In [263]:
X_test = full_matrix[idx_split:,:]
test_pred = lrcv.predict_proba(X_test)[:, 1]

In [264]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

    

In [265]:
%%notify
write_to_submission_file(test_pred, "assignment6_alice_submission_tss.csv")

<IPython.core.display.Javascript object>