<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [37]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
import pickle
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer

Reading original data

In [38]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'websites_train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'websites_test_sessions.csv'), index_col='session_id')

# Load websites dictionary
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

Separate target feature 

In [39]:
y = train_df['target']

In [40]:
# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [41]:
# United dataframe of the initial data 
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

# Index to split the training and test data sets
idx_split = train_df.shape[0]

# Column names for sites and times
sitescolumns = ['site%s' % i for i in range(1, 11)]
timescolumns = ['time%s' % i for i in range(1, 11)]

full_df[timescolumns] = full_df[timescolumns].apply(pd.to_datetime)
full_df[sitescolumns] = full_df[sitescolumns].fillna(0).astype(np.int32)

In [42]:
# do the same as in Assignment 4
full_sites = full_df[sitescolumns]
print(full_sites.shape)
full_sites.head()

(336358, 10)


Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,718,0,0,0,0,0,0,0,0,0
2,890,941,3847,941,942,3846,3847,3846,1516,1518
3,14769,39,14768,14769,37,39,14768,14768,14768,14768
4,782,782,782,782,782,782,782,782,782,782
5,22,177,175,178,177,178,175,177,177,178


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [43]:
full_sites['sites'] = full_sites.astype(str).apply(lambda x: ' '.join(x), axis=1)

In [44]:
full_sites.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,sites
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,718,0,0,0,0,0,0,0,0,0,718 0 0 0 0 0 0 0 0 0
2,890,941,3847,941,942,3846,3847,3846,1516,1518,890 941 3847 941 942 3846 3847 3846 1516 1518
3,14769,39,14768,14769,37,39,14768,14768,14768,14768,14769 39 14768 14769 37 39 14768 14768 14768 1...
4,782,782,782,782,782,782,782,782,782,782,782 782 782 782 782 782 782 782 782 782
5,22,177,175,178,177,178,175,177,177,178,22 177 175 178 177 178 175 177 177 178


In [45]:
# You code here
tfidf = TfidfVectorizer(ngram_range=(1, 3), max_features=None) #change to None
tfidf_matrix = tfidf.fit_transform(full_sites['sites'])
#vect_df = vect.fit_transform(full_sites['site1'].astype(str))
#tfidf_matrix = tfidftrans.fit_transform(vect_df)
print(tfidf_matrix.shape)

(336358, 1259706)


Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [64]:
# You code here
times_df = pd.DataFrame(index=full_df.index)
times_df['start_month'] = full_df['time1'].apply(lambda ts: 100 * ts.year + ts.month)
times_df['hour'] = full_df['time1'].apply(lambda ts: ts.hour)
times_df['morning'] = ((times_df['hour'] >= 7) & (times_df['hour'] < 10)).astype(np.int32)
times_df['day'] = ((times_df['hour'] >= 10) & (times_df['hour'] < 19)).astype(np.int32)
times_df['evening'] = ((times_df['hour'] >= 19) & (times_df['hour'] < 22)).astype(np.int32)
times_df['night'] = ((times_df['hour'] >= 22) | (times_df['hour'] < 7)).astype(np.int32)
times_df['is_weekend'] = full_df['time1'].apply(lambda x: 1 if x.date().weekday() in (5, 6) else 0)
ohe_weekday_df = pd.get_dummies(full_df['time1'].apply(lambda ts: ts.dayofweek), prefix='dayofweek')
ohe_hour_df = pd.get_dummies(full_df['time1'].apply(lambda ts: ts.hour), prefix='hour')
ohe_daymonth_df = pd.get_dummies(full_df['time1'].apply(lambda ts: ts.day), prefix='day')
ohe_month_df = pd.get_dummies(full_df['time1'].apply(lambda ts: ts.month), prefix='month')
times_df = pd.concat([times_df, ohe_weekday_df, ohe_hour_df, ohe_daymonth_df, ohe_month_df], axis=1)

# Find sessions' starting and ending
times_df['min'] = full_df[timescolumns].min(axis=1)
times_df['max'] = full_df[timescolumns].max(axis=1)

# Calculate sessions' duration in seconds
times_df['seconds'] = (times_df['max'] - times_df['min']) / np.timedelta64(1, 's')

# Calculate differences between times
#from __future__ import division
#times_df['diffs'] =full_df[timescolumns].apply(lambda row: [pd.Timedelta(sorted(row)[n] - sorted(row)[n-1]).seconds for n in range(1,10)], axis=1)
#times_df['max_interval'] = times_df['diffs'].apply(lambda lst: max(lst)).fillna(0)
#times_df['min_interval'] = times_df['diffs'].apply(lambda lst: min(lst)).fillna(0)
#times_df.drop(['diffs'], axis=1, inplace=True)
times_df.drop(['max', 'min'], axis=1, inplace=True)
times_df.head()

Unnamed: 0_level_0,start_month,hour,morning,day,evening,night,is_weekend,dayofweek_0,dayofweek_1,dayofweek_2,...,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12,seconds
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,201402,10,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
2,201402,11,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,26.0
3,201312,16,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,7.0
4,201403,10,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,270.0
5,201402,10,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,246.0


Scale these features and combine them with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [65]:
for col in ['morning', 'day', 'evening', 'night']:
    times_df[col + '_weekend'] = times_df[col] * times_df['is_weekend']

In [48]:
#for i in range(7, 24):
#    times_df['weekend_hour_' + str(i)] = times_df['hour_' + str(i)] * times_df['is_weekend']

In [66]:
times_df.shape

(336358, 73)

In [50]:
# add quadratic features for times
#from sklearn.preprocessing import PolynomialFeatures
#poly = PolynomialFeatures(2)
#poly_df = poly.fit_transform(times_df)
#poly_df.shape

In [67]:
# You code here
scaled_times_df = StandardScaler().fit_transform(times_df)

In [68]:
full_matrix = hstack([tfidf_matrix, scaled_times_df]).tocsr()

Perform cross-validation with logistic regression.

In [69]:
full_matrix.shape

(336358, 1259779)

In [70]:
%%time
X_train = full_matrix[:idx_split,:]

skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

lrcv = LogisticRegressionCV(scoring='roc_auc',
                            Cs=np.logspace(0,2,20), 
                            cv=skf, n_jobs=-1, verbose = 10)
lrcv.fit(X_train, y)

[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed: 23.2min remaining: 34.7min
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed: 23.2min remaining: 15.5min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed: 23.3min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed: 23.3min finished


CPU times: user 1min 17s, sys: 59.6 s, total: 2min 17s
Wall time: 24min 8s


In [71]:
print(lrcv.Cs_)
print('Best C:', lrcv.C_[0])
print ('Max auc_roc:', lrcv.scores_[1].mean(axis=0).max()) 

[  1.           1.27427499   1.62377674   2.06913808   2.6366509
   3.35981829   4.2813324    5.45559478   6.95192796   8.8586679
  11.28837892  14.38449888  18.32980711  23.35721469  29.76351442
  37.92690191  48.32930239  61.58482111  78.47599704 100.        ]
Best C: 18.329807108324356
Max auc_roc: 0.991018995185238


In [57]:
def get_auc_lr_valid(X, y, C=1.0, seed=42, ratio = 0.9):
    # Split the data into the training and validation sets
    idx = int(round(X.shape[0] * ratio))
    # Classifier training
    lr = LogisticRegression(C=C, random_state=seed, n_jobs=-1).fit(X[:idx, :], y[:idx])
    # Prediction for validation set
    y_pred = lr.predict_proba(X[idx:, :])[:, 1]
    # Calculate the quality
    score = roc_auc_score(y[idx:], y_pred)
    
    return score

In [72]:
%%time
auc_score = get_auc_lr_valid(X_train, y, C=lrcv.C_[0], ratio = 0.9)
print(auc_score)
print(prev_auc_score)

0.9919422289258788
0.9909802556183095
CPU times: user 8min 16s, sys: 8min 25s, total: 16min 42s
Wall time: 4min 55s


In [59]:
prev_auc_score = auc_score

In [60]:
%%time
#lrfinal = LogisticRegression(C=lrcv.C_[0], random_state=42, n_jobs=-1).fit(X_train, y)

CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 7.39 µs


Make prediction for the test set and form a submission file.

In [61]:
X_test = full_matrix[idx_split:,:]
#test_pred = lrfinal.predict_proba(X_test)[:, 1]
test_pred = lrcv.predict_proba(X_test)[:, 1]

In [62]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

    

In [63]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")