<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
import pickle
import lightgbm as lgb
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.model_selection import GridSearchCV

Reading original data

In [2]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'websites_train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'websites_test_sessions.csv'), index_col='session_id')

# Load websites dictionary
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

Separate target feature 

In [3]:
y_train = train_df['target']

In [4]:
# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [5]:
# United dataframe of the initial data 
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

# Index to split the training and test data sets
idx_split = train_df.shape[0]

# Column names for sites and times
sitescolumns = ['site%s' % i for i in range(1, 11)]
timescolumns = ['time%s' % i for i in range(1, 11)]

full_df[timescolumns] = full_df[timescolumns].apply(pd.to_datetime)
full_df[sitescolumns] = full_df[sitescolumns].fillna(-1).astype(np.int32)

In [6]:
# do the same as in Assignment 4
full_sites = full_df[sitescolumns]
print(full_sites.shape)
full_sites.head()

(336358, 10)


Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,718,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,890,941,3847,941,942,3846,3847,3846,1516,1518
3,14769,39,14768,14769,37,39,14768,14768,14768,14768
4,782,782,782,782,782,782,782,782,782,782
5,22,177,175,178,177,178,175,177,177,178


In [7]:
# sequence of indices
sites_flatten = full_sites.values.flatten()

# and the matrix we are looking for
full_sites_sparse = csr_matrix(([1] * sites_flatten.shape[0],
                                sites_flatten,
                                range(0, sites_flatten.shape[0]  + 10, 10)))[:, 1:]
print(full_sites_sparse.shape)

(336358, 48371)


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [8]:
full_sites['sites'] = full_sites.astype(str).apply(lambda x: ' '.join(x), axis=1)

In [9]:
full_sites.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,sites
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,718,-1,-1,-1,-1,-1,-1,-1,-1,-1,718 -1 -1 -1 -1 -1 -1 -1 -1 -1
2,890,941,3847,941,942,3846,3847,3846,1516,1518,890 941 3847 941 942 3846 3847 3846 1516 1518
3,14769,39,14768,14769,37,39,14768,14768,14768,14768,14769 39 14768 14769 37 39 14768 14768 14768 1...
4,782,782,782,782,782,782,782,782,782,782,782 782 782 782 782 782 782 782 782 782
5,22,177,175,178,177,178,175,177,177,178,22 177 175 178 177 178 175 177 177 178


In [10]:
# You code here
vect = CountVectorizer(ngram_range=(1, 4), max_features=250000)
tfidf = TfidfVectorizer(ngram_range=(1, 4), max_features=250000)
tfidftrans = TfidfTransformer()
tfidf_matrix = tfidf.fit_transform(full_sites['sites'])
#vect_df = vect.fit_transform(full_sites['site1'].astype(str))
#tfidf_matrix = tfidftrans.fit_transform(vect_df)
print(tfidf_matrix.shape)

(336358, 250000)


Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [11]:
# You code here
times_df = pd.DataFrame(index=full_df.index)
times_df['start_month'] = full_df['time1'].apply(lambda ts: 100 * ts.year + ts.month)
times_df['hour'] = full_df['time1'].apply(lambda ts: ts.hour)
times_df['morning'] = ((times_df['hour'] > 7) & (times_df['hour'] <= 10)).astype(np.int32)
times_df['day'] = ((times_df['hour'] > 10) & (times_df['hour'] <= 19)).astype(np.int32)
times_df['evening'] = ((times_df['hour'] > 19) & (times_df['hour'] <= 22)).astype(np.int32)
times_df['night'] = ((times_df['hour'] > 22) | (times_df['hour'] <= 7)).astype(np.int32)
times_df['is_weekend'] = full_df['time1'].apply(lambda x: 1 if x.date().weekday() in (5, 6) else 0)
ohe_df = pd.get_dummies(full_df['time1'].apply(lambda ts: ts.dayofweek), prefix='dayofweek')
times_df = pd.concat([times_df, ohe_df], axis=1)

# Find sessions' starting and ending
times_df['min'] = full_df[timescolumns].min(axis=1)
times_df['max'] = full_df[timescolumns].max(axis=1)

# Calculate sessions' duration in seconds
times_df['seconds'] = (times_df['max'] - times_df['min']) / np.timedelta64(1, 's')


times_df.drop(['max', 'min'], axis=1, inplace=True)
times_df.head()

Unnamed: 0_level_0,start_month,hour,morning,day,evening,night,is_weekend,dayofweek_0,dayofweek_1,dayofweek_2,dayofweek_3,dayofweek_4,dayofweek_5,dayofweek_6,seconds
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,201402,10,1,0,0,0,0,0,0,0,1,0,0,0,0.0
2,201402,11,0,1,0,0,1,0,0,0,0,0,1,0,26.0
3,201312,16,0,1,0,0,0,1,0,0,0,0,0,0,7.0
4,201403,10,1,0,0,0,0,0,0,0,0,1,0,0,270.0
5,201402,10,1,0,0,0,0,0,0,0,0,1,0,0,246.0


Scale these features and combine them with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [12]:
for col in ['morning', 'day', 'evening', 'night']:
    times_df[col + '_weekend'] = times_df[col] * times_df['is_weekend']

In [13]:
times_df.shape

(336358, 19)

In [14]:
# You code here
scaled_times_df = StandardScaler().fit_transform(times_df)

In [15]:
full_matrix = hstack([tfidf_matrix, scaled_times_df]).tocsr()

Perform cross-validation with logistic regression.

In [16]:
full_matrix.shape

(336358, 250019)

In [17]:
X_train = full_matrix[:idx_split,:]

In [None]:
param_grid = {
    'num_leaves': [127, 64],
    'feature_fraction': [0.5, 0.75],
    'bagging_fraction': [0.75], 
    'reg_alpha': [0.0],
    'reg_lambda': [0.0, 0.1],
    'learning_rate': [0.01]}

gbm = lgb.LGBMClassifier(objective='binary', 
                         boosting_type='rf',
                          n_jobs=-1, 
                          is_unbalance=True, 
                          two_round=True,
                          bagging_freq=1,
                          min_child_samples=10,
                          min_child_weight=5,
                          min_data_in_leaf=20,
                          min_split_gain=0.0,
                          n_estimators=10,
                          subsample=1.0,
                          silent=False)

gsearch = GridSearchCV(estimator=gbm, 
                       param_grid=param_grid, 
                       cv=5,
                       scoring='roc_auc',
                       n_jobs=-1,
                       verbose=5) 

lgb_model = gsearch.fit(X_train, y_train)
print(lgb_model.best_params_, lgb_model.best_score_)

In [23]:
# You code here


#lrcv = LogisticRegressionCV(scoring='roc_auc',
                            #class_weight='balanced',
                            #Cs=10, 
                            #cv=5, n_jobs=-1, verbose = 1, max_iter=1000)
#lrcv.fit(X_train, y)

In [26]:
#print(lrcv.Cs_)
#print('Best C:', lrcv.C_[0])
#print ('Max auc_roc:', lrcv.scores_[1].mean(axis=0).max()) 

In [27]:
def get_auc_lgbm_valid(X, y, params, seed=17, ratio = 0.9):
    # Split the data into the training and validation sets
    idx = int(round(X.shape[0] * ratio))
    # Classifier training
    gbm = lgb.LGBMClassifier(objective='binary', 
                          n_jobs=-1, 
                          is_unbalance=True,  
                          two_round=True,
                          bagging_fraction=params['bagging_fraction'],
                          bagging_freq=1,
                          boosting_type='rf',
                          feature_fraction=params['feature_fraction'],
                          learning_rate=params['learning_rate'],
                          min_child_samples=10,
                          min_child_weight=5,
                          min_data_in_leaf=20,
                          min_split_gain=0.0,
                          n_estimators=10,
                          num_leaves=params['num_leaves'],
                          reg_alpha=params['reg_alpha'],
                          reg_lambda=0.0,
                          subsample=1.0)
    gbm.fit(X[:idx, :], y[:idx])
    # Prediction for validation set
    y_pred = gbm.predict_proba(X[idx:, :])[:, 1]
    # Calculate the quality
    score = roc_auc_score(y[idx:], y_pred)
    
    return score

In [35]:
auc_score = get_auc_lgbm_valid(X_train, y_train, lgb_model.best_params_)
print(auc_score)
#print(prev_auc_score)

0.987578499591


In [29]:
prev_auc_score = auc_score

In [30]:
#lrfinal = LogisticRegression(C=lrcv.C_[0], random_state=42, n_jobs=-1).fit(X_train, y)

Make prediction for the test set and form a submission file.

In [31]:
X_test = full_matrix[idx_split:,:]
test_pred = lgb_model.predict_proba(X_test)[:, 1]

In [32]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

    

In [33]:
write_to_submission_file(test_pred, "assignment6_alice_submission_lgbm.csv")