<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [65]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import TimeSeriesSplit
from collections import Counter 

In [66]:
# функция для записи прогнозов в файл
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

In [67]:
def get_auc_lr_valid(X, y, C=1.0, ratio = 0.9, seed=17):
    '''
    X, y – выборка
    ratio – в каком отношении поделить выборку
    C, seed – коэф-т регуляризации и random_state 
              логистической регрессии
    '''
    train_len = int(ratio * X.shape[0])
    X_train = X[:train_len, :]
    X_valid = X[train_len:, :]
    y_train = y[:train_len]
    y_valid = y[train_len:]
    
    logit = LogisticRegression(C=C, n_jobs=-1, random_state=seed)
    
    logit.fit(X_train, y_train)
    
    valid_pred = logit.predict_proba(X_valid)[:,1]
    
    return roc_auc_score(y_valid, valid_pred)

Reading original data

In [68]:
# загрузим обучающую и тестовую выборки
train_df = pd.read_csv('../../data/train_sessions.csv',
                       index_col='session_id')
test_df = pd.read_csv('../../data/test_sessions.csv',
                      index_col='session_id')

# приведем колонки time1, ..., time10 к временному формату
times = ['time%s' % i for i in range(1, 11)]
train_df[times] = train_df[times].apply(pd.to_datetime).fillna(method='ffill', axis=1)
test_df[times] = test_df[times].apply(pd.to_datetime).fillna(method='ffill', axis=1)

# отсортируем данные по времени
train_df = train_df.sort_values(by='time1')

# посмотрим на заголовок обучающей выборки
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,2013-01-12 08:05:57,55.0,2013-01-12 08:05:57,,2013-01-12 08:05:57,,2013-01-12 08:05:57,,2013-01-12 08:05:57,...,2013-01-12 08:05:57,,2013-01-12 08:05:57,,2013-01-12 08:05:57,,2013-01-12 08:05:57,,2013-01-12 08:05:57,0
54843,56,2013-01-12 08:37:23,55.0,2013-01-12 08:37:23,56.0,2013-01-12 09:07:07,55.0,2013-01-12 09:07:09,,2013-01-12 09:07:09,...,2013-01-12 09:07:09,,2013-01-12 09:07:09,,2013-01-12 09:07:09,,2013-01-12 09:07:09,,2013-01-12 09:07:09,0
77292,946,2013-01-12 08:50:13,946.0,2013-01-12 08:50:14,951.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:16,...,2013-01-12 08:50:16,948.0,2013-01-12 08:50:16,784.0,2013-01-12 08:50:16,949.0,2013-01-12 08:50:17,946.0,2013-01-12 08:50:17,0
114021,945,2013-01-12 08:50:17,948.0,2013-01-12 08:50:17,949.0,2013-01-12 08:50:18,948.0,2013-01-12 08:50:18,945.0,2013-01-12 08:50:18,...,2013-01-12 08:50:18,947.0,2013-01-12 08:50:19,945.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:20,0
146670,947,2013-01-12 08:50:20,950.0,2013-01-12 08:50:20,948.0,2013-01-12 08:50:20,947.0,2013-01-12 08:50:21,950.0,2013-01-12 08:50:21,...,2013-01-12 08:50:21,946.0,2013-01-12 08:50:21,951.0,2013-01-12 08:50:22,946.0,2013-01-12 08:50:22,947.0,2013-01-12 08:50:22,0


In [69]:
# приведем колонки site1, ..., site10 к целочисленному формату и заменим пропуски нулями
sites = ['site%s' % i for i in range(1, 11)]
train_df[sites] = train_df[sites].fillna(0).astype('int')
test_df[sites] = test_df[sites].fillna(0).astype('int')

In [70]:
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,2013-01-12 08:05:57,55,2013-01-12 08:05:57,0,2013-01-12 08:05:57,0,2013-01-12 08:05:57,0,2013-01-12 08:05:57,...,2013-01-12 08:05:57,0,2013-01-12 08:05:57,0,2013-01-12 08:05:57,0,2013-01-12 08:05:57,0,2013-01-12 08:05:57,0
54843,56,2013-01-12 08:37:23,55,2013-01-12 08:37:23,56,2013-01-12 09:07:07,55,2013-01-12 09:07:09,0,2013-01-12 09:07:09,...,2013-01-12 09:07:09,0,2013-01-12 09:07:09,0,2013-01-12 09:07:09,0,2013-01-12 09:07:09,0,2013-01-12 09:07:09,0
77292,946,2013-01-12 08:50:13,946,2013-01-12 08:50:14,951,2013-01-12 08:50:15,946,2013-01-12 08:50:15,946,2013-01-12 08:50:16,...,2013-01-12 08:50:16,948,2013-01-12 08:50:16,784,2013-01-12 08:50:16,949,2013-01-12 08:50:17,946,2013-01-12 08:50:17,0
114021,945,2013-01-12 08:50:17,948,2013-01-12 08:50:17,949,2013-01-12 08:50:18,948,2013-01-12 08:50:18,945,2013-01-12 08:50:18,...,2013-01-12 08:50:18,947,2013-01-12 08:50:19,945,2013-01-12 08:50:19,946,2013-01-12 08:50:19,946,2013-01-12 08:50:20,0
146670,947,2013-01-12 08:50:20,950,2013-01-12 08:50:20,948,2013-01-12 08:50:20,947,2013-01-12 08:50:21,950,2013-01-12 08:50:21,...,2013-01-12 08:50:21,946,2013-01-12 08:50:21,951,2013-01-12 08:50:22,946,2013-01-12 08:50:22,947,2013-01-12 08:50:22,0


Separate target feature 

In [71]:
# наша целевая переменная
y_train = train_df['target']

alice_df = train_df[train_df['target']==1].drop('target', axis=1)
alice_df = alice_df[['site1', 'site2', 'site3','site4','site5','site6','site7','site8','site9','site10']]

# объединенная таблица исходных данных
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

# индекс, по которому будем отделять обучающую выборку от тестовой
idx_split = train_df.shape[0]

In [72]:
alice_df.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
251175,270,270,270,21,21,7832,21,7832,30,7832
196388,29,7832,37,7832,7832,29,7832,29,7832,7832
172448,29,7832,7832,29,37,7832,29,7832,29,270
70129,167,167,1515,167,37,1514,855,1515,855,1514
206254,1520,1522,1522,1515,1515,1524,1514,1515,1520,1521


In [73]:
# табличка с индексами посещенных сайтов в сессии
full_sites = full_df[sites]
full_sites.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
21669,56,55,0,0,0,0,0,0,0,0
54843,56,55,56,55,0,0,0,0,0,0
77292,946,946,951,946,946,945,948,784,949,946
114021,945,948,949,948,945,946,947,945,946,946
146670,947,950,948,947,950,952,946,951,946,947


In [74]:
full_sites.shape

(336358, 10)

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [75]:
%%time
str_df = full_sites[sites]\
               .apply(lambda x: " ".join([str(a) for a in x.values if not a==0]), axis=1)

CPU times: user 10.6 s, sys: 55.1 ms, total: 10.6 s
Wall time: 10.6 s


In [76]:
str_df.head()

session_id
21669                                       56 55
54843                                 56 55 56 55
77292     946 946 951 946 946 945 948 784 949 946
114021    945 948 949 948 945 946 947 945 946 946
146670    947 950 948 947 950 952 946 951 946 947
dtype: object

In [132]:
%%time
full_sites_sparse = TfidfVectorizer(ngram_range=(1,2), max_df=0.5,
                                    max_features=300000, token_pattern='(?u)\\b\\w+\\b').fit_transform(str_df)

CPU times: user 9.41 s, sys: 104 ms, total: 9.52 s
Wall time: 9.51 s


In [133]:
X_train_sparse = full_sites_sparse[:idx_split]
X_test_sparse = full_sites_sparse[idx_split:]

In [134]:
full_sites_sparse.shape, X_train_sparse.shape, X_test_sparse.shape

((336358, 300000), (253561, 300000), (82797, 300000))

In [135]:
%%time
logit = LogisticRegression(n_jobs=-1, random_state=17)
logit.fit(X_train_sparse, y_train)

CPU times: user 13.4 s, sys: 220 ms, total: 13.6 s
Wall time: 3.5 s


In [136]:
get_auc_lr_valid(X_train_sparse, y_train, ratio=0.7)

0.88633937809437469

0.92729257732863013

In [137]:
tscv = TimeSeriesSplit(n_splits=7)
scores = cross_val_score(logit, X_train_sparse, y_train, cv=tscv, scoring='roc_auc')
np.mean(scores)

0.84059928044284671

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [138]:
new_feat_train = pd.DataFrame(index=train_df.index)
new_feat_test = pd.DataFrame(index=test_df.index) 

new_feat_train['year_month'] = train_df['time1'].apply(lambda ts: ts.year * 100 + ts.month)
new_feat_test['year_month'] = test_df['time1'].apply(lambda ts: ts.year * 100 + ts.month)

In [139]:
scaler = StandardScaler()
scaler.fit(new_feat_train['year_month'].values.reshape(-1,1))

new_feat_train['year_month_scaled'] = scaler.transform(new_feat_train['year_month'].values.reshape(-1,1))
new_feat_test['year_month_scaled'] = scaler.transform(new_feat_test['year_month'].values.reshape(-1,1))

In [140]:
new_feat_train.head()

Unnamed: 0_level_0,year_month,year_month_scaled
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1
21669,201301,-1.744405
54843,201301,-1.744405
77292,201301,-1.744405
114021,201301,-1.744405
146670,201301,-1.744405


In [141]:
# Создаем признак start_hour:
new_feat_train['start_hour'] = train_df['time1'].apply(lambda ts: ts.hour)
new_feat_test['start_hour'] = test_df['time1'].apply(lambda ts: ts.hour)

In [142]:
scaler = StandardScaler()
scaler.fit(new_feat_train['start_hour'].values.reshape(-1,1))

new_feat_train['start_hour_scaled'] = scaler.transform(new_feat_train['start_hour'].values.reshape(-1,1))
new_feat_test['start_hour_scaled'] = scaler.transform(new_feat_test['start_hour'].values.reshape(-1,1))

In [143]:
# Создаем признаки morning:
new_feat_train['morning'] = new_feat_train['start_hour'].apply(lambda x: 1 if x<=11 else 0)
new_feat_test['morning'] = new_feat_test['start_hour'].apply(lambda x: 1 if x<=11 else 0)

# Day:
new_feat_train['day'] = new_feat_train['start_hour'].apply(lambda x: 1 if x>11 and x<=18 else 0)
new_feat_test['day'] = new_feat_test['start_hour'].apply(lambda x: 1 if x>11 and x<=18 else 0)

# Noon:
new_feat_train['noon'] = new_feat_train['start_hour'].apply(lambda x: 1 if x>14 and x<=19 else 0)
new_feat_test['noon'] = new_feat_test['start_hour'].apply(lambda x: 1 if x>14 and x<=19 else 0)

In [144]:
# Создаем признак weekday:
new_feat_train['weekday'] = train_df['time1'].apply(lambda ts: ts.weekday())
new_feat_test['weekday'] = test_df['time1'].apply(lambda ts: ts.weekday())

In [145]:
# Создаем признак duration:
new_feat_train['duration'] = (train_df['time10'] - train_df['time1']).astype('timedelta64[s]')
new_feat_test['duration'] = (test_df['time10'] - test_df['time1']).astype('timedelta64[s]')

In [146]:
new_feat_train.head()

Unnamed: 0_level_0,year_month,year_month_scaled,start_hour,start_hour_scaled,morning,day,noon,weekday,duration
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
21669,201301,-1.744405,8,-1.357366,1,0,0,5,0.0
54843,201301,-1.744405,8,-1.357366,1,0,0,5,1786.0
77292,201301,-1.744405,8,-1.357366,1,0,0,5,4.0
114021,201301,-1.744405,8,-1.357366,1,0,0,5,3.0
146670,201301,-1.744405,8,-1.357366,1,0,0,5,2.0


In [147]:
from scipy import stats
duration_1 = new_feat_train[y_train==1]['duration']
duration_0 = new_feat_train[y_train==0]['duration']
stats.ttest_ind(duration_0, duration_1)

Ttest_indResult(statistic=14.03620484562515, pvalue=9.727912658574387e-45)

In [148]:
np.median(duration_1), np.median(duration_0)

(11.0, 28.0)

In [149]:
pd.concat([new_feat_train['duration'], new_feat_test['duration']]).quantile(q=0.9)

354.0

In [150]:
new_feat_train['duration_10'] = new_feat_train['duration'].apply(lambda x: 1 if x<=2.0 else 0)
new_feat_test['duration_10'] = new_feat_test['duration'].apply(lambda x: 1 if x<=2.0 else 0)

new_feat_train['duration_10-20'] = new_feat_train['duration'].apply(lambda x: 1 if x>2 and x<=5 else 0)
new_feat_test['duration_10-20'] = new_feat_test['duration'].apply(lambda x: 1 if x>2 and x<=5 else 0)

new_feat_train['duration_20-30'] = new_feat_train['duration'].apply(lambda x: 1 if x>5 and x<=9 else 0)
new_feat_test['duration_20-30'] = new_feat_test['duration'].apply(lambda x: 1 if x>5 and x<=9 else 0)

new_feat_train['duration_30-40'] = new_feat_train['duration'].apply(lambda x: 1 if x>9 and x<=17 else 0)
new_feat_test['duration_30-40'] = new_feat_test['duration'].apply(lambda x: 1 if x>9 and x<=17 else 0)

new_feat_train['duration_40-50'] = new_feat_train['duration'].apply(lambda x: 1 if x>17 and x<=28 else 0)
new_feat_test['duration_40-50'] = new_feat_test['duration'].apply(lambda x: 1 if x>17 and x<=28 else 0)

new_feat_train['duration_50-60'] = new_feat_train['duration'].apply(lambda x: 1 if x>28 and x<=48 else 0)
new_feat_test['duration_50-60'] = new_feat_test['duration'].apply(lambda x: 1 if x>28 and x<=48 else 0)

new_feat_train['duration_60-70'] = new_feat_train['duration'].apply(lambda x: 1 if x>48 and x<=82 else 0)
new_feat_test['duration_60-70'] = new_feat_test['duration'].apply(lambda x: 1 if x>48 and x<=82 else 0)

new_feat_train['duration_70-80'] = new_feat_train['duration'].apply(lambda x: 1 if x>82 and x<=152 else 0)
new_feat_test['duration_70-80'] = new_feat_test['duration'].apply(lambda x: 1 if x>82 and x<=152 else 0)

new_feat_train['duration_80-90'] = new_feat_train['duration'].apply(lambda x: 1 if x>152 and x<=354 else 0)
new_feat_test['duration_80-90'] = new_feat_test['duration'].apply(lambda x: 1 if x>152 and x<=354 else 0)

In [151]:
new_feat_train = pd.DataFrame(pd.concat([new_feat_train,
                           pd.get_dummies(new_feat_train['weekday'], prefix='d', drop_first=False)], axis=1))
new_feat_test = pd.concat([new_feat_test,
                           pd.get_dummies(new_feat_test['weekday'], prefix='d', drop_first=False)], axis=1)

In [152]:
new_feat_train.head()

Unnamed: 0_level_0,year_month,year_month_scaled,start_hour,start_hour_scaled,morning,day,noon,weekday,duration,duration_10,...,duration_60-70,duration_70-80,duration_80-90,d_0,d_1,d_2,d_3,d_4,d_5,d_6
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,201301,-1.744405,8,-1.357366,1,0,0,5,0.0,1,...,0,0,0,0,0,0,0,0,1,0
54843,201301,-1.744405,8,-1.357366,1,0,0,5,1786.0,0,...,0,0,0,0,0,0,0,0,1,0
77292,201301,-1.744405,8,-1.357366,1,0,0,5,4.0,0,...,0,0,0,0,0,0,0,0,1,0
114021,201301,-1.744405,8,-1.357366,1,0,0,5,3.0,0,...,0,0,0,0,0,0,0,0,1,0
146670,201301,-1.744405,8,-1.357366,1,0,0,5,2.0,1,...,0,0,0,0,0,0,0,0,1,0


In [153]:
new_feat_train['year_month'][y_train==1].value_counts()

201311    446
201402    410
201403    400
201309    377
201404    302
201312    134
201401    129
201302     61
201304     38
Name: year_month, dtype: int64

In [154]:
new_feat_train['month'] = train_df['time1'].apply(lambda ts: ts.month)
new_feat_test['month'] = test_df['time1'].apply(lambda ts: ts.month)

In [155]:
new_feat_train = pd.DataFrame(pd.concat([new_feat_train,
                           pd.get_dummies(new_feat_train['month'], prefix='m', drop_first=False)], axis=1))
new_feat_test = pd.DataFrame(pd.concat([new_feat_test,
                           pd.get_dummies(new_feat_test['month'], prefix='m', drop_first=False)], axis=1))

In [156]:
%%time
new_feat_train['num_unique_sites'] = [np.unique(full_sites.iloc[:idx_split,:].values[i, :]).shape[0] 
                    for i in range(full_sites.iloc[:idx_split,:].shape[0])]
new_feat_test['num_unique_sites'] = [np.unique(full_sites.iloc[idx_split:,:].values[i, :]).shape[0] 
                    for i in range(full_sites.iloc[idx_split:,:].shape[0])]

CPU times: user 36.3 s, sys: 25.4 ms, total: 36.3 s
Wall time: 36.2 s


In [157]:
new_feat_train.head()

Unnamed: 0_level_0,year_month,year_month_scaled,start_hour,start_hour_scaled,morning,day,noon,weekday,duration,duration_10,...,m_4,m_5,m_6,m_7,m_8,m_9,m_10,m_11,m_12,num_unique_sites
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,201301,-1.744405,8,-1.357366,1,0,0,5,0.0,1,...,0,0,0,0,0,0,0,0,0,3
54843,201301,-1.744405,8,-1.357366,1,0,0,5,1786.0,0,...,0,0,0,0,0,0,0,0,0,3
77292,201301,-1.744405,8,-1.357366,1,0,0,5,4.0,0,...,0,0,0,0,0,0,0,0,0,6
114021,201301,-1.744405,8,-1.357366,1,0,0,5,3.0,0,...,0,0,0,0,0,0,0,0,0,5
146670,201301,-1.744405,8,-1.357366,1,0,0,5,2.0,1,...,0,0,0,0,0,0,0,0,0,6


In [158]:
new_feat_train['weekday'].value_counts()

2    55971
1    48659
3    44147
4    41140
0    40513
5    15799
6     7332
Name: weekday, dtype: int64

In [159]:
new_feat_train = pd.DataFrame(pd.concat([new_feat_train,
                           pd.get_dummies(new_feat_train['num_unique_sites'], prefix='u', drop_first=False)], axis=1))
new_feat_test = pd.DataFrame(pd.concat([new_feat_test,
                           pd.get_dummies(new_feat_test['num_unique_sites'], prefix='u', drop_first=False)], axis=1))

In [160]:
new_feat_train = pd.DataFrame(pd.concat([new_feat_train,
                           pd.get_dummies(new_feat_train['start_hour'], prefix='sh', drop_first=False)], axis=1))
new_feat_test = pd.DataFrame(pd.concat([new_feat_test,
                           pd.get_dummies(new_feat_test['start_hour'], prefix='sh', drop_first=False)], axis=1))

In [161]:
alice_df.values.flatten()

array([270, 270, 270, ...,   0,   0,   0])

In [162]:
alice_site_dict = dict(sorted(Counter(alice_df.values.flatten()).items(), key=lambda t: t[1], reverse=True))

In [163]:
top10 = [key for key, value in alice_site_dict.items()][:10]

In [164]:
top10

[77, 80, 76, 29, 21, 81, 22, 879, 75, 82]

In [165]:
%%time
new_feat_train['n_top10'] = [len(set(list(full_sites.iloc[:idx_split,:].iloc[i]))&set(top10))
                             for i in range(full_sites.iloc[:idx_split,:].shape[0])]
new_feat_test['n_top10'] = [len(set(list(full_sites.iloc[idx_split:,:].iloc[i]))&set(top10))
                             for i in range(full_sites.iloc[idx_split:,:].shape[0])]

CPU times: user 1min 15s, sys: 1.29 s, total: 1min 16s
Wall time: 1min 14s


In [166]:
new_feat_train['n_top10'].value_counts()

0    128134
1     77665
2     34036
3      7613
4      3992
5      1804
6       306
7        11
Name: n_top10, dtype: int64

In [167]:
new_feat_train = pd.DataFrame(pd.concat([new_feat_train,
                           pd.get_dummies(new_feat_train['n_top10'], prefix='top', drop_first=False)], axis=1))
new_feat_test = pd.DataFrame(pd.concat([new_feat_test,
                           pd.get_dummies(new_feat_test['n_top10'], prefix='top', drop_first=False)], axis=1))

In [168]:
new_feat_test.columns

Index(['year_month', 'year_month_scaled', 'start_hour', 'start_hour_scaled',
       'morning', 'day', 'noon', 'weekday', 'duration', 'duration_10',
       'duration_10-20', 'duration_20-30', 'duration_30-40', 'duration_40-50',
       'duration_50-60', 'duration_60-70', 'duration_70-80', 'duration_80-90',
       'd_0', 'd_1', 'd_2', 'd_3', 'd_4', 'd_5', 'd_6', 'month', 'm_5', 'm_6',
       'm_7', 'm_8', 'm_9', 'm_10', 'm_11', 'm_12', 'num_unique_sites', 'u_1',
       'u_2', 'u_3', 'u_4', 'u_5', 'u_6', 'u_7', 'u_8', 'u_9', 'u_10', 'sh_7',
       'sh_8', 'sh_9', 'sh_10', 'sh_11', 'sh_12', 'sh_13', 'sh_14', 'sh_15',
       'sh_16', 'sh_17', 'sh_18', 'sh_19', 'sh_20', 'sh_21', 'sh_22', 'sh_23',
       'n_top10', 'top_0', 'top_1', 'top_2', 'top_3', 'top_4', 'top_5',
       'top_6', 'top_7'],
      dtype='object')

In [169]:
features = ['year_month_scaled', 'morning', 'day', 'duration_10', 'duration_10-20', 'duration_20-30',
            'duration_30-40', 'duration_40-50', 'duration_50-60', 'duration_60-70',
            'duration_70-80', 'd_0', 'd_1', 'd_2', 'd_3', 'd_4','d_5', 'd_6',
            'u_1', 'u_2', 'u_3', 'u_4', 'u_5', 'u_6', 'u_7', 'u_8', 'u_9', 'u_10',
            'top_0', 'top_1', 'top_2', 'top_3', 'top_4', 'top_5',
            'top_6', 'top_7']

In [170]:
X_train = csr_matrix(hstack([X_train_sparse,
                            new_feat_train[features].values.reshape(-1,len(features))]))

X_test = csr_matrix(hstack([X_test_sparse,
                            new_feat_test[features].values.reshape(-1,len(features))]))

In [171]:
X_train.shape, X_test.shape

((253561, 300036), (82797, 300036))

In [172]:
idx = len(new_feat_train)- len(new_feat_train[new_feat_train['year_month']>=201305])

In [173]:
new_feat_train.iloc[idx:,:].head()

Unnamed: 0_level_0,year_month,year_month_scaled,start_hour,start_hour_scaled,morning,day,noon,weekday,duration,duration_10,...,sh_23,n_top10,top_0,top_1,top_2,top_3,top_4,top_5,top_6,top_7
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
27876,201305,-1.65019,8,-1.357366,1,0,0,6,56.0,0,...,0,0,1,0,0,0,0,0,0,0
113470,201305,-1.65019,8,-1.357366,1,0,0,6,18.0,0,...,0,2,0,0,1,0,0,0,0,0
248880,201305,-1.65019,8,-1.357366,1,0,0,6,243.0,0,...,0,0,1,0,0,0,0,0,0,0
168463,201305,-1.65019,8,-1.357366,1,0,0,6,362.0,0,...,0,2,0,0,1,0,0,0,0,0
60553,201305,-1.65019,8,-1.357366,1,0,0,6,357.0,0,...,0,1,0,1,0,0,0,0,0,0


In [174]:
X_train[idx:,:]

<241935x300036 sparse matrix of type '<class 'numpy.float64'>'
	with 4333347 stored elements in Compressed Sparse Row format>

In [175]:
%%time
logit = LogisticRegression(n_jobs=-1, random_state=17)
logit.fit(X_train, y_train)

CPU times: user 38.8 s, sys: 559 ms, total: 39.4 s
Wall time: 9.95 s


In [176]:
get_auc_lr_valid(X_train, y_train, ratio=0.7)

0.94301690565633178

In [177]:
scores = cross_val_score(logit, X_train, y_train, cv=tscv, scoring='roc_auc')
np.mean(scores)

0.89737821048198485

In [178]:
C = np.logspace(-1, 2, 20)

In [179]:
%%time
searchCV = LogisticRegressionCV(Cs=C, cv=tscv, penalty='l2', scoring='roc_auc', random_state=17, n_jobs=-1, verbose=10)
searchCV.fit(X_train[idx:], y_train[idx:])
print ('Max auc_roc:', searchCV.scores_[1].max())

[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done   2 out of   7 | elapsed:  8.2min remaining: 20.4min
[Parallel(n_jobs=-1)]: Done   3 out of   7 | elapsed:  8.6min remaining: 11.5min
[Parallel(n_jobs=-1)]: Done   4 out of   7 | elapsed:  9.5min remaining:  7.1min
[Parallel(n_jobs=-1)]: Done   5 out of   7 | elapsed: 10.2min remaining:  4.1min
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed: 10.7min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed: 10.7min finished


Max auc_roc: 0.977127586026
CPU times: user 40.5 s, sys: 2.02 s, total: 42.5 s
Wall time: 10min 54s


In [180]:
best_c = searchCV.C_[0]
best_c

16.237767391887211

In [181]:
get_auc_lr_valid(X_train, y_train, ratio=0.7, C=best_c)

0.94374641227803058

0.93372138833945972

In [185]:
test_pred = searchCV.predict_proba(X_test)[:,1]

In [186]:
write_to_submission_file(test_pred, 'n_gram1_2_max300k_36_feat_tuned_C_clipped.csv')