# Catch Me If You Can ("Alice")
### Intruder Detection through Webpage Session Tracking

***(Based on mlcourse.ai training materials)***

In [2]:
import pickle

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm_notebook

%matplotlib inline
import seaborn as sns
from matplotlib import pyplot as plt

In [3]:
from scipy import sparse

In [4]:
from sklearn.model_selection import train_test_split

### 1. Loading and transforming data


In [5]:
train_df = pd.read_csv("/content/drive/MyDrive/Py/mlcourse.ai/project_alice/train_sessions.csv", index_col="session_id")
test_df = pd.read_csv("/content/drive/MyDrive/Py/mlcourse.ai/project_alice/test_sessions.csv", index_col="session_id")

times = ["time%s" % i for i in range(1, 11)]
train_df[times] = train_df[times].apply(pd.to_datetime)
test_df[times] = test_df[times].apply(pd.to_datetime)

train_df = train_df.sort_values(by="time1")

train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,site6,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,2013-01-12 08:05:57,55.0,2013-01-12 08:05:57,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,0
54843,56,2013-01-12 08:37:23,55.0,2013-01-12 08:37:23,56.0,2013-01-12 09:07:07,55.0,2013-01-12 09:07:09,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,0
77292,946,2013-01-12 08:50:13,946.0,2013-01-12 08:50:14,951.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:16,945.0,2013-01-12 08:50:16,948.0,2013-01-12 08:50:16,784.0,2013-01-12 08:50:16,949.0,2013-01-12 08:50:17,946.0,2013-01-12 08:50:17,0
114021,945,2013-01-12 08:50:17,948.0,2013-01-12 08:50:17,949.0,2013-01-12 08:50:18,948.0,2013-01-12 08:50:18,945.0,2013-01-12 08:50:18,946.0,2013-01-12 08:50:18,947.0,2013-01-12 08:50:19,945.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:20,0
146670,947,2013-01-12 08:50:20,950.0,2013-01-12 08:50:20,948.0,2013-01-12 08:50:20,947.0,2013-01-12 08:50:21,950.0,2013-01-12 08:50:21,952.0,2013-01-12 08:50:21,946.0,2013-01-12 08:50:21,951.0,2013-01-12 08:50:22,946.0,2013-01-12 08:50:22,947.0,2013-01-12 08:50:22,0


  
**User sessions are allocated in such a way that they cannot be longer than half an hour or 10 sites. That is, the session is considered ended either when the user has visited 10 sites in a row or when the session took more than 30 minutes.**

**The table contains missing values, which means that the session consists of less than 10 sites. Replace the missing values with zeros and convert the signs to an integer type.**

In [38]:
sites = ["site%s" % i for i in range(1, 11)]
train_df[sites] = train_df[sites].fillna(0).astype("int")
test_df[sites] = test_df[sites].fillna(0).astype("int")

with open(r"/content/drive/MyDrive/Py/mlcourse.ai/project_alice/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

sites_dict_df = pd.DataFrame(
    list(site_dict.keys()), index=list(site_dict.values()), columns=["site"]
)
sites_dict_df.head()

Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


**Select the target variable and merge the samples to bring them together to a sparse format.**

In [7]:
y_train = train_df["target"]

# concatenated initial data  table
full_df = pd.concat([train_df.drop("target", axis=1), test_df])

# index to split the concatenated table to train and test samples
idx_split = train_df.shape[0]

In [8]:
y_train.shape

(253561,)

**For the very first model only visited sites in the session will be used  and the time features will not be included. The idea behind this choice for the model is that Alice has her favorite sites, and the more often you see these sites in a session, the more likely it is that this is Alice's session and vice versa.**

In [9]:
# табличка с индексами посещенных сайтов в сессии
full_sites = full_df[sites]
full_sites.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
21669,56,55,0,0,0,0,0,0,0,0
54843,56,55,56,55,0,0,0,0,0,0
77292,946,946,951,946,946,945,948,784,949,946
114021,945,948,949,948,945,946,947,945,946,946
146670,947,950,948,947,950,952,946,951,946,947


**Sessions are a sequence of site indexes. Data in this form is inconvenient for linear methods. In accordance with our hypothesis (Alice has favorite sites), we need to transform this table so that each possible site has its own separate attribute (column), and its value would be equal to the number of visits to this site in a session.** 

**This transformation operation will be done with the use of sparse matrix.**

In [10]:
from scipy.sparse import csr_matrix

In [11]:
# sequence with indices
sites_flatten = full_sites.values.flatten()

full_sites_sparse = csr_matrix(
    (   [1] * sites_flatten.shape[0],
        sites_flatten,
        range(0, sites_flatten.shape[0] + 10, 10),
    )
                                )[:, 1:]

In [12]:
full_sites_sparse.shape

(336358, 48371)

In [13]:
full_sites_sparse_train = full_sites_sparse[:idx_split]

In [14]:
full_sites_sparse_test = full_sites_sparse[idx_split:]


### 2. Making the first model

**The first model will be made with the the logistic regression from the sklearn package with default parameters. The first 90% of the data will be used for training (training sample is sorted by time), and the remaining 10% for quality control (validation).**

**Function that returns the quality of the model on lazy sampling and train our first classifier.**

In [15]:
def get_auc_lr_valid(X, y, C=1.0, ratio=0.9, seed=17):

    train_len = int(ratio * X.shape[0])
    X_train = X[:train_len, :]
    X_valid = X[train_len:, :]
    y_train = y[:train_len]
    y_valid = y[train_len:]
    
    logit = LogisticRegression(C=C, n_jobs=-1, random_state=seed)
    
    logit.fit(X_train, y_train)
    
    valid_pred = logit.predict_proba(X_valid)[:, 1]
    
    return roc_auc_score(y_valid, valid_pred)
 

**ROC AUC on a deferred sample**

In [16]:
%%time
get_auc_lr_valid(full_sites_sparse_train, y_train)

CPU times: user 151 ms, sys: 62.8 ms, total: 214 ms
Wall time: 4.52 s


0.9197955574958127


**To build a model for forecasting on a test sample the model wil be trained again on the entire training sample, that will increase algorithm generalizing ability:**

In [17]:
logit = LogisticRegression(C=1.0, n_jobs=-1, random_state=17)
    
logit.fit(full_sites_sparse_train, y_train)

# make a prediction on a test sample

logit_pred = logit.predict_proba(full_sites_sparse_test)[:, 1]

**Write prediction to a file and make a package to kaggle**

In [18]:
# function for writing forecasts to a file
def write_to_submission_file(
    predicted_labels, out_file, target="target", index_label="session_id"
):
    predicted_df = pd.DataFrame(
        predicted_labels,
        index=np.arange(1, predicted_labels.shape[0] + 1),
        columns=[target],
    )
    predicted_df.to_csv(out_file, index_label=index_label)
    print(predicted_df)

In [19]:
write_to_submission_file(
    logit_pred, r'C:\\Users\\Pav\\Desktop\\Py\\mlcourse.ai\\predictions1.csv')

             target
1      2.219764e-03
2      2.518962e-09
3      6.160276e-09
4      1.322690e-08
5      2.729067e-05
...             ...
82793  1.330106e-05
82794  1.242971e-05
82795  8.433158e-03
82796  3.878555e-04
82797  1.295467e-05

[82797 rows x 1 columns]


The score on Kaaggle is 0.90734

### 3. Model tuning

**Create a feature `month&year` which will be a number of the type YYYYMM from the date when the session took place, for example, 201407 - 2014 and 7 months**

In [20]:
train_df['month&year'] = train_df['time1'].dt.year*100 + train_df['time1'].dt.month

**Scale the feature using `StandardScaler`**

In [21]:
ss = StandardScaler()
a_scaled = ss.fit_transform(train_df['month&year'].values.reshape(-1, 1))
train_df['month&year'] = a_scaled
train_df['month&year']

session_id
21669    -1.744405
54843    -1.744405
77292    -1.744405
114021   -1.744405
146670   -1.744405
            ...   
12224     0.681626
164438    0.681626
12221     0.681626
156968    0.681626
204762    0.681626
Name: month&year, Length: 253561, dtype: float64

In [22]:
# morning feature

train_df['morning'] =  (train_df['time1'].dt.hour <=11)
train_df['morning'] = train_df['morning'].astype(int)

In [23]:
# start_hour feature

train_df['start_hour'] = train_df['time1'].dt.hour

In [24]:
# concatenating all new features in a separate dataframe

new_feat_train_full = train_df[['month&year', 'start_hour', 'morning']]

In [25]:
new_feat_train_full

Unnamed: 0_level_0,month&year,start_hour,morning
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
21669,-1.744405,8,1
54843,-1.744405,8,1
77292,-1.744405,8,1
114021,-1.744405,8,1
146670,-1.744405,8,1
...,...,...,...
12224,0.681626,23,0
164438,0.681626,23,0
12221,0.681626,23,0
156968,0.681626,23,0


**Add two new traits: `start_hour` and `morning`**

The `start_hour` attribute is the hour at which the session started (from 0 to 23), and the binary attribute` morning` is 1 if the session started in the morning and 0 if the session started later (we will assume that it is morning if `start_hour is equal to` 11 or less).

**Calculate ROC AUC on a lazy sample for a sample with**
- sites, `month&year` и `start_hour`
- sites, `month&year` и `morning`
- sites, `month&year`, `start_hour` и `morning`

In [26]:
# Функция для расчета ROC AUC на отложенной выборке для разных сочетаний признаков

def roc_auc_calculation(dropped_feat):
    new_feats_train = new_feat_train_full.drop([dropped_feat], axis = 1)
    sites_and_feats_train = sparse.hstack([full_sites_sparse_train, new_feats_train])
    sites_and_feats_train = sites_and_feats_train.tocsr()
    return(get_auc_lr_valid(sites_and_feats_train, y_train))

In [27]:
# sites, start_month & start_hour

roc_auc_calculation('morning')

0.9566499742171244

In [28]:
# sites, start_month & morning

roc_auc_calculation('start_hour')

0.9477508063941531

In [29]:
# sites, start_month, start_hour & morning

new_feats_train = new_feat_train_full
sites_and_feats_train = sparse.hstack([full_sites_sparse_train, new_feats_train])
sites_and_feats_train = sites_and_feats_train.tocsr()
get_auc_lr_valid(sites_and_feats_train, y_train)

0.9582227598183243

In [30]:
sites_and_feats_train.shape

(253561, 48374)

**The biggest score we get when all the features are used**

### 4. Selection of the regularization coefficient in the interval `np.logspace(-3, 1, 10)` 



**Find the `C` from` np.logspace (-3, 1, 10) `, at which the ROC AUC on the deferred sample is maximum.**

In [31]:
# split the sample into deferred and training ones

X_train1, X_valid1, y_train1, y_valid1 = train_test_split(sites_and_feats_train, y_train, test_size=0.1)


In [32]:
a = []
C = list(np.logspace(-3, 1, 10)) 

# Loop for selection of the best regularization param
for i in C:
    logit = LogisticRegression(C=i, n_jobs=-1, random_state=17)
    logit.fit(X_train1, y_train1)
    valid_pred = logit.predict_proba(X_valid1)[:, 1]
    a.append(roc_auc_score(y_valid1, valid_pred))

In [33]:
# Receive the best regularisation param
index = a.index([max(a)])
C[index]

1.2915496650148828

**Train the model with the found optimal value of the regularization param and with all the features**

**Firstly get all the test data in a sparce matrix form**

In [34]:
full_sites_sparse_test = full_sites_sparse[idx_split:]

# month&year

test_df['month&year'] = test_df['time1'].dt.year*100 + test_df['time1'].dt.month

ss = StandardScaler()
a_scaled = ss.fit_transform(test_df['month&year'].values.reshape(-1, 1))
test_df['month&year'] = a_scaled

# morning

test_df['morning'] =  (test_df['time1'].dt.hour <=11)
test_df['morning'] = test_df['morning'].astype(int)

# start_hour

test_df['start_hour'] = test_df['time1'].dt.hour

# All the new features

new_feat_test_full = test_df[['month&year', 'start_hour', 'morning']]

new_feats_test = new_feat_test_full
sites_and_feats_test = sparse.hstack([full_sites_sparse_test, new_feats_test])
sites_and_feats_test = sites_and_feats_test.tocsr()

# concatenate all the new features with cites feature

new_feats_test = new_feat_test_full
sites_and_feats_test = sparse.hstack([full_sites_sparse_test, new_feats_test])
sites_and_feats_test = sites_and_feats_test.tocsr()

**Secondly do training on the all train sample without splitting onto train and deferred sets and do the final prediction on the test set**

In [35]:
logit = LogisticRegression(C=C[index], n_jobs=-1, random_state=17)
logit.fit(sites_and_feats_train, y_train)
Test_pred = logit.predict_proba(sites_and_feats_test)[:, 1]


**Write prediction to a file and make a package to kaggle**

In [36]:
write_to_submission_file(
    Test_pred, r'/content/drive/MyDrive/Py/mlcourse.ai/project_alice/final_prediction.csv')

             target
1      6.803758e-05
2      4.674615e-14
3      3.482106e-10
4      2.360167e-09
5      2.193073e-05
...             ...
82793  3.443585e-05
82794  3.052532e-05
82795  6.484673e-04
82796  2.495609e-05
82797  1.669314e-07

[82797 rows x 1 columns]


**This prediction got a bigger score in comparison with the previous one with only cites feature (0.92889 and 0.90734 respectively)**