# Project: "Identification of Internet Users»
# Part 6. Vowpal Wabbit

### 1. Подготовка данных

Next, let's look at Vowpal Wabbit in action. However, in the task of our competition for binary classification of web sessions, we will not notice any difference – both in quality and speed, we will demonstrate all the speed of VW in the task of classification into 400 classes. The original data is still the same, but 400 users have been allocated, and the task of identifying them is being solved. Data taken from the Kaggle competition: ["Identify Me If You Can"](https://www.kaggle.com/c/identify-me-if-you-can4).

In [1]:
import os
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.preprocessing import LabelEncoder

In [3]:
PATH_TO_DATA = 'identify_me_if_you_can'

In [4]:
train_df_400 = pd.read_csv(os.path.join(PATH_TO_DATA,'train_sessions_400users.csv'), 
                           index_col='session_id')

In [5]:
test_df_400 = pd.read_csv(os.path.join(PATH_TO_DATA,'test_sessions_400users.csv'), 
                           index_col='session_id')

In [6]:
train_df_400.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,user_id
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,23713,2014-03-24 15:22:40,23720.0,2014-03-24 15:22:48,23713.0,2014-03-24 15:22:48,23713.0,2014-03-24 15:22:54,23720.0,2014-03-24 15:22:54,...,2014-03-24 15:22:55,23713.0,2014-03-24 15:23:01,23713.0,2014-03-24 15:23:03,23713.0,2014-03-24 15:23:04,23713.0,2014-03-24 15:23:05,653
2,8726,2014-04-17 14:25:58,8725.0,2014-04-17 14:25:59,665.0,2014-04-17 14:25:59,8727.0,2014-04-17 14:25:59,45.0,2014-04-17 14:25:59,...,2014-04-17 14:26:01,45.0,2014-04-17 14:26:01,5320.0,2014-04-17 14:26:18,5320.0,2014-04-17 14:26:47,5320.0,2014-04-17 14:26:48,198
3,303,2014-03-21 10:12:24,19.0,2014-03-21 10:12:36,303.0,2014-03-21 10:12:54,303.0,2014-03-21 10:13:01,303.0,2014-03-21 10:13:24,...,2014-03-21 10:13:36,303.0,2014-03-21 10:13:54,309.0,2014-03-21 10:14:01,303.0,2014-03-21 10:14:06,303.0,2014-03-21 10:14:24,34
4,1359,2013-12-13 09:52:28,925.0,2013-12-13 09:54:34,1240.0,2013-12-13 09:54:34,1360.0,2013-12-13 09:54:34,1344.0,2013-12-13 09:54:34,...,2013-12-13 09:54:34,1346.0,2013-12-13 09:54:34,1345.0,2013-12-13 09:54:34,1344.0,2013-12-13 09:58:19,1345.0,2013-12-13 09:58:19,601
5,11,2013-11-26 12:35:29,85.0,2013-11-26 12:35:31,52.0,2013-11-26 12:35:31,85.0,2013-11-26 12:35:32,11.0,2013-11-26 12:35:32,...,2013-11-26 12:35:32,11.0,2013-11-26 12:37:03,85.0,2013-11-26 12:37:03,10.0,2013-11-26 12:37:03,85.0,2013-11-26 12:37:04,273


**We see that in the training sample there are 182793 sessions, in the test sample there are 46473, and the sessions actually belong to 400 different users.**

In [7]:
train_df_400.shape, test_df_400.shape, train_df_400['user_id'].nunique()

((182793, 21), (46473, 20), 400)

**Vowpal Wabbit likes class labels to be distributed from 1 to K, where K is the number of classes in the classification problem (in our case, 400). Therefore, we will have to apply `LabelEncoder`, and then add +1. Then we will need to apply the reverse conversion.**

In [8]:
y = train_df_400['user_id']
class_encoder = LabelEncoder()
y_for_vw = class_encoder.fit_transform(y) + 1

**Next, we will compare VW with SGDClassifier and with logistic regression. All these models need a preprocessing of the input data. Let's prepare sparse matrices for sklearn models, as was done in Part 5 of project:**
- combine the training and test samples
- select only sites (attributes from 'site1' to 'site10')
- replace the omissions with zeros (our sites were numbered from 0)
- translate into a sparse format `csr_matrix`
- we will divide it back into training and test parts

In [9]:
sites = ['site' + str(i) for i in range(1, 11)]

In [10]:
def sparse_matr_gen(X):
    indptr = [0]
    indices = []
    data = []    

    for session in X:
        for site in session:
            if site == 0: continue # сайт с id=0 не учитываем
            indices.append(site-1) # для нумерации с id=0 (а не id=1), т.е. удаление столбца 0
            data.append(1)
        indptr.append(len(indices))    
    
    return csr_matrix((data, indices, indptr))


train_test_df = pd.concat([train_df_400, test_df_400])
train_test_df_sites = train_test_df[sites].fillna(0).astype('int')
train_test_sparse = sparse_matr_gen(train_test_df_sites.values)

X_train_sparse = train_test_sparse[:train_df_400.shape[0]]
X_test_sparse = train_test_sparse[train_df_400.shape[0]:]

### 2. Testing using a validation sample

**Select the training (70%) and validation (30%) parts of the original training sample. The data don't mix, consider that session, sorted by time.**

In [11]:
train_share = int(.7 * train_df_400.shape[0])
train_df_part = train_df_400[sites].iloc[:train_share, :]
valid_df = train_df_400[sites].iloc[train_share:, :]
X_train_part_sparse = X_train_sparse[:train_share, :]
X_valid_sparse = X_train_sparse[train_share:, :]

In [12]:
y_train_part = y[:train_share]
y_valid = y[train_share:]
y_train_part_for_vw = y_for_vw[:train_share]
y_valid_for_vw = y_for_vw[train_share:]

In [30]:
def arrays_to_vw(X, y=None, train=True, out_file='tmp.vw'):
    '''
    The function converts the selection to the Vowpal Wabbit format. The result is saved in the out_file file.
    
    Parameters:
    X – NumPy matrix (training selection)
    y -  vector of responses (NumPy) (optional)
    train – flag, True for the training sample, False for the test sample
    out_file – path to the file .vw to record to
    '''
    with open(out_file, 'w') as f:
        for i, row in enumerate(X):
            f.write(str(1 if y is None else y[i])+ ' | ' + ' '.join(map(str, row)) + '\n')            

In [31]:
%%time
arrays_to_vw(train_df_part.values, y_train_part_for_vw, out_file=os.path.join(PATH_TO_DATA,'train_part.vw'))
arrays_to_vw(valid_df.values, y_valid_for_vw, out_file=os.path.join(PATH_TO_DATA,'valid.vw'))
arrays_to_vw(train_df_400[sites].values, y_for_vw, out_file=os.path.join(PATH_TO_DATA,'train.vw'))
arrays_to_vw(test_df_400[sites].values, out_file=os.path.join(PATH_TO_DATA,'test.vw'))

Wall time: 3.44 s


**Result:**

In [36]:
!head -3 $PATH_TO_DATA/train_part.vw

262 | 23713.0 23720.0 23713.0 23713.0 23720.0 23713.0 23713.0 23713.0 23713.0 23713.0
82 | 8726.0 8725.0 665.0 8727.0 45.0 8725.0 45.0 5320.0 5320.0 5320.0
16 | 303.0 19.0 303.0 303.0 303.0 303.0 303.0 309.0 303.0 303.0


In [33]:
!head -3  $PATH_TO_DATA/valid.vw

4 | 7.0 923.0 923.0 923.0 11.0 924.0 7.0 924.0 838.0 7.0
160 | 91.0 198.0 11.0 11.0 302.0 91.0 668.0 311.0 310.0 91.0
312 | 27085.0 848.0 118.0 118.0 118.0 118.0 11.0 118.0 118.0 118.0


In [34]:
!head -3 $PATH_TO_DATA/test.vw

1 | 9.0 304.0 308.0 307.0 91.0 308.0 312.0 300.0 305.0 309.0
1 | 838.0 504.0 68.0 11.0 838.0 11.0 838.0 886.0 27.0 305.0
1 | 190.0 192.0 8.0 189.0 191.0 189.0 190.0 2375.0 192.0 8.0


**Train the Vowpal Wabbit model on the `train_part.vw` sample. Specify that the classification problem with 400 classes is being solved (`--oaa`), and make 3 passes through the selection (`--passes`). Setting some cache file (`-c`) will make it faster for VW to make all the passes through the selection after the first one (the previous cache file is deleted using the `-k`argument). Parameter `b` = 26 - this is the number of bits used for hashing.**

In [37]:
train_part_vw = os.path.join(PATH_TO_DATA, 'train_part.vw')
valid_vw = os.path.join(PATH_TO_DATA, 'valid.vw')
train_vw = os.path.join(PATH_TO_DATA, 'train.vw')
test_vw = os.path.join(PATH_TO_DATA, 'test.vw')
model = os.path.join(PATH_TO_DATA, 'vw_model.vw')
pred = os.path.join(PATH_TO_DATA, 'vw_pred.csv')

In [40]:
%%time
!vw --oaa 400 $PATH_TO_DATA/train_part.vw --passes 3 -c -k -f $PATH_TO_DATA/train_part_model.vw -b 26 --random_seed 17

Wall time: 33.1 s


final_regressor = identify_me_if_you_can/train_part_model.vw
Num weight bits = 26
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = identify_me_if_you_can/train_part.vw.cache
Reading datafile = identify_me_if_you_can/train_part.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0      262        1       11
1.000000 1.000000            2            2.0       82      262       11
1.000000 1.000000            4            4.0      241      262       11
1.000000 1.000000            8            8.0      352      262       11
1.000000 1.000000           16           16.0      135       16       11
1.000000 1.000000           32           32.0       71      112       11
0.968750 0.937500           64           64.0      358      231       11
0.976563 0.984375          128          128.0      3

**We write the forecasts on the sample `valid.vw` in `vw_valid_pred.csv`**.

In [41]:
%%time
!vw -i $PATH_TO_DATA/train_part_model.vw -t -d $PATH_TO_DATA/valid.vw -p $PATH_TO_DATA/vw_valid_pred.csv

Wall time: 931 ms


only testing
predictions = identify_me_if_you_can/vw_valid_pred.csv
Num weight bits = 26
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = identify_me_if_you_can/valid.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0        4      188       11
1.000000 1.000000            2            2.0      160      220       11
0.750000 0.500000            4            4.0      143      143       11
0.750000 0.750000            8            8.0      247      247       11
0.687500 0.625000           16           16.0      341       30       11
0.593750 0.500000           32           32.0      237      237       11
0.609375 0.625000           64           64.0      178      178       11
0.656250 0.703125          128          128.0      132      228       11
0.664063 0.671875          256          256.0      

**We read the forecasts from the file *PATH_TO_DATA/*`vw_valid_pred.csv` and look at the proportion of correct answers on the validation sample.**

In [45]:
from sklearn.metrics import accuracy_score

vw_valid_pred = pd.read_csv(os.path.join(PATH_TO_DATA,'vw_valid_pred.csv'), header=None)
np.around(accuracy_score(y_valid_for_vw, vw_valid_pred.values), 3)

0.343

**Now we will train `SGDClassifier` (3 passes on the sample, the logistic loss function) and `LogisticRegression` for 70% of the sparse training sample - (`X_train_part_sparse`, `y_train_part`), make a forecast for the validation sample `(X_valid_sparse, y_valid)` and calculate the proportion of true answers:**

In [46]:
logit = LogisticRegression(random_state=17, n_jobs=-1)
sgd_logit = SGDClassifier(random_state=17, n_jobs=-1, max_iter=3)

In [47]:
%%time
logit.fit(X_train_part_sparse, y_train_part)

Wall time: 8min 1s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=-1, penalty='l2', random_state=17,
                   solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

In [48]:
%%time
sgd_logit.fit(X_train_part_sparse, y_train_part)

Wall time: 6.29 s




SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=3,
              n_iter_no_change=5, n_jobs=-1, penalty='l2', power_t=0.5,
              random_state=17, shuffle=True, tol=0.001, validation_fraction=0.1,
              verbose=0, warm_start=False)

Percentage of correct answers in the validation selection for Vowpal Wabbit:

In [52]:
np.around(accuracy_score(y_valid_for_vw, vw_valid_pred.values), 3)

0.343

Percentage of correct answers on the validation sample for SGD:

In [53]:
np.around(accuracy_score(y_valid, sgd_logit.predict(X_valid_sparse)), 3)

0.282

Percentage of correct answers in the validation sample for logistic regression:

In [54]:
np.around(accuracy_score(y_valid, logit.predict(X_valid_sparse)), 3)

0.353

### 3. Testing using a public sample (Public Leaderboard)

We train the VW model with the same parameters on the entire training sample - `train.vw`:

In [55]:
%%time
!vw --oaa 400 $PATH_TO_DATA/train.vw --passes 3 -c -k -f $PATH_TO_DATA/train_model.vw -b 26 --random_seed 17

Wall time: 39.3 s


final_regressor = identify_me_if_you_can/train_model.vw
Num weight bits = 26
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = identify_me_if_you_can/train.vw.cache
Reading datafile = identify_me_if_you_can/train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0      262        1       11
1.000000 1.000000            2            2.0       82      262       11
1.000000 1.000000            4            4.0      241      262       11
1.000000 1.000000            8            8.0      352      262       11
1.000000 1.000000           16           16.0      135       16       11
1.000000 1.000000           32           32.0       71      112       11
0.968750 0.937500           64           64.0      358      231       11
0.976563 0.984375          128          128.0      348      346    

Let's make a forecast for the public (test) sample:

In [56]:
%%time
!vw -i $PATH_TO_DATA/train_model.vw -t -d $PATH_TO_DATA/test.vw -p $PATH_TO_DATA/vw_test_pred.csv

Wall time: 838 ms


only testing
predictions = identify_me_if_you_can/vw_test_pred.csv
Num weight bits = 26
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = identify_me_if_you_can/test.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0        1       90       11
1.000000 1.000000            2            2.0        1       21       11
1.000000 1.000000            4            4.0        1      265       11
1.000000 1.000000            8            8.0        1      137       11
1.000000 1.000000           16           16.0        1      273       11
1.000000 1.000000           32           32.0        1      384       11
1.000000 1.000000           64           64.0        1      139       11
1.000000 1.000000          128          128.0        1       85       11
1.000000 1.000000          256          256.0        

We write the forecast into a file, apply the inverse label conversion (there was LabelEncoder and then +1 in labels) and send the solution to Kaggle.

In [57]:
def write_to_submission_file(predicted_labels, out_file,
                             target='user_id', index_label="session_id"):
    # turn predictions into data frame and save as csv file
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

In [62]:
vw_test_pred = pd.read_csv(os.path.join(PATH_TO_DATA,'vw_test_pred.csv'), header=None)
vw_pred = class_encoder.inverse_transform(vw_test_pred - 1)
vw_pred

array([224,  48, 795, ..., 107, 387, 179], dtype=int64)

In [63]:
write_to_submission_file(vw_pred, os.path.join(PATH_TO_DATA, 'vw_400_users.csv'))

We will do the same for SGD and logistic regression.

In [64]:
logit.fit(X_train_sparse, y)
sgd_logit.fit(X_train_sparse, y)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=3,
              n_iter_no_change=5, n_jobs=-1, penalty='l2', power_t=0.5,
              random_state=17, shuffle=True, tol=0.001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [65]:
logit_test_pred = logit.predict(X_test_sparse)
sgd_logit_test_pred = sgd_logit.predict(X_test_sparse)

In [66]:
write_to_submission_file(logit_test_pred, os.path.join(PATH_TO_DATA, 'logit_400_users.csv'))
write_to_submission_file(sgd_logit_test_pred, os.path.join(PATH_TO_DATA, 'sgd_400_users.csv'))

Let's look at the percentages of correct answers in the public sample [of this](https://inclass.kaggle.com/c/identify-me-if-you-can4) competitions.

Percentage of correct responses in the public sample (public leaderboard) for Vowpal Wabbit: **0.18656.**

Percentage of correct responses in the public sample (public leaderboard) for SGD: **0.19409.**

Percentage of correct responses in the public sample (public leaderboard) for для логистической регрессии: **0.19409.**

# Summary of project (part 1 - part 6)

This project solved the problem of identifying the user by the sequence of sites visited by them. In parts 1-4 of this project, the optimal method for solving the problem was selected. In the 5th part of the project, the conclusions made in parts 1-4 were used in the competition Kaggle: "Catch Me If You Can", which allowed us to achieve a high degree of confidence in the predictions of the algorithm. In part 6, using data from the Kaggle competition: "Identify Me If You Can", the advantages of using the Vowpal Wabbit library for online classification on large samples with a large number of classes were demonstrated, namely, the speed of the algorithm and low consumption of computer resources.