<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

# <center> Assignment № 8
## <center> Vowpal Wabbit for Stackoverflow question tag classification

## Plan
    1. Introduction
    2. Data description
    3. Data preprocessing
    4. Training and validation of models
    5. Summary

### 1. Introduction

In this task, you will do something that we do every week at Mail.Ru Group: train models on several GBs of data. You might cope with Python in Windows, but we strongly recommend some \*NIX-system (for instance, with Docker) and use bash utils.
A sad, but true, fact is that, if you want to work in the best companies in the world in ML, you will need experience with UNIX bash. Here is an interactive [tutorial](https://www.codecademy.com/en/courses/learn-the-command-line/lessons/environment/exercises/bash-profile) from CodeAcademy on UNIX command line (1-2 hours).

Submit your answers through the [web-form](https://docs.google.com/forms/d/14adHGB-XKtpHlG9JJgog3DUzMUabd4y1YWG3b866m54/edit).

For this particular task, you will need Vowpal Wabbit installed (we already have it inside the docker-container of our course. Check out instructions in the README in our course [repo](https://github.com/Yorko/mlcourse_open)). Make sure you have approximately 70 GB of disk space. I have tested the solution on an ordinary Macbook Pro 2015 (8 kernels, 16GB RAM), and the heaviest model was trained in ~ 12 min, so this task is doable with ordinary hardware. Still, if you have plans to rent Amazon servers, right now is a good time to do it.

### 2. Data description

We have 10 GB of questions from StackOverflow – [download](https://drive.google.com/file/d/1ZU4J3KhJDrHVMj48fROFcTsTZKorPGlG/view) and unpack the archive. 

The data format is simple:<br>
<center>*question text* (space dilimited words) TAB *question tags* (space delimited)

TAB is the tabulation symbol.
Let's see the first sample from the training set:

In [2]:
PATH_TO_DATA = "../../raw_data"

In [None]:
!head -1 $PATH_TO_DATA/stackoverflow.10kk.tsv

Here, we have the question text, followed by a tab and the question tags: *css, css3* and *css-selectors*. There are 10 billion of such questions in our dataset.

In [None]:
%%time
!wc -l $PATH_TO_DATA/stackoverflow.10kk.tsv

Note, that we do not want to overload memory with this amount of data, so we will use the following Unix utilities - `head`, `tail`, `wc`, `cat`, `cut`, etc.

### 3. Data preprocessing

Let's select all questions with the tags *javascript, java, python, ruby, php, c++, c#, go, scala*, and *swift* from the data source, and prepare the training set in Vowpal Wabbit's data format. We will perform 10-class question classification over the tags we've selected.

In general, questions may have several tags, but we will simplify our task by selecting only one of the listed tags or dropping questions in case of no such tags.
Note that VW supports multilabel classification (`--multilabel_oaa` parameter).
<br>
<br>
Implement your data preprocessing code in a separate file `preprocess.py`. Your code must select lines with our tags and write them to a separate file in Vowpal Wabbit format. Details are as follows:
 - script must work with command line arguments: file paths for input and output
 - lines are processed one-by-one (there is a wonderful `tqdm` module for iterations counting)
 - if a line has no tab symbols or more than one tab symbol - then the line is broken, skip it
 - if a line has exactly one tab symbol, check how many tags are from our list *javascript, java, python, ruby, php, c++, c#, go, scala* or  *swift*. If there is only one tag, write the string to output with VW format: `label | text`, where `label` is a number from 1 to 10 (1 - *javascript*, ... 10 – *swift*). Skip strings with more than 1 or no tags.
 - remove `:` and `|` symbols from the question text - they have special meaning for VW

In [1]:
import os
from tqdm import tqdm
from time import time
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

You should have 4389054 lines in the preprocessed data file. We can see that VW can process 10 GB of data in roughly 1-2 minutes.

In [None]:
!python preprocess.py $PATH_TO_DATA/stackoverflow.10kk.tsv $PATH_TO_DATA/stackoverflow.vw

In [None]:
!wc -l $PATH_TO_DATA/stackoverflow.vw

Split the dataset into training, validation, and test sets in equal proportions with 1463018 lines in each file. We don't need to shuffle the data, the first 1463018 lines must go into training `stackoverflow_train.vw`, the last 1463018 lines to test `stackoverflow_test.vw`, and the rest to validation `stackoverflow_valid.vw`. 

Save answer vectors for validation and test sets into separate files: `stackoverflow_valid_labels.txt` and `stackoverflow_test_labels.txt`, respectively.

Do not hesitate to use `head`, `tail`, `split`, `cat` and `cut` linux utils.

In [None]:
# Your code here
!split -l 1463018 $PATH_TO_DATA/stackoverflow.vw stackoverflow_
!mv stackoverflow_aa stackoverflow_train.vw
!mv stackoverflow_ab stackoverflow_test.vw
!mv stackoverflow_ac stackoverflow_valid.vw
!cut -d'|' -f1 stackoverflow_valid.vw > stackoverflow_valid_labels.txt
!cut -d'|' -f1 stackoverflow_test.vw > stackoverflow_test_labels.txt

### 4. Training and validation of models

Train Vowpal Wabbit with `stackoverflow_train.vw` 9 times with (1,3,5) iterating passes and n-gram (n=1,2,3) parameters.
The rest of the parameters are `bit_precision=28` and `seed=17`. Don't forget to tell VW that we have a 10-class problem.

Evaluate accuracy on `stackoverflow_valid.vw`. Choose the model with the best parameters, and test it on `stackoverflow_test.vw` set.

In [3]:
y_true = pd.read_csv('stackoverflow_valid_labels.txt', names=['pred'])
y_true.head()

Unnamed: 0,pred
0,1
1,5
2,7
3,2
4,9


In [4]:
accuracies = []
settings = []
for ngram in [1, 2, 3]:
    for passes in [1,3,5]:
        train_cmd = "vw -d stackoverflow_train.vw --loss_function hinge --oaa 10 --ngram {ng} --passes {ps} \
        -b 28 --random_seed 17 --readable_model {ng}_{ps}readable.vw.model -f {ng}_{ps}vw.model -c".format(ng=ngram, ps=passes)
        ! echo $train_cmd
        ! $train_cmd
        test_cmd = "vw -d stackoverflow_valid.vw -t -i {ng}_{ps}vw.model -p stackoverflow_valid_preds.txt".format(ng=ngram, ps=passes)
        ! echo $test_cmd
        ! $test_cmd
        y_pred = pd.read_csv('stackoverflow_valid_preds.txt', names=['pred'])
        acc = accuracy_score(y_true, y_pred)
        accuracies.append(acc)
        settings.append("n%d p%d" % (ngram, passes))
        print("With %d ngrams and %d passes got %f" % (ngram, passes, acc))

vw -d stackoverflow_train.vw --loss_function hinge --oaa 10 --ngram 1 --passes 1 -b 28 --random_seed 17 --readable_model 1_1readable.vw.model -f 1_1vw.model -c
Generating 1-grams for all namespaces.
final_regressor = 1_1vw.model
Num weight bits = 28
learning rate = 0.5
initial_t = 0
power_t = 0.5
using cache_file = stackoverflow_train.vw.cache
ignoring text input in favor of cache input
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0        1        1      161
0.500000 1.000000            2            2.0        4        1       68
0.750000 1.000000            4            4.0        7        1       88
0.750000 0.750000            8            8.0        7        1       95
0.750000 0.750000           16           16.0        7        7      209
0.781250 0.812500           32           32.0        7        2      174
0.765625 0

0.079224 0.081299         8192         8192.0        1        1      201
0.080139 0.081055        16384        16384.0        2        2      132
0.081635 0.083130        32768        32768.0        6        6      170
0.081863 0.082092        65536        65536.0        1        1     3000
0.081352 0.080841       131072       131072.0        7        7      111
0.081676 0.082001       262144       262144.0        7        7      114
0.081945 0.082214       524288       524288.0        7        7      375
0.081896 0.081846      1048576      1048576.0        7        7       25

finished run
number of examples = 1463018
weighted example sum = 1463018.000000
weighted label sum = 0.000000
average loss = 0.081903
total feature number = 292619465
With 1 ngrams and 3 passes got 0.918097
vw -d stackoverflow_train.vw --loss_function hinge --oaa 10 --ngram 1 --passes 5 -b 28 --random_seed 17 --readable_model 1_5readable.vw.model -f 1_5vw.model -c
Generating 1-grams for all namespaces.
final_reg

0.059570 0.046875         1024         1024.0        2        2      204
0.065918 0.072266         2048         2048.0        6        6      294
0.069580 0.073242         4096         4096.0        1        1      250
0.067383 0.065186         8192         8192.0        1        1      400
0.067017 0.066650        16384        16384.0        2        2      262
0.067810 0.068604        32768        32768.0        6        6      338
0.068176 0.068542        65536        65536.0        1        1     5998
0.067429 0.066681       131072       131072.0        7        7      220
0.067318 0.067207       262144       262144.0        7        7      226
0.067663 0.068008       524288       524288.0        7        7      748
0.067591 0.067518      1048576      1048576.0        7        2       48

finished run
number of examples = 1463018
weighted example sum = 1463018.000000
weighted label sum = 0.000000
average loss = 0.067465
total feature number = 582312894
With 2 ngrams and 1 passes go

using no cache
Reading datafile = stackoverflow_valid.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0        1        1      242
0.000000 0.000000            2            2.0        5        5      372
0.250000 0.500000            4            4.0        2        2      270
0.125000 0.000000            8            8.0        5        5      158
0.125000 0.125000           16           16.0        6        6      346
0.093750 0.062500           32           32.0        2        2     1296
0.078125 0.062500           64           64.0        1        5      112
0.078125 0.078125          128          128.0        1        1      182
0.070312 0.062500          256          256.0        2        1       82
0.072266 0.074219          512          512.0        5        5      302
0.063477 0.054688         1024         1024.0      

0.135880 0.112335       131072       131072.0        2        2      280
0.117374 0.098869       262144       262144.0        5        5      691
0.102125 0.086876       524288       524288.0        6        6      421
0.090308 0.078491      1048576      1048576.0        1        1     1261
0.081231 0.081231      2097152      2097152.0        5        5     2083 h

finished run
number of examples per pass = 1316717
passes used = 3
weighted example sum = 3950151.000000
weighted label sum = 0.000000
average loss = 0.073110 h
total feature number = 2345071710
vw -d stackoverflow_valid.vw -t -i 3_3vw.model -p stackoverflow_valid_preds.txt
Generating 3-grams for all namespaces.
only testing
predictions = stackoverflow_valid_preds.txt
Num weight bits = 28
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = stackoverflow_valid.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight 

In [None]:
print(list(zip(settings,accuracies)))
print(settings[np.argmax(accuracies)])
acc_valid_max = max(accuracies)
print(acc_valid_max)

In [10]:
!vw -d stackoverflow_valid.vw -t -i 2_1vw.model -p stackoverflow_valid_preds.txt --loss_function hinge --ngram 2 --passes 1 --random_seed 17
y_pred = pd.read_csv('stackoverflow_valid_preds.txt', names=['pred'])
accuracy_score(y_true, y_pred)

args =  -d stackoverflow_valid.vw -t --initial_regressor 2_1vw.model -p stackoverflow_valid_preds.txt --loss_function hinge --oaa 10 --ngram 2 --passes 1 -b 28 --random_seed 17 --bit_precision 28 --ngram 2 --hash_seed 0 --oaa 10 --link identity
ignoring duplicate option: '--bit_precision 28'
Generating 2-grams for all namespaces.
only testing
predictions = stackoverflow_valid_preds.txt
args =  -d stackoverflow_valid.vw --testonly --initial_regressor 2_1vw.model --predictions stackoverflow_valid_preds.txt --loss_function hinge --oaa 10 --ngram 2 --passes 1 --bit_precision 28 --random_seed 17 --ngram 2 --hash_seed 0 --oaa 10 --link identity
ignoring duplicate option: '--oaa 10'
Num weight bits = 28
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = stackoverflow_valid.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000          

0.9325346646452743

**Question 1.** Which parameter set provides the best accuracy on the validation set `stackoverflow_valid.vw`?
- bigrams and 3 passes
- trigrams and 5 passes
- ##### bigrams and 1 pass
- unigrams and 1 pass

Check the best (according to validation accuracy) model on the test set. 

In [11]:
# Your code here
y_true_test = pd.read_csv('stackoverflow_test_labels.txt', names=['pred'])
test_cmd = "vw -d stackoverflow_test.vw -t -i 2_1vw.model -p stackoverflow_test_preds.txt --loss_function hinge --ngram 2 --passes 1 --random_seed 17"
! echo $test_cmd
! $test_cmd
y_pred = pd.read_csv('stackoverflow_test_preds.txt', names=['pred'])
acc_test = accuracy_score(y_true_test, y_pred)
print(acc_test)

vw -d stackoverflow_test.vw -t -i 2_1vw.model -p stackoverflow_test_preds.txt --loss_function hinge --ngram 2 --passes 1 --random_seed 17
Generating 2-grams for all namespaces.
only testing
predictions = stackoverflow_test_preds.txt
Num weight bits = 28
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = stackoverflow_test.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0        2        7      354
0.500000 0.000000            2            2.0        7        7      146
0.250000 0.000000            4            4.0        5        5      516
0.125000 0.000000            8            8.0        7        7      286
0.125000 0.125000           16           16.0        6        6      716
0.062500 0.000000           32           32.0        2        2      798
0.062500 0.062500           64           6

In [12]:
(acc_test-acc_valid_max)*100

-0.01797653890792672

**Question 2.** Compare best validation and test accuracies. Choose the correct answer (% is a percent here i.e. a drop from 50% to 40% would be 10%, not 20%).
- Test accuracy is lower by approx. 2%
- Test accuracy is lower by approx. 3%
- ##### difference is less than 0.5%

Train VW with parameters selected on the validation set, but first merge the training and validation sets. Evaluate the share of correct answers on the test set. 

In [13]:
# Your code here
!cat stackoverflow_train.vw stackoverflow_valid.vw > stackoverflow_train_big.vw

In [14]:
!vw -d stackoverflow_train_big.vw --loss_function hinge --oaa 10 --ngram 2 --passes 1 -b 28 --random_seed 17 --readable_model readable.vw.model -f vw.model -c
!vw -d stackoverflow_test.vw -t -i vw.model -p stackoverflow_test_preds.txt
y_pred = pd.read_csv('stackoverflow_test_preds.txt', names=['pred'])
acc_final = accuracy_score(y_true_test, y_pred)

Generating 2-grams for all namespaces.
final_regressor = vw.model
Num weight bits = 28
learning rate = 0.5
initial_t = 0
power_t = 0.5
using cache_file = stackoverflow_train_big.vw.cache
ignoring text input in favor of cache input
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0        1        1      320
0.500000 1.000000            2            2.0        4        1      134
0.750000 1.000000            4            4.0        7        1      174
0.750000 0.750000            8            8.0        7        1      188
0.750000 0.750000           16           16.0        7        7      416
0.781250 0.812500           32           32.0        7        2      346
0.750000 0.718750           64           64.0        3        3      406
0.648438 0.546875          128          128.0        1        7       56
0.617188 0.585938      

In [15]:
print(acc_final)
print(round((acc_final-acc_test)*100, 1))

0.9366972928562738
0.4


**Question 3.** How large is the gain after training with 2x the data (training `stackoverflow_train.vw` + validation `stackoverflow_valid.vw`) versus the model trained solely on `stackoverflow_train.vw`?
 - 0.1%
 - ##### 0.4%
 - 0.8%
 - 1.2%

### 5. Conclusion

We have only just scratched the surface with Vowpal Wabbit in this assignment. Here are some hints on what to do next:
 - multilabel classification (`multilabel_oaa` argument) – data format perfectly matches with this type of problem
 - Tuning VW parameters with hyperopt. VW developers say that the accuracy strongly depends on gradient descent (`initial_t` and `power_t`) parameters. Also, we can test different loss functions i.e. train logistic regression and linear SVM
 - Learn about factorization machines and its implementation in VW (the `lrq` argument)