<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Authors: [Yury Kashnitsky](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

# <center> Assignment 5. Optional part
## <center> Vowpal Wabbit for Stackoverflow question tag classification
    
        
#  <center>  <font color = 'red'> Warning! </font>This is a very useful but ungraded assignment

## Plan
   1. [Introduction](#1.-Introduction)
   2. [Data description](#2.-Data-description)
   3. [Data preprocessing](#3.-Data-preprocessing)
   4. [Training and validation](#4.-Training-and-validation)
   5. [Notes](#5.-Notes)

### 1. Introduction

In this task you'll be doing the same thing that I did at Mail.ru Group – training models with gigabytes of data. You can try to stick to Python and Windows environment, but we strongly recommend some \*NIX-system (with, for instance, Docker) and use bash actively there. Having some experience with bash and UNIX utils is a very important skill for a data scientist.

For this particular task we need Vowpal Wabbit installed (we provide it with docker-container, instructions are given [here](https://mlcourse.ai/prerequisites)) and approximately  50 GB of disk space. I tested solution on ordinary Macbook Pro 2015 (8 cores, 16 GB RAM), the heaviest model was trained in under 12 min, so the task is doable with quite usual hardware.


Supplementary stuff:
 - interactive [tutorial](https://www.codecademy.com/en/courses/learn-the-command-line/lessons/environment/exercises/bash-profile) from CodeAcademy on UNIX command line (1-2 hours)

### 2. Data description

We have 10 GB of questions from StackOverflow split into 75% train and 25% test parts. You can download the training part [from here](https://drive.google.com/file/d/1w8z6HmFe4oCQSG6DjomSRUWvJ-gK0LTe/view?usp=sharing) (~2.5 GB archived, ~8 GB unpacked).

Data format is simple:<br>
<center>*question text* (space delimited words) TAB *question tags* (space delimited)

TAB – is a tabulation symbol.

In [1]:
# customize this
PATH_TO_DATA = '../../data/stackoverflow'

First sample from training set for example:

In [None]:
!head -1 $PATH_TO_DATA/stackoverflow_raw_train_7500k.tsv

Here we have question text, then tab and question tags: *css, css3* and *css-selectors*. And so we have 7.5 mln such questions.

In [None]:
%%time
!wc -l $PATH_TO_DATA/stackoverflow_raw_train_7500k.tsv

Note, that we are not going to load this dataset into memory at any point, feel free to use Unix utilities - `head`, `tail`, `wc`, `cat`, `cut`, etc. to explore and process the dataset. 

### 3. Data preprocessing

Let's select all questions with tags *javascript, java, python, ruby, php, c++, c#, go, scala* and  *swift* from the data source and prepare a training set in the Vowpal Wabbit data format. We be solving a 10-class classification problem: each question can be tagged with one of these tags. 

Generally, as we see, questions may have several tags, but we will simplify our task selecting only questions having one of the tags from the list.

However, it's good to know that VW supports multilabel classification (`--multilabel_oaa` parameter).
<br>
<br>
Implement data preprocessing code in separate file `preprocess.py`. This script is going to select all lines with tags *javascript, java, python, ruby, php, c++, c#, go, scala*, *swift* and write them to a file in VW format. Details:
 - the script takes command line arguments: input and output file paths 
 - lines are processed one-by-one (you can use `tqdm` to track iterations)
 - if a line has no tabs or more than one tab  - then the line is broken, skip it
 - if line has exactly one tab symbol, check, how many tags from list *javascript, java, python, ruby, php, c++, c#, go, scala* or  *swift* are there. If there is only one tag - write string to output with VW format: `label | text`, where `label` – number from 1 to 10 (1 - *javascript*, ... 10 – *swift*). Skip strings with no tags or more than one tag from our list.
 - remove `:` and `|` symbols from question text - these are reserved VW symbols

In [None]:
import os
from tqdm import tqdm
from time import time
import numpy as np
from sklearn.metrics import accuracy_score

You should get 3291403 lines in the processed data file. In our case Python processes 8 GB in ~1.5 min.

In [None]:
%%time
!python preprocessor.py -ip_fp $PATH_TO_DATA/stackoverflow_raw_train_7500k.tsv \
    -op_fp $PATH_TO_DATA/stackoverflow_raw_train_7500k.vw

In [None]:
%%time
!wc -l $PATH_TO_DATA/stackoverflow_raw_train_7500k.vw

In [None]:
!head -2 $PATH_TO_DATA/stackoverflow_raw_train_7500k.vw

Split dataset into training, and validation parts  -  approx. 2/3 shall go to the training - 2194270 lines. We don't need to shuffle the data, first 2194270 lines go into training part `stackoverflow_train.vw`, last 1097133 lines – to the validation test part `stackoverflow_valid.vw`. 

Also, save a vector with correct labels for the validation set into a separate files `stackoverflow_valid_labels.txt`.

Use `head`, `tail`, `split`, `cat` and `cut` linux utils.

In [9]:
!split $PATH_TO_DATA/stackoverflow_raw_train_7500k.vw  $PATH_TO_DATA/stackoverflow_processed -l 2038856

In [10]:
!mv $PATH_TO_DATA/stackoverflow_processedaa $PATH_TO_DATA/stackoverflow_train.vw

In [21]:
!cat $PATH_TO_DATA/stackoverflow_processedab | cut -d ' ' -f 2- > $PATH_TO_DATA/stackoverflow_valid.vw

In [22]:
!cat $PATH_TO_DATA/stackoverflow_processedab | cut -d ' ' -f 1 > $PATH_TO_DATA/stackoverflow_valid_labels.txt

### 4. Training and validation

Train Vowpal Wabbit with `stackoverflow_train.vw` 9 times, changing the number of `passes` (1,3,5) and `ngram` (1,2,3). The rest parameters are: `bit_precision`=28 and `seed`=17. Also tell VW, that it is a 10-class classification problem that we have.

Evaluate accuracy with `stackoverflow_valid.vw` and select best hyperparams.

<font color='red'> Question.</font> Which parameter set provides the best accuracy on validation set `stackoverflow_valid.vw`?
- bigrams (`ngram`=2) and 3 epochs (`passes`=3)
- trigrams and 5 epochs
- bigrams and 1 epoch
- unigrams and 1 epoch

**bigrams and 1 epoch**
**93.3%**

In [25]:
!$PATH_TO_DATA/train.sh

In [31]:
import os
from sklearn.metrics import accuracy_score, roc_auc_score

In [33]:
with open(os.path.join(PATH_TO_DATA, 'stackoverflow_valid_labels.txt')) as pred_file:
    y_valid = [float(label) for label in pred_file.readlines()]

for n_pass in [1, 3, 5]:
    for n_gram in [1, 2, 3]:
        file_name = 'stackoverflow_valid_pred_' + str(n_pass) + str(n_gram) + '.txt'
        with open(os.path.join(PATH_TO_DATA, file_name)) as pred_file:
            test_prediction = [float(label) for label in pred_file.readlines()]
        print("Epochs: {}, N-grams: {}".format(n_pass, n_gram))
        print("Accuracy: {}".format(round(accuracy_score(y_valid, test_prediction), 3)))

Epochs: 1, N-grams: 1
Accuracy: 0.919
Epochs: 1, N-grams: 2
Accuracy: 0.933
Epochs: 1, N-grams: 3
Accuracy: 0.932
Epochs: 3, N-grams: 1
Accuracy: 0.919
Epochs: 3, N-grams: 2
Accuracy: 0.931
Epochs: 3, N-grams: 3
Accuracy: 0.929
Epochs: 5, N-grams: 1
Accuracy: 0.919
Epochs: 5, N-grams: 2
Accuracy: 0.932
Epochs: 5, N-grams: 3
Accuracy: 0.929


### 5. Notes

A note on this task:
- in future, there'll be a Kaggle competition organized with this data
- we could've used  `sklearn` wrapper for Vowpal Wabbit as shown in [this Kernel](https://www.kaggle.com/kashnitsky/training-while-reading-vowpal-wabbit-starter)
- we did not use `hyperopt` package for parameter tuning
- it is better to write results in a log file instead of printing
- for data preprocessing tasks Linux shell utilities are faster than `Python` scripts

However, the solution that you'll get is quite good. And, keeping the data set size in mind, there is no point in heavy hyperparameter tuning. In general, with Vowpal Wabbit you can get reasonable baselines very fast, even in tasks where dataset sizes look intimidating at a first glance.