# install binary into system path
echo "PATH=/usr/local/bin:$PATH" >> ~/.bash_profile && \
source ~/.bash_profile

# Vowpal Wabbit
VW (Vowpal Wabbit) is a flexible tool for online and batch learning of (mostly) linear models. It has been designed with efficiency and simplicity as its core principles. It is not a complete description of VW, but every effort has been made to have pointers to more information throughout the document.

These notes are what I think are important after reading [Vowpal Wabbit tutorial for the Uninitiated](https://www.zinkov.com/posts/2013-08-13-vowpal-tutorial/), the [wiki](https://github.com/JohnLangford/vowpal_wabbit/wiki) and Yahoo's Scalable Machine Learning Wiki

## Installation on Mac
```shell
# install the boost library
brew install boost

# get vw
git clone git://github.com/JohnLangford/vowpal_wabbit.git

# change directory
cd vowpal_wabbit

# install
make

# test
make test


```


## Input Format

    [key \t] label [importance] [base] [tag]|namespace feature ... |namespace feature ... \n
where namespace = `string[:float]` and feature = `string[:float]`

example:
```
1.0 1.0 w1;1.0;1.0;W6K6lkoG7v5SHMYjTWrTwQbhrV.h5E3Dcj8ADTnv|a gpos3ttfc:0.218977 gpos6ttc gpos8ctr:0.275933  |q int2_spans_count:0.255002 max_df_localdcat:0.307283 min_idf_webf1:0.351295 int2_class_place_name vikings int3_spans_count:0.130118 max_idf_webf1:0.283068 max_idf_webf2:0.262892 nwords:0.231262 avg_idf_webf2:0.249664 minnesota qnav:0.496886
```

Here, the label is 1.0, the importance weight is 1.0, the tag w1;1.0;1.0;W6K6lkoG7v5SHMYjTWrTwQbhrV.h5E3Dcj8ADTnv has assorted meta-data such as the page-view ID of the page corresponding to these feature, and the example has two namespaces: a and q. Note that VW admits sparse representations and string feature names: in essence, the input format is little more that groups of key-value pairs, where the value can be omitted if it's equal to 1 and the whole pair can be omitted if the value is zero (thus, implicitly, a missing feature is assumed to have a value of zero).

Tags are important for passing identifier and meta-data to the predictions. In the absence of a tag, only the output of the model (the score) will be written. Namespaces are useful for feature grouping and such feature grouping comes in handy when constructing quadratic pairs.
An optional but important piece of an example is tab-separated "key" at the beginning: this key is completely ignored by VW, but is useful for data generation with Hadoop streaming, as during the reduce phase the outputs will be sorted by this key. Typical values are a random integer (to shuffle the data) or the time-stamp of the page-view (to have the data in sequential order).

Note that the tag must be adjacent to the namespace separator '|' without extra whitespace in-between. Similarly namespace must follow '|' without any whitespace.

## Training
```
zcat train.vw.gz | vw --cache_file train.cache -f data.model --compressed
```
`--cache-file arg` is where the data is stored in a format easier for vw to reuse. `-f arg` specifies the filename of the output model/predictor. By default none is created. `--compressed` will make it a point to try to process the data and store caches and models in a gzip-compressed format.

## Validation and Testing
```
zcat test.vw.gz | vw -t --cache_file test.cache -i data.model -p test.pred
```
the `-t` option tells vw to ignore the labels and not train on the data. The `-i arg` option specifies the model. The `-p arg` option specifies the output file of the prediction.

## Model Inspection
If you want a human readable model, use `--readable_model arg` instead of `-f arg`. To perserve the feature names instead of seeing hashes, use `--invert_hash arg`

## Example - Houston Housing Price
We will predict housing prices in Houston

```
vw boston.data.vw --readable_model --invert_hash boston.model
```
`housing_model`:
```
Version 7.3.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:0.013058
^B:0.013684
^CHAS:3.058681
^CRIM:-0.047248
^DIS:0.385468
^INDUS:-0.052709
^LSTAT:-0.165589
^NOX:3.014200
^PTRATIO:0.124905
^RAD:-0.072961
^RM:0.713633
^TAX:-0.000079
^ZN:0.054472
Constant:4.484257
```
```
^AGE:104042:28.8:0.0188122@1.18149e+09
```
^AGE is the feature name, 104042 is the hashed value for the feature, 28.8 is the value of that feature for this example, 0.0188122 is the weight we have learned thus far for the feature, and 1.1849e9 is a sum of gradients squared over the feature values, which is used for adjusting the weight on the feature.

## Turning the learning algorithm
By default, vw minimizaes the following function:
<img src='./img/minfunction.png' />
Where λ1 and λ2 refer to L1 and L2 regularization functions. ℓ refers to the loss function.

`--l1 arg` and `--l2 arg` controls l1 and l2 regularization lambda values

`--loss_function arg` default to `squared`, but can be `logistic`, `hinge`, `quantile`, `poisson`

`-l arg` [`--learning_rate arg`] (=0.5) sets the initial learning rate

`--decay_learning_rate arg` (=1) sets Decay factor for learning_rate between passes

`--feature_mask arg` is usually used with `l1` to learn which features to include
```shell
vw -d data.vw --l1 0.001 -f features.model
vw -d data.vw -i features.model --feature_mask features.model -f final.model
```

`--keep arg` keep namespaces beginning with character arg

`--ignore arg` ignore namespaces beginning with character arg

`--adaptive` (on by default) use adaptive learning rate $\alpha$ for each parameter in the model 

`--normalized` (on by default) use per feature normalized updates

`--invariant` (on by default) 

## Multiclass
### One vs all
`--oaa` only requires that the labels are between 1-k (inclusive) classes

## Nonlinear (Quadratic and Cubic)
`--quadratic arg` [`-q arg`] creates quadratic features between 2 namespaces beginning with characters in arg (limitation: only first letters of the namespace can be used)

`--cubic arg` creates cubic features between 3 namespaces beginning with characters in arg (limitation: only first letters of the namespace can be used)

## Active Learning
for unlabelled data, vw can ask for labels whenever it thinks it needs it. The labeller can be a human, or a process on a certain port.

```shell
cat data.vw | vw --active_simulation --active_mellowness 0.000001
```