# Welcome to "NLP in VW"!

The goal here is to get you comfortable with using `vw` for basic NLP tasks, like binary classification. We will explore some of the `vw` options that are particularly useful for language problems and then future notebooks will go beyond binary classification. The topics we'll cover are:

* [What is binary classification?](#binary)
* [Constructing a data set](#data)
* [Running vw](#run)
* [Multiple passes over the data](#passes)
* [Saving the model](#save)
* [Making predictions on test data](#test)
* [Cheat sheet and next steps](#cheat)

# <a id='binary'></a> What is Binary Classification?

The job of a binary classifier is to learn to map inputs (usually called $\mathbf x$) to binary labels (usually called $+1$ and $-1$). A simple example is sentiment classification. Given some text (perhaps a movie review), determine whether the overall sentiment expressed by that review is positive or negative toward the movie.

Because this is a machine learning application, this mapping is **induced** from training data. The training data consists of a (hopefully large!) set of labeled examples: movie reviews **paired** with the correct label (positive or negative). The classic data set for this is from [Pang and Lee](http://www.cs.cornell.edu/people/pabo/movie-review-data/); this is the data we will work with later. For comparison purposes, it's worth keeping in mind that the best performance Pang and Lee achieve on this data in their [2004 paper](http://www.cs.cornell.edu/home/llee/papers/cutsent.pdf) that introduced it is about 13% error. In this tutorial we'll get 15% error, and in the [subsequent tutorial](GettingTheMost.ipynb) will get 10%.

Once the classifier (mapping from review to sentiment) has been learned, we can apply it to new reviews that are missing ratings to predict what the rating probably would have been. We usually care about the **accuracy** of this classifier: what percentage of predictions did it get wrong. Of course, we want to be able to measure this accuracy, so we hold out some test data on which to evaluate the classifier.

## Sounds Great, Let's Do It!

There are two prerequisites: we need to make sure `vw` is installed and we need some data. If `vw` is installed correctly, and is in your path, the following should work:

In [3]:
!vw --version

8.1.1


If you get some error like "vw: not found" then `vw` is either not installed correctly or is not in your path.

# <a id='data'></a>  Constructing Your Data Set

We also need some data on which to train and test a classifier. We'll download the Pang and Lee data referenced above.

In [4]:
!mkdir data
!curl -o data/review_polarity.tar.gz http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
!tar zxC data -f data/review_polarity.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3053k  100 3053k    0     0  1045k      0  0:00:02  0:00:02 --:--:-- 1045k


We can take a look at the beginning of one of the positive reviews and one of the negative reviews:

In [5]:
!head -n3 data/txt_sentoken/pos/cv000_29590.txt

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 


In [6]:
!head -n3 data/txt_sentoken/neg/cv000_29416.txt

plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 


Okay, so our first job is to put this data into `vw` format. Luckily this data is already lowercased and tokenized (words are separated from punctuation by extra spaces), so we don't have to deal with that issue.

This format is quite flexible, and we'll see additional fun things you can do later, but for now, the basic file format is one-example per line, with the label first and then a vertical bar (pipe) and then all of the features. If we're doing bag of words representation (a good starting point for text data), the features are just each-of-the-individual-words-in-the-text. For example, for the two above files, we'd want to create two `vw` examples like:

    +1 | films adapted from comic books have had plenty of success , whether they're ...
    -1 | plot : two teen couples go to a church party , drink and then drive . they get into ...
    
However, there's an issue here. There are two **reserved characters** in the `vw` example: colon (`:`) and pipe (`|`). This means we need to convert these to characters to anything-else.

Let's write a little python to do this conversion. You could do it just with `sed` and friends, but this is an iPython notebook, so why not do it that way?

In [7]:
from __future__ import print_function

def textToVW(lines):
    return ' '.join([l.strip() for l in lines]).replace(':','COLON').replace('|','PIPE')

def fileToVW(inputFile):
    return textToVW(open(inputFile,'r').readlines())

print(fileToVW('data/txt_sentoken/neg/cv000_29416.txt')[:50])

plot COLON two teen couples go to a church party ,


Here, we see the first few words of the negative review, with ':' replaced by COLON (this is safe because all the other text is lowercased) and '|' replaced by PIPE.

Now we just need to read in all the positive examples and all the negative examples:

In [8]:
import os

def readTextFilesInDirectory(directory):
    return [fileToVW(directory + os.sep + f) 
            for f in os.listdir(directory)
            if  f.endswith('.txt')]

examples = ['+1 | ' + s for s in readTextFilesInDirectory('data/txt_sentoken/pos')] + \
           ['-1 | ' + s for s in readTextFilesInDirectory('data/txt_sentoken/neg')]

print('{0} total examples read'.format(len(examples)))

2000 total examples read


Now, we've got all the files, we put "`+1 | `" at the beginning of the positive ones and "`-1 | `" at the beginning of the negative ones. *Voila* we have our `vw` data.

We'll now generate some training data and some test data. To achieve this, we're going to permute the examples (after putting in a random seed for reproducability, [hopefully okay cross-platform](http://stackoverflow.com/questions/9023660/how-to-generate-a-repeatable-random-number-sequence)) and then taking the first 80% and train and the last 20% as test.

The fact that we're permuting the data is **very important**. By default, `vw` uses an online learning strategy, and if we did something silly like putting all the positive examples before the negative examples, learning would take a LONG time. More on this later.

In [9]:
import random
random.seed(1234)
random.shuffle(examples)   # this does in-place shuffling
# print out the labels of the first 50 examples to be sure they're sane:
print(''.join(s[0] for s in examples[:50]))

+---++-+-+++-+-+++--+++-+++++++-++--+-+----++++-++


Now, we can write the first 1600 to a training file and the last 400 to a test file.

In [10]:
def writeToVWFile(filename, examples):
    with open(filename, 'w') as h:
        for ex in examples:
            print('{0}'.format(ex), file=h)
            
writeToVWFile('data/sentiment.tr', examples[:1600])
writeToVWFile('data/sentiment.te', examples[1600:])

!wc -l data/sentiment.tr data/sentiment.te

    1600 data/sentiment.tr
     400 data/sentiment.te
    2000 total


At this point, everything is properly set up and we can run `vw`!

# <a id='run'></a>  Running VW for the First Time

In [11]:
!vw --binary data/sentiment.tr

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/sentiment.tr
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0   1.0000  -1.0000      707
1.000000 1.000000            2            2.0  -1.0000   1.0000      573
0.500000 0.000000            4            4.0  -1.0000  -1.0000      558
0.625000 0.750000            8            8.0   1.0000  -1.0000      356
0.750000 0.875000           16           16.0   1.0000  -1.0000     1043
0.531250 0.312500           32           32.0  -1.0000   1.0000     1034
0.578125 0.625000           64           64.0   1.0000  -1.0000      472
0.515625 0.453125          128          128.0  -1.0000   1.0000      480
0.449219 0.382812          256          256.0   1.0000   1.0000     1645
0.373047 0.296875          512          512.0  -1.0000  -1.0

This output consists of three parts:

1. The header, which displays some information about the parameters `vw` is using to do the learning (number of bits, learning rate, ..., number of sources). We'll discuss (some) of these later.
2. The progress list (the lines with lots of numbers); much more on this below.
3. The footer, which displays some statistics about the success (or failure) of learning. In this case, it says, among other things, that it made one pass over the data, encountered 1600 training examples (yay!) and found a model with an average loss of 28.06%. It also says that it processed 1.2m features (summed over all training examples), which gives some sense of the data size.

One important note is that when we ran `vw`, we added the flag `--binary`, which instructs `vw` to report all losses as zero-one loss.

Let's look first at the first four lines of the progress list:

    average  since         example        example  current  current  current
    loss     last          counter         weight    label  predict features
    1.000000 1.000000            1            1.0   1.0000  -1.0000      740
    0.500000 0.000000            2            2.0   1.0000   1.0000      630
    0.750000 1.000000            4            4.0   1.0000  -1.0000      870
    0.500000 0.250000            8            8.0  -1.0000  -1.0000      526
    
The columns are labeled, which gives some clue as to what's being printed out. The way `vw` works internally is that it processes one example at a time. At every $2^k$th example (examples 1, 2, 4, 8, 16, ...), it prints out a status update. This way you get lots of updates early (as a sanity check) and fewer as time goes on. The third column gives you the example number. The fourth column tells you the total "weight" of examples so far; right now all examples have a weight of 1.0, but for some problems (e.g., imbalanced data), you might want to give different weight to different examples. The fifth column tells you the true current label (+1 or -1) and the sixth column tells you the models' current prediction. Lastly, it tells you how many features there are in this example.

The first two columns deserve some explanation. In "default" mode, `vw` reports "progressive validation loss." This means that when `vw` sees a training example, it *first* makes a prediction. It then computes a loss on that single prediction. Only after that does it "learn". The average loss computed in this was is the **progressive validation loss.** It has a nice property that it's a good estimate of test loss, *provided you only make one pass over the data*, **and** it's efficient to compute. The first column tells you the average progressive loss over the *entire* run of `vw`; the second column tells you the average progressive loss *since the last time `vw` printed something*.

In practice, this second column is what you want to look at for telling how well your model is doing.

# Your Second Run of VW

There are a couple of things we need to do to get a useful system. The first is that for most data sets, a single online pass over the data is insufficient -- we need to run more than one. The second is that we actually need to store the model somewhere so that we can make predictions on test data! We'll go through these in order.

## <a id='passes'></a>  Running More than One Pass

On the surface, running more than one pass seems like an easy thing to ask `vw` to do. It's a bit more complicated than it might appear.

The first issue is that one of the main speed bottlenecks for `vw` is file IO. Reading, and parsing, your input data is incredibly time consuming. In order to get around this, when multiple passes over the data are requested, `vw` will create and use a **cache file**, which is basically a second copy of your data stored in a `vw`-friendly, efficient, binary format. So if you want to run more than one pass, you have to tell `vw` to create a cache file.

Here's an example running 5 passes:

In [10]:
!vw --binary data/sentiment.tr --passes 5 -c -k

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = data/sentiment.tr.cache
Reading datafile = data/sentiment.tr
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0   1.0000  -1.0000      740
0.500000 0.000000            2            2.0   1.0000   1.0000      630
0.750000 1.000000            4            4.0   1.0000  -1.0000      870
0.500000 0.250000            8            8.0  -1.0000  -1.0000      526
0.687500 0.875000           16           16.0   1.0000  -1.0000      490
0.562500 0.437500           32           32.0  -1.0000   1.0000      454
0.515625 0.468750           64           64.0  -1.0000   1.0000      520
0.398438 0.281250          128          128.0   1.0000   1.0000      563
0.382812 0.367188          256          256.0   1.0000   1.0000     1311
0.357

In this command, we added three new command-line options:

* `--passes 5`: this is the most obvious one: it tells `vw` to run five passes over the data.
* `-c`: this tells `vw` to automatically create and use a cache file; `vw` constructs this cache file in `foo.cache` where `foo` is the name of your input data (in the `vw` header it informs you that it's creating a file called `data/sentiment.tr.cache` for caching)
* `-k`: by default, if `vw` uses a cache file, it *first* checks to see if the file exists. If the cache file already exists, it completely ignores the data file (`sentiment.tr`) and *just* uses the cache file. This is great if your data never changes because it makes the first pass slightly faster. However, I often change my data between `vw` runs and it's *really* annoying to spend two hours debugging only to find out that `vw` is ignoring the new data in favor of it's old cache file. `-k` tells `vw` to "kill" the old cache file: even if it exists, it should be recreated from scratch.

(Warning: if you're running multiple jobs on the same file in parallel, you will get clashes on the cache file. You should either create a single cache file ahead of time and use it for all jobs [remove `-k` in that case], *or* you should explicitly give your own file names to the cache by saying `--cache myfilename0.cache` instead of `-c`.)

If you're particularly attentive, you might have noticed that there are a few "`h`"s in the progress list (and in the printing of the average loss at the end).

This is **holdout** loss. Remember all that discussion of progressive validation loss? Well, it's useless when you're making more than one pass. That's because on the second pass, you'll already have trained on all the training data, so your model is going to be exceptionally good at making predictions.

`vw`'s default solution to this is to holdout a fraction of the training data as validation data. By default, it will hold out **every 10th example** as test. The holdout loss (signaled by the `h`) is then the average loss, *limited to these 10% of the training examples*. (Note that on the first pass, it still prints progressive validation loss because this is a safe thing to do.)

## <a id='save'></a>  Saving the Model and Making Test Predictions

Now that we know how to do several passes and get heldout losses, we might want to actually save the learned model to a file so we can make predictions on test data! This is easy: we just tell `vw` where to save the final model using `-f file` (`-f` means "final"). Let's do this, and crank up the number of passes to 20:

In [11]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model

final_regressor = data/sentiment.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = data/sentiment.tr.cache
Reading datafile = data/sentiment.tr
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0   1.0000  -1.0000      740
0.500000 0.000000            2            2.0   1.0000   1.0000      630
0.750000 1.000000            4            4.0   1.0000  -1.0000      870
0.500000 0.250000            8            8.0  -1.0000  -1.0000      526
0.687500 0.875000           16           16.0   1.0000  -1.0000      490
0.562500 0.437500           32           32.0  -1.0000   1.0000      454
0.515625 0.468750           64           64.0  -1.0000   1.0000      520
0.398438 0.281250          128          128.0   1.0000   1.0000      563
0.382812 0.367188          256         

And now, we have a model:

In [12]:
!ls -l data/sentiment.model

-rw-r--r-- 1 hal hal 283246 Jan  8 15:13 data/sentiment.model


One thing you might have noticed is that even though we asked `vw` for 20 passes, it actually only did 9! (It tells you this in the footer.) This happens because by default `vw` does early stopping: if the holdout loss ceases to improve for three passes over the data, it stops optimizing and stores the *best* model found so far. We will later see how to adjust these defaults.

## <a id='test'></a> Making Predictions

Now we want to make predictions. In order to do this, we have to (a) tell `vw` to load a model, (b) tell it only to make predictions (and not to learn), and (c) tell it where to store the predictions. (Ok, technically we don't need to store the predictions anywhere if all we want to know is our error rate, but I'll assume we actually care about the output of our system.)

In [13]:
!vw --binary -t -i data/sentiment.model -p data/sentiment.te.pred data/sentiment.te

only testing
predictions = data/sentiment.te.pred
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/sentiment.te
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0   1.0000   1.0000      967
0.000000 0.000000            2            2.0   1.0000   1.0000     1043
0.000000 0.000000            4            4.0   1.0000   1.0000      757
0.000000 0.000000            8            8.0  -1.0000  -1.0000      243
0.062500 0.125000           16           16.0  -1.0000   1.0000      345
0.156250 0.250000           32           32.0   1.0000  -1.0000      572
0.125000 0.093750           64           64.0   1.0000   1.0000     1517
0.140625 0.156250          128          128.0  -1.0000  -1.0000      575
0.160156 0.179688          256          256.0  -1.0000  -1.0000 

Let's go through these options in turn:

* `--binary`: as before, tell `vw` that this is a binary classification problem and to report loss as a zero-one value
* `-t`: put `vw` in test mode. You might assume that because we're loading a model to start with, `vw` would be in test mode by default. You would be wrong. Sometimes it's useful to start from a pre-trained model and continue training later.
* `-i data/sentiment.model`: tell `vw` to load an **i**nitial model from the specified file
* `-p data/sentiment.te.pred`: store the predictions in the specified file
* `data/sentiment.te`: the data on which to make predictions

One of the most important bits of information in the output is the `average loss` which tells us our test error rate: in this case, 15% error.

We can now take a look at the predictions:

In [14]:
!head data/sentiment.te.pred

1
1
-1
1
-1
-1
-1
-1
-1
1


And yay, we've successfully made predictions!

Because `vw` knows this is a binary classification problem, it's just giving you +1/-1 outputs. In many cases, we want a scalar value, before threshholding occurs. We can do this by asking `vw` from **raw** predictions, using `-r` in lieu (or in addition to) `-p`:

In [15]:
!vw --binary -t -i data/sentiment.model -r data/sentiment.te.raw data/sentiment.te

only testing
raw predictions = data/sentiment.te.raw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/sentiment.te
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0   1.0000   1.0000      967
0.000000 0.000000            2            2.0   1.0000   1.0000     1043
0.000000 0.000000            4            4.0   1.0000   1.0000      757
0.000000 0.000000            8            8.0  -1.0000  -1.0000      243
0.062500 0.125000           16           16.0  -1.0000   1.0000      345
0.156250 0.250000           32           32.0   1.0000  -1.0000      572
0.125000 0.093750           64           64.0   1.0000   1.0000     1517
0.140625 0.156250          128          128.0  -1.0000  -1.0000      575
0.160156 0.179688          256          256.0  -1.0000  -1.00

In [16]:
!head data/sentiment.te.raw

0.786418
1.720858
-0.315573
0.386969
-1.752520
-1.432538
-0.474776
-0.189435
-2.108955
1.990319


The `.raw` file now contains the un-thresholded predictions. Anything greater than 0 gets mapped to +1 and anything less than zero gets mapped to -1.

For fun, we can also compute our accuracy on the training data. This should by lower:

In [17]:
!vw --binary -t -i data/sentiment.model data/sentiment.tr

only testing
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/sentiment.tr
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0   1.0000   1.0000      740
0.000000 0.000000            2            2.0   1.0000   1.0000      630
0.000000 0.000000            4            4.0   1.0000   1.0000      870
0.000000 0.000000            8            8.0  -1.0000  -1.0000      526
0.000000 0.000000           16           16.0  -1.0000  -1.0000      529
0.000000 0.000000           32           32.0   1.0000   1.0000     1188
0.000000 0.000000           64           64.0  -1.0000  -1.0000      931
0.015625 0.031250          128          128.0   1.0000   1.0000      662
0.011719 0.007812          256          256.0   1.0000   1.0000      922
0.009766 0.007812          512          512.0  

This is, indeed, quite a bit lower: a 1.56% error rate! Of course, this is cheating.

Sometimes, especially at test time, you don't want `vw` to produce output while running. You can tell it to be quiet with `--quiet`.

Finally, when we're making real predictions on real test data, we often don't have labels. That's fine. If you give `vw` an example without a label, it won't learn on it, but it can still make predictions. We can simulate this on the beginning of the test data, for instance by looking at:

In [18]:
!head data/sentiment.te | cut -d' ' -f2-20

| after watching " rat race " last week , i noticed my cheeks were sore and realized that
| when andy leaves for cowboy camp , his mother holds a yard sale and scrounges in his room
| of course i knew this going in . why is it that whenever a tv-star makes a movie
| the film " magnolia " can be compared to a simple flower as its title and movie poster
| some movies ask you to leave your brain at the door , some movies ask you to believe
| the high school comedy seems to be a hot genre of the moment . with she's all that
| in double jeopardy , the stakes are high . think of the plot as a rehash of sleeping
| its a stupid little movie that trys to be clever and sophisticated , yet trys a bit too
| " goodbye , lover " sat on the shelf for almost a year since its lukewarm reception at
| those of you who frequently read my reviews are not likely to be surprised by the fact that


These are the first ten test examples with their labels (but not the pipe) removed, and only the first 19 words kept. When making real predictions we'll use all the words, but for printing on the screen this keep the output small.

We can pipe this directly into `vw`:

In [19]:
!head data/sentiment.te | cut -d' ' -f2- | vw --binary -t -i data/sentiment.model -r /dev/stdout --quiet

0.786418
1.720858
-0.315573
0.386969
-1.752520
-1.432538
-0.474776
-0.189435
-2.108955
1.990319


Here, you can see that (a) `vw` can read data from standard input (in this case, the `head` of the test data), and can produce output to `/dev/stdout`. Because we ran in `--quiet` mode, all we got were the predictions. And note these are the same predictions as before: `vw` isn't cheating by looking at the correct label when it's in `-t` (test) mode.

# <a id='cheat'></a> Cheat Sheet and Next Steps

Train with:

    vw --binary --passes 20 -c -k -f MODEL DATA

Predict with:

    vw --binary -t -i MODEL -r RAWOUTPUT DATA

You're now in a position where you can successfully: download data, process it into `vw` format, train a predictor on it, and use that predictor to make test predictions.

From here, you can:

* Learn how to [adjust some of the default arguments to try to get better performance](GettingTheMost.ipynb)
* Learn how to [adjust example weights for rare category detection and related problems](RareCategory.ipynb)
* Learn how to [do more complicated classification like multiclass classification](MulticlassClassification.ipynb)
* Learn how to [multiclass classification with label-dependent features / solve ranking problems](MulticlassLDF.ipynb)
* Learn how to [do unsupervised learning, like topic modeling and autoencoding](UnsupervisedNLP.ipynb)
* Learn how to [do structured prediction, like part of speech tagging or dependency parsing](StructuredPrediction.ipynb)
