# Getting the Most Out Of VW

`vw` has *lots* of command-line arguments. For some of them you have to learn a little bit about how `vw` works internally. This notebook assume that you've made your way successfully through the [Getting Started](GettingStarted.ipynb) notebook first. In this notebook we'll make our way through the following topics:

* [Adjusting the number of bits used to store models](#capacity)
* [Using some NLP-style feature extractors](#capacity) ([word affixes](#affixes), [spelling](#spelling) and [n-grams](#ngram))
* [Changing the loss function that's being optimized](#loss) and [probabilistic output](#prob)
* [Getting a human-readable model out of `vw`](#human)
* [Changing the default holdout settings (eg to use NLP-style "development data")](#holdout)
* [Namespaces](#ns) and [quadratic features](#quad)
* [Regularization](#reg)
* [Neural networks](#nn)
* [Summary](#summary)

# <a id='capacity'></a> Increasing Representational Capacity (and memory usage)

Let's start with our previous training example:

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model

Internally, in order to be fast, `vw` never stores any strings. When it reads your input (in which your features were represented as strings!), it *immediately* hashes the strings to some feature index. By default, it uses $2^{18}$ possible feature indices; this magic number $18$ is the "number of bits" used to store the weights in the model. (This is something of a misnomer: it's really the number of parameters in the learning algorithm, which is roughly the number of floats.)

What does this hashing accomplish? Well, it accomplishes speed because no string manipulation ever happens. However, it comes with two downsides. The first is that you can get hash collisions. In particular, you might have to different features (i.e., different strings) that hash to the same location. From the learning algorithm's perspective, this means these two features are indistinguishable.

Remember that hash collisions are incredibly common. In NLP land, we often have several hundred thousand unique features. By a simple birthday-paradox type argument, we know that the probability of collision when you have $k$ items into $N$ buckets is approximately $1-\exp\left[\frac {k(k-1)} {2N}\right]$. In this case, with $N=2^{18}$, even with only $2000$ unique features, the probability of collision is already 99.95%. With $100k$ features it's basically guaranteed.

The solution is to increase the number of bits used in the representation. This will (a) reduce the number of collisions, (b) make `vw` take more RAM, (c) make `vw` somewhat slower, and (d) make the resulting models larger on disk. Currently, the maximum number of bits that `vw` will let you use is 31. I don't suggest using this; it means `vw` will consume about 8GB of memory while running and the resulting file may take as much as 2GB of disk space. [Runtime memory will be 4 times larger than disk space because the optimization algorithms need extra working memory.] But note that with 100k unique features, even with 31 bits, the probability of collision is 99.1%. It will happen.

In practice I usually use around 24 bits.

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model -b 24

In this case, increasing the number of bits did not help test accuracy, but we will see that, when we add additional features, it becomes more important.

# <a id='nlp'></a>  Fun NLP-esque Features for Free

One nice thing about `vw` is that it internally supports "extra feature" generation. The main useful features are: word prefixes and suffixes, spelling features, and ngram features.

## <a id='affixes'></a> Word Affixes

For NLP tasks that mostly depend on the *meaning* (aka "semantics") of words, we often don't care about the funny little things that come at the ends of words. For instance, for sentiment classification, the words `awesome` and `awesomeness` are likely to roughly mean the same thing. For other tasks, like part of speech tagging, it's the suffixes that might matter most: words that end in `-ness` are much more likely to be adjectives than anything else.

`vw` can automatically generate word prefixes and suffixes for you, using the `--affix` feature. For instance, if you add "`--affix +5,-3`" to the command line, this says to automatically compute (and add as new features) 5-character prefixes (that's the `+`) and three character suffixes (that's the `-`).

Let's try it:

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model -b 24 --affix +6

That was (somewhat) helpful -- the holdout loss dropped from 14.4% to 13.8%. Perhaps not an ACL paper, but at least it's doing something! Note, satisfyingly, that the number of features approximately doubled. (In fact, when the old feature count was 955, the new feature count is 1909. This is because $1909=1 + 954 * 2$. The 955 old features includes 954 words and a bias feature. Each word gets an affix so we have $954 * 2 = 1908$ real features, plus a bias.)

## <a id='spelling'></a> Spelling Features

Spelling features are *super* useful for tasks where things like capitalization, years, numbers, etc. matter a lot. In other words, tasks *not at all* like sentiment classification.

In `vw`, the spelling features option tells it to generate new features based on the word forms seen. For example, a word "Alice" has the word form "Aaaaa" (meaning: a capital letter followed by four lowercase letters); "VanBuren" has the form "AaaAaaaa". The general rule is that digits 0-9 get mapped to "0", letters a-z to "a", letters A-Z to "A", period to "." and anything else to "#". Thus, "xY9s,3.80vaq" gets mapped to "aA0a#0.00aaa" and this new "word form" is used as a new feature.

To turn on spelling features, you simply add `--spelling _` to the command line. We can do this with little expectation that it will help:

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model -b 24 --spelling _

Nope, indeed it did not help and it actually hurt slightly (14.4% error to 15.0% error).

You might wonder what the "`_`" in the command line means; for now, don't worry about it. We'll come back to this when we talk about namespaces.

## <a id='ngram'></a> N-gram Features

Our current representation for learning is bag of words. Some times looking at a single word at a time is insufficient and we want to, instead, look at (contiguous) sequences of words: ngrams.

Given an example text "the monster ate a sandwich", if we were to augment this with bigram features, we would get "the_monster monster_ate ate_a a_sandwich". In cases where word bigrams are useful, this is going to be helpful.

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model -b 24 --ngram 2

Wow, that was super useful! Loss dropped from 14.4% to 11.9%! We can try trigram features too:

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model -b 24 --ngram 3

Okay, that didn't help any more.

We can, however, now see that the number of bits matters. If we drop the number of bits back down to 18, we get:

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model -b 18 --ngram 3

This is no better than we started with.

More specifically: **if we hadn't increased the number of bits, we would have concluded that ngram features weren't useful!** This is why I always use as many bits as I can tolerate (sometimes up to 27 or 29).

Finally, `vw` can do "skip ngrams" too. This means that instead of only looking at bigrams of adjacent words, you can look at bigrams with some gap. For instance, if you say `--ngram 2 --skips 1`, this means "compute all bigrams that have at most one gap in them." For our favorite sentence "the monster ate a sandwich", you would get the default bigram features ("the_monster monster_ate ate_a a_sandwich") and *also* the skips ("the_ate monster_a ate_sandwich").

Note: as you increase to, say, four-grams, this automatically includes bigrams and trigrams. As you increase number of skips, you get all the lower order skips too.

# <a id='loss'></a> Changing the Loss Function

By default, `vw` optimizes squared loss. This means that if the correct label for an example is +1, and the model predicts -1, the error is 4.0. However, if the model predicts +3, the error is still 4.0, even though it's making the right binary prediction. Squared loss has the nice property that it estimates means. But it's not necessarily the most natural loss for classification problems.

Many people prefer logistic loss (which gives a nice probabilistic interpretation) or hinge loss (which, when combined with regularization, yields support vector machines). You can switch the loss function with a simple command-line flag:

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model --loss_function logistic

<a id='prob'></a> If you're using logistic loss, you can get probabilistic predictions out of you model by using `-r`(aw) output; for example:

In [None]:
!vw --binary -i data/sentiment.model -t -r data/sentiment.te.pred data/sentiment.te --quiet
!head data/sentiment.te.pred

These predictions are values *before* being hit with a logistic function. To get probabilities, map $z \mapsto \frac 1 {1 + \exp(-z)}$, as in:

In [None]:
!head data/sentiment.te.pred | perl -ne '$a = 1/(1+exp(-$_)); print "$a\n";'

We can alternatively switch to hinge loss:

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model --loss_function hinge

In this case, logistic loss does awesome (dropping the error from 14.4% to 11.3%) and hinge does crummy (increasing to 15.6%). This is perhaps because with hinge we probably want to regularize, which we'll get back to later.

# <a id='human'></a> Getting a Human Readable Model

Admittedly one of the most annoying things with `vw` is getting a human-readable model out. In some more complex cases, this is nearly impossible. And even in simple cases (like those here), it's cumbersome. This is because of the fact that `vw` doesn't store strings. So if you want to get a mapping from features-to-weights out of `vw` you have to jump through some hoops.

What are those hoops?

First you have to learn a model and save it to disk. Fine, we know how to do that. You then have to instruct `vw` to load the saved model and take another pass over your training data, and save the results to disk. It has to take another pass over the data because as it makes that final pass, it actually *does* store the strings in memory (if you tell it to) so it can generate the human readable file.

What does this look like in practice? First, we generate a model. For a warm-up, we won't use any fancy features. Then, we have to do a second incantation of `vw` and tell it where to store the resulting human-readable model.

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model --loss_function logistic --quiet
!vw -i data/sentiment.model -t --invert_hash data/sentiment.model.readable data/sentiment.tr --quiet

The first command here trains the model as before. The second says: start from that pre-trained model; go into test mode (so that you don't adjust any of the weights of the model); store the resulting readable model (`--invert_hash`) into the specified file; and read from `data/sentiment.tr` (you have to re-read from the same training data).

We can now look at `data/sentiment.model.readable` to see what's going on.

In [None]:
!head -n40 data/sentiment.model.readable

The beginning of this output is some header information that tells you a bit about the type of model that was stored. After the `options:` line there's the `:0` line, and then after that you get a list of `feature:hash:weight` triples. For instance, the feature "`earth`" was hashed to position 20130, and has a feature weight -0.056972, which means it's (very) mildly indicative of the negative class. These words are actually sorted; the reason "earth" pops up at the top is because the "e" in the string is some weird unicode "e" and not the normal ascii "e".

We can extract just the features by dropping off the first 12 lines of the output, and then sort by the feature weights:

In [None]:
!cat data/sentiment.model.readable  | tail -n+13 | sort -t: -k3nr | head

(Ignore the broken pipe issues, which are a ipython notebook joy.)

Here, we've dropped the first 12 lines, then sorted the remaining by the third column in reverse numerical order (`-k3nr`) where columns are separated by colons (`-t:`) and then looked at the top 10.

These are kind of weird features to see at the top of a sentiment classification data set. Presumably this means that people like Hamlet and probably also like Matt Damon. But there's very little more we can get from this.

We can look at the most negative features too:

In [None]:
!cat data/sentiment.model.readable  | tail -n+13 | sort -t: -k3nr | tail

These are much more reasonable, and the magnitudes are significantly larger (if negative), suggesting that these are probably what is really being used to make good predictions.

We can do the same thing with the model with ngram features:

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model --loss_function logistic --ngram 3 -b 24 --quiet
!vw -i data/sentiment.model -t --invert_hash data/sentiment.model.readable data/sentiment.tr --quiet

You'll notice this took a bit longer to run because there are *lots* of ngrams. We can see the output as follows:

In [None]:
!echo "Top positive features"
!cat data/sentiment.model.readable  | tail -n+13 | sort -t: -k3nr | head
!echo ""
!echo "Top negative features"
!cat data/sentiment.model.readable  | tail -n+13 | sort -t: -k3nr | tail

In all of these, there's only one ngram feature that pops up as negative, "`puppetry^and/or^hate`". You can google this trigram if you really want to know what it's from. Amusingly, "attempt" also comes up as a negative word. I saw this once in a review for one of my papers: "this paper attempts to do XYZ." Not a good sign.

# <a id='holdout'></a> Changing `vw`'s Default Holdout Settings

If you recall from the introduction, the default way `vw` works for doing multiple passes is: on the first pass, perform progressive validation; on subsequent passes, use every 10th example as a heldout "validation" example. And to stop optimizing when things don't improve for three passes.

These are reasonable defaults, but somewhat at odds with the behavior I often want.

First, I often *don't* want `vw` to do early stopping. If I tell it to do 20 passes, then by golly it should do 20 passes. This is easy. I just say `--early_terminate 999`. This means that instead of needing 3 passes of no-improvement in order to terminate, it now needs 999. Since I never run that many passes, this is a good default to say "don't stop early." However, if *will* still output only the best model found.

More relevant, often in NLP we have training data, development data, and test data. And I want to get validation performance on the development data rather than every-10th-example. You can accomplish this with `--holdout_after N`. What this means is: instead of doing every-10th-example as validation, use the first (N-1) examples as training data, and anything after that as development data.

Putting these together, I usually do something like:

In [None]:
!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model --early_terminate 999 --holdout_after 1401

Here, the important thing is that the first 1400 examples are used as training data, the the remaining examples (in this case, 200) are used as heldout data. The average loss reported is then *precisely* the average loss on this heldout data.

# <a id='ns'></a>  Namespaces and quadratic features

For this part of the tutorial to make sense, we have to make our task a little more interesting.

Many people who do sentiment analysis start from a *sentiment lexicon*: basically, a list of positive-ish and negative-ish words. There are [lots of sentiment lexicons](http://sentiment.christopherpotts.net/lexicons.html). We will use [the one from Bing Liu](http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html). First, let's download it and decompress it:

In [1]:
!rm -f data/*words.txt
!curl -o data/opinion-lexicon-English.rar https://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
!rar x data/opinion-lexicon-English.rar  data
!ls -l data/*words.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   404  100   404    0     0    333      0  0:00:01  0:00:01 --:--:--   333

UNRAR 5.31 freeware      Copyright (c) 1993-2016 Alexander Roshal

data/opinion-lexicon-English.rar is not RAR archive
No files to extract
ls: data/*words.txt: No such file or directory


We can look at some of the positive and negative words:

In [None]:
!head -n50 data/*-words.txt | grep -v '^;'

In this, I dropped lines that begin with "`;`" because these are comments in the files.

We now want to go back and generate some new data files for `vw` that include lexicon features. In particular, we will include *both* the bag of words representation *as well as* lexicon features. The lexicon features we will use are very simple: the log of the count of words in the document that are on the positive list, and the log of the count on the negative list. We use logs because getting more positive words has diminishing returns.

To do this, we'll write a bit more python:

In [None]:
import io
def loadLexicon(filename):
    with io.open(filename, 'r', encoding='iso-8859-1') as h:
        return set(l.strip() 
                    for l in h.readlines()
                    if  not l.startswith(';') and len(l) > 1)

import math
def countLexiconWords(text, lexicon):
    return math.log(1.0 + len([w for w in text if w in lexicon]))

positiveLexicon = loadLexicon('data/positive-words.txt')
negativeLexicon = loadLexicon('data/negative-words.txt')

We now have a copy of the two lexicons and we want to generate `vw` examples.

But we have two different types of features. We have the original bag of words features. And we have these lexicon features. We'd like to keep them separate.

This is where feature namespaces come in. We're going to create examples with *two* namespaces, one for the bag of words (let's call it the `w` namespace) and one for the lexicon features (let's call that the `l` namespace). In `vw`, namespaces are separated by pipes, so an example might look like:

    +1 |l pos:5 neg:2 |w some words might go here ...
    
In addition to having two namespaces, this example also shows how to use feature values. By default, all features in a `vw` example get a value of one. If you want to override this, you can say something like "`pos:5`", which means that there's a single feature (called "`pos`") that has a feature value of 5.

Let's generate data like this. Some of the code is copied from the Getting Started tutorial.

In [None]:
def textToVW(lines):
    return ' '.join([l.strip() for l in lines]).replace(':','COLON').replace('|','PIPE')

def fileToVW(inputFile, posLex, negLex):
    text     = textToVW(open(inputFile,'r').readlines())
    words    = text.split()
    posCount = countLexiconWords(words, posLex)
    negCount = countLexiconWords(words, negLex)
    return '|l pos:%g neg:%g |w ' % (posCount,negCount) + text

import os
def readTextFilesInDirectory(directory):
    return [fileToVW(directory + os.sep + f, positiveLexicon, negativeLexicon) 
            for f in os.listdir(directory)
            if  f.endswith('.txt')]

examples = ['+1 ' + s for s in readTextFilesInDirectory('data/txt_sentoken/pos')] + \
           ['-1 ' + s for s in readTextFilesInDirectory('data/txt_sentoken/neg')]

print('{0} total examples read'.format(len(examples)))
print('first example: {0}...'.format(str(examples[ 0][:70])))
print('last  example: {0}...'.format(str(examples[-1][:70])))

At least based on these two examples, this seems promising: the positive/negative lexicon features seem to correlate with the labels!

Let's generate a new `vw` training file:

In [None]:
import random
from __future__ import print_function

random.seed(1234)
random.shuffle(examples)   # this does in-place shuffling


def writeToVWFile(filename, examples):
    with open(filename, 'w') as h:
        for ex in examples:
            print("{0}".format(ex), file=h)
            
writeToVWFile('data/sentiment-lex.tr', examples[:1600])
writeToVWFile('data/sentiment-lex.te', examples[1600:])

Now, we're in a position where we can train a model. Let's use exactly the same command as earlier, which got us 11.9% error rate, but using the new data:

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function logistic -b 24

And, disappointingly, we still get 11.9% error rate!

One thing we can do that is useful is *turn off* a subset of the namespaces. For instance, if we want to *only* use the lexicon features, we can tell `vw` to ignore the `w` namespace:

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function logistic -b 24 --ignore w

Overall that's not too impressive: 31.9% error. On the other hand, this is just using two features (plus a bias).

## <a id='quad'></a> Quadratic features

The real magic comes from *feature combination*. For instance, the first example from above looks like:

    +1 |l pos:4.54329 neg:3.4012 |w note COLON some may consider portions ...
    
There might be reason to believe that looking at *pairs* of features between the `l` and `w` namespaces would be useful. In this case, these features would be things like:

    note_pos:4.5 note_neg:3.4 COLON_pos:4.5 COLON_neg:3.4 some_pos:4.5 ...
    
(I've rounded 4.54329 to 4.5 and 3.4012 to 3.4 for brevity.)

This allows you to model interactions among features. `vw` will do this automatically for you with `-q` (quadratic) features. For example, you can ask for all pairs of features between these two namespaces as:

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function logistic -b 24 -q wl

And with that, our loss has dropped from 11.9% to 11.3%. Not a huge win, but something. Note that the number of features per example has approximately tripled here.

You can go crazy if you want and add quadratic features between the `l` namespace and itself and the `w` namespace and itself too:

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function logistic -b 24 -q wl -q ll -q ww

This is significantly slower *and* significantly worse, basically because there are now hundreds of thousands of features, and the model has overfit.

Minor note: when creating quadratic features, you can use `:` as a wildcard to refer to "any namespace", for instance "`-q l:`" pairs `l` with all other namespaces; "`-q ::`" pairs all namespaces with all other namespaces.

# <a id='reg'></a> Regularization

Regularization is a sometime-helpful method for preventing your model from overfitting to the training data. Once you have a reasonable amount of data, I find regularization in `vw` to be relatively **un**helpful, largely because the underlying learning algorithm is quite good. But for small or modest data set sizes, like the sentiment data, it is plausibly useful.

`vw` has two built-in forms of regularization: $\ell_2$ ("Gaussian") regularization and $\ell_1$ ("sparse") regularization. You can combing them if you want to get "elastic net" regularization. Both forms for regularization require a strength parameter, which usually should be quite small and must be tuned carefully. Doing $\ell_1$ has the advantage of often producing models with lots of zeros. Here are some runs with both, where I've chosen regularization strengths by hand that work well. We're also going to save a readable model to disk so we can look at how many features are being used.

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function logistic -b 24 -q wl --l2 0.0001 -f data/sentiment-lex.model
!vw -i data/sentiment-lex.model -t --invert_hash data/sentiment-lex.model.readable data/sentiment-lex.tr --quiet
!echo ""
!echo "total number of features:"
!tail -n+13 data/sentiment-lex.model.readable | wc -l

With this, a regularization strength of 0.0001 has dropped the error rate from 11.3% to 10.0%! And the model is using 130k features.

We can also try $\ell_1$:

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function logistic -b 24 -q wl --l1 0.000001 -f data/sentiment-lex.model
!vw -i data/sentiment-lex.model -t --invert_hash data/sentiment-lex.model.readable data/sentiment-lex.tr --quiet
!echo ""
!echo "total number of features:"
!tail -n+13 data/sentiment-lex.model.readable | wc -l

This gives us a slightly higher loss (the original value of 11.9%) but using fewer features: only 97k in this case. We can increase the strength of the regularizer and get fewer features at the cost of higher error:

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function logistic -b 24 -q wl --l1 0.00001 -f data/sentiment-lex.model
!vw -i data/sentiment-lex.model -t --invert_hash data/sentiment-lex.model.readable data/sentiment-lex.tr --quiet
!echo ""
!echo "total number of features:"
!tail -n+13 data/sentiment-lex.model.readable | wc -l

This shows that if we're willing to suffer a bit more loss (in this case, 15.0%), we can get a model that's about a quarter of the size.

If you add both `--l1` and `--l2` (and tune the two corresponding hyperparameters), you get elastic net. I've never had this be particularly effective.

# <a id='nn'></a>  Neural networks

We've already seen several ways of achieving non-linearity if `vw`: quadratic features, ngrams, etc. A more traditional approach is to add a hidden layer of representation, yielding a two-layer feed-forward neural network. To achieve this, you simply say `--nn 10`, to get ten hidden layers:

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function hinge -b 24 --nn 10

(Note: I switched to hinge loss. With logistic loss, you get about 50% error which is horrible. This has to do with scaling of the gradients.) Well, that is disappointing. Our loss went from around 12% to around 16%.

One reason is that the original bag of words features are actually quite useful, and by forcing them through a 5-unit hidden layer you lose a lot of information. A solution to this problem is `--inpass` which adds additional edges to the neural network directly from the input to the output. So you basically get the best of both worlds:

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function hinge -b 24 --nn 10 --inpass

Okay, so this is better, we're down to 13.8% error. But what else can we do? **DROPOUT**! This is a particularly useful method of regularization for neural networks.

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function hinge -b 24 --nn 10 --inpass --dropout

Well, disappointed once again: loss is back up to 14.4%.

By playing around with: (a) loss function, (b) size of the hidden units and (c) dropout, I **was** able to get a model that's slightly better:

In [None]:
!vw --binary data/sentiment-lex.tr --passes 20 -c -k --loss_function logistic -b 24 --nn 10 --inpass

Note, here, that we've switched *back* to logistic loss. When you use `--inpass` and get features directly from the input, you don't have the same problems with logistic loss that you do when you don't use `--inpass`. It definitely requires some fiddling.

# <a id='summary'></a>Summary

In this notebook, we've learned lots of ways to get extra features from `vw`. Here's a brief summary:

* Using `-b 24` to increase the size of the model, something you should always do
* `--affix +6,-2w` to add six character prefixes to features from all namespaces and two character suffixes to features from the w namespace
* `--spelling w` to add spelling features to the w namespace (use `--spelling _`) to add spelling features to the default namespace
* `--ngram 3 --skips 1` to add one-skip, trigram features to all namespaces
* `--loss_function logistic/hinge` to switch the loss function
* How to get a human readable model using `--invert_hash`
* Using `--early_terminate 999 --holdout_after 1401` to treat the last 200 examples as development data and turn off early stopping
* Using namespaces and `-q` for quadratic features (there's also `--cubic` for cubic features!)
* Using `--l2` or `--l1` to regularize the model
* Running in neural networks mode with `--nn 10 --inpass --dropout`, and then playing around with other parameters

One important thing to remember is that arguments that affect the features that `vw` use get stored in saved models. This means that if you train with `-f model` and then test with `-t -i model`, when you load the model (`-i model`), you *also* load all of the feature generators. This ensures that training and testing use a consistent feature representation, and also means you don't have to remember what arguments you used to train the model.