### Timing POS taggers

Comparing times to parse the text of *Crime and Punishment*:

* NLTK
* TextBlob
* spaCy

In [1]:
# imports

import timeit
import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en_core_web_sm', parser=False, entity=False)

In [2]:
# read text

f = open('crime.txt', 'r')
raw_text = f.read()
f.close()

raw_text = raw_text[:1000000]

In [3]:
# NLTK

start_time = timeit.default_timer()

tokens = nltk.word_tokenize(raw_text)
tags = nltk.pos_tag(tokens)
#print(tags[:5])

print("NLTK time = ", timeit.default_timer() - start_time)


NLTK time =  7.762525241999999


In [4]:
# TextBlob

start_time = timeit.default_timer()

blob = TextBlob(raw_text)
#print(blob.tags[:5])

print("TextBlob time = ", timeit.default_timer() - start_time)


TextBlob time =  0.02342948899999797


In [7]:
# spaCy

start_time = timeit.default_timer()

doc = nlp(raw_text)
#print(blob.tags[:5])

print("spaCy time = ", timeit.default_timer() - start_time)


spaCy time =  24.128748495999986


### Comparisons

In each of the 3 code chunks above, there is a print statement that prints the tags of the first 5 tokens. When this line is commented out, the taggers achieved the following times (on average):

* NLTK 7.8 seconds
* TextBlob 0.2 seconds
* spaCy 26.7 seconds

However, when the print line is not commented out, here are the times:

* NLTK 8 seconds
* TextBlob 10.9 seconds
* spaCy 33.9 seconds

The blazingly fast tagging time for TextBlob seems to be offset by the increased time to access the tags from the blob object. That's disappointing. 

### Limitations of this comparison

These comparisons are meant to be a comparison of usage as much as time. And the comparisons above is not comparing apples to apples. These comparisons are really comparing an apple, an apple salad, and a full fruit buffet. 

* NLTK does what we ask: tokenize, then tag. Ask for an apple and you get one. In contrast, both TextBlob and spaCy are creating objects with many other features. The TextBlob object is the apple salad in the analogy, giving many other annotation features, and spaCy is a full buffet table of annotations.

* Two things were done to help spaCy. First, the raw text input size was limited to 1,000,000 since anything over this limit caused memory warnings to be thrown. There are ways around this, as suggested in the warning message. The second assist to spaCy was to add parameters (parser=False, entity=False) to the model to avoid loading some things not needed as suggested in [this post](https://github.com/explosion/spaCy/issues/430). 

### The take-away

The upshot is that if you just need POS tagging, NLTK is probably your best bet. If you want further annotations, use TextBlob. If you want the full buffet, use spaCy.
