Welcome to the first leg of my Natural Language Processing (NLP) journey. I will be covering practical use cases for NLP along with common techniques to achieve them. My examples will use Python, along with [Jupyter](http://jupyter.org/) and [spaCy](https://spacy.io/).

First, I'll start off with some common components of a language processing pipeline.

# Tokenization

Tokenization is the process of breaking up text into smaller pieces, called _tokens_. Tokens may or may not be words. There is not one specific way to perform tokenization, as different problems may call for different granularity of tokens. Below are two examples of tokenizers.

In [1]:
text = "I'm against picketing, but I don't know how to show it."

In [2]:
# A naïve tokenizer that splits on spaces.
tokens = text.split(" ")
print(tokens)

["I'm", 'against', 'picketing,', 'but', 'I', "don't", 'know', 'how', 'to', 'show', 'it.']


Just splitting on spaces gets you 80% of the way there, but it doesn't take into account things like punctuation.

In [3]:
# The default tokenizer in spaCy is a bit more robust.
import spacy
from spacy.lang.en import English

english = spacy.load('en')
tokenizer = English().Defaults.create_tokenizer(english)

tokens = tokenizer(text)
tokens = [token.text for token in tokens]

print(tokens)

['I', "'m", 'against', 'picketing', ',', 'but', 'I', 'do', "n't", 'know', 'how', 'to', 'show', 'it', '.']


It treats words and symbols as separate tokens, and even splits up contractions.

# Stopwords

Stop words are merely words you want your NLP pipeline to ignore. Stop lists usually contain common words like 'the', but they may also contain domain-specific words like 'Cerner' or 'DevCon'.

In [2]:
from spacy.lang.en import English

stopwords = list(English.Defaults.stop_words)

print(f"Number of stopwords in spaCy: {len(stopwords)}")
print(f"Examples: {' '.join(stopwords[0:10])}")

Number of stopwords in spaCy: 305
Examples: no to meanwhile latter over none seem ten thus within


In [5]:
text = "I'm against picketing, but I don't know how to show it."

english = spacy.load('en')
doc = english(text)

# spaCy automatically tags when words are stop words with `is_stop`
tokens = [token for token in doc if not token.is_stop]

print(tokens)

[I, 'm, picketing, ,, I, n't, know, .]


# Stemming and Lemmatization

Two common ways of combining similar words are _stemming_ and _lemmatization_.

## Stemming

The simpler of the two, stemming works by chopping off the end of each word. Porter's algorithm is a popular implementation of stemming. It would convert `ponies` -> `poni` and `cats` -> `cat`. Stemming is fast but can have a high false positive rate.

In [1]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

text = "This shirt is dry-clean only which means it's dirty"
tokens = text.split(" ")
singles = [stemmer.stem(token) for token in tokens]
for token in tokens:
    stem = stemmer.stem(token)
    print(stem)

thi
shirt
is
dry-clean
onli
which
mean
it'
dirti


Stemming doesn't take into account parts of speech, and therefore can incorrectly group different words together. [This is why](https://github.com/explosion/spaCy/issues/327#issuecomment-208658745) spaCy doesn't include a stemmer, instead it provides lemmatization.

## Lemmatization
Grouping words into their _lemma_, or standard form of the word. For example, `better` and `best` have the lemma `good` whereas `walk` is the lemma for `walking`, `walked`, and `walks`. Lemmatization is more complicated than stemming, but has fewer false positives.

In [7]:
import spacy
from spacy.lang.en import English

text = "I haven't slept for ten days, because that would be too long."

nlp = spacy.load('en')
doc = nlp(text)
for token in doc:
    if not token.is_punct and token.text != token.lemma_:
        print(token.text, "->", token.lemma_)

I -> -PRON-
n't -> not
slept -> sleep
days -> day


You can see plurals and different tenses are collapsed into a single base form. spaCy also introduces a special case for pronouns because there is no clear lemma for pronouns. Should "I" become "me", "it", or "they"? To avoid this ambiguity they treat all pronouns as a special lemma, -PRON-.