# Basic preprocessing in natural language processing

A common NLP library in python is *spacy*. You might need to install it first via <code>pip install spacy</code>, or if you're running this in google colab, via <code>!pip install spacy</code>. First we import *spacy* and load the corpus, ie, the text database. 

In [1]:
# uncomment next line if you need to install spacy
#!pip install spacy

import spacy
nlp=spacy.load('en_core_web_sm') # a small corpus of English text

## Tokenisation

First step: split the text sample into tokens.

In [17]:
doc=nlp("I'm afraid of bears and can't stand their smell.")
[w.text for w in doc]

['I',
 "'m",
 'afraid',
 'of',
 'bears',
 'and',
 'ca',
 "n't",
 'stand',
 'their',
 'smell',
 '.']

What might seem weird: "'m" and "n't" have each become a token. That is because they do represent a part of the speech that is really separate from the word they're in. We see in the next step why treating them separately is a good idea.

## Lemmatisation

Lemmatisation transforms a token into its base form. "am" or "'m" becomes "be", "bears" becomes "bear" and so on. "I/you/she" etc becomes "-PRON-", a marker for "pronoun".

In [18]:
[w.lemma_ for w in doc]

['-PRON-',
 'be',
 'afraid',
 'of',
 'bear',
 'and',
 'can',
 'not',
 'stand',
 '-PRON-',
 'smell',
 '.']

## Stop words

Another common step consists in filtering out very common words, called stop words. Words such as "I", "the", "or" and so on appear in so many phrases that they do not carry much information. 

In [19]:
[w.lemma_ for w in doc if not w.is_stop]

['afraid', 'bear', 'stand', 'smell', '.']

A word of warning: All these steps destroy information. If "am" and "was" collapse to "be", and if "they", "she" and "I" all become "-PRON-" then information is lost. Depending on the application this may be acceptable since, at the same time, we also gain something, namely a substantial reduction in vocabulary. In this sense, these preprocessing steps work as a dimension reduction method.  