# Text Normalization with Spacy

### Introduction

In the last lesson, we saw how we can represent similar words with the same encodings by first normalizing the text through stemming or lemmatization.  In this lesson, we'll learn how we can use the Spacy library to perform lemmatization, or find characteristics about words for us.  Let's get started.

## Using spacy

Now the sklearn library does not have the functionality to perform lemmatization for us, so instead we can use the [spacy](https://spacy.io/api/doc) library.  Spacy is already installed in a colab environment, but to install it locally, we can do so with the following: 

```
pip3 install spacy
python3 -m spacy download en_core_web_sm
```

We begin by loading up the English core small model, [referenced here](https://spacy.io/models/en).

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")

Now to perform both lemmatization and tokenization on a document, we can do so with the following:

In [2]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.  It is cool.")

In [6]:
doc

Apple is looking at buying U.K. startup for $1 billion.  It is cool.

Now we have just created a Document object in spacy.  

In [4]:
type(doc)

spacy.tokens.doc.Doc

In the rest of this lesson, we'll explore what this accomplished.

### Spacy Tokens

If we look at the different elements of a document, we see that a document consists of a collection of tokens.

In [5]:
first_token = doc[0]
first_token

Apple

In [34]:
type(first_token)

spacy.tokens.token.Token

And each token has different attributes.  

> Check the different attributes below by pressing tab.

In [None]:
first_token.

What we'll use most often is a token's lemma.

In [82]:
[token.lemma_ for token in doc]

['Apple',
 'be',
 'look',
 'at',
 'buy',
 'U.K.',
 'startup',
 'for',
 '$',
 '1',
 'billion',
 '.',
 ' ',
 '-PRON-',
 'be',
 'cool',
 '.']

So we can see that many of the `lemma_` attributes are identical to the word, but others, like `is`, `buying`, are not.  Also notice that spacy returns the string `--PRON--` when it finds a pronoun like `she` or `it`.    

> We can keep the original pronoun with a simple if else statement.

In [83]:
[tok.lemma_ if tok.lemma_ != "-PRON-" else tok for tok in doc]

['Apple',
 'be',
 'look',
 'at',
 'buy',
 'U.K.',
 'startup',
 'for',
 '$',
 '1',
 'billion',
 '.',
 ' ',
 It,
 'be',
 'cool',
 '.']

### Spacy and Scikit Learn

Scikit learn allows for us to format our documents, by providing a method that takes in a document and returns a list of tokens.  

In [84]:
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")

def spacy_tokenizer(document):
    return [token.lemma_ for token in nlp(document)]

> In the line above, we just wrapped our code inside of a function. 

And then we can pass that function into our count vectorizer.

In [85]:
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizor = CountVectorizer(tokenizer = spacy_tokenizer)

Now our countvectorizer, will apply the function to each document before encoding a bag of words model.

In [86]:
documents = ["Apple is looking at buying U.K. startup for $1 billion.  It is cool."]
vectors = bow_vectorizor.fit_transform(documents)

In [87]:
bow_vectorizor.inverse_transform(vectors)

[array(['apple', 'be', 'look', 'at', 'buy', 'u.k', '.', 'startup', 'for',
        '$', '1', 'billion', ' ', '-PRON-', 'cool'], dtype='<U7')]

### Summary

In this lesson, we saw how to use the spacy library.  We can begin by loading up the library, and defining a parser.  We then pass through the document to parse.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

This creates a spacy document, which contains a collection of many tokens.  Each token has different attributes.  The one we'll use most is `lemma_`, which returns to us the lemma of the token. 

Finally, we saw that we can have spacy tokenize and perform lemmatization of each document, by defining a function that we then pass into our countvectorizer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def spacy_tokenizer(document):
    return [token.lemma_ for token in nlp(document)]

bow_vectorizor = CountVectorizer(tokenizer = spacy_tokenizer)

### Resources

[Spacy](https://towardsdatascience.com/a-short-introduction-to-nlp-in-python-with-spacy-d0aa819af3ad)