# Spacy Text Normalization Lab

### Introduction

In this lab, we'll become more familiar with the spacy library, and use it to perform text normalization, so that we can then use a vector space model like bag of words to represent our text.

### Loading our Data

Let's begin by loading our dataset of airline tweets.

In [9]:
import pandas as pd

df = pd.read_csv('./Tweets.csv')

In [11]:
documents = df.text

Let's take a look at some of the documents.

In [14]:
[document for document in documents][:10]

['@VirginAmerica What @dhepburn said.',
 "@VirginAmerica plus you've added commercials to the experience... tacky.",
 "@VirginAmerica I didn't today... Must mean I need to take another trip!",
 '@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse',
 "@VirginAmerica and it's a really big bad thing about it",
 "@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA",
 '@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)',
 '@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP',
 "@virginamerica Well, I didn't…but NOW I DO! :-D",
 "@VirginAmerica it was amazing, and arrived an hour early. You're too good to me."]

### Creating Spacy Documents

Now, we can begin to normalize some of the text by first creating a document and then finding the lemma of each word.  Begin by creating a spacy document from the text of the first tweet.

In [23]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [29]:
first_doc = documents[0]

In [30]:
first_spacy_doc = nlp(first_doc)

In [31]:
type(first_spacy_doc)

# spacy.tokens.doc.Doc

spacy.tokens.doc.Doc

### Viewing Tokens

Next, let's display the lemma of each word.

In [38]:
[token.lemma_ for token in first_spacy_doc]

# ['@VirginAmerica', 'what', '@dhepburn', 'say', '.']

['@VirginAmerica', 'what', '@dhepburn', 'say', '.']

Here, notice that this included the punctuation mark, like a period.  Let's say that we don't want to include this, so we'll only include a token if spacy indicates that it is`is_alpha`. 

In [46]:
[token.lemma_ for token in first_spacy_doc if token.is_alpha]

['what', 'say']

We can see that this got rid of too much information as it eliminated the twitter handles as well.  Instead, use the `is_punct` method to only exclude elements are punctuation.

In [47]:
[token.lemma_ for token in first_spacy_doc if not token.is_punct]

# ['@VirginAmerica', 'what', '@dhepburn', 'say']

['@VirginAmerica', 'what', '@dhepburn', 'say']

Ok, this is starting to look better.  Next, let's also get rid of stop words by using the `is_stop` method.

In [48]:
[token.lemma_ for token in first_spacy_doc if not token.is_punct and not token.is_stop]

['@VirginAmerica', '@dhepburn', 'say']

### Working with Sklearn

Let's move onto incorporating our use of spacy with sklearn. First, we define a tokenizer function that takes in a document of a string, and returns a list of tokens.  We want our tokenizer to return a list of lemmas of the document, and to exclude punctuation and stop words.

In [51]:
def lemma_tokenizer(document):
    return [token.lemma_ for token in nlp(document) if not token.is_punct and not token.is_stop]

In [54]:
lemma_tokenizer(first_doc)

# ['@VirginAmerica', '@dhepburn', 'say']

['@VirginAmerica', '@dhepburn', 'say']

Next, we can use our tokenizer to perform the task of tokenizing our document with our CountVectorizer.

In [57]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(tokenizer = lemma_tokenizer)

vectors = cv.fit_transform(documents)

We can get a sense of how well we performed this by looking at the returned vectors after an inverse transform.

In [59]:
cv.inverse_transform(vectors)[:3]

[array(['@virginamerica', '@dhepburn', 'say'], dtype='<U52'),
 array(['@virginamerica', 'plus', 'add', 'commercial', 'experience',
        'tacky'], dtype='<U52'),
 array(['@virginamerica', 'today', 'mean', 'need', 'trip'], dtype='<U52')]

### Summary

In this lesson, we worked with the spacy library and saw how we can use the `lemma_` `is_stop` and `is_punct` methods to select certain words, and return the lemma of our words.  We also saw how we can define our logic in a function and pass it through our vectorizer.

### Resources

[Spacy Kaggle Tutorial](https://www.kaggle.com/honeysingh/spacy-tutorial)

[Spellchecker](http://theautomatic.net/2019/12/10/3-packages-to-build-a-spell-checker-in-python/)