# Spacy Text Normalization Lab

### Introduction

In this lab, we'll become more familiar with the spacy library, and use it to perform text normalization, so that we can then use a vector space model like bag of words to represent our text.

### Loading our Data

Let's begin by loading our dataset of airline tweets.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/nlp-text-representation/master/Tweets.csv"
df = pd.read_csv(url)

Let's again assign the column `text` to the variable `documents`.

In [2]:
documents = df.text

Now, let's take a look at some of the documents.

In [3]:
[document for document in documents][:10]

['@VirginAmerica What @dhepburn said.',
 "@VirginAmerica plus you've added commercials to the experience... tacky.",
 "@VirginAmerica I didn't today... Must mean I need to take another trip!",
 '@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse',
 "@VirginAmerica and it's a really big bad thing about it",
 "@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA",
 '@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)',
 '@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP',
 "@virginamerica Well, I didn't…but NOW I DO! :-D",
 "@VirginAmerica it was amazing, and arrived an hour early. You're too good to me."]

### Creating Spacy Documents

Previously, from here, we simply initialized our count vectorizer.  But do not want to our CountVectorizer to represent similar words as completely different, we'll first use spacy to transform each word into it's corresponding lemma. 

Begin by loading up the `en_core_web_sm` library of spacy, and assigning the model to the variable `nlp`.

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [6]:
type(nlp)
# spacy.lang.en.English

spacy.lang.en.English

Next let's practice working with spacy on a single document.  Begin by selecting the first document, and then passing through this string into the model `nlp` (which we can think of as a parser).

In [8]:
first_doc = documents[0]
first_doc
# '@VirginAmerica What @dhepburn said.'

'@VirginAmerica What @dhepburn said.'

Assign the parsed document to `first_spacy_doc`.

In [9]:
first_spacy_doc = nlp(first_doc)

In [10]:
type(first_spacy_doc)

# spacy.tokens.doc.Doc

spacy.tokens.doc.Doc

### Viewing Tokens

Next, let's display the lemma of each word.  Use list iteration to display the lemma of each word.

In [38]:
[token.lemma_ for token in first_spacy_doc]

# ['@VirginAmerica', 'what', '@dhepburn', 'say', '.']

['@VirginAmerica', 'what', '@dhepburn', 'say', '.']

Here, notice that this included the punctuation mark, like a period.  Let's say that we don't want to include this, so we'll only include a token if spacy indicates that it is`is_alpha`. 

In [46]:
[token.lemma_ for token in first_spacy_doc if token.is_alpha]

# ['what', 'say']

['what', 'say']

We can see that this got rid of too much information as it eliminated the twitter handles as well.  Instead, use the `is_punct` method to only exclude elements are punctuation.

In [47]:
[token.lemma_ for token in first_spacy_doc if not token.is_punct]

# ['@VirginAmerica', 'what', '@dhepburn', 'say']

['@VirginAmerica', 'what', '@dhepburn', 'say']

Ok, this is starting to look better.  Next, let's also get rid of stop words by using the `is_stop` method.

In [48]:
[token.lemma_ for token in first_spacy_doc if not token.is_punct and not token.is_stop]
# ['@VirginAmerica', '@dhepburn', 'say']

['@VirginAmerica', '@dhepburn', 'say']

### Working with Sklearn

Let's move onto incorporating our use of spacy with sklearn. First, we define a `tokenizer` function that takes in a document as a string, and returns a list of tokens.  We want our tokenizer to return a list of lemmas of the document.

> **Do not** remove punctuation or stop words in the tokenizer.

In [26]:
def tokenizer(document):
    return [token.lemma_ for token in nlp(document)]

In [27]:
tokenizer(first_doc)

# ['@VirginAmerica', '@dhepburn', 'say']

['@VirginAmerica', 'what', '@dhepburn', 'say', '.']

Next, we can use our tokenizer to perform the task of tokenizing our document with our CountVectorizer.  Initialize our count vectorizer, setting the `tokenizer` as the function defined above.  Notice that if we wanted to remove stop words, we could have taken care of this in our tokenizer.

Set the `ngram_range` to (1, 2).

In [38]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(tokenizer = tokenizer, ngram_range= (1, 2))

vectors = cv.fit_transform(documents)

We can get a sense of how well we performed this by looking at the returned vectors after an inverse transform.

In [39]:
cv.inverse_transform(vectors)[:3]

[array(['@virginamerica', 'what', '@dhepburn', 'say', '.',
        '@virginamerica what', 'what @dhepburn', '@dhepburn say', 'say .'],
       dtype='<U60'),
 array(['@virginamerica', '.', 'plus', '-PRON-', 'have', 'add',
        'commercial', 'to', 'the', 'experience', '...', 'tacky',
        '@virginamerica plus', 'plus -PRON-', '-PRON- have', 'have add',
        'add commercial', 'commercial to', 'to the', 'the experience',
        'experience ...', '... tacky', 'tacky .'], dtype='<U60'),
 array(['@virginamerica', 'to', '...', 'i', 'do', 'not', 'today', 'must',
        'mean', 'need', 'take', 'another', 'trip', '!', '@virginamerica i',
        'i do', 'do not', 'not today', 'today ...', '... must',
        'must mean', 'mean i', 'i need', 'need to', 'to take',
        'take another', 'another trip', 'trip !'], dtype='<U60')]

### Training a model

Let's see if this improves the accuracy of our model at all.  Assign `airline_sentiment` to equal the variable `sentiment`. 

In [40]:
sentiment = df.airline_sentiment

In [41]:
sentiment[:2]

# 0     neutral
# 1    positive
# Name: airline_sentiment, dtype: object

0     neutral
1    positive
Name: airline_sentiment, dtype: object

Then convert the dataset to be numeric using the following mapping:

In [42]:
sentiment_map = {'negative': 0, 'neutral': 1, 'positive': 2}

Assign the target to equal `y`.

In [43]:
y = sentiment.map(sentiment_map)

Let's split our data into training and test data.  Set the `test_size` to .2, and stratify the data.  Set the `random_state` to 2.

In [44]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, y, stratify = y, random_state = 2)

Then let's train a logistic regression model setting the `random_state` to equal 2 and iterations to 1000.

In [45]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter = 1000).fit(X_train, y_train)

In [46]:
lr.score(X_test, y_test)
# 0.8106557377049181

0.8106557377049181

We see that our model performs similarly to it's performance before lemmatization.

> However, we see that the features are slightly different, and we do not have recurring features like we did previously.

In [49]:
pd.Series(lr.coef_[0], cv.get_feature_names()).sort_values(ascending = False)[:30]

delay          1.452159
suck           1.312443
lose           1.186573
not            1.068794
hour           1.062909
nothing        1.059326
no             1.033789
bad            1.021676
terrible       0.976796
the bad        0.906495
on hold        0.885917
bag            0.880149
rude           0.855303
why            0.834972
hrs            0.805047
fuck           0.790441
since          0.783878
cancel         0.783625
stop           0.761884
website        0.759926
. help         0.747833
just dm'd      0.739816
:(             0.736653
ridiculous     0.734676
@united dme    0.727143
give up        0.719622
again          0.714931
min            0.708890
because        0.686110
why do         0.683581
dtype: float64

### Summary

In this lesson, we worked with the spacy library and saw how we can use the `lemma_` `is_stop` and `is_punct` methods to select certain words, and return the lemma of our words.  We also saw how we can define our logic in a function and pass it through our vectorizer.

### Resources

[Spacy Kaggle Tutorial](https://www.kaggle.com/honeysingh/spacy-tutorial)

[Spellchecker](http://theautomatic.net/2019/12/10/3-packages-to-build-a-spell-checker-in-python/)