<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP I: Language Data Pre-Processing and Sentiment Analysis

_Authors: Matt Brems, Noelle Brown_

---

### Learning Objectives

1. Define and implement tokenizing, lemmatizing, and stemming.
2. Preprocess text data.
3. Define and apply sentiment analysis.

#### Before we begin, try running this:

In [1]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("cats")

'cat'

In [2]:
# if you get an error with the above code, run this & follow below directions:
# nltk.download()

If you ran into issues with the above:

1. Run `nltk.download()`. A new screen will pop up outside your Jupyter notebook. (It may be hidden behind other windows.)
2. Once this box opens up, click `all`, then `download`. Once this is done, restart your Jupyter notebook and try running the first three cells again.
3. Run:
```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("cats")```

    - If this returns `cat`, then fantastic! You’re done. 
    - If not, head to http://www.nltk.org/install.html and follow instructions for your computer, then try running the first three cells again.
    
---

**Natural language processing** (NLP) describes the field of getting computers to understand language how we as humans do. Natural language processing has many, many applications including:

- identifying trending topics on a social media website
- automatically detecting spam
- voice-to-text services
- chatbots
- translation

Generally when we get text data, strings aren't broken out into individual words or even sentences. We might have a full tweet, full chapter of a book, or full .pdf file all in one long string.

Today, we're diving into the practical side of NLP: taking text data and breaking it out into words that we can then leverage in machine learning.

In [3]:
# Imports
import pandas as pd
import re

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [4]:
# Define spam text.
spam = 'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.'

print(spam)

Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.


# Pre-Processing 

When dealing with text data, there are common pre-processing steps. We won't necessarily use all of them every time we deal with text data.

- Remove special characters
- Tokenizing
- Lemmatizing/Stemming
- Stop word removal

## Removing special characters & Tokenizing

We may need to remove unnecessary characters when cleaning text data (punctuation, symbols, etc.), depending on how we're going to use it. There are many ways to do this; we'll 

When we "**tokenize**" data, we take it and split it up into distinct chunks based on some pattern.

Here we'll use a `RegexpTokenizer` do to these steps together.

**Note**: Later we'll learn a different tool that we will use more frequently. It will allow us to preprocess, tokenize, and assemble the data in a way that scikit-learn will be able to easily handle.

In [5]:
sent_tokenize(spam.lower())

['hello,\ni saw your contact information on linkedin.',
 'i have carefully read through your profile and you seem to have an outstanding personality.',
 'this is one major reason why i am in contact with you.',
 'my name is mr. valery grayfer chairman of the board of directors of pjsc "lukoil".',
 'i am 86 years old and i was diagnosed with cancer 2 years ago.',
 'i will be going in for an operation later this week.',
 'i decided to will/donate the sum of 8,750,000.00 euros(eight million seven hundred and fifty thousand euros only etc.',
 'etc.']

In [6]:
word_tokenize(spam.lower())

['hello',
 ',',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 '.',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 '.',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 '.',
 'my',
 'name',
 'is',
 'mr.',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'directors',
 'of',
 'pjsc',
 '``',
 'lukoil',
 "''",
 '.',
 'i',
 'am',
 '86',
 'years',
 'old',
 'and',
 'i',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago',
 '.',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 '.',
 'i',
 'decided',
 'to',
 'will/donate',
 'the',
 'sum',
 'of',
 '8,750,000.00',
 'euros',
 '(',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euros',
 'only',
 'etc',
 '.',
 'etc',
 '.']

In [7]:
# Instantiate Tokenizer
tokenizer = RegexpTokenizer(r'\w+') # We'll talk about this in a moment.

In [8]:
# "Run" Tokenizer
spam_tokens = tokenizer.tokenize(spam.lower())

In [9]:
# Show Results
print(spam_tokens)

['hello', 'i', 'saw', 'your', 'contact', 'information', 'on', 'linkedin', 'i', 'have', 'carefully', 'read', 'through', 'your', 'profile', 'and', 'you', 'seem', 'to', 'have', 'an', 'outstanding', 'personality', 'this', 'is', 'one', 'major', 'reason', 'why', 'i', 'am', 'in', 'contact', 'with', 'you', 'my', 'name', 'is', 'mr', 'valery', 'grayfer', 'chairman', 'of', 'the', 'board', 'of', 'directors', 'of', 'pjsc', 'lukoil', 'i', 'am', '86', 'years', 'old', 'and', 'i', 'was', 'diagnosed', 'with', 'cancer', '2', 'years', 'ago', 'i', 'will', 'be', 'going', 'in', 'for', 'an', 'operation', 'later', 'this', 'week', 'i', 'decided', 'to', 'will', 'donate', 'the', 'sum', 'of', '8', '750', '000', '00', 'euros', 'eight', 'million', 'seven', 'hundred', 'and', 'fifty', 'thousand', 'euros', 'only', 'etc', 'etc']


<details><summary>In comparing the original text to our tokenized version of the text, we converted one long string into a list of strings. What other changes occurred?</summary>

- All strings were converted to lower case.
- All punctuation was removed. (This was done using **regular expressions**.)
</details>

### Briefly: Regular Expressions

Regular Expressions, or RegEx, is a helpful tool for detecting patterns in text. 

RegEx is extremely powerful, and also extremely finicky. Many people use interactive tools to help them build their regular expressions.

Do not use RegEx for [parsing HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)!

When we instantiate `RegexpTokenizer(r'\w+')`, the `\w+` portion is the **regular expression** which tells Python to find "one or more of any word character."

Let's visit [RegEx101](https://regex101.com/) to investigate what that means.

**Note**: For most of the work you do, you **will not need** regular expressions. RegEx is most useful when you need to find strings that match a certain variety of patterns, for example "all things that might be phone numbers," "all things that might be email addresses," or "all things that might be social security numbers." It's important to know that RegEx is one tool you have, but using it here is overkill - we could get the same effect using inbuilt Python string methods.

![](../images/regex.png)

[_from xkcd_](https://xkcd.com/1171/)

---

## Lemmatizing & Stemming

- "He is *running* really fast!"
- "He *ran* the race."
- "He *runs* a five-minute mile."

If we wanted a computer to interpret these sentences, I might count up how many times I see each word. The computer will treat words like "running," "ran," and "runs" differently... but they mean very similar things (in this context)!

**Lemmatizing** and **stemming** are two forms of shortening words so we can combine similar forms of the same word.

When we "**lemmatize**" data, we take words and attempt to return their *lemma*, or the base/dictionary form of a word.

In [10]:
# Instantiate lemmatizer.
lemmatizer = WordNetLemmatizer()

In [11]:
# Lemmatize tokens.
tokens_lem = [lemmatizer.lemmatize(i) for i in spam_tokens]

In [12]:
# Compare tokens to lemmatized version.
list(zip(spam_tokens, tokens_lem))

[('hello', 'hello'),
 ('i', 'i'),
 ('saw', 'saw'),
 ('your', 'your'),
 ('contact', 'contact'),
 ('information', 'information'),
 ('on', 'on'),
 ('linkedin', 'linkedin'),
 ('i', 'i'),
 ('have', 'have'),
 ('carefully', 'carefully'),
 ('read', 'read'),
 ('through', 'through'),
 ('your', 'your'),
 ('profile', 'profile'),
 ('and', 'and'),
 ('you', 'you'),
 ('seem', 'seem'),
 ('to', 'to'),
 ('have', 'have'),
 ('an', 'an'),
 ('outstanding', 'outstanding'),
 ('personality', 'personality'),
 ('this', 'this'),
 ('is', 'is'),
 ('one', 'one'),
 ('major', 'major'),
 ('reason', 'reason'),
 ('why', 'why'),
 ('i', 'i'),
 ('am', 'am'),
 ('in', 'in'),
 ('contact', 'contact'),
 ('with', 'with'),
 ('you', 'you'),
 ('my', 'my'),
 ('name', 'name'),
 ('is', 'is'),
 ('mr', 'mr'),
 ('valery', 'valery'),
 ('grayfer', 'grayfer'),
 ('chairman', 'chairman'),
 ('of', 'of'),
 ('the', 'the'),
 ('board', 'board'),
 ('of', 'of'),
 ('directors', 'director'),
 ('of', 'of'),
 ('pjsc', 'pjsc'),
 ('lukoil', 'lukoil'),
 ('i'

In [13]:
# Print only those lemmatized tokens that are different.
for i in range(len(spam_tokens)):
    if spam_tokens[i] != tokens_lem[i]:
        print((spam_tokens[i], tokens_lem[i]))

('directors', 'director')
('years', 'year')
('was', 'wa')
('years', 'year')
('euros', 'euro')
('euros', 'euro')


In [14]:
[(spam_tokens[i], tokens_lem[i]) for i in range(len(spam_tokens)) if spam_tokens[i] != tokens_lem[i]]

[('directors', 'director'),
 ('years', 'year'),
 ('was', 'wa'),
 ('years', 'year'),
 ('euros', 'euro'),
 ('euros', 'euro')]

Lemmatizing is sometimes very useful and sometimes makes the task at hand harder.

As an example, imagine searching a database of academic papers for the string "democracy." We would probably be interested in seeing papers on "democratically elected governments," "democratic trends," and even "democracies." It might be appropriate for the database search engine to lemmatize the query and match against lemmatized paper titles.

Let's try a few examples:

In [15]:
# Lemmatize the word "computers."
lemmatizer.lemmatize("computers")

'computer'

In [16]:
lemmatizer.lemmatize("computer")

'computer'

In [17]:
lemmatizer.lemmatize("computation")

'computation'

In [18]:
lemmatizer.lemmatize("computationally")

'computationally'

## Stemming

When we "**stem**" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization. There's a [method developed by Porter in 1980](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf) that explains the algorithm used below.

In [19]:
# Instantiate PorterStemmer.
p_stemmer = PorterStemmer()

In [20]:
# Stem tokens.
stem_spam = [p_stemmer.stem(i) for i in spam_tokens]

In [21]:
# Compare tokens to stemmed version.
list(zip(spam_tokens, stem_spam))

[('hello', 'hello'),
 ('i', 'i'),
 ('saw', 'saw'),
 ('your', 'your'),
 ('contact', 'contact'),
 ('information', 'inform'),
 ('on', 'on'),
 ('linkedin', 'linkedin'),
 ('i', 'i'),
 ('have', 'have'),
 ('carefully', 'care'),
 ('read', 'read'),
 ('through', 'through'),
 ('your', 'your'),
 ('profile', 'profil'),
 ('and', 'and'),
 ('you', 'you'),
 ('seem', 'seem'),
 ('to', 'to'),
 ('have', 'have'),
 ('an', 'an'),
 ('outstanding', 'outstand'),
 ('personality', 'person'),
 ('this', 'thi'),
 ('is', 'is'),
 ('one', 'one'),
 ('major', 'major'),
 ('reason', 'reason'),
 ('why', 'whi'),
 ('i', 'i'),
 ('am', 'am'),
 ('in', 'in'),
 ('contact', 'contact'),
 ('with', 'with'),
 ('you', 'you'),
 ('my', 'my'),
 ('name', 'name'),
 ('is', 'is'),
 ('mr', 'mr'),
 ('valery', 'valeri'),
 ('grayfer', 'grayfer'),
 ('chairman', 'chairman'),
 ('of', 'of'),
 ('the', 'the'),
 ('board', 'board'),
 ('of', 'of'),
 ('directors', 'director'),
 ('of', 'of'),
 ('pjsc', 'pjsc'),
 ('lukoil', 'lukoil'),
 ('i', 'i'),
 ('am', 'am'

In [22]:
# Print only those stemmed tokens that are different.

[(spam_tokens[i], stem_spam[i]) for i in range(len(spam_tokens)) if spam_tokens[i] != stem_spam[i]]

[('information', 'inform'),
 ('carefully', 'care'),
 ('profile', 'profil'),
 ('outstanding', 'outstand'),
 ('personality', 'person'),
 ('this', 'thi'),
 ('why', 'whi'),
 ('valery', 'valeri'),
 ('directors', 'director'),
 ('years', 'year'),
 ('was', 'wa'),
 ('diagnosed', 'diagnos'),
 ('years', 'year'),
 ('going', 'go'),
 ('operation', 'oper'),
 ('this', 'thi'),
 ('decided', 'decid'),
 ('donate', 'donat'),
 ('euros', 'euro'),
 ('hundred', 'hundr'),
 ('fifty', 'fifti'),
 ('euros', 'euro'),
 ('only', 'onli')]

We can also do this on individual words as well.

In [23]:
# Stem the word "computers"
p_stemmer.stem("computers")

'comput'

In [24]:
# Stem the word "computer."
p_stemmer.stem("computer")

'comput'

In [25]:
# Stem the word "computation."
p_stemmer.stem("computation")

'comput'

In [26]:
# Stem the word "computationally."
p_stemmer.stem("computationally")

'comput'

## Stop Word Removal

The following quote has had stop words (and punctuation) removed:

"Answer great question life universe everything said deep thought said deep thought paused forty two said deep thought infinite majesty calm."

<details><summary>What book is the above sentence from?</summary>

The Hitchhiker's Guide to the Galaxy!
    
![](../images/hgg.jpg)
    
The original quote reads:  
..."The Answer to the Great Question..."  
"Yes..!"  
"Of Life, the Universe and Everything..." said Deep Thought.  
"Yes...!"  
"Is..." said Deep Thought, and paused.  
"Yes...!"  
"Is..."  
"Yes...!!!...?"  
"Forty-two," said Deep Thought, with infinite majesty and calm.”
</details>

<details><summary>If you were familiar with the book, how did you know what book the sentence was from?</summary>

Removing stop words did not remove key identifying words such as "life", "universe", "everything", and "forty-two".
</details>

<details><summary>Based on this, how would you define stop words?</summary>

Stop words are words that have little to no significance or meaning. They are common words that only add to the grammatical structure and flow of the sentence, so it is still relatively easy to identify the contents of sentences without stop words.
</details>

In [27]:
# Print English stopwords.
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [28]:
# Remove stopwords from "spam_tokens."
no_stop_words = [token for token in spam_tokens if token not in stopwords.words('english')]

In [29]:
# Check it
print(no_stop_words)

['hello', 'saw', 'contact', 'information', 'linkedin', 'carefully', 'read', 'profile', 'seem', 'outstanding', 'personality', 'one', 'major', 'reason', 'contact', 'name', 'mr', 'valery', 'grayfer', 'chairman', 'board', 'directors', 'pjsc', 'lukoil', '86', 'years', 'old', 'diagnosed', 'cancer', '2', 'years', 'ago', 'going', 'operation', 'later', 'week', 'decided', 'donate', 'sum', '8', '750', '000', '00', 'euros', 'eight', 'million', 'seven', 'hundred', 'fifty', 'thousand', 'euros', 'etc', 'etc']


---

# Sentiment Analysis

![](../images/sent.jpeg)

[Sentiment analysis](https://www.kdnuggets.com/2018/08/emotion-sentiment-analysis-practitioners-guide-nlp-5.html) is an area of natural language processing in which we seek to classify text as having positive or negative emotion.

We'll look today at the [VADER sentiment analyzer](https://github.com/cjhutto/vaderSentiment). Vader stands for "valence-aware dictionary and sentiment reasoner." We won't have to download anything new; NLTK comes with VADER. You can read the [VADER paper](https://www.scinapse.io/papers/2099813784#fullText) (pdf link) for more details on how VADER was created.

In [30]:
review_1 = '''Took a chance on this blouse and so glad i did. i wasn't crazy about how the blouse is photographed on the model. i paired it whit white pants and it worked perfectly. crisp and clean is how i would describe it. launders well. fits great. drape is perfect. wear tucked in or out - can't go wrong.'''

review_2 = '''First of all, this is not pullover styling. there is a side zipper. i wouldn't have purchased it if i knew there was a side zipper because i have a large bust and side zippers are next to impossible for me.

second of all, the tulle feels and looks cheap and the slip has an awkward tight shape underneath.

not at all what is looks like or is described as. sadly will be returning, but i'm sure i will find something to exchange it for!'''

review_3 = '''I don't normally review my purchases, but i was so amazed at how poorly this dress was made, i couldn't help myself but to post a review. the neck line isn't even hemmed down so it flaps up. the material is thin and feel cheap. this dress isnt even worth $20 in my opinion. i was expecting a well made, good quality dress for the high price tag.'''

In [31]:
print(review_1)

Took a chance on this blouse and so glad i did. i wasn't crazy about how the blouse is photographed on the model. i paired it whit white pants and it worked perfectly. crisp and clean is how i would describe it. launders well. fits great. drape is perfect. wear tucked in or out - can't go wrong.


In [32]:
print(review_2)

First of all, this is not pullover styling. there is a side zipper. i wouldn't have purchased it if i knew there was a side zipper because i have a large bust and side zippers are next to impossible for me.

second of all, the tulle feels and looks cheap and the slip has an awkward tight shape underneath.

not at all what is looks like or is described as. sadly will be returning, but i'm sure i will find something to exchange it for!


In [33]:
print(review_3)

I don't normally review my purchases, but i was so amazed at how poorly this dress was made, i couldn't help myself but to post a review. the neck line isn't even hemmed down so it flaps up. the material is thin and feel cheap. this dress isnt even worth $20 in my opinion. i was expecting a well made, good quality dress for the high price tag.


In [34]:
# Instantiate Sentiment Intensity Analyzer
sia = SentimentIntensityAnalyzer()

In [35]:
# Calculate sentiment of yelp_1.
sia.polarity_scores(review_1)

{'neg': 0.0, 'neu': 0.612, 'pos': 0.388, 'compound': 0.9782}

In [36]:
# Calculate sentiment of yelp_2.
sia.polarity_scores(review_2)

{'neg': 0.039, 'neu': 0.9, 'pos': 0.061, 'compound': 0.4199}

In [37]:
# Calculate sentiment of yelp_3.
sia.polarity_scores(review_3)

{'neg': 0.066, 'neu': 0.77, 'pos': 0.164, 'compound': 0.8515}

What limitations have you noticed?

Let's try a few random examples ourselves:

In [38]:
sia.polarity_scores('this is awesome')

{'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}

In [39]:
sia.polarity_scores('this is AWESOME')

{'neg': 0.0, 'neu': 0.293, 'pos': 0.707, 'compound': 0.7034}

In [40]:
sia.polarity_scores(':)')

{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4588}

In [41]:
sia.polarity_scores(':(')

{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.4404}

Let's look at the whole reviews dataset:

In [42]:
# Read in training data.
reviews = pd.read_csv("../data/womens-clothing-reviews.csv")[['Review Text', 'Rating']]

In [43]:
# View the first five rows.
reviews.head()

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,4
1,Love this dress! it's sooo pretty. i happene...,5
2,I had such high hopes for this dress and reall...,3
3,"I love, love, love this jumpsuit. it's fun, fl...",5
4,This shirt is very flattering to all due to th...,5


In [44]:
# Examine a review.
reviews['Review Text'][2]

'I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c'

In [45]:
# Sentiment of the review.
sia.polarity_scores(reviews['Review Text'][2])

{'neg': 0.027, 'neu': 0.792, 'pos': 0.181, 'compound': 0.9427}

In [46]:
# Does this match the sentiment given in the training data?
reviews['Rating'][2]

3

## Interview Question

<details><summary>When processing text, what can you do about frequently occurring words, like "a," "and," "the," etc.?</summary>

- These words, called "stopwords," can either be kept or removed.
    - If we think these words do help explain our $Y$ variable, we might keep them. (For example, if we're classifying the era of a poem, the frequency of the word "the" may be helpful information!)
    - If we think these words don't help explain our $Y$ variable, we might remove them. (For example, in sentiment analysis, we might not think that people who use "the" more or less frequently are happier or angrier.)
</details>

A couple things to note:
1. NLP broadly describes: 
    - how we can get unstructured text data into a more structured form that can be interpreted by computers, and 
    - algorithms for interpreting text data.
2. That does not mean these tools we used today work to the exclusion of other methods. You can and should include other variables in your model!
    - For example, maybe the length of a review tells us something about how much people liked/disliked the movie, or maybe additional information about the reviewer (i.e. geography, age, how many reviews they had submitted) has predictive value.