# ![GA Logo](https://camo.githubusercontent.com/6ce15b81c1f06d716d753a61f5db22375fa684da/68747470733a2f2f67612d646173682e73332e616d617a6f6e6177732e636f6d2f70726f64756374696f6e2f6173736574732f6c6f676f2d39663838616536633963333837313639306533333238306663663535376633332e706e67) 

## NLP I: Tokenizing/Lemmatization and Sentiment Analysis
---

#### Before we begin, try running this:

In [None]:
import nltk

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
lemmatizer.lemmatize("cats")

If you ran into issues with the above:

1. Run `nltk.download()`. A new screen will pop up outside your Jupyter notebook. (It may be hidden behind other windows.)
2. Once this box opens up, click `all`, then `download`. Once this is done, restart your Jupyter notebook and try running the first three cells again.
3. Run:
```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("cats")```

    - If this returns `cat`, then fantastic! You’re done. 
    - If not, head to http://www.nltk.org/install.html and follow instructions for your computer, then try running the first three cells again.
    
#### Also run this:

In [None]:
!pip install regex

### Which of these was machine generated?

- A: "Kilimanjaro is a mountain of 19,710 feet covered with snow and is said to be the highest mountain in Africa. The summit of the west is called “Ngaje Ngai” in Masai, the house of God. Near the top of the west there is a dry and frozen dead body of leopard. No one has ever explained what leopard wanted at that altitude."

- B: "Kilimanjaro is a snow-covered mountain 19,710 feet high, and is said to be the highest mountain in Africa. Its western summit is called the Masai “Ngaje Ngai,” the House of God. Close to the western summit there is the dried and frozen carcass of a leopard. No one has explained what the leopard was seeking at that altitude."

<details><summary>Answer:</summary>

- Item B was written by [Ernest Hemingway](https://en.wikipedia.org/wiki/Ernest_Hemingway) in "The Snows of Kilimanjaro."

- Item A was produced by a Japanese author translating "The Snows of Kilimanjaro" into Japanese, then this Japanese version was passed through Google Translate so that it could be "translated back" into English.
</details>

**Natural language processing** (NLP) describes the field of getting computers to understand language how we as humans do. Natural language processing has many, many applications including:
- voice-to-text services for people who are hard of hearing.
- text-to-voice services for people who have difficulty reading.
- automated chatbots for organizations.
- translation services.

Generally when we get text data, strings aren't broken out into individual words or even sentences. We might have a full tweet, full chapter of a book, or full .pdf file all in one long string.

Today, we're diving into the practical side of NLP - taking text data and breaking it out into words that we can then leverage into $n$-grams or $tf$-$idf$ vectorizers.

## Agenda
1. Pre-Processing
2. Sentiment Analysis

## Learning Objectives

1. **Define** and **implement** tokenizing, lemmatizing, and stemming.
2. **Describe** what RegEx does.
3. **Apply** sentiment analysis.
4. **Preprocess** text data.

In [None]:
# Define spam text.
spam = 'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.'

print(spam)

## Pre-Processing 

When dealing with text data, there are some common pre-processing steps we might use. However, we won't necessarily use all of them every time we deal with text data.

- Tokenizing
- Regular Expression
- Lemmatizing/Stemming
- Cleaning (i.e. removing HTML)

### Tokenizing

When we "tokenize" data, we take it and split it up into distinct chunks based on some pattern.

In [None]:
# Import Tokenizer
from nltk.tokenize import RegexpTokenizer

In [None]:
# Instantiate Tokenizer
tokenizer = RegexpTokenizer(r'\w+') ## We'll talk about this in a moment.

In [None]:
# "Run" Tokenizer
spam_tokens =

In [None]:
# Show Results


<details><summary>In comparing the original text to our tokenized version of the text, we converted one long string into a list of strings. What other changes occurred?</summary>

- All strings were converted to lower case.
- All punctuation was removed.
- This was done using **regular expressions**.
</details>

### Regular Expressions

Regular Expressions, or RegEx, are an extraordinarily helpful way for us to detect patterns in text. 
- This is a tool of which you should be aware, but you'll learn more about it later!

In [None]:
import regex as re

In [None]:
for i in spam_tokens:
    print(re.findall('\d+', i), i)

RegEx in Python 3 understands `\d+` to identify numeric digits. Therefore, the above code searched through `spam_tokens` to see if any numeric digits were in there. 

A `RegexpTokenizer` splits a string into substrings using regular expressions.

The following example is pulled from [this site](http://www.nltk.org/_modules/nltk/tokenize/regexp.html).

In [None]:
# Define and print string.
s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."

print(s)

In [None]:
# Instantiate tokenizer.
tokenizer_1 = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

In [None]:
# Run tokenizer.


`tokenizer_1` splits tokens up by spaces or by periods that are not attached to a digit.

In [None]:
# Instantiate tokenizer.
tokenizer_2 = RegexpTokenizer('\s+', gaps=True)

# Run tokenizer.


`tokenizer_2` will identify the spaces. By setting `gaps = True`, we're grabbing everything else: thus, we're splitting our tokens up by spaces.
- If you changed to `gaps = False`, you'll return only the whitespaces!

In [None]:
# Instantiate tokenizer.
tokenizer_3 = RegexpTokenizer('[A-Z]\w+')

# Run tokenizer.


`tokenizer_3` returns only words that begin with a capital letter.

As you can imagine, using RegEx can be incredibly helpful if you want to find text matching a specific pattern.
- People used to use two spaces after a period to split sentences up; you could use RegEx to detect that pattern and tokenize on entire sentences.
- Chapters in a book could be titled "Chapter" followed by a number; you could use RegEx to detect that pattern and tokenize a book by its chapters.
- When Python libraries are upgraded, syntax changes! Perhaps you want to detect a certain pattern of syntax so you can update your code efficiently.

### Lemmatizing & Stemming

- "He is *running* really fast!"
- "He *ran* the race."
- "He *runs* a five-minute mile."

If we wanted a computer to interpret these sentences, I might count up how many instances of each word I observe. The computer will treat words like "running," "ran," and "runs" differently although they mean about the same thing (in this context).

**Lemmatizing** and **stemming** are two forms of shortening words so we can combine similar forms of the same word.

When we "lemmatize" data, we take words and attempt to return their *lemma*, or the base/dictionary form of a word.

In [None]:
# Import lemmatizer. (Same as above.)
from nltk.stem import WordNetLemmatizer

# Instantiate lemmatizer. (Same as above.)
lemmatizer =

In [None]:
# Lemmatize tokens.
tokens_lem =

In [None]:
# Compare tokens to lemmatized version.
list(zip(spam_tokens, tokens_lem))

In [None]:
# Print only those lemmatized tokens that are different.


Lemmatizing is usually the more correct and precise way of handling things from a grammatical/morphological point of view, but also might not have much of an effect. 

We can also do this on individual words.

In [None]:
# Lemmatize the word "computers."


When we "stem" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization. There's a [method developed by Porter in 1980](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf) that explains the algorithm used below.

In [None]:
# Instantiate stemmer.
from nltk.stem.porter import PorterStemmer

In [None]:
# Instantiate object of class PorterStemmer.
p_stemmer =

In [None]:
# Stem tokens.
stem_spam =

In [None]:
# Compare tokens to stemmed version.
list(zip(spam_tokens, stem_spam))

In [None]:
# Print only those stemmed tokens that are different.


We can also do this on individual words as well.

In [None]:
# Stem the word "computer."


In [None]:
# Stem the word "computation."


In [None]:
# Stem the word "computationally."


# Let's start with a very simple example.

[Sentiment analysis](https://www.kdnuggets.com/2018/08/emotion-sentiment-analysis-practitioners-guide-nlp-5.html) is an area of natural language processing in which we seek to classify text as having positive or negative emotion.

Let's build a simple function that can classify text as either having positive or negative sentiment.

What words tell us whether certain text is positive?

In [None]:
# Let's come up with a list of positive and negative words we might observe.

positive_words =
negative_words =

In [None]:
def simple_sentiment(text):
    # Instantiate tokenizer.
    tokenizer = RegexpTokenizer(r'\w+')
    
    # Tokenize text.
    tokens = 
    
    # Instantiate stemmer.
    p_stemmer = 
    
    # Stem words.
    stemmed_words =
    
    # Stem our positive/negative words.
    positive_stems =
    negative_stems =

    # Count "positive" words.
    positive_count =
    
    # Count "negative" words
    negative_count =
    
    # Calculate Sentiment Percentage 
    # (Positive Count - Negative Count) / (Total Count)

    return 

In [None]:
# Run our sentiment analyzer on our spam email.


In [None]:
# Three not-so-random Chipotle reviews.

yelp_1 = "No Chipotle should ever have a 2 out of 5 star rating on Yelp. Especially not this one. As a regular (usually two or three visits a week), I have never been dissatisfied with a single meal here. It's Chipotle, so you know you'll pay $8 (after tax) for a chicken bowl and be full and satisfied afterwards. \n The employees are friendly and give generous portions. Seating is limited, but there is a place you can stand and eat near the window, which is where I always eat. I'm sitting down eight hours a day at the office anyway - standing and eating here is probably extending my lifespan. \nThe line gets line long during lunch, but it moves fast. Dinner time is amazing - rarely a line and the portions are extra generous during this time.\n This fairly new Chipotle is at a great location, near McPherson Square. It's right next to my office and gym so it's perfect for me. \nBottom line: if you're craving Chipotle and are worried about the other reviews and low ratings for this location, don't be. It's my favorite Chipotle location in the DC area, and that's not an exaggeration."

yelp_2 = "DISGUSTING LONG HAIR THREADED THROUGH CHICKEN IN BURRITO BOWL.\n There was a long blonde hair threaded through my chicken as I was eating a burrito bowl.  I did not notice until it was too late and the HAIR ENTERED MY MOUTH as I was eating and I grossly pulled the hair out.\n I calmly walked up to the register to inform them that there was hair in my food. The register person was busy, I understand that, but I was promptly ignored like my issue was not a big deal.  He proceeded to get his manager, 'Leslie' I believe.  She was not apologetic at all and offered no condolences. She did however offer a refund, but I didn't care about the money, I just wanted to eat food without eating someone's hair as a side dish.\n The second time I went back up, a different person, the general manager Peris, was more apologetic and handled the situation better. He ended up getting Leslie to file a report, but who knows if they submitted it or not.\n Suffice to say, if you dont want food in your hair, dont eat here."

yelp_3 = "First time going to this Chipotle.  The line was very quick and the food was fresh.  But as I started eating a notice that the food was very salty.  I started separating my bowl after two bites.  I ordered a bowl with white rice, black beans, chicken, sour cream, cheese and lettuce.  I tasted everything separately.  Once I tasted the Chicken by it self it was unbearable.  It taste like someone pouring the entire bottle of salt on tge chicken.  I tried to take most the chicken out the bowl but still I could not bear the taste of the salt.  So I ended up throwing the damn bowl away.  $8.00 down the drain.  SMH."

In [None]:
yelp_1

In [None]:
yelp_2

In [None]:
yelp_3

In [None]:
# Calculate sentiment of yelp_1.


In [None]:
# Calculate sentiment of yelp_2.


In [None]:
# Calculate sentiment of yelp_3.


<details><summary> What are some shortcomings of this method? </summary>

- Primarily, we're limited to the positive/negative words we came up with.
- If someone wrote "not good" or "not bad," our sentiment function would probably treat "not good" as positive or neutral instead of negative.
- The ordering of the words doesn't matter here, which is not how language generally works.
- We haven't been able to correct for misspellings.
</details>

# Sorting Positive from Negative Reviews

The easiest way to do sentiment classification of analysis is by training a model on data we've already labeled.
- This is a huge consideration for your capstone, even if it isn't related to NLP!

Today, we will begin by reviewing the basic NLP techniques we learned yesterday to create a sentiment analyzer from Rotten Tomatoes Movie review. This code-along is adapted from Kaggle's tutorial, available [here](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words).


## Step One: Import The Data

In [None]:
# Import pandas.
import pandas as pd       

# Read in training data.
train = pd.read_csv("../labeledTrainData.tsv",
                    header=0,
                    delimiter="\t",
                    quoting=3)

In [None]:
# View the first five rows.
train.head()

In [None]:
# Examine the first review.


### Train/Test Split

In [None]:
# Read in testing data.
test = pd.read_csv("../testData.tsv",
                   header=0, 
                   delimiter="\t",
                   quoting=3)

In [None]:
# View the first five rows.
test.head()

Remember that our Kaggle data was organized `train.csv` and `test.csv`. However, the `test.csv` didn't contain any labels!

Let's do a train/test split by splitting up `train.csv`.

In [None]:
# Import train_test_split.
from sklearn.model_selection import train_test_split

# Create train_test_split.
X_train, X_test, y_train, y_test = train_test_split(train[['id','review']],
                                                    train['sentiment'],
                                                    test_size = 0.25,
                                                    random_state = 42)

There are a few steps we'll take to clean up the text data before it's ready for processing.

- Remove the HTML code artifacts from the text.
- Remove punctuation.
- Remove stopwords. (We'll cover these shortly.)

## Step One: Remove HTML code artifacts

In [None]:
from bs4 import BeautifulSoup             

# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(X_train['review'][0])

# Print the raw review and then the output of get_text(), for 
# comparison
print(X_train['review'][0])
print()
print(example1.get_text())

## Step Two: Remove Punctuation

Punctuation can be removed using regular expressions

In [None]:
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text())   # The text to search

In [None]:
# Show first fifty letters of letters_only.


In [None]:
# Convert letters_only to lower case.
lower_case =

# Split lower_case up at each space.
words = 

In [None]:
# Check first ten words.


## Step Three: Remove Stop Words

With our Yelp reviews above, you noticed that our sentiment scores were right around zero. While there were some positive and negative words, the vast majority of the words had neither a positive sentiment nor negative sentiment!
- Examples include "the," "of," "and," "a," "to," and "in."
    
Stopwords are very common words that are often removed because they amount to unnecessary information and removing them can dramatically speed up processing.

If you didn't complete the NLTK download, you may run into some issues here.

In [None]:
# Download text data sets, including stop words. Uncomment this if you did not download!

# nltk.download()  

In [None]:
# Import stopwords.
from nltk.corpus import stopwords # Import the stop word list

In [None]:
# Print English stopwords.


In [None]:
# Remove stop words from "words."
words =

In [None]:
# Check "words" to make sure we did this properly.
print(words)

## Step Four: Combine our cleaning into one function

**Check**: Why should we do everything with one function?

In [None]:
def review_to_words(raw_review):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    
    # 1. Remove HTML.
    review_text = BeautifulSoup(raw_review).get_text()
    
    # 2. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    
    # 3. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    # Notice that we did this in one line!
    
    # 4. In Python, searching a set is much faster than searching
    # a list, so convert the stop words to a set.
    stops = set(stopwords.words('english'))
    
    # 5. Remove stop words.
    meaningful_words = [w for w in words if not w in stops]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

## Step Five (Finally!) Applying our Function

In [None]:
# Get the number of reviews based on the dataframe size.
total_reviews = train.shape[0]
print(f'There are {total_reviews} reviews.')

# Initialize an empty list to hold the clean reviews.
clean_train_reviews = []
clean_test_reviews = []

In [None]:
print("Cleaning and parsing the training set movie reviews...")

j = 0
for train_review in X_train['review']:
    # Convert review to words, then append to clean_train_reviews.
    clean_train_reviews.append(review_to_words(train_review))
    
    # If the index is divisible by 1000, print a message
    if (j + 1) % 1000 == 0:
        print(f'Review {j + 1} of {total_reviews}.')
    
    j += 1

# Let's do the same for our testing set.

print("Cleaning and parsing the testing set movie reviews...")

for test_review in X_test['review']:
    # Convert review to words, then append to clean_train_reviews.
    clean_test_reviews.append(review_to_words(test_review))
    
    # If the index is divisible by 1000, print a message
    if (j + 1) % 1000 == 0:
        print(f'Review {j + 1} of {total_reviews}.')
        
    j += 1

## Our data is finally ready.....

In [None]:
# Import CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 5000) 

You'll describe this in greater detail in a later lesson, but CountVectorizer will transform the lists of the cleaned reviews above into features that we can pass into a model.
- It will create columns (also knon as vectors), where each column counts how many times each word is observed in each review.

In [None]:
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.

train_data_features = vectorizer.fit_transform(clean_train_reviews)

test_data_features = vectorizer.transform(clean_test_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array.
train_data_features = train_data_features.toarray()

In [None]:
print(train_data_features.shape)

In [None]:
print(test_data_features.shape)

In [None]:
train_data_features[0:6]

In [None]:
vocab = vectorizer.get_feature_names()
print(vocab)

### Now we have an array that we can use for classification!

In [None]:
# Import logistic regression.

from sklearn.linear_model import LogisticRegression

In [None]:
# Instantiate logistic regression model.

lr = LogisticRegression()

In [None]:
# Fit model to training data.

lr.fit(train_data_features, y_train)

In [None]:
# Evaluate model on training data.
lr.score(train_data_features, y_train)

<details><summary>What is this score?</summary>

- Accuracy.
- Remember that the default metric for classification models in `scikit-learn` is accuracy.
</details>

In [None]:
# Evaluate model on testing data.

lr.score(test_data_features, y_test)

<details><summary>What would we conclude about our model based on these scores?</summary>

- Since our model has a higher accuracy on the training set than on the testing set, our model is overfit.
- We could combat this overfitting by removing features (i.e. setting `max_features` to be lower than 5000) and/or by regularizing our model.
</details>

A couple things to note:
1. NLP broadly describes: 
    - how we can get unstructured text data into a more structured form that can be interpreted by computers and 
    - algorithms for interpreting text data.
2. That does not mean these tools we used today work to the exclusion of other methods. You can and should include other variables in your model!
    - For example, maybe the length of a review tells us something about how much people liked/disliked the movie, or maybe additional information about the reviewer (i.e. geography, age, how many reviews they had submitted) has predictive value.