# Natural Language Processing
So far, we have focused on data science involving quantitative and categorical variables. Today, we will learn how to analyze bodies of text with Natural Language Processing (NLP).

In [None]:
# Imports
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords

# Download nltk libraries
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

In [None]:
# Read in the IMDB Dataset and look at the first few rows
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-9/workshop/IMDB.csv'
reviews = pd.read_csv(url)
reviews.head()

In its raw form, the data has one feature (review) and a label (sentiment). Each row is a movie review that is either positive or negative.

As a review (ba dum tsss), let's start by creating our own feature: the length of the review in characters. We're going to use an apply function on the 'review' column.

In [None]:
# Create a 'length' column


Let's find the longest review (in terms of number of characters) and use that to learn some NLP skills.

In [None]:
# Use idxmax to find the index of the longest review


In [None]:
# Store the text of the longest review


Let's lowercase all the words so it's easier to identify words with different capitalizations as the same.

In [None]:
# Lowercase txt

# Tokenize
"Tokenizing" is breaking a piece of text into smaller parts. The smaller parts of text are called "tokens."

In [None]:
# Import sent_tokenize and word_tokenize from nltk.tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

In [None]:
# Tokenize by sentence


# See how many sentences are in the review and preview the first 10


In [None]:
# Tokenize by word


# See how many words are in the review and preview the first 10


Tokenizing by word gives a list of all the words in the text. This allows us to get value counts of all of the words to get a sense of the main ideas.

In [None]:
# Covert to pd.Series and look at the value counts


Hmmmm. That's not very helpful. We don't really care about words like "the," "and," and "to." And, punctuation doesn't convey anything. Luckily, there's a really easy way to improve on this!

We'll start by removing the punctuation using *punctuation* from the "string" library.

In [None]:
# Remove punctuation with for loop (string.punctuation)
import string


# Stop Words
A stop word is a commonly used word that does not convey much meaning.

Let's take a look at the stop words that nltk provides for us.

In [None]:
# Take a look at the nltk stopwords


Let's remove these from our tokenized text.

In [None]:
# Remove stop words


We can see that 'and' was removed because it is a stop word.

In [None]:
# Get value counts


That's more like it! We're getting a sense of the important words in the text. We can still do better though!

If "sneak" is in the text, we would want to group it with "sneaking" because they express the same concept. We can do this with stemming!

# Stemming
Stemming is reducing words to there stem. Words like "start," "started," "starting," and "starts" all have the same stem. After stemming them, they will all become "start." 

In [None]:
# Look at words 15-30 before stemming


In [None]:
# Import the PorterStemmer
from nltk.stem import PorterStemmer


# Stem all words in filtered_txt


# Compare the previous output with the same words after stemming


We can see that "according" has been shortened to "accord." "Started" has been stemmed to "start."

In [None]:
# Look at the value counts of stemmed


This gives us a much clearer picture of the important concepts in the text. However, we can see that it isn't perfect. Using computers to process text rarely is. The "'s" isn't helpful. And interestingly "booker" did not get stemmed to "book." The natural language libraries do the best they can, but the English language is complicated.

# Part of Speech Tagging
Another cool feature of nltk is POS tagging.

In [None]:
# Do POS tagging on no_punc


NN means singluar nown. CD is a cardinal digit. CC is a coordinating conjuction.

Here you can find a description of what the acronyms mean: https://www.guru99.com/pos-tagging-chunking-nltk.html#:~:text=POS%20Tagging%20in%20NLTK%20is%20a%20process%20to,grammatical%20information%20of%20each%20word%20of%20the%20sentence.

POS tagging comes in handy for chunking. "Chunking" is grouping similar words or phrases together based on the nature of the word or phrase. You can search for sequence of words of different types. For example you could find verbs followed by nouns to get more information on the actions in the text. It involves regular expressions, which we have not covered yet. If NLP is something that interests you, "chunking" is definitely something to look into.

# Sentiment Analysis
Sentiment analysis is using computers to categorize opinions in text, especially to determine whether the attitude of the text is positive, negative, or neutral.

In [None]:
# Import TextBlob
from textblob import TextBlob

# Take a look at the review at index 1


# Tokenize by sentence


In [None]:
# Get the sentiment of the first sentence


## Your turn
Before having TextBlob tell you the sentiment of the sentences, take a look at the sentences for yourself and decide what you think the sentiment should be.

1. Find the polarity and subjectivity of the last sentence in the review we just did. 
2. Find the polarity and subjectivity of the first sentence (index 0) of the review at index 49996.

In [None]:
# Take a look at the last sentence in positive_sent_tokens and judge the sentiment for yourself
# Remember that the index of the last element is "-1"


In [None]:
# Get the sentiment of the last sentence in review 1


In [None]:
# Get the review at index 49996


# Tokenize by sentence


In [None]:
# Take a look at the first sentence (index 0) and judge the sentiment for yourself


In [None]:
# Get the sentiment of the first sentence


As we can see, sentiment analysis works very well for text that is clearly positive or negative. However, in some situations, it can get confused.

# Bag of Words
A bag-of-words is a way of representing text as the frequency of certain words in a text. It involves a vocabulary of known words and a count for each of those words. The bag of words representation ignores the order of the words in the text. It is only concerned with the frequency of each word from the vocabulary.

In [None]:
# sklearn imports
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
# Covert the review column to a list of reviews


In [None]:
# Use the CountVectorizer to convert to a BOW representation


# Look at the feature names


The features names above are the words in the vocabulary. They were selected because they are the most common words in the column.

In [None]:
# Take a look at X to see what's going on


In [None]:
# Convert to a DataFrame


Now we can see the BOW much more clearly. Each row is still a review. But, instead of containing text, there is a column for each word in the vocabulary. Each cell represents the count of that vocabulary word in the review.

In [None]:
# Map y so that 'positive' is 1 and 'negative' is 0


In [None]:
# Train test split
# X_train, X_test, y_train, y_test = 

While the cell below is running. It's a good time to talk about the advantages and disadvantages of different models.

In [None]:
# Fit a RandomForestClassifier (this takes a little while)


In [None]:
# Predict on the testing data and compare to the actual


In [None]:
# Get the accuracy score


Not bad considering that the reviews are 50% positive and 50% negative!