# Introduction to Data Science – Basic Practical Natural Language Processing (NLP)
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/* 

In this lecture, we introduce some practical NLP following up on the theoretical lecture. We will do some basic text processing followed by analyzing sentiment for movie reviews. For this purpose, we'll introduce  the [Natural Language Toolkit (NLTK)](http://www.nltk.org/), a Python library for natural language processing. 

We won't cover NLTK or NLP extensively here – this lecture is meant to give you a few pointers if you want to use NLP in the future, e.g., for your project.

Also, there is a well-regarded alternative to NLTK: [Spacy](https://spacy.io/). If you're planning to use a lot of NLP in your project, that might be worth checking out. 

If you're interested in using LLMs, you may want to try [LangChain](https://python.langchain.com/docs/get_started/). Note for most cases, it needs to be hooked up to a *paid* API. You can run some models locally, but they require a capable GPU.

**Reading:** 

[S. Bird, E. Klein, and E. Loper, *Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit*](http://www.nltk.org/book/). 


[C. Manning and H. Schütze, *Foundations of Statistical Natural Language Processing* (1999).](http://nlp.stanford.edu/fsnlp/)

[D. Jurafsky and J. H. Martin, *Speech and Language Processing* (2016).](https://web.stanford.edu/~jurafsky/slp3/)

**Next week:** guest lecturer Dr. Vivek Srikumar will be speaking about state-of-the-art techniques!   

### NLP Tasks

There are several tasks we might perform in understand text data:
* Part of speech tagging (what are the nouns, verbs, adjectives, prepositions).
+ Information Extraction
+ Sentiment Analysis (determine the attitude of text, e.g., is it positive or negative).
+ Semantic Parsing (translate natural language into a formal meaning representation).

The current strategy for many NLP tasks is to find a good way to represent the text ("extract features") and then to use machine learning / statistics tools, such as classification or clustering. 

Our goal today is to use NLTK + scikit-learn to do some basic NLP tasks.

### Install datasets and models

To use NLTK, you must first download and install the datasets and models. The following cell does this. It will take some time, so be careful when you clear outputs or re-run cells. It's better if you can preserve this one.

In [None]:
# Note this part can take some time! Try to run it only once.
# Be careful when you clear outputs so you don't need to re-run it.
import nltk
nltk.download('all')

In [None]:
# imports and setup
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('ggplot')

## Basics of NLTK

We have downloaded a set of text corpora above. Here is a list of these texts:

In [None]:
from nltk.book import *

Let's look at the first 20 words of text1 – Moby Dick:

In [None]:
text1[0:20]

### Text Statistics

We can check the length of a text. The text of Moby Dick is 260,819 words, whereas Monty Python and the Holy Grail has 16,967 words. 

In [None]:
len(text1)

In [None]:
len(text6)

We can check for the frequency of a word. The word "swallow" appears 10 times in Monty Python.

In [None]:
text6.count("swallow")

We might want to know the context in which "swallow" appears in the text

"You shall know a word by the company it keeps." – John Firth

Use the [`concordance`](http://www.nltk.org/api/nltk.html#nltk.text.Text.concordance) function to print out the words just before and after all occurrences of the word "swallow". 

In [None]:
text6.concordance("swallow")

Words that occur with notable frequencey are "fly" or "flight", "unladen", "air", "African", "European". We can learn about what a swallow can do or properties of a swallow by this. 

And if we look for Ishmael in Moby Dick:

In [None]:
text1.concordance("Ishmael")

Here, we see a lot of "I"s. We could probably infer that Ishmael is the narrator based on that. 

We can see what other words frequently appear in the same context using the  [`similar`](http://www.nltk.org/api/nltk.html#nltk.text.Text.similar) function.  

In [None]:
text6.similar("swallow")

In [None]:
text6.similar("african")

In [None]:
text6.similar("coconut")

This means that 'african' and 'unladen' both appeared in the text with the same word just before and just after. To see what the phrase is, we can use the [`common_contexts`](http://www.nltk.org/api/nltk.html#nltk.text.Text.concordance) function. 

In [None]:
text6.common_contexts(["African", "unladen"])

We see that both "an unladen swallow" and "an african swallow" appear in the text. 

In [None]:
text6.concordance("unladen")
print()
text6.concordance("african")

### Dispersion plot

`text4` is the Inaugural Address Corpus which includes inaugural addresses going back to 1789. 
We can use a dispersion plot to see where in a text certain words appear, and hence how the language of the address has changed over time. 


In [None]:
# Let's add back in the spaces so we can read it first.
print(" ".join(text4[:100]))
print("")

print(" ".join(text4[-100:]))

In [None]:
# Based on the above, let's see how some words have changed over time.
text4.dispersion_plot(["citizens", "democracy", "freedom", "duty", "America", "nation", "God", "military"])

### Exploring texts using statistics

We'll explore a text by counting the frequency of different words.

The total number of words ("outcomes") in Moby Dick is 260,819 and the number of unique words ("samples") is 19,317. 

In [None]:
frequency_dist = FreqDist(text1)
print(frequency_dist)

# find 50 most common words
print('\n',frequency_dist.most_common(50))

# not suprisingly, whale occurs quite frequently (906 times!)
print('\n', frequency_dist['whale'])

We can find all the words in Moby Dick with more than 15 characters

In [None]:
unique_words = set(text1)
long_words = [w.lower() for w in unique_words if len(w) > 15]
long_words

### Stopword Removal

Often, it is useful to ignore frequently used words, to concentrate on the meaning of the remaining words. These are referred to as *stopwords*. Examples are "the", "was", "is", etc. 

NLTK comes with a stopword corpus. 

In [None]:
from nltk.corpus import stopwords
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

Depending on the task, these stopwords are important modifiers, or superfluous content. 

### Exercise 1.1: Frequent Words
Find the most frequently used words in Moby Dick that are not stopwords and not punctuation. Hint: [`str.isalpha()`](https://docs.python.org/3/library/stdtypes.html#str.isalpha) could be useful here.

In [None]:
# your code here


### Stopwords in different corpora
Is there a difference between the frequency in which stopwords appear in the different texts? 

In [None]:
def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)

for i,t in enumerate([text1,text2,text3,text4,text5,text6,text7,text8,text9]):
    print(i+1,content_fraction(t))

Apparently, "text8: Personals Corpus" has the most content. Why is that?

In [None]:
print(" ".join(text8[:100]))

### Collocations
A *collocation* is a sequence of words that occur together unusually often, we can retreive these using the [`collocations()`](http://www.nltk.org/api/nltk.html#nltk.text.Text.collocations) function.

In [None]:
text2.collocations()

## Sentiment analysis for movie reviews
When analyzing movie reviews, we can ask the simple question: Is the attitude of a movie review positive or negative? If you're developing [rotten tomatoes](https://www.rottentomatoes.com/), that's what you want to know to certify whether a review is "fresh" or "rotten".

How can we approach this question?

Our data is a corpus consisting of 2000 movie reviews together with the user's sentiment polarity (positive or negative). More information about this dataset is available [from this website](https://www.cs.cornell.edu/people/pabo/movie-review-data/).

Our goal is to predict the sentiment polarity from just the review. 

Of course, this is something that we can do very easily: 
1. That movie was terrible. -> negative
+ That movie was great! -> positive





In [None]:
from nltk.corpus import movie_reviews as reviews

The datset contains 1000 positive and 1000 negative movie reviews. 

The paths to / IDs for the individual reviews are accessible via the fileids() call:

In [None]:
reviews.fileids()[0:5]

We can access the positives or negatives explicitly:

In [None]:
reviews.fileids('pos')[0:5]

There are in fact 1000 positive and 1000 negative reviews:

In [None]:
num_reviews = len(reviews.fileids())
print(num_reviews)
print(len(reviews.fileids('pos')),len(reviews.fileids('neg')))

Let's see the review for the third movie. Its a negative review for [The Mod Squad](https://www.rottentomatoes.com/m/mod_squad/) (see the [trailer](https://www.youtube.com/watch?v=67cdXuWnRKs)), which has a "rotten" rating on rotten tomatoes. 

![Mod Squad at Rotten Tomatoes](mod_squad.png)

In [None]:
# the name of the file 
fid = reviews.fileids()[2]
print(fid)

print('\n', reviews.raw(fid))


print('\n', "The Category:", reviews.categories(fid) )

print('\n', "Individual Words:",reviews.words(fid))

Let's look at some sentences that indicate that this is a negative review:

 * "it is movies like these that make a jaded movie viewer thankful for the invention of the timex indiglo watch"
 * "sounds like a cool movie , does it not ? after the first fifteen minutes , it quickly becomes apparent that it is not ." 
 * "nothing spectacular"
 * "avoid this film at all costs"
 * "unfortunately , even he's not enough to save this convoluted mess"

### A Custom Algorithm
We'll build a sentiment classifier using methods we already know to predicts the label ['neg', 'pos'] from the review text

`reviews.categories(file_id)` returns the label ['neg', 'pos'] for that movie

In [None]:
categories = [reviews.categories(fid) for fid in reviews.fileids()]
print(categories[0:10])
labels = {'pos':1, 'neg':0}
# create the labels: 1 for positive, 0 for negative
y = [labels[x[0]] for x in categories]
# output labels for the first (a negative) and the 1000th (a positive review)
y[0], y[1000]

Here, we collect all words into a nested array datastructure:

In [None]:
doc_words = [list(reviews.words(fid)) for fid in reviews.fileids()]

In [None]:
# first 10 words of the third document - mod squad
doc_words[2][1:10]

Here we get all of the words in the reviews and make a FreqDist, pick the most common 2000 words and remove the stopwords.

In [None]:
# get the 2000 most common words in lowercase
most_common = nltk.FreqDist(w.lower() for w in reviews.words()).most_common(2000)

# remove stopwords
filtered_words = [word_tuple for word_tuple in most_common if word_tuple[0].lower() not in stopwords]
# remove punctuation marks
filtered_words = [word_tuple for word_tuple in filtered_words if word_tuple[0].isalpha()]
print(len(filtered_words))
filtered_words[0:50]

We  extract this word list from the frequency tuple.

In [None]:
word_features =  [word_tuple[0] for word_tuple in filtered_words]
print(word_features[:5])
len(word_features)

We define a function that takes a document and returns a list of zeros and ones indicating which of the words in  `word_features` appears in that document. 

In [None]:
def document_features(document):
    # convert each document into a set of its words 
    # this removes duplicates and makes "existence" tests efficient
    document_words = set(document)
    # a list, initalized with 0s, that we'll set to 1 for each of the words that exists in the document
    features = np.zeros(len(word_features))
    for i, word in enumerate(word_features):
        features[i] = (word in document_words)
    return features

Let's just focus on the third document. Which words from `word_features` are in this document? 

In [None]:
words_in_doc_2 = document_features(doc_words[2])
print(words_in_doc_2)

inds = np.where(words_in_doc_2 == 1)[0]
print('\n', [word_features[i] for i in inds])

Now we build our feature set for all the reviews.

In [None]:
X = np.zeros([num_reviews,len(word_features)])
for i in range(num_reviews):
    X[i,:] = document_features(doc_words[i])

X[0:5]

The result is a feature vector for each of these reviews that we can use in classification.

Now that we have features for each document and labels, **we have a classification problem!** 

NLTK has a built-in classifier, but we'll use the scikit-learn classifiers we're already familiar with. 

Let's try k-nearest neighbors:

In [None]:
k = 30
model = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(model, X, y, cv=10)
print(scores)

And SVM:

In [None]:
model = svm.SVC(kernel='rbf', C=30, gamma="auto")
scores = cross_val_score(model, X, y, cv=10)
print(scores)

Here we can see that kNN with these parameters is less accurate than SVM, which is about 80% accurate. Of course, we could now use cross validation to find the optimal parameters, `k` and `C`, but as always, SVM is slow... 

So, let's see what our algorithm things about the Mod Squad! 

In [None]:
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1, test_size=0.2)

model.fit(XTrain, yTrain)

In [None]:
mod_squad = [X[2]]
mod_squad

In [None]:
model.predict(mod_squad)

Our model says 0 - so a bad review! We have succesfully build a classifier that can detect the Mod Squad review as a bad review! 

Let's take a look at a mis-classified movie. Remember, that the first 1000 movies are negative reviews, so we can just look for the first negative one:

In [None]:
model.predict(X[0:10])

Review 9, which was misclassified, is for Aberdeen, which has [generally favorable reviews](https://www.rottentomatoes.com/m/aberdeen/) with about 80% positive. Let's looks at the review:

In [None]:
fid = reviews.fileids()[8]

print('\n', reviews.raw(fid))
print('\n', reviews.categories(fid) )

So if we read this, we can see that this is a negative review, but not a terrible review. Take this sentence for example: 

 * "if signs & wonders sometimes feels overloaded with ideas , at least it's willing to stretch beyond what we've come to expect from traditional drama"
 * "yet this ever-reliable swedish actor adds depth and significance to the otherwise plodding and forgettable aberdeen , a sentimental and painfully mundane european drama"

## We could have also used the Classifier from the NLTK library

Below is the sentiment analysis from [Ch. 6 of the NLTK book](http://www.nltk.org/book/ch06.html). 



In [None]:
documents = [(list(reviews.words(fileid)), category)
             for category in reviews.categories() 
             for fileid in reviews.fileids(category)]

This list contains tuples where the review, stored as an array of words, is the first item in the tuple and the category is the second. 

In [None]:
documents[1]

Extract the features from all of the documents

In [None]:
def document_features(document):    
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains('+ word +')'] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

In [None]:
featuresets[2]

Split into train_set, test_set and perform classification 

In [None]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

classifier.show_most_informative_features(10)

NLTK gives us 88% accuracy, which isn't bad, but our home-made naive algorithm also achieved a respectable 80%.


What improvements could we have made? Obviously, we could have used more data, or – in our home-grown model select words that discriminate between good and bad reviews. We could have used n-grams, e.g., to catch "not bad" as a postitive sentiment.