# Intro to Natural Language Processing (NLP)




(based on Ch 12 of Deitel and Deitel)

# Tasks in natural language processing (NLP)

What might we want to do with text in an automated way?  Understanding the full meaning of text is still beyond our capabilities, but we could still analyze which words or sequences of words are common in the text.  That is itself a stepping stone to...



* Sentiment analysis - determining whether people are happy or unhappy. 



* Topic classification - Determining whether a document is relevant to a topic.



* Named entity recognition - What proper nouns are being talked about, for example for sentiment analysis or stock prediction.



* Translation



* Exploratory visualization - a word cloud or visualization in a "semantic space"



* Chatbots - for helping or entertainment

# TextBlobs and their properties

TextBlobs are sort of like strings with many bells and whistles attached to them.  The TextBlob module provides an easy-to-use interface on top of two powerful tools for NLP, the nltk module and the pattern module.  After a TextBlob is created, many of its features are usable by just accessing attributes of the TextBlob.



Normally, breaking strings down into words might be done with the split() method, and we'd still have punctuation lying around.  A TextBlob immediately knows what sentences it has and what words it has, and the words aren't attached to their punctuation.

These features use the nltk tokenizer, and so the data used for that needs to be downloaded first.  (A tokenizer breaks a sentence into meaningful words and punctuation - tokens.)

In [None]:
import nltk
nltk.download('punkt')

In [None]:
!pip install textblob

In [None]:
from textblob import TextBlob
text = TextBlob('Hello, textblob!  Hello, sentences and words!')

print(text.sentences)
print(text.words)

The TextBlob is already doing some nontrivial tokenizing work for us, as detecting sentences and tokens has some subtle issues.  This work is often done by a lightweight machine learning algorithm, like a perceptron (a single unit neural network).  In the next example, a comma and a period are both correctly interpreted as not being sentence-level punctuation.

In [None]:
text = TextBlob('I can\'t decide whether I want to get $1,000 or give my $0.02.')
print(text.sentences)
print(text.words)


Part-of-speech (POS) tagging is another task that comes built in to every text blob; it also typically uses a perceptron or similar lightweight machine learning.  Just accessing the .tags field gives a list of tuples of words and parts of speech.  (Like the tokenizing, this borrows a trained classifier from nltk - a perceptron trained to predict part of speech from nearby words.)

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

text.tags

Umm, what?  Thankfully, nltk (which TextBlobs use) comes with a command to see what each of the abbreviations means.

In [None]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

TextBlobs even come with a ready-to-use "sentiment" field that uses more lightweight machine learning (averaging word sentiment scores) to determine the polarity of the sentiment (negative or positive, in range [-1,1]) and the subjectivity of the sentence (range [0,1] with 1 subjective).  The second number can tell you whether you ought to use this sentence for sentiment analysis at all, or how much you might weight it in a classification of the document.

In [None]:
happy_text = TextBlob("This is a pretty neat module.")
happy_text.sentiment

This doesn't work all the time (it allegedly gets 75% accuracy on movie reviews, and less on text that isn't reviews), and you'll probably get better performance out of a more powerful sentiment analysis technique.  But it is kind of cute.

In [None]:
unhappy_text = TextBlob("But, I don't like how I can't really tell how it makes its decisions.")
unhappy_text.sentiment

There are a few different analyzers to try, including a Naive Bayes analyzer (see DS120) trained on movie reviews.

In [None]:
from textblob.sentiments import NaiveBayesAnalyzer
nltk.download('movie_reviews') # Data for Naive Bayes training
unhappy_text = TextBlob("But, I don't like how I can't really tell how it makes its decisions.", analyzer=NaiveBayesAnalyzer())
unhappy_text.sentiment

If you filter to get just strong polarity and subjectivity, you may have better luck than trying to classify subtle sentences.

In [None]:
unhappy_text = TextBlob("I hate this stupid thing!")
unhappy_text.sentiment

# Cleaning for machine learning

If you're planning on using some text for some machine learning, you may want to standardize the text in some ways first.



If your algorithm counts "swim" and "swimming" as two totally different things, you may lose out on an algorithmic realization that both passages were talking about the same thing.  However, being too aggressive, and doing something like deleting "ing" from the ends of all words, could result in genuinely different words being treated as the same.



In general, this kind of fixing matters more for small texts than very large ones; a very big machine learning corpus could have enough mentions of swim, swims, swimming, and swam that the learner can realize their connection on its own.  The big algorithm then can benefit from using the subtle distinctions between the words.  But for many more humble projects, it can give the algorithm a boost to "normalize" the words.



TextBlobs offer both "stemming" and "lemmatizing."  "Stemming" is a less nuanced approach that returns a piece of a word with prefixes and suffixes removed that may not be a real word.  "Lemmatizing" factors in nearby words and returns a real word, but it can be conservative in TextBlobs and tends to leave words alone.  Neither is likely to be perfect, but either can aid machine learning when the training data is small.

In [None]:
text = TextBlob("I am enjoying these delicious strawberries")
text.words.stem()

In [None]:
nltk.download('wordnet')
text.words.lemmatize()

To aid machine learning, you may want to disregard some "filler" words entirely, such as "a" and "the."  These are called "stop words," and nltk provides lists for several languages, including English.  Like stemming and lemmatizing, a learner with quite a lot of data might benefit from the subtle information these stop words provide, but for smaller projects, it's probably best to leave them out and avoid confusing the learner with them.

In [None]:
nltk.download('stopwords')

from nltk.corpus import stopwords
stops = stopwords.words('english')
stops

In [None]:
text = TextBlob("I was going to say a lot, but maybe not now")
# Notice how you can apply a filter within a list comprehension using "if"
nonstop = [word for word in text.words if word not in stops]
print(nonstop)

Rather than learning on the basis of individual word frequencies, many machine learning algorithms benefit from seeing 2 or 3 word phrases, and using those frequencies instead of individual word frequencies.  TextBlobs possess an ngrams method that generates the list of n-word phrases for a sentence.  In this way, a word "the" that might have been discarded as a stop word could gain new life as part of *The Great Escape*, a famous movie title and "trigram."  ("Gram" means "word.")

In [None]:
blob = TextBlob("Let's go watch 'The Great Escape.'")
blob.ngrams(n = 3)

# WordNet

Another functionality that TextBlobs borrow from NLTK is the ability to access WordNet, a database that contains definitions and lists of synonyms for each word.  This can be used as a stepping stone for more machine learning, although recently machine learning has steered towards large amounts of unstructured data over structured data like this.

In [None]:
nltk.download('wordnet')

from textblob import Word

net = Word('net')

net.definitions

A "synset" is a set of synonyms, indexed by a word, the part of speech of the word, and the definition number of the word that is being referred to.

In [None]:
net.synsets

We can get words back out by iterating through synset.lemmas(), each of which is a lemmatized synonym that falls under that synset's definition.

In [None]:
syns = set()

for word in net.synsets[0].lemmas(): # Just the Internet category of synonyms
    syns.add(word.name())
print(syns)

# Top words

Here's an example from our textbook that plots the frequencies of the top 20 words in *Romeo and Juliet*, not counting stop words.
(The text is from [Project Gutenberg](https://www.gutenberg.org), which contains simple text versions of many classics.)  Visualization like this can raise interesting questions for further exploration.



First, we need to read in the text from a file.  

In [None]:
# Google colab specific upload
from google.colab import files

uploaded = files.upload()

In [None]:
with open('RomeoAndJuliet.txt', 'r') as myfile:
    text = myfile.read() # reads all of it
    blob = TextBlob(text)

An additional feature of a blob is that it contains a dictionary from words to counts - albeit one that isn't stemmed or lemmatized.

In [None]:
print(blob.word_counts['thy'])

We can get (word, count) pairs using blob.word_counts.items(), and drop tuples with words that are also stopwords.

In [None]:
items = [item for item in blob.word_counts.items() if item[0] not in stops]
print(items[1])

We'll now sort the tuples that didn't contain stopwords, from highest count to smallest count.  sorted is a built-in function for sorting, and the key keyword takes a function to apply to each item before trying to sort- so we'll use a lambda to grab the wordcount part of the tuple.

In [None]:
sorted_items= sorted(items, key=lambda x: x[1], reverse=True)

We'll then use pandas to do the final plotting, using its built-in bar chart creator.  We'll skip item 0, which happens to be an apostrophe and is therefore a little underwhelming.

In [None]:
import pandas as pd
df = pd.DataFrame(sorted_items[1:21],columns=['word','count'])
df.head()

In [None]:
axes = df.plot.bar(x='word',y='count',legend=False)

# Wordcloud module


In [None]:
!pip install wordcloud

with open('RomeoAndJuliet.txt', 'r') as myfile:
    text = myfile.read() # reads all of it

from wordcloud import WordCloud
wordcloud = WordCloud(background_color = 'white', height=800, width=800)
wordcloud = wordcloud.generate(text)

In [None]:
import matplotlib.pyplot as plt
plt.imshow(wordcloud)

In [None]:
wordcloud = wordcloud.to_file('RandJWordCloud.png')

In [None]:
# Google colab - "how do we get it down from there?"
files.download('RandJWordCloud.png')