# NLP with NLTK Workshop

Today's workshop will address various concepts in Natural Language Processing, primarily through the use of NTLK. A fundmental understanding of Python is necessary. We will cover:

1. Preparing your own corpus
2. Tagging
3. Chunking
4. Document classification

You will need:

* NLTK ( \$ pip install nltk)
* Brown corpus from NLTK ( >>> nltk.download() )
* Punkt tokenizer from NLTK ( >>> nltk.download() )
* Movie reviews corpus from NLTK ( >>> nltk.download() )
* BeautifulSoup ( \$ pip install beautifulsoup4)

This workshop will further help to solidfy understandings of regex and list comprehensions.

Much of today's work will be adapted, or taken directly, from the NLTK book found here: http://www.nltk.org/book/ . The guide for BeautifulSoup is here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ . For further explanation of grammars see *Data Science from Scratch*: http://shop.oreilly.com/product/0636920033400.do .

# 1) Preparing your own corpus

We are going to take Jonathan Swift's *Gulliver's Travels* from archive.org to use as our text throughout today's workshop. Although we will utilize pre-made corpora to explore more robust options, it is useful to know how to clean your own text files you may have, create your own corpus, declare it properly, and run analyses, so we will start from scratch.

## String manipulation and cleaning

Let's first use Beautiful Soup to grab only the text. There are packages that exist to clean texts from standard sites such as a Gutenberg package for gutenberg.org, but today we'll clean it as best we can manually:

In [None]:
import requests
from bs4 import BeautifulSoup

url = "http://tinyurl.com/gullivert"#"https://ia801404.us.archive.org/2/items/gulliverstravels17157gut/17157-h/17157-h.htm"

f = requests.get(url)
html = f.content

print (f.content)

Create bs object and trim:

In [None]:
#clean and extract only raw text 
bspage = BeautifulSoup(html, "lxml") #or "html.parser"
rawtext = BeautifulSoup.get_text(bspage)

#slice at beginning and end of book
beginning = "My father had"
end = "of my unfortunate voyages."
gtravels = rawtext[rawtext.find(beginning):rawtext.find(end)+len(end)]

print (gtravels)

You'll notice there are still page numbers and chapter headings in our text, and you might have other pieces you want to clean. Recalling your regex work from Day 3 of the intro series, how can we get rid of all the page numbers within brackets?

In [None]:
import re

#regex for page numbers in brackets
gtravels = re.sub("\[[0-9]+\]", "", gtravels)

#regex to replace Roman Numerals following all caps word, up to RN 9 (only 8 chapters)
gtravels = re.sub("([A-Z]+ (I?V|V?I{1,3})\(.)", "",gtravels)

print (gtravels)

Let's save this text so we can read it in the corpus later:

In [None]:
import codecs
with codecs.open("gulliver.txt", "w","utf-8") as f:
    f.write(gtravels)

## Declaring a corpus in NLTK

While you can use NLTK on strings and lists of sentences, it's better to formally declare your corpus.

In [None]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = "" #rel. path
my_texts = PlaintextCorpusReader(corpus_root, '.*txt')

We now have a text corpus, on which we can run all the basic methods you learned in the introductory sequence. To list all the files in our corpus:

In [None]:
my_texts.fileids()

We can also extract either all the words or all the sentences in list format:

In [None]:
my_texts.words('gulliver.txt')

In [None]:
gsents = my_texts.sents('gulliver.txt')
print (gsents)

We now have a corpus, or text, from which we can get any of the statistics you learned in Day 3 of the Python workshop. We will review some of these functions once we get some more information

# 2) Tagging

There are many situations, in which "tagging" words (or really anything) may be useful in order to determine or calculate trends, or for further text analysis to extract meaning. We will cover 3 methods of tagging: simple regex, n-gram, and Brill transformation based tagging. Although they will not be covered today, HMM, CRF, and neural networks will be briefly alluded to as additional machine learning models.

It is important to note that in Natural Language Processing (NLP), POS (Part of Speech) tagging is the most common use for tagging, but the actual tag can be anything. Other applications include sentiment analysis and NER (Named Entity Recognition). Tagging is simply labeling a word to a specific category via a tuple.

Nevertheless, for training more advanced tagging models, POS tagging is nearly essential. If you are defining a machine learning model to predict patterns in your text, these patterns will most likley rely on, among other things, POS features. You will therefore first tag POS and then use the POS as a feature in your model.

## On a low-level

Tagging is creating a tuple of (word, tag) for every word in a text or corpus. For example: "My name is Chris" may be tagged for POS as:

My/PossessivePronoun name/Noun is/Verb Chris/ProperNoun ./Period

*NB: type 'nltk.data.path' to find the path on your computer to your downloaded nltk corpora. You can explore these files to see how large corpora are formatted.*

You'll notice how the text is annotated, using a forward slash to match the word to its tag. So how can we get this to a useful form for Python?

In [None]:
from nltk.tag import str2tuple

line = "My/Possessive_Pronoun name/Noun is/Verb Chris/Proper_Noun ./Period"
tagged_sent = [str2tuple(t) for t in line.split()]

print (tagged_sent)

Further analysis of tags with NLTK requires a *list* of sentences, otherwise you will get an index error on higher level methods.

Naturally, these tags are a bit verbose, the standard tagging conventions follow the Penn Treebank (more in a second): https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

## Working with a tagged corpus

Now that we know how tagging works, let's import a tagged corpus from the NLTK database and see what we can do.

In [None]:
from nltk.corpus import brown #if you don't have this downloaded, type nltk.download()
brown.tagged_words()

*NB: the argument tagset = "universal" simplifies the tagset.*

Let's find the most frequent parts of speech in the corpus:

In [None]:
import nltk

brown_news_tagged = brown.tagged_words(categories='news') #not universal tagset
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

So what do these tags mean?

In [None]:
nltk.help.upenn_tagset()

We can also find out what the most common nouns are. For the linguists, there are naturally many subgroups of nouns, let's see what we can get:

In [None]:
def find_tags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions()) #cfd.conditions() yields all tags possibilites

tagdict = find_tags('NN', brown_news_tagged)
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

We can also look at what linguistic environment words are in, below lists all the words following "President":

In [None]:
brown_news_text = brown.words(categories='news')
sorted(set(b for (a, b) in nltk.bigrams(brown_news_text) if a == 'President'))

If we are looking to build a classifier, perhaps for author identification, it may be useful to quantify the syntax.

In [None]:
tags = [b[1] for (a, b) in nltk.bigrams(brown_news_tagged) if a[1].startswith('VB')]
fd1 = nltk.FreqDist(tags)
fd1.tabulate(10)

## Automatic Tagging

Now that we know some things we can do with a tagged corpus, how can we tag our own corpus? We will work through regex models, n-gram models, and discuss a couple more advanced models.

### Regex Tagger

Let's write a simple regex tagger for 8 parts of speech. First we need to define the patterns for each part:

In [None]:
patterns = [
     (r'.*ing$', 'VBG'),               # gerunds
     (r'.*ed$', 'VBD'),                # simple past
     (r'.*\'s$', 'NN$'),               # possessive nouns
     (r'.*s$', 'NNS'),                 # plural nouns
     (r'.*', 'NN')                     # nouns (default)
 ]

Now we build the tagger and we can test it on the first sentence of our *Gulliver's Travels*.

In [None]:
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(gsents[0])

That didn't work so well, no worries, this was a very naïve attempt. But we can evaluate the accuracy nonetheless:

In [None]:
brown_tagged_sents = brown.tagged_sents(categories='news')
regexp_tagger.evaluate(brown_tagged_sents)

### N-Gram Tagging

N-Gram tagging is a very basic supervised machine learning technique. It looks at a word and *n* previous words' tags to determine the best tag for the focal word. Because n-gram tagging and other machine learning models require data to train on they are called "supervised", because you know the data being given to it. This also means that we must divide the data into training and testing data, because if you test your model on the same data it was trained with, you will have a great degree of bias. Originally, a 90-10 divide was recommended, but standards have now changed to k-fold cross-validation, usually 10 folds.

In [None]:
#divide tagged data
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

#train bigram tagger
bigram_tagger = nltk.BigramTagger(train_sents) #word and tag of prev word

We can now try this tagger on that sentence again:

In [None]:
bigram_tagger.tag(gsents[0])

All of the "None" means it didn't know how to tag it because the model was insufficient, as once it encounters an unknown word to tag, the following will also be un-taggable. To fix this we have to implement backoff tagging, or cascading taggers:

In [None]:
t0 = nltk.RegexpTagger(patterns)
t1 = nltk.UnigramTagger(train_sents, backoff=t0) #only current word, most likely tag
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t3 = nltk.TrigramTagger(train_sents, backoff=t2)

Now let's try to tag that sentence again:

In [None]:
t3.tag(gsents[0])

In [None]:
t3.evaluate(test_sents)

### Transformation-based Brill Tagging

There are many different machine learning algorithms out there. The current "hot" choice is neural networks, but that is beyond the scope of this workshop. Let's look at a transformation-based tagger included in NLTK, which will help us understand how many machine learning models make decisions.

In [None]:
from nltk.tag.brill import *

def train_brill_tagger(tagged_sents):
    t0 = nltk.RegexpTagger(patterns)
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    t3 = nltk.TrigramTagger(train_sents, backoff=t2)
    Template._cleartemplates()
    templates = brill24() #or fntbl37
    t4 = nltk.tag.brill_trainer.BrillTaggerTrainer(t3, templates, trace=3)
    t4 = t4.train(tagged_sents, max_rules=100)
    
    return t4

tagger = train_brill_tagger(brown_tagged_sents)


We see that the Brill tagger corrects itself up to a certain threshold based on rules it generated from the data we gave it. Other machine learning models such as Conditional Random Fields (CRF) work in a similar way, in that you tell it what features are important to look at, and it weights these features in writing its rules. Neural networks go more into linear algebra and matrix multiplication, a different approach. Libraries do exist for easy implmentation of neural nets such as pybrain (http://pybrain.org) for general advanced modelling, and nlpnet (http://nilc.icmc.usp.br/nlpnet/index.html) for POS or SRL (Semantic Role Labeling).

So let's tag that sentence again with our Brill tagger:

In [None]:
gtagged_sent = tagger.tag(gsents[0])
print (gtagged_sent)

In [None]:
tagger.evaluate(test_sents)

Let's POS tag our entire text:

In [None]:
g_tagged_all = [tagger.tag(sent) for sent in gsents]

In [None]:
g_tagged_all[:3]

What types of adjectives are used?

In [None]:
g_tagged_words = [item for sublist in g_tagged_all for item in sublist]

tagdict = find_tags('JJ', g_tagged_words)
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

How about we compare the syntax of Gulliver's Travels to the news corpus:

In [None]:
tags = [b[1] for (a, b) in nltk.bigrams(g_tagged_words) if a[1].startswith('VB')]
fd2 = nltk.FreqDist(tags)

print ("Gulliver")
fd2.tabulate(10)
print ()
print ("News")
fd1.tabulate(10)

There are several explanations for the difference. Perhaps due to the familiarity with characters in the novel form, personal pronoun objects ("me, him, her, etc.") are more common to follow verbs than articles, which likely attempt to clarify an unknown in a news source.

In developing machine learning models, you may want to know where the model is making errors. This can be done by examining the Confusion Matrix:

In [None]:
def tag_list(tagged_sents):
    return [tag for sent in tagged_sents for (word, tag) in sent] #just grabbing a list of all the tags
def apply_tagger(tagger, corpus):
    return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] #notice we first untag the sentence

gold = tag_list(brown_tagged_sents)
test = tag_list(apply_tagger(tagger, brown_tagged_sents))

cm = nltk.ConfusionMatrix(gold, test)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=10))

### Pickling

If you want to save your model, or any complex variable in Python, you can use pickle:

In [None]:
from pickle import dump,load

with open("brilltagger.pkl", "wb") as f:
    dump (tagger, f, -1) #-1 calls for a more efficient binary protocol

In [None]:
with open('brilltagger.pkl', 'rb') as f:
    tagger = load(f)
    
type (tagger)

## 3) Chunking, grammars, and Named Entity Recognition

On a low linguistic level, you may want to map out a sentence visually based on parts of speech, of course this visualization is actually just a navigable data type, which can be used to mine statistics. We have to first define the grammar. We'll just define a noun phrase for English consisting of a determiner, indefinite article, count, or possessive pronoun, an adjective, and noun. Defining the grammar is done similarly to writing regular expressions. We can then draw the map.

In [None]:
grammar = r"""
  NP: {<DT|AT|CD|PP\$>?<JJ>*<PPSS|NN.*>}       
  PP: {<IN><NP>}            
  VP: {<BEDZ|HVD|VB.*><AT>?<OD>?<NP|PP|CLAUSE>+} 
  CLAUSE: {<NP><VP>}        
  """
# | is "or", a following ? means optional, * is 0 or more, .* is anything following

cp = nltk.RegexpParser(grammar)
result = cp.parse(gtagged_sent)
result #result.draw() for not in python notebook

In [None]:
print (result) #can be traversed using indexes, obviously searched as well

With this information, we can then train classifiers for Named Entity Recognition (NER), i.e. identifying people, places, and things. We won't go into detail today, but NLTK already has a trained classfier we can use off-the-shelf:

In [None]:
print (nltk.ne_chunk(gtagged_sent, binary=True))

## 4) Document Classification

We now download a corpus of movie reviews, which were already labeled as positive or negative. We can build a Naive Bayes Classifier to learn from the annotated data and then predict unseen reviews as positive or negative.

In [None]:
import random
from nltk.corpus import movie_reviews

movie_reviews.categories()

In [None]:
movie_reviews.fileids()

In [None]:
len(movie_reviews.fileids())

In [None]:
movie_reviews.words("neg/cv000_29416.txt")

In [None]:
documents = [(list(movie_reviews.words(fileid)), category)
                for category in movie_reviews.categories()
                for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [None]:
documents[0]

We will use a relatively simple model of only two features: single words and bigrams. First we need to get a list of the most common words and bigrams in the entire corpus.

In [None]:
movie_words = [w.lower() for w in movie_reviews.words()]
all_words = nltk.FreqDist(movie_words)
word_features = list(all_words.most_common())[:2000]
word_features = [x[0] for x in word_features]

all_bis = nltk.bigrams(movie_words)
all_bis = nltk.FreqDist(all_bis)
bi_features = list(all_bis.most_common())[:2000]
bi_features = [x[0] for x in bi_features]

Now we define features we deem relevant for classifying a document, in our case we will only use the words and bigrams generated above.

In [None]:
def document_features(document):
    document_words = set(document)
    document_bis = set(nltk.bigrams(document))
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    for bi in bi_features:
        features['contains({})'.format(str(bi))] = (bi in document_bis)
    return features

We then extract all these features into a tuple with the classification. Divide into training and testing sets, and train the classfier.

In [None]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(classifier, test_set)

In [None]:
classifier.show_most_informative_features(30)

Now go to the internet, find a movie review, create a tokenized string out of it, and try to classify it!

Hint:

In [None]:
my_review = nltk.word_tokenize("STRING OF MOVIE REVIEW")

In [None]:
rev_doc_feats = document_features(my_review)

In [None]:
classifier.classify(rev_doc_feats)