# NLP with NLTK

Today's workshop will address various concepts in Natural Language Processing, primarily through the use of NTLK. A fundmental understanding of Python is necessary. We will cover:

1. Pre-processing
2. Preparing and declaring your own corpus
3. POS-Tagging
4. NER
5. Sentiment Analysis


You will need:

* NLTK ( \$ pip install nltk)
* the NER wrapper requires the [Java Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml#Download)

# 1) Pre-processing

We are going to take Jonathan Swift's *Gulliver's Travels* from archive.org to use as our text throughout today's workshop. Although we will utilize pre-made corpora to explore more robust options, it is useful to know how to clean your own text files you may have, create your own corpus, declare it properly, and run analyses, so we will start from scratch.

## String manipulation and cleaning

Let's first use Beautiful Soup to grab only the text. There are packages that exist to clean texts from standard sites such as a Gutenberg package for gutenberg.org, but today we'll clean it as best we can manually:

In [None]:
import requests
from bs4 import BeautifulSoup

url = "http://tinyurl.com/gullivert"#"https://ia801404.us.archive.org/2/items/gulliverstravels17157gut/17157-h/17157-h.htm"

f = requests.get(url)
html = f.content

print (f.content)

Create bs object and trim:

In [None]:
#clean and extract only raw text 
bspage = BeautifulSoup(html, "lxml") #or "html.parser"
rawtext = BeautifulSoup.get_text(bspage)

#slice at beginning and end of book
beginning = "My father had"
end = "of my unfortunate voyages."
gtravels = rawtext[rawtext.find(beginning):rawtext.find(end)+len(end)]

print (gtravels)

You'll notice there are still page numbers and chapter headings in our text, and you might have other pieces you want to clean. Recalling your regex work from Day 3 of the intro series, how can we get rid of all the page numbers within brackets?

In [None]:
import re

#regex for page numbers in brackets
gtravels = re.sub("\[[0-9]+\]", "", gtravels)

#regex to replace Roman Numerals following all caps word, up to RN 9 (only 8 chapters)
gtravels = re.sub("([A-Z]+ (I?V|V?I{1,3})\(.)", "",gtravels)

print (gtravels)

Let's save this text so we can read it in the corpus later:

In [None]:
import codecs
with codecs.open("gulliver.txt", "w","utf-8") as f:
    f.write(gtravels)

# 2) Declaring a corpus in NLTK

While you can use NLTK on strings and lists of sentences, it's better to formally declare your corpus.

In [None]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = "" #rel. path
my_texts = PlaintextCorpusReader(corpus_root, '.*txt')

We now have a text corpus, on which we can run all the basic methods you learned in the introductory sequence. To list all the files in our corpus:

In [None]:
my_texts.fileids()

We can also extract either all the words or all the sentences in list format:

In [None]:
my_texts.words('gulliver.txt')  # uses punkt tokenizer

In [None]:
my_texts.sents('gulliver.txt')

In [None]:
my_texts.paras('gulliver.txt')[0]

In [None]:
gsents = my_texts.sents('gulliver.txt')
print (gsents)

We now have a corpus, or text, from which we can get any of the statistics you learned in Day 3 of the Python workshop. We will review some of these functions once we get some more information

# 3) POS-Tagging

There are many situations, in which "tagging" words (or really anything) may be useful in order to determine or calculate trends, or for further text analysis to extract meaning. We will cover 3 methods of tagging: simple regex, n-gram, and Brill transformation based tagging. Although they will not be covered today, HMM, CRF, and neural networks will be briefly alluded to as additional machine learning models.

It is important to note that in Natural Language Processing (NLP), POS (Part of Speech) tagging is the most common use for tagging, but the actual tag can be anything. Other applications include sentiment analysis and NER (Named Entity Recognition). Tagging is simply labeling a word to a specific category via a tuple.

Nevertheless, for training more advanced tagging models, POS tagging is nearly essential. If you are defining a machine learning model to predict patterns in your text, these patterns will most likley rely on, among other things, POS features. You will therefore first tag POS and then use the POS as a feature in your model.

## On a low-level

Tagging is creating a tuple of (word, tag) for every word in a text or corpus. For example: "My name is Chris" may be tagged for POS as:

My/PossessivePronoun name/Noun is/Verb Chris/ProperNoun ./Period

*NB: type 'nltk.data.path' to find the path on your computer to your downloaded nltk corpora. You can explore these files to see how large corpora are formatted.*

You'll notice how the text is annotated, using a forward slash to match the word to its tag. So how can we get this to a useful form for Python?

In [None]:
from nltk.tag import str2tuple

line = "My/Possessive_Pronoun name/Noun is/Verb Chris/Proper_Noun ./Period"
tagged_sent = [str2tuple(t) for t in line.split()]

print (tagged_sent)

Further analysis of tags with NLTK requires a *list* of sentences, otherwise you will get an index error on higher level methods.

Naturally, these tags are a bit verbose, the standard tagging conventions follow the Penn Treebank (more in a second): https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

## Automatic Tagging

In [None]:
from nltk import pos_tag
gtagged_sent = pos_tag(gsents[0])
print (gtagged_sent)

In [None]:
g_tagged_all = [pos_tag(sent) for sent in gsents]

In [None]:
g_tagged_all[:3]

In [None]:
import nltk
def find_tags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions()) #cfd.conditions() yields all tags possibilites

In [None]:
g_tagged_words = [item for sublist in g_tagged_all for item in sublist]

tagdict = find_tags('JJ', g_tagged_words)
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

In [None]:
tags = [b[1] for (a, b) in nltk.bigrams(g_tagged_words) if a[1].startswith('VB')]
fd2 = nltk.FreqDist(tags)

print ("Gulliver")
fd2.tabulate(10)

## Working with a tagged corpus

Now that we know how tagging works, let's import a tagged corpus from the NLTK database and see what we can do.

In [None]:
from nltk.corpus import brown #if you don't have this downloaded, type nltk.download()
brown.tagged_words()

*NB: the argument tagset = "universal" simplifies the tagset.*

Let's find the most frequent parts of speech in the corpus:

In [None]:
import nltk

brown_news_tagged = brown.tagged_words(categories='news') #not universal tagset
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

So what do these tags mean?

In [None]:
nltk.help.upenn_tagset()

We can also find out what the most common nouns are. For the linguists, there are naturally many subgroups of nouns, let's see what we can get:

In [None]:
tagdict = find_tags('NN', brown_news_tagged)
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

We can also look at what linguistic environment words are in, below lists all the words following "President":

In [None]:
brown_news_text = brown.words(categories='news')
sorted(set(b for (a, b) in nltk.bigrams(brown_news_text) if a == 'President'))

If we are looking to build a classifier, perhaps for author identification, it may be useful to quantify the syntax.

In [None]:
tags = [b[1] for (a, b) in nltk.bigrams(brown_news_tagged) if a[1].startswith('VB')]
fd1 = nltk.FreqDist(tags)

print ("Gulliver")
fd2.tabulate(10)
print ()
print ("News")
fd1.tabulate(10)

# 4) Named Entity Recognition

In [None]:
from nltk.tag.stanford import StanfordNERTagger

st = StanfordNERTagger(
        '/Users/chench/Documents/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz',
        '/Users/chench/Documents/stanford-ner-2015-12-09/stanford-ner.jar')

In [None]:
ner_sents = [st.tag(s) for s in gsents]

# 5) Sentiment Analysis

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

sent_pol = sid.polarity_scores(s)["compound"]