# Introduction

Today's workshop will address various concepts in Natural Language Processing, primarily through the use of NTLK. Students should already have a fundmental understanding of Python. We'll work with a corpus of documents and learn how to identify different types of linguistic structure in the text, which can help in classifying the documents or extracting useful information from them. We'll cover:

1. NLTK Corpora
2. Tokenization
3. Part-of-Speech (POS) Tagging
4. Phrase Chunking
5. Named Entity Recognition (NER)
6. Dependency Parsing

You will need:

* NLTK (in Bash $ `pip install nltk`)

* NLTK Book corpora and packages (In Python `>>> nltk.download()` )

* NumPy package (in Bash `$ pip install numpy`)

* scikit-learn (in Bash `$ pip install scikit-learn`)

* Stanford Parser: Download Stanford Parser 3.6.0 and unzip to a location that's easy for you to find (e.g. a folder called SourceCode in your Documents folder). Link: http://nlp.stanford.edu/software/lex-parser.shtml#Download

Much of today's work will be adapted, or taken directly, from the NLTK book found here: http://www.nltk.org/book/ .

## Motivation

Why would we use natural language processing? How does it relate to other things we might be doing -- or trying to do -- with text in our research?

Natural language processing is a field of computer science and linguistics; it aims to enable computers to process and derive meaning from input in human language. NLP research is being used to automate tasks like translation, question answering, voice recognition, and language generation.

### Task: Document Classification

We're often interested in characterizing text from different sources, e.g. measuring the ideology of different politicians based on the language used in their speeches. A simple case would be a situation in which we have a bunch of documents that we want to label as "positive" or "negative". This is often called "sentiment analysis", and it can be very difficult, despite only having two categories, because sentiment is a subjective and often subtle idea.

Since sentiment analysis involves human judgment about the meaning of language, we'll need to do this in a supervised manner, using training data that has already been labeled. We'll need to use a bit of machine learning for this task, but we'll use one of the existing classifiers provided by `scikit-learn`. These classifiers take a set of training documents that have already been categorized, and learn how to predict the categories of other documents. We'll use the NLTK Movie Reviews corpus for our training data.

We can't give a classifier raw text, because it wouldn't know how to use human language in its calculations. Instead, we represent each document by a vector of features (numeric or boolean values) and then the machine learns what combinations of those features fall into each category. Throughout the workshop today, we'll learn how NLP tools can help us extract different features from documents, to represent the documents with more meaningful or relevant information that might help classify them more accurately.

Let's first import `nltk` and download some of the tools we'll use.

In [1]:
import nltk

nltk.download('averaged_perceptron_tagger')  # download POS tagger
nltk.download('punkt')  # download tokenizer
nltk.download('brown')  # download brown news corpus
nltk.download('movie_reviews')  # download IMDB movie reviews corpus

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /global/homes/k/khcheung/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /global/homes/k/khcheung/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package brown to
[nltk_data]     /global/homes/k/khcheung/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /global/homes/k/khcheung/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

## 1) Starting with a Corpus

We can use NLTK on strings, lists or dictionaries of strings, or files containing text. We call the overall body of texts we're working with a "corpus", which is a collection of written documents or texts (plural "corpora"). You might have a single .csv file of sentences, titles, tweets, etc. that you want to read in to Python all at once and then analyze. If you have your documents in different files, however, NLTK provides a class called PlainTextCorpusReader for working with a corpus as a group of text files. We can declare an NLTK corpus object containing all text files in the current working directory or subdirectories as follows:

In [2]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = "" # relative path, i.e. current working directory
my_corpus = PlaintextCorpusReader(corpus_root, '.*txt')

We can list all of the files we've just included in the corpus as follows:

In [3]:
my_corpus.fileids()

['example.txt']

We can read in the contents of a file in our corpus as a string using the .raw() method:

In [4]:
my_corpus.raw('example.txt')

"Welcome to natural language processing! Is it NLP or N.L.P.? Let's work with NLTK to process, classify, and extract information from texts.\n"

We can also extract either all the words, or sentences and their words, as lists of strings:

In [5]:
my_corpus.words('example.txt')

['Welcome', 'to', 'natural', 'language', 'processing', ...]

In [6]:
sents = my_corpus.sents('example.txt')
print(sents)

[['Welcome', 'to', 'natural', 'language', 'processing', '!'], ['Is', 'it', 'NLP', 'or', 'N', '.', 'L', '.', 'P', '.?'], ...]


NLTK comes with a variety of downloadable corpora that can be used for trying out the methods in the toolkit. We'll use two of these corpora in just a bit, when we set up our practical tasks.

## 2) Tokenization

We usually look at grammar and meaning at the level of words, related to each other within sentences, within each document. So if we're starting with raw text, we first need to split the text into sentences, and those sentences into words -- which we call "tokens". An NLTK corpus object does this for us, allowing us to read in lists of words from our text files. But NLTK also provides tools that enable us to "tokenize" strings ourselves, if we've read them in in longer form.

To understand how to pre-process raw text, let's read in the contents of 'example.txt' in your current directory, using the .raw() method to get one long string.

In [7]:
example_text = my_corpus.raw('example.txt')
print(example_text)

Welcome to natural language processing! Is it NLP or N.L.P.? Let's work with NLTK to process, classify, and extract information from texts.



Now, you might imagine that the easiest way to identify sentences is to split the document at every period '.', and to split the sentences using white space to get the words.

In [8]:
example_sents = example_text.split('.')
example_sents_toks = [sent.split(' ') for sent in example_sents]

for sent_toks in example_sents_toks:
    print(sent_toks)

['Welcome', 'to', 'natural', 'language', 'processing!', 'Is', 'it', 'NLP', 'or', 'N']
['L']
['P']
['?', "Let's", 'work', 'with', 'NLTK', 'to', 'process,', 'classify,', 'and', 'extract', 'information', 'from', 'texts']
['\n']


This doesn't look right. Not all periods divide sentences (periods may also used in abbreviations), and not all sentences end in a period (some end in question marks or exclamation points). Words might be separated by not only single spaces, but also tabs or newlines. We can use the 're' package split method to use regular expressions that capture these various possibilities:

In [9]:
import re

example_sents = re.split('(?<=[a-z])[.?!]\s', example_text)
example_sents_toks = [re.split('\s+', sent) for sent in example_sents]

for sent_toks in example_sents_toks:
    print(sent_toks)

['Welcome', 'to', 'natural', 'language', 'processing']
['Is', 'it', 'NLP', 'or', 'N.L.P.?', "Let's", 'work', 'with', 'NLTK', 'to', 'process,', 'classify,', 'and', 'extract', 'information', 'from', 'texts']
['']


This looks better, though we've lost the punctuation at the end of each sentence, except for the period at the end of the string (since we only split sentences on a period followed by white space). That last period has remained attached to the word 'out', since we only split words on white space. We could instead use 're.findall()' to search for all sequences of alphanumeric characters. This would split apart conjunctions, which might be useful if we want to consider 'I' and ''m' (short for 'am') to represent separate words.

We'll stop there, because NLTK provides handy classes to do this for us:

In [10]:
import nltk

example_sents = nltk.sent_tokenize(example_text)
example_sents_toks = [nltk.word_tokenize(sent) for sent in example_sents]

for sent_toks in example_sents_toks:
    print(sent_toks)

['Welcome', 'to', 'natural', 'language', 'processing', '!']
['Is', 'it', 'NLP', 'or', 'N.L.P', '.', '?']
['Let', "'s", 'work', 'with', 'NLTK', 'to', 'process', ',', 'classify', ',', 'and', 'extract', 'information', 'from', 'texts', '.']


These lists of tokens are what we get with the words() method applied to an NLTK corpus. We'll work with the NLTK corpora from here on, but now you know how to turn your own documents into lists of words, either by creating an NLTK corpus object containing your own text files, or by reading in longer strings and then using NLTK functions to tokenize them.

## Putting into Practice

### Task: Classifying Documents

Now we're ready to set up our first task: classifying documents as positive or negative. For this task, we'll use the NLTK Movie Reviews corpus, which contains 2,000 movie reviews already categorized as positive or negative.

In [11]:
from nltk.corpus import movie_reviews

movie_reviews.categories()

['neg', 'pos']

In [12]:
movie_reviews.fileids()[:10]

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt',
 'neg/cv005_29357.txt',
 'neg/cv006_17022.txt',
 'neg/cv007_4992.txt',
 'neg/cv008_29326.txt',
 'neg/cv009_29417.txt']

In [13]:
print(movie_reviews.raw('neg/cv000_29416.txt'))

plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no id

First we need to send our reviews into `X` and `y` arrays. Each review text will be added to the `X` and its corresponding category to the `y`.

In [14]:
X = []
y = []

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        X.append(movie_reviews.raw(fileid))
        if category == 'neg':
            y.append(0)
        else:
            y.append(1)

Then we'll shuffle our arrays since currently they're in order of category. This way when we split the data to train we don't get stuck with only positive or only negative reviews to train on.

In [15]:
from sklearn.utils import shuffle
import numpy as np

np.random.seed(1)

X, y = shuffle(X, y, random_state=0)

With `sklearn`'s text pipelines, we can quickly build a test a classifier: 

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import cross_validation

text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LinearSVC(random_state=0))
                     ])
scores = cross_validation.cross_val_score(text_clf, X, y, cv=10)
print(scores, np.mean(scores))



[ 0.86   0.88   0.88   0.805  0.785  0.86   0.86   0.855  0.825  0.865] 0.8475


Whoa! What just happened?!? The pipeline tells us three things happened:

1. CountVectorizer

2. TfidfTransformer

3. LinearSVC

Let's walk through this step by step.

1. A count vectorizer does exactly what we did above with tokenization. It changes all the texts to words, and then simply counts the frequency of each word occuring in the corpus for each document. The feature array for each document at this point is simply the length of all unique words in a corpus, with the count for the frequency of each. This is the most basic way to provide features for a classifier.

2. tfidf (term frequency inverse document frequency) is an algorithm that aims to find words that are important to specific documents. It does this by by taking the term frequency (tf) for a specific term in a specific document, and multiplying it by the term's inverse document frequency (idf), which is the total number of documents divided by the number of documents that contain the term at least once. Thus, idf is defined as:

$$idf(t, d, D)= log\left(\frac{\mid D \mid}{\mid \{d \subset D : t \subset d \} \mid}\right )$$

So tfidf is simply:

$$tfidf(t, d, D)= f_{t,d}*log\left(\frac{\mid D \mid}{\mid \{d \subset D : t \subset d \} \mid}\right )$$

A tfidf value is calculated for each term for each document. The feature arrays for a document is now the tfidf values.

The pipeline now sends these tfidf feature arrays to a 3. **Linear Support Vector classfier**, a common machine learning algorithm particularly effective on text data.

The code below breaks this down by each step, but combines the `CountVectorizer` and `TfidfTransformer` in the `TfidfVectorizer`.

In [17]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=40)

# get tfidf values
tfidf = TfidfVectorizer()
tfidf.fit(X)
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)

# built and test SVC
svc_class = LinearSVC()
model = svc_class.fit(X_train, y_train)
model.score(X_test, y_test)

0.84250000000000003

We can then index the tfidf matrix for the words with the most significant coefficients to get the most helpful features:

In [18]:
feature_names = tfidf.get_feature_names()
top10pos = np.argsort(model.coef_[0])[-10:]
print("Top features for positive reviews:")
print(list(feature_names[j] for j in top10pos))
print()
print("Top features for negative reviews:")
top10neg = np.argsort(model.coef_[0])[:10]
print(list(feature_names[j] for j in top10neg))

Top features for positive reviews:
['very', 'true', 'seen', 'he', 'fun', 'life', 'well', 'is', 'great', 'and']

Top features for negative reviews:
['bad', 'worst', 'plot', 'any', 'unfortunately', 'script', 'only', 'nothing', 'have', 'looks']


We can also use our model to classify new reviews, all we have to do is extract the tfidf features from the raw text and send them to the model:

In [19]:
new_bad_review = "This movie really sucked. I can't believe how long it dragged on. The actors are absolutely terrible."

features = tfidf.transform([new_bad_review])

model.predict(features)

array([0])

In [20]:
new_good_review = "I loved this film! The cinematography was incredible, and Leonardo Dicarpio is flawless."

features = tfidf.transform([new_good_review])

model.predict(features)

array([1])

## 3) Part-of-Speech Tagging

One of the fundamental aspects of words that we can use to begin to understand them is their parts of speech -- whether a word is a noun, verb, adjective, etc. Labeling each word in a sequence with its part of speech is called "part-of-speech tagging" or "POS tagging". A part of speech represents a syntactic function; the aim here is to identify the grammatical components of a sentence.

State-of-the-art POS taggers rely on probabilistic models and machine learning to tag tokens sequentially, or jointly across the whole sentence at the same time (finding the most likely combination of tags that make sense in relation to each other). Fortunately, you don't need to do this yourself. NLTK comes with an off-the-shelf POS tagger for English language text:

In [21]:
sent = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit.")
sent_tagged = nltk.pos_tag(sent)
print(sent_tagged)

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN'), ('.', '.')]


To tag several sentences at once, since the POS tagger depends on the context of the sentence, it must be run on each sublist sentence:

In [22]:
sents = [nltk.word_tokenize(s) for s in nltk.sent_tokenize("They refuse to permit us to obtain the refuse permit. It's a shame.")]
sents_tagged = nltk.pos_tag_sents(sents)
print(sents_tagged)

[[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN'), ('.', '.')], [('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('shame', 'NN'), ('.', '.')]]


Some corpora come already tagged. If we read in the Brown Corpus in raw text format (rather than tokenized), we'll actually see pairs of tokens and tags, separated by a '/'.

In [23]:
from nltk.corpus import brown

news_raw = brown.raw('ca01').strip()
print(news_raw[:200])

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/c


We can read these tagged tokens in as a list of tuples using the corpus object's tagged_words() method:

In [24]:
news_tagged = brown.tagged_words('ca01')
print(news_tagged[:10])

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]


Different POS taggers use different tags (collectively a "tagset"). The NLTK pos_tagger uses the Penn Treebank POS tagset. Good documentation can be found here: http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html.

The Brown Corpus is tagged with a different tagset, but you can see the similarities.

## Putting into Practice

### CHALLENGE: Task: Classifying Documents

Part-of-speech tags may be useful in document classification. For sentiment analysis in particular, we might care more about adjectives than about nouns and verbs, and probably much more than about articles or prepositions.

Let's modify the code for our classification task above to only send adjective tfidf values as features and get the most helpful features.

In [25]:
# Your code here

What does this result tell us about how important adjectives are to the sentiment of a movie review?

## 4) Phrase Chunking

We may want to work with larger segments of text than single words (but still smaller than a sentence). For instance, in the sentence "The black cat climbed over the tall fence", we might want to treat "The black cat" as one thing (the subject), "climbed over" as a distinct act, and "the tall fence" as another thing (the object). The first and third sequences are noun phrases, and the second is a verb phrase.

We can separate these phrases by "chunking" the sentence, i.e. splitting it into larger chunks than individual tokens. This is also an important step toward identifying entities, which are often represented by more than one word. You can probably imagine certain patterns that would define a noun phrase, using part of speech tags. For instance, a determiner (e.g. an article like "the") could be concatenated onto the noun that follows it. If there's an adjective between them, we can include that too.

To define rules about how to structure words based on their part of speech tags, we use a grammar (in this case, a "chunk grammar"). NLTK provides a RegexpParser that takes as input a grammar composed of regular expressions. The grammar is defined as a string, with one line for each rule we define. Each rule starts with the label we want to assign to the chunk (e.g. NP for "noun phrase"), followed by a colon, then an expression in regex-like notation that will be matched to tokens' POS tags.

We can define a single rule for a noun phrase like this. The rule allows 0 or 1 determiner, then 0 or more adjectives, and finally at least 1 noun. (By using 'NN.*' as the last POS tag, we can match 'NN', 'NNP' for a proper noun, or 'NNS' for a plural noun.) If a matching sequence of tokens is found, it will be labeled 'NP'.

In [26]:
grammar = "NP: {<DT>?<JJ>*<NN.*>+}"

We create a chunk parser object by supplying this grammar, then use it to parse a sentence into chunks. The sentence we want to parse must already be POS-tagged, since our grammar uses those POS tags to identify chunks. Let's try this on the first sentence in the election-related sentences we just extracted.

In [27]:
from nltk import RegexpParser

cp = RegexpParser(grammar)

sent = brown.sents()[100]
sent_tagged = nltk.pos_tag(sent)
sent_chunked = cp.parse(sent_tagged)

print(sent_chunked)

(S
  (NP Daniel/NNP)
  personally/RB
  led/VBD
  (NP the/DT fight/NN)
  for/IN
  (NP the/DT measure/NN)
  ,/,
  which/WDT
  he/PRP
  had/VBD
  watered/VBN
  down/RP
  considerably/RB
  since/IN
  its/PRP$
  (NP rejection/NN)
  by/IN
  two/CD
  (NP previous/JJ Legislatures/NNS)
  ,/,
  in/IN
  (NP a/DT public/JJ hearing/NN)
  before/IN
  (NP the/DT House/NNP Committee/NNP)
  on/IN
  (NP Revenue/NNP)
  and/CC
  (NP Taxation/NNP)
  ./.)


When we called print() on this chunked sentence, it printed out a nested list of nodes. Some are phrases (labeled 'NP') and others that didn't get chunked into a phrase are just the original tagged tokens (e.g. the verb 'climbed').

The chunked sentence is actually an NLTK tree object, we can find out by calling type() on the output from the RegexpParser:

In [28]:
type(sent_chunked)

nltk.tree.Tree

The tree object has a number of methods we can use to interact with its components. For instance, we can use the method draw() to see a more graphical representation. This will open a separate window.

The tree is pretty flat, because we defined a grammar that only grouped words into non-overlapping noun phrases, with no additional hierarchy above them. This is sometimes referred to as "shallow parsing". We'll get to more complex parsing later.

In [29]:
sent_chunked.draw()

TclError: no display name and no $DISPLAY environment variable

If we want to move through the chunks and look at certain phrases, since the tree is essentially flat, we can use a 'for' loop to iterate through all of the nodes in the order they were printed above. Some of the nodes are themselves NLTK tree objects, containing the noun phrases we chunked. Other nodes are just tuples with a token and tag, that didn't make it into a chunk.

If a node is a tree object, it has a method label(), in this case marked 'NP'. It also has a method leaves() that will give us the list of tagged tokens (tuples) in the phrase. If we pull out the first token from each tuple, and concatenate these, we can get the original phrase back.

In [None]:
for node in sent_chunked:
    if type(node)==nltk.tree.Tree and node.label()=='NP':
        phrase = [tok for (tok, tag) in node.leaves()]
        print(' '.join(phrase))

## 5) Named Entity Recognition

Once we have noun phrases separated out, we might find it useful to figure out what categories of things these nouns refer to. Especially if the noun phrase is a proper noun, i.e. a name of something, we might be able to tell if it is the name of a person, an organization, a place, or some other thing. Labeling noun phrases as different types of named entities is called "Named Entity Recognition" or "NER".

Named Entity Recognition involves meaning (semantics) as well as grammar (syntax). The name of a person or an organization might appear in the same place in the exact same sentence, so we also have to know something about existing person and organization names to be able to tell them apart. For that reason, NER taggers are usually trained from labeled training data, using supervised machine learning techniques. NLTK comes with a pre-trained NER tagger we can use for general English text:

In [None]:
sent_nes = nltk.ne_chunk(sent_tagged)
print(sent_nes)

In [None]:
entities = {'ORGANIZATION':[], 'PERSON':[], 'LOCATION':[]}
for node in sent_nes:
    if type(node)==nltk.tree.Tree:
        phrase = [tok for (tok, tag) in node.leaves()]
        if node.label() in entities.keys():
            entities[node.label()].append(' '.join(phrase))

for key, value in entities.items():
    print(key, value)

## 6) Parsing

Breaking down parts of a sentence and identifying their grammatical roles constitutes parsing. There are two main types of parsing in NLP: constituency parsing and dependency parsing. We'll just discuss dependency parsing for today.

### Dependency Parsing

Another way to parse sentences is to identify which words are syntactically dependent on other words, and what their dependency relationship is. Dependency parsing usually places the main verb of a sentence at the root of the tree, then assigns the verb's subject, direct object, and indirect objects as dependents. An indirect object will usually be connected to a root verb through a preposition. And nouns can have dependents too, which modify or are about some aspect of the noun.

> #### Prepositional phrase attachment
> 
> Dependency parsing is very complex; determining which words depend on which other words 
> involves not only part-of-speech tags, but other information that's more specific to each 
> verb or noun in the given sequence. Here's a classic example:
> 
> * "He ate pizza with olives."
> * "He ate pizza with a fork."
> 
> Which word in the sentence does the last word modify? In the first sentence, the olives are 
> on the pizza, they modify the noun. Saying "He ate with olives" wouldn't make sense without 
> the pizza. In the second sentence, we aren't talking about a thing called "a pizza with a 
> fork", that doesn't make sense. The fork modifies the verb "ate": "He ate with a fork".

Because of these nuances, dependency parsers are usually built using extensive training data, in the form of "treebanks" of sentences annotated with dependency relations. Several major dependency parsers are available in pre-trained form for English language text. It is also possible to train open-source dependency parsers on other publicly available treebanks (such as from the Universal Dependencies project, which offers annotated treebanks in many languages).

Today, we'll work with the Stanford Parser, which is part of the Stanford CoreNLP toolkit. Stanford CoreNLP provides a number of state-of-the-art NLP tools and is widely used by computer scientists as well as social scientists and humanists. It is written in Java, but there are APIs that enable you to access some of the tools from Python. Several of the most popular tools can be used through NLTK.

### Stanford CoreNLP

To get started, you'll need to have downloaded the Stanford Parser from this website: http://nlp.stanford.edu/software/lex-parser.html#Download and unzip it to a location on your computer that's easy to find (e.g. a folder called SourceCode in your Documents folder).

Then in Python, import the StanfordDependencyParser class from NLTK's parser package. You'll also need to import the module 'os' and set the following environment variables to the location on your computer where you put the unzipped Stanford Parser folder.

In [None]:
# !wget https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip
# !unzip stanford-parser-full-2017-06-09.zip

In [30]:
import os
from nltk.parse.stanford import StanfordDependencyParser

os.environ['STANFORD_PARSER'] = 'stanford-parser-full-2017-06-09'
os.environ['STANFORD_MODELS'] = 'stanford-parser-full-2017-06-09'

Now let's create a dependency parser object and try parsing our election-related sentences.

In [31]:
dependency_parser = StanfordDependencyParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
sents_parsed = dependency_parser.parse_sents(brown.sents()[:5])

LookupError: 

===========================================================================
  NLTK was unable to find stanford-parser\.jar! Set the CLASSPATH
  environment variable.

  For more information, on stanford-parser\.jar, see:
    <https://nlp.stanford.edu/software/lex-parser.shtml>
===========================================================================

The NLTK interface to the Stanford Parser returns an iterators (over iterators) over NLTK Dependency Graph objects. To be able to access the graph objects more than once, we can convert this into a list:

In [None]:
sents_parseobjs = [obj for sent in sents_parsed for obj in sent]

In [None]:
len(sents_parseobjs)

The graph object contains a method .tree() to depict the parse tree. (If we add .draw(), it will open in a separate window.)

In [None]:
sents_parseobjs[0].tree()

This tree shows us the dependencies (i.e. the arcs), but it doesn't show us the labeled dependency relations, which are a huge part of the value of dependency parsing. In other words, it shows us that "investigation" and "evidence" are both dependents of the verb "produced", but it doesn't show which was the subject and which the object of the action.

The method .triples() extracts dependency triples of the form: ((head word, head tag), rel, (dep word, dep tag)). So for every head word - dependent word pair, it will give us a triple, with the dependency relation label in between. (The method .triples() also returns an iterator; here we'll just use a for loop to print out each triple.)

In [None]:
for triple in sents_parseobjs[0].triples():
    print(triple)

This list of triples repeats a lot of head words, in order to capture all of their relations. Another format in which we can view the parse information is to convert it into CoNLL format. (CoNLL stands for the SIGNLL Conference on Computational Natural Language Learning; it organizes annual shared tasks relating to syntactic and semantic parsing.) The CoNLL formatted output is a string with one line for each word in the original sentence. The lines contain the word, its part-of-speech tag (two versions), the line number for the head word it is directly dependent on, and the label for that dependency relation.

In [None]:
print(sents_parseobjs[0].to_conll(10))

In [None]:
print(dir(sents_parseobjs[0]))