# Lecture 4.1 - NLP with NLTK

## Natural Language Processing

Many of the examples below are taken from the [NLTK book](http://www.nltk.org/book/) Before we start, we should install all the required material. Run the cell below to install the tools and corpora. This can take a minute...

Please run the cell below to install the additional material.

In [None]:
import nltk
nltk.download('book')

## 1. Introduction to Python's Natural Language Toolkit (NLTK)

First, I demonstrate the power of the NLTK by inspecting some of the **prepared corpora** of this library. Later on, I show how you can build your own corpus, and unleash all the nice tools on **your own data**.

In the Digital Humanities, we often treat texts as *raw data*, as input for our programs. Interpretations arise from abstraction, for example, by counting word frequencies, analysing specific segments of a corpus (i.e. Key Word In Context, or KWIC analysis) or searching for patterns (i.e. collocations). 

NLTK provides several tools for both **processing** data and **interpreting** texts.

Let's see what corpora NLTK provides by loading the `book` module from the library.

In [None]:
from nltk.book import *

`from nltk.book import *` says as much as "from NLTK's book module, load all items." This loads all the books that are processed in advance for further analysis.

From the above output, we discover that NLTK includes the script of 'Monty Python and the Holy Grail'. When we `print` text6 we (surprisingly) can not see the actual content yet.

In [None]:
print(text6)

As a standard procedure, we should uncover the data type of the object we are dealing with.

In [None]:
print(type(text6))

In [None]:
dir(text6)

Let's print the first hundred tokens.

In [None]:
print(text6.tokens[:100])

We see that the text is already properly tokenized (i.e. words and punctuation marks are properly separated from each other).

But we can do more with this `Text` object. To view all the methods attached to this object, use Python's help function. You can ignore all those that start with a double underscore and scroll down.

In [None]:
help(nltk.text.Text)

Let's inspect some of the methods attached to the `Text` object.

### `.concordance()`

An oft-used technique for distant reading is **Keyword In Context Analysis** in which we centre a whole corpus on a specific word of interest. NLTK comes with a `concordance()` method that allows you to do just this. For example, how is the word 'grail' used in  'Monty Python and the Holy Grail'?

In [None]:
help(nltk.text.Text.concordance)

In [None]:
text6.concordance('grail')

A more realistic research question would be: how have American presidents used 'democracy' in their Inaugural Addresses since 1861? Try to do this in the cell below.

In [None]:
text4.concordance('democracy')

We can specify the number of hits to print with the `lines` argument.

In [None]:
text4.concordance('democracy',lines=100)

What about 'monstrous' in Moby Dick?

In [None]:
text1.concordance('monstrous')

### --Exercise--

Compare the use of 'love' in Melville's Moby Dick to Jane Austin's Sense and Sensibility

In [None]:
# Exercise

### `.similar()`

`concordance()` shows words in their context. For example, we saw that monstrous occurred in contexts such as the \_\_\_ pictures and a \_\_\_ size. What other words appear in a **similar context**? We can find out by applying the `.similar` method to an NLTK text and enter the word you want analyse within parentheses (don't forget to put a string between quotation marks):

In [None]:
help(text1.similar)

`.similarity()` returns a list of semantically related words from one text (based on the intuition that words which share common contexts are related) 

In [None]:
#text1: Moby Dick by Herman Melville 1851
text1.similar("monstrous")
print('\n')
#text2: Sense and Sensibility by Jane Austen 1811
text2.similar("monstrous")

Observe that we get **different results for different texts**. Austen uses this word quite differently from Melville; for her, monstrous has positive connotations and sometimes functions as an intensifier like the word very.

In [None]:
text5.similar("cool")
print('\n')
text3.similar("cool")

The method `common_contexts` allows us to **examine just the contexts** that are shared by two or more words, such as monstrous and very. We have to enclose these words by square brackets as well as parentheses, and separate them with a comma:

In [None]:
text1.common_contexts(["monstrous", "true"])

### --Exercise--

- Apply the `.similarities()` and `.common_contexts()` to the Wall Street Journal.

- What are the benefits/limits of this tool?

### `.dispersion_plot()`

We can also determine the **location** of a word in the text: how many words from the beginning it appears. This **positional information** can be displayed using a dispersion plot. Each **stripe** represents an instance of a word, and each **row** represents the entire text. In 1.2 we see some striking patterns of word usage over the last 220 years (in an artificial text constructed by joining the texts of the Inaugural Address Corpus end-to-end). You can produce this plot as shown below. You might like to try more words (e.g., liberty, constitution), and different texts. Can you predict the dispersion of a word before you view it? As before, take care to get the quotes, commas, brackets and parentheses exactly right.

In [None]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

### --Exercise--

Explore the thematic shifts in Genesis using the dispersion plot.

In [None]:
# Exercise

### `collocations`

A collocation is a **sequence of words that occur together unusually often**. Thus *red wine* is a collocation, whereas *the wine* is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds definitely odd.

### -- Exercise--

- Apply `.collacations` to some of the prepared text. Assess the outcomes and its use for research.

In [None]:
# Exercise

### Back to the tokens

Still, we still have not inspected the actual text. NLTK represents texts as a list (an inbuilt data type we encountered earlier). Let's find out where this information is hidden.

In [None]:
dir(text1)

In [None]:
type(text1)

We can use the index notation for find the first 100 words.

In [None]:
print(text1[:100])

Let's determine the length of a text from start to finish, in terms of the **words and punctuation** symbols that appear--if you have a closer look at the output of the previous print statement, you'll see that it comprises punctuation marks as individual items. We use the function len to obtain the length of a list, which we'll apply here to the book of Moby Dick:

In [None]:
print(len(text1))

**Exercise**: Is "Sense and Sensibility" longer than "Moby Dick"?

In [None]:
len(text2) > len(text1)

A **token** is the technical name for a **sequence of characters** — such as hairy, his, or :) — that we want to **treat as a group** (or **unit** of our analysis). When we count the number of tokens in a text, say, the phrase "to be or not to be", we are counting occurrences of these sequences. Thus, in our example phrase there are two occurrences of "to", two of "be", and one each of "or" and "not". 

In [None]:
sentence = 'to be or not to be'
tokens = sentence.split()
print(tokens)

But there are only **four distinct** vocabulary items in this phrase. In Python we can use the `set()` function to count the amount of unique words also called *types*

In [None]:
types = set(tokens)
print(types)

### --Exercise

We can apply this to larger texts!

- How many tokens does the book of Genesis contain?
- How many distinct words does the book of Genesis contain? 
- Can you print the type-token ratio? What doe you think it means?

In [None]:
# Exercise

### **Exercise**
Is the type-token ration of Moby Dick greater than Sense and Sensibility?

In [None]:
# Exercise

## Intermezzo: Emotion Mining

## Vader Sentiment Analyzer
The variable paragraphs counts is not the most interesting one, let's have a look at the semtiment values of these mentions of immigration.

For this we use **VADER**.

[from Github](https://github.com/cjhutto/vaderSentiment): VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool.

VADER uses a lexicon (a mapping of words to sentiment values, e.g bad=-1.0, good=+1.0) to compute the sentiment (positivity or negativity) of a text.

In [None]:
# we need to install the vader lexicon first
import nltk
nltk.download('vader_lexicon')

Now load the VADER Sentiment analyzer

In [None]:
from nltk.sentiment import vader
analyzer = vader.SentimentIntensityAnalyzer()

Below you can test VADER yourself by changing the value of the ``text`` variable, and running the code block. 

Can you trick the system? Not very easy isn't it?!

In [None]:
text = "Not interesting."
sentiments_analysis = analyzer.polarity_scores(text)
print(sentiments_analysis)

We are interested here in the compound, the combination of positive and negative sentiments. We can select this by putting the string 'compound' between square brackets

In [None]:
sentiments_analysis['compound']

# 2. Working with your own corpus

## 2.1 Preprocessing: Tokenization

Up to this point, you might wonder: what if I want to investigate *other* texts? Of course, this is possible but requires some *preprocessing* steps. We have to **tokenize** the document on your computer or on the Web (which is just a sequence of characters) to an NLTK `Text` object.

Before we do this, let's inspects some of NLTK's preprocessing tools.

In the previous lecture, we have already covered a few common preprocessing steps such a removing punctuation and lower casing. Here we will take a slightly different route because NLTK takes cares of many of issues that required these steps.

### Sentence Tokenization

Often it is useful to process a text by sentence, if we want, for example, inquire the use of different words within its meaningful context. 

In [None]:
from nltk.tokenize import sent_tokenize
book = 'This is a sentence. And this another one!'
sent_tokenize(book)

### --Exercise-- 
How many sentences does Tolstoy's 'War and Peace' contain (approximately)?

In [None]:
import requests
data = requests.get('http://www.gutenberg.org/files/2600/2600-0.txt').text
# add your code here

We can now score the sentiment of each sentence:

In [None]:
for sent in sent_tokenize(data)[1000:1005]: # we only use a small subset of the sentences
    sent_sentiments_analysis = analyzer.polarity_scores(sent)
    print(sent)
    print(sent_sentiments_analysis['compound'])

### --Exercise--

Experiment with a snippet from The Guardian:
- tokenize by sentence with `sent_tokenize`
- compute emotion for each sentence using a `for` loop

In [None]:
# add you code here

### Word Tokenization

As alluded to earlier, 'tokens' are the minimal units for the machine to process. We often simply equated this with words, which, in turn, were defined as everything between to whitespaces--but the relationship is more complex. Luckily, NLTK comes with many ready-made tools for splitting strings into tokens.

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
sentence = "On the 12th of August, 18-- (just three days after my tenth birthday, when I had been given such wonderful presents), I was awakened at seveno’clock in the morning by Karl Ivanitch slapping the wall close to my head with a fly-flap made of sugar paper and a stick."

In [None]:
print(word_tokenize(sentence))

There is not one "correct" method for tokenizing texts. Therefore NLTK comes with many different tokenizers. What are their differences?

In [None]:
from nltk.tokenize import regexp_tokenize, wordpunct_tokenize

In [None]:
print(sentence.split())

In [None]:
print(regexp_tokenize(sentence, pattern='\w+'))

In [None]:
print(wordpunct_tokenize(sentence))

### --Exercise--

- How many tokens are there in 'War and Peace'? How many unique words?
- Does the type of tokenization change these results?

## 2.2. Converting data to NLTK Text

To convert your own text to an NLTK corpus, you only need to tokenize the text. Let's download Tolstoy's "Childhood".

In [None]:
import nltk
import requests
url = 'http://www.gutenberg.org/files/2142/2142-0.txt'
text = requests.get(url).text
tokens = word_tokenize(text)
nltk_text = nltk.text.Text(tokens)

Now you can apply all the NLTK methods to this book! Enjoy!

### --Exercise--
Apply the `concordance`, `similar`, `collocation` and `dispersion_plot` to this book (or another book of your choice, preferably one you are familiar with)

In [None]:
# Exercise

What do to with texts that are only stored on your own computer. The Python syntax is slightly more complicated here.

In [None]:
text = open('./texts/katholicismeenfascisme.txt','r').read()

In [None]:
tokens = word_tokenize(text)

In [None]:
nltk_text_2 = nltk.text.Text(tokens)

In [None]:
nltk_text_2.concordance('fascisme')

# 3. Advanced Topics: Normalising and Enriching Text Data

### Stemming

Stemming, in its literal sense, amounts to cutting down the branches of a tree to its stem. But also tokens can be reduced to their stem. **Stemming is a crude, rule-based process by which we want to group together different variations of a token.** For example, the word 'eat' will have variations like 'eating', 'eaten', 'eats', and so on. In some applications, when does **not make sense to differentiate between 'eat' and 'eaten'**, we typically use stemming to reduce these grammatical variances to the root of the word.

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [None]:
pst = PorterStemmer()
print(pst.stem('loving'))
print(pst.stem('loved'))

Above, we create a stemmer object and apply the `stem()` method to a string.

In [None]:
tokens = word_tokenize('love loved loving flower flowers dogs dog')
stemmed = []
for token in tokens:
    stemmed.append(pst.stem(token))
print(stemmed)

### **Exercise**

Download the Bible, stem it, and count how often the stem 'love' appears (use the percentage function above). Compare this percentage with a non-stemmed version of the Bible.
> Tip: use the count method

> l = ['a','a','b','c']

> l.count('a')

In [None]:
bible_url = 'http://www.gutenberg.org/cache/epub/10/pg10.txt'

text_lower = requests.get(bible_url) # download and lowercase the Bible

# complete exercise

### Lemmatization

Lemmatization is a more **methodical** way of converting all the **grammatical/inflected** forms of the root of the word. Lemmatization uses context and **part of speech** (see below) to determine the inflected form of the word and applies **different normalization** rules for each part of speech to get the root word.

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:
wlem = WordNetLemmatizer()

In [None]:
print(wlem.lemmatize("flowers",pos='n'))
print(wlem.lemmatize("was",pos='v'))

In [None]:
print(wlem.lemmatize("run",pos='v'))
print(wlem.lemmatize("ran",pos='v'))

In [None]:
print(pst.stem('run'))
print(pst.stem('ran'))

In [None]:
print(wlem.lemmatize("mouse",pos='n'))
print(wlem.lemmatize("mice",pos='n'))

In [None]:
print(pst.stem('mouse'))
print(pst.stem('mice'))

## Syntactic Analysis

### Part of Speech Tagging

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word:

In [None]:
from nltk import pos_tag
tokens = word_tokenize("And now for something completely different")
tagged = pos_tag(tokens)
tagged

We have seen how to iterate over a list. The output of `pos_tag` returns a list of **tuples** which we can **unpack** using the following notation:

In [None]:
first_element = tagged[0]
print('Tuple = ',first_element)
word,tag = first_element
print('Word = ',word)
print('Tag = ',tag)

In [None]:
for word,tag in tagged:
    print('This is a word',word)
    print('This is a tag',tag)

This is similar but slightly more elegant than:

In [None]:
for element in tagged:
    print('This is a word',element[0])
    print('This is a tag',element[1])

NLTK provides documentation for each tag, which can be queried using the tag, e.g. `nltk.help.upenn_tagset('RB')`. An overview of all the Part-of-Speech tags you'll find [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

In [None]:
# what does 'RB' mean
nltk.help.upenn_tagset('RB')

### --Exercise--

Collect all the nouns in "The Communist Manifesto" of Marx and Engels.
> Tip use the `startswith()` method!

In [None]:
url = 'http://www.gutenberg.org/cache/epub/61/pg61.txt'

In [None]:
text = requests.get(url).text # load the text

tokens = word_tokenize(text) # tokenize the text

pos_tagged = pos_tag(tokens) # part of speech tag the tokenized text, this can take a while

In [None]:
print(pos_tagged[100:120]) # inspect the data

In [None]:
nouns = []
for word,tag in pos_tagged:
    #if .startswith ...
    # use append
    

**Difficult Exercise**: Collect all [bigrams](https://en.wikipedia.org/wiki/Bigram) (sequence of two words) that start with an adjective and end with a noun.
> Tip: Use index notation as shown below (below I apply it to string)

### Named Entity Recognition

Python also has functions for more refined syntactic analysis and named entity recognition.

In [None]:
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Mark and John are working at Google."
ne= ne_chunk(pos_tag(word_tokenize(sentence)))
ne