# Lecture 4.1 - NLP with NLTK

## Natural Language Processing

Natural Language, language used for everyday communication
Natural Language Processing computational manipulation of natural language in any from
We covered already some aspects in the previous lectures
Language technologies prevalent in everyday life
But also in the Humanities, we often deal with texts, extracting and organize information from texts


## Introduction to NLTK: Basic text analytics

First we demonstrate the power of this module by inspecting some of the prepared corpora that NLTK provides. Later we show how you can build your own corpus, and unleash all the nice tools on your own data.

Many of the examples below are taken from the [NLTK book](http://www.nltk.org/book/) Before we start, we should install all the required material. Run the cell below to install the tools and corpora. This can take a minute...

In [None]:
import nltk
nltk.download('book')

In the Digital Humanities we often treat text as *raw data*, is input for our programs. Interpretation arise from abstraction, for example the counting of word frequencies. This is a radically different approach than the close reading of texts.

Let's see what corpora NLTK provides us with.

In [None]:
from nltk.book import *

`from nltk.book import *` says as much as "from NLTK's book module, load all items."

Nice! We can see includes the script for 'Monty Python and the Holy Grail. But if we want to print text6 we do not get the actual content yet.

In [None]:
print(text6)

As a standard procedure we should track the data type of these objects?

In [None]:
print(type(text6))

From what we learned previously we might have expected these books to be string. Instead we see a totally new type the NLTK 'Text'. As we see below, converting a your text to this data type has many benefits, it facilitates the 'distant reading' of your corpus. The explain in how, we discuss some populat function below.

A oft-used technique for distant reading is Keyword In Context Analysis in which we center a whole corpus on specific words of interest. NLTK comes with a `concordance()` method that allows you to do just this. For example, how is the word 'grail' used in the Monty Python and the Holy Grail?

In [None]:
text6.concordance('grail')

A more realistic reserch question would be: how have American presidents used about 'democracy' in their Inaugural adresses? Try to do this in the cell below.

In [None]:
text4.concordance('democracy')

Or the word 'monstrous' in Moby Dick?

In [None]:
text1.concordance('monstrous')

[B]A concordance permits us to see words in context. For example, we saw that monstrous occurred in contexts such as the \_\_\_ pictures and a \_\_\_ size . What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses:

In [None]:
text1.similar("monstrous")
text2.similar("monstrous")

[B]Observe that we get different results for different texts. Austen uses this word quite differently from Melville; for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word very.

In [None]:
print(text5.similar("cool"))
print()
print(text1.similar("cool"))

[B]The term common_contexts allows us to examine just the contexts that are shared by two or more words, such as monstrous and very. We have to enclose these words by square brackets as well as parentheses, and separate them with a comma:

In [None]:
text2.common_contexts(["monstrous", "very"])

[B]It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text. In 1.2 we see some striking patterns of word usage over the last 220 years (in an artificial text constructed by joining the texts of the Inaugural Address Corpus end-to-end). You can produce this plot as shown below. You might like to try more words (e.g., liberty, constitution), and different texts. Can you predict the dispersion of a word before you view it? As before, take care to get the quotes, commas, brackets and parentheses exactly right.

In [None]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

Still, we did not access the actual text. NLTK represent these texts as a list (an in-built data type we encountered earlier) Let's find out where this information is hidden.

In [None]:
dir(text1)

This overview quite some stuff associated with the NLTK Text object. We come back to this later, but with respect to the previous question--where is the text hidden--maybe tokens seems a good option. What type does this attribute belong to?

In [None]:
type(text1.tokens)

`text1.tokens` returns a list, something which are fimiliar with now. So let's print the first hundred tokens of Moby Dick

In [None]:
print(text1.tokens[:100])

How many tokens does Moby Dick comprise? [B] Let's determine the length of a text from start to finish, in terms of the words and punctuation symbols that appear--if you have a closer look at the output of the previous print statement, you'll see that it comprises punctuation marks as individual items. We use the function len to obtain the length of a list, which we'll apply here to the book of Moby Dick:

In [None]:
print(len(text1.tokens))

[B]A token is the technical name for a sequence of characters — such as hairy, his, or :) — that we want to treat as a group. When we count the number of tokens in a text, say, the phrase to be or not to be, we are counting occurrences of these sequences. Thus, in our example phrase there are two occurrences of to, two of be, and one each of or and not. But there are only four distinct vocabulary items in this phrase. How many distinct words does the book of Genesis contain? To work this out in Python, we have to pose the question slightly differently. The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items of text3 with the command: set(text3). When you do this, many screens of words will fly past. Now try the following:

In [None]:
len(set(text1.tokens))

[B]Although it has 44,764 tokens, this book has only 2,789 distinct words, or "word types." A word type is the form or spelling of the word independently of its specific occurrences in a text — that is, the word considered as a unique item of vocabulary. Our count of 2,789 items will include punctuation symbols, so we will generally call these unique items types instead of word types.

Now, let's calculate a measure of the lexical richness of the text. The next example shows us that the number of distinct words is just 6% of the total number of words, or equivalently that each word is used 16 times on average

In [None]:
len(set(text1.tokens)) / len(text1.tokens)

[B] Next, let's focus on particular words. We can count how often a word occurs in a text, and compute what percentage of the text is taken up by a specific word:

In [None]:
print(100 * text1.count('whale') / len(text1))
print(100 * text3.count('whale') / len(text3))

[B] You may want to repeat such calculations on several texts, but it is tedious to keep retyping the formula. Instead, you can come up with your own name for a task, like "lexical_diversity" or "percentage", and associate it with a block of code. Now you only have to type a short name instead of one or more complete lines of Python code, and you can re-use it as often as you like. The block of code that does a task for us is called a function, and we define a short name for our function with the keyword def. The next example shows how to define two new functions, lexical_diversity() and  percentage():

In [None]:
 def lexical_diversity(text):
        return len(set(text)) / len(text)
    
def percentage(count, total):
        return 100 * count / total


Exercise: Which text has the highest lexical diversity?

## Preprocessing

Up to this point, you might wonder: what if I want to investigate *other* texts? Of course, this is possible, but requires some *preprocessing* steps. To transform a document on your disk or on the Web (whcih is just a sequence of characters) to a NLTK `Text` object.

In the previous lecture we have already covered a few common preprocessing steps such a removing punctuation and lowercasing. Here we will take a slightly different route, because NLTK takes cares of many of issues that required these steps.

### Tokenization

Aa alluded to earlier, 'tokens' are the minimal units for the machine to process. We often simply equated this with words--which, in turn where defined as everything between to whitespaces--but the relation is more complex. Luckily, NLTK comes with many read-made tools for splitting strings into tokens.

In [3]:
from nltk.tokenize import word_tokenize

In [1]:
sentence = "On the 12th of August, 18-- (just three days after my tenth birthday, when I had been given such wonderful presents), I was awakened at seveno’clock in the morning by Karl Ivanitch slapping the wall close to my head with a fly-flap made of sugar paper and a stick."

In [5]:
print(word_tokenize(sentence))

['On', 'the', '12th', 'of', 'August', ',', '18', '--', '(', 'just', 'three', 'days', 'after', 'my', 'tenth', 'birthday', ',', 'when', 'I', 'had', 'been', 'given', 'such', 'wonderful', 'presents', ')', ',', 'I', 'was', 'awakened', 'at', 'seveno', '’', 'clock', 'in', 'the', 'morning', 'by', 'Karl', 'Ivanitch', 'slapping', 'the', 'wall', 'close', 'to', 'my', 'head', 'with', 'a', 'fly-flap', 'made', 'of', 'sugar', 'paper', 'and', 'a', 'stick', '.']


In [7]:
from nltk.tokenize import regexp_tokenize, wordpunct_tokenize

In [11]:
print(sentence.split())

['On', 'the', '12th', 'of', 'August,', '18--', '(just', 'three', 'days', 'after', 'my', 'tenth', 'birthday,', 'when', 'I', 'had', 'been', 'given', 'such', 'wonderful', 'presents),', 'I', 'was', 'awakened', 'at', 'seveno’clock', 'in', 'the', 'morning', 'by', 'Karl', 'Ivanitch', 'slapping', 'the', 'wall', 'close', 'to', 'my', 'head', 'with', 'a', 'fly-flap', 'made', 'of', 'sugar', 'paper', 'and', 'a', 'stick.']


In [9]:
print(regexp_tokenize(sentence, pattern='\w+'))

['On', 'the', '12th', 'of', 'August', '18', 'just', 'three', 'days', 'after', 'my', 'tenth', 'birthday', 'when', 'I', 'had', 'been', 'given', 'such', 'wonderful', 'presents', 'I', 'was', 'awakened', 'at', 'seveno', 'clock', 'in', 'the', 'morning', 'by', 'Karl', 'Ivanitch', 'slapping', 'the', 'wall', 'close', 'to', 'my', 'head', 'with', 'a', 'fly', 'flap', 'made', 'of', 'sugar', 'paper', 'and', 'a', 'stick']


In [10]:
print(wordpunct_tokenize(sentence))

['On', 'the', '12th', 'of', 'August', ',', '18', '--', '(', 'just', 'three', 'days', 'after', 'my', 'tenth', 'birthday', ',', 'when', 'I', 'had', 'been', 'given', 'such', 'wonderful', 'presents', '),', 'I', 'was', 'awakened', 'at', 'seveno', '’', 'clock', 'in', 'the', 'morning', 'by', 'Karl', 'Ivanitch', 'slapping', 'the', 'wall', 'close', 'to', 'my', 'head', 'with', 'a', 'fly', '-', 'flap', 'made', 'of', 'sugar', 'paper', 'and', 'a', 'stick', '.']


### Stemming

Stemming, in its literal sens, amouns to cutting down the branches of a tree to its stem. But also tokens can be reduced to their stem. Stemming is a crude, rule-based process by which we want group together different variations of a token. [B2] For example, the word eat will have variations like eating, eaten, eats, and so on. In some applications, as it does not make sense to differentiate between eat and eaten, we typically use stemming to club both grammatical variances to the root of the word

In [13]:
from nltk.stem import PorterStemmer

In [15]:
pst = PorterStemmer()
print(pst.stem('loving'))
print(pst.stem('loved'))

love
love


In [None]:
[B2]We are creating different stemmer objects, and applying a stem() method on the string

In [16]:
tokens = word_tokenize(sentence)
stems = [pst.stem(w) for w in tokens]
print(stems)

['On', 'the', '12th', 'of', 'august', ',', '18', '--', '(', 'just', 'three', 'day', 'after', 'my', 'tenth', 'birthday', ',', 'when', 'I', 'had', 'been', 'given', 'such', 'wonder', 'present', ')', ',', 'I', 'wa', 'awaken', 'at', 'seveno', '’', 'clock', 'in', 'the', 'morn', 'by', 'karl', 'ivanitch', 'slap', 'the', 'wall', 'close', 'to', 'my', 'head', 'with', 'a', 'fly-flap', 'made', 'of', 'sugar', 'paper', 'and', 'a', 'stick', '.']


### Lemmatization

Lemmatization is a more methodical way of converting all the grammatical/inflected forms of the root of the word. Lemmatization uses context and part of speech to determine the inflected form of the word and applies different normalization rules for each part of speech to get the root word.

In [29]:
from nltk.stem.wordnet import WordNetLemmatizer

In [30]:
wlem = WordNetLemmatizer()

In [35]:
wlem.lemmatize("was")
#print(pst.stem("ate"))

'wa'

### From File to Text

In [37]:
import nltk
with open('data/Tolstoy-Childhood.txt','r') as doc:
    tokens = word_tokenize(doc.read())
    
text = nltk.text.Text(tokens)

In [38]:
text.concordance('nose')

Displaying 13 of 13 matches:
 . Karl Ivanitch sneezed , wiped his nose , flicked his fingers , and began am
his call . Karl , with spectacles on nose and a book in his hand , was sitting
 had slipped down his large aquiline nose , and the blue , half-closed eyes an
ied as I caressed her and kissed her nose , “ we are going away today . Good-b
mall and perpetually twinkling , his nose large and aquiline , his lips irregu
eginning to stand out on my brow and nose . My ears were burning , I trembled 
hat no human being with such a large nose , such thick lips , and such small g
 , a turned-up , strongly pronounced nose , very bright red lips ( which , nev
iously , as well as of twitching his nose and eyebrows . Consequently every on
from mine , she scratched her little nose with her glove ! All this I can see 
 was left visible but the tip of her nose . Indeed , I could see that , if her
ss her fingers and eyes and lips and nose and feet -- kiss all of her. ” “ How
d , Natalia Savishna , 

Exercise: **to do**

### Stop word and rare word removal

**to do**

## Enrichment

In [None]:
Topics
- Part-of-Speech Tagging
- Named Entity Recognition
- Information Extraction: NER and Relation Extraction

### Regular Expressions

**to do**