# Text Processing

We will be using NLTK library to illustrate basic text processing functionalities: tokenization, lemmization, stop words, ...


In [67]:
import nltk

## Text Corpora

A text corpus is a large body of text. Many corpora are designed to contain a careful balance of material in one or more genres.  

Let's start by loading the Gutenberg corpora.  The Project Gutenberg corpora is electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. We begin by querying to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

[Reference](https://www.sketchengine.eu/gutenberg-corpora-2020/)

In [69]:
nltk.download('gutenberg')
nltk.download('punkt')
from nltk.corpus import gutenberg
gutenberg.fileids()

[nltk_data] Downloading package gutenberg to /Users/pmui/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /Users/pmui/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

The first text is Emma by Jane Austen.  How many words does it contain?

In [70]:
emma = gutenberg.words('austen-emma.txt')
len(emma)

192427

In [71]:
emma

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]

Let's print out all info about the gutenberg corpora by looping over all the values of fileid corresponding to the gutenberg file identifiers listed earlier and then computing statistics for each text. 

In [72]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print("chars/word, words/sent, words/vocab")
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

chars/word, words/sent, words/vocab
5 25 26 austen-emma.txt
chars/word, words/sent, words/vocab
5 26 17 austen-persuasion.txt
chars/word, words/sent, words/vocab
5 28 22 austen-sense.txt
chars/word, words/sent, words/vocab
4 34 79 bible-kjv.txt
chars/word, words/sent, words/vocab
5 19 5 blake-poems.txt
chars/word, words/sent, words/vocab
4 19 14 bryant-stories.txt
chars/word, words/sent, words/vocab
4 18 12 burgess-busterbrown.txt
chars/word, words/sent, words/vocab
4 20 13 carroll-alice.txt
chars/word, words/sent, words/vocab
5 20 12 chesterton-ball.txt
chars/word, words/sent, words/vocab
5 23 11 chesterton-brown.txt
chars/word, words/sent, words/vocab
5 19 11 chesterton-thursday.txt
chars/word, words/sent, words/vocab
4 21 25 edgeworth-parents.txt
chars/word, words/sent, words/vocab
5 26 15 melville-moby_dick.txt
chars/word, words/sent, words/vocab
5 52 11 milton-paradise.txt
chars/word, words/sent, words/vocab
4 12 9 shakespeare-caesar.txt
chars/word, words/sent, words/vocab
4 12 8 

The raw() function gives us the contents of the file without any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt')) tells us how many letters occur in the text, including the spaces between words. 

In [73]:
len(gutenberg.raw('blake-poems.txt'))

38153

The sents() function divides the text up into its sentences, where each sentence is a list of words:

In [74]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
macbeth_sentences

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]

In [75]:
macbeth_sentences[1116]

['Double',
 ',',
 'double',
 ',',
 'toile',
 'and',
 'trouble',
 ';',
 'Fire',
 'burne',
 ',',
 'and',
 'Cauldron',
 'bubble']

In [76]:
longest_len = max(len(s) for s in macbeth_sentences)
longest_len

158

In [77]:
[s for s in macbeth_sentences if len(s) == longest_len]

[['Doubtfull',
  'it',
  'stood',
  ',',
  'As',
  'two',
  'spent',
  'Swimmers',
  ',',
  'that',
  'doe',
  'cling',
  'together',
  ',',
  'And',
  'choake',
  'their',
  'Art',
  ':',
  'The',
  'mercilesse',
  'Macdonwald',
  '(',
  'Worthie',
  'to',
  'be',
  'a',
  'Rebell',
  ',',
  'for',
  'to',
  'that',
  'The',
  'multiplying',
  'Villanies',
  'of',
  'Nature',
  'Doe',
  'swarme',
  'vpon',
  'him',
  ')',
  'from',
  'the',
  'Westerne',
  'Isles',
  'Of',
  'Kernes',
  'and',
  'Gallowgrosses',
  'is',
  'supply',
  "'",
  'd',
  ',',
  'And',
  'Fortune',
  'on',
  'his',
  'damned',
  'Quarry',
  'smiling',
  ',',
  'Shew',
  "'",
  'd',
  'like',
  'a',
  'Rebells',
  'Whore',
  ':',
  'but',
  'all',
  "'",
  's',
  'too',
  'weake',
  ':',
  'For',
  'braue',
  'Macbeth',
  '(',
  'well',
  'hee',
  'deserues',
  'that',
  'Name',
  ')',
  'Disdayning',
  'Fortune',
  ',',
  'with',
  'his',
  'brandisht',
  'Steele',
  ',',
  'Which',
  'smoak',
  "'",
  'd',
 

## Brown Corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, etc.  A complete list of genres for the Brown Corpus can be found at: http://icame.uib.no/brown/bcm-los.html.

We can access the corpus as a list of words, or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read:

In [78]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

The Brown Corpus is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics. Let's compare genres in their usage of modal verbs. The first step is to produce the counts for a particular genre. Remember to import nltk before doing the following:

In [79]:
news_text = brown.words(categories='news')

# let's find the frequency of words within a text
news_dist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
    print(m + ':', news_dist[m], end=' ')

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389 

In [80]:
five_w = ['what', 'when', 'where', 'who', 'why']
for f in five_w:
    print(f + ':', news_dist[f], end=' ')

what: 95 when: 169 where: 59 who: 268 why: 14 

In [81]:
fiction_text = brown.words(categories='fiction')
fiction_dist = nltk.FreqDist(w.lower() for w in fiction_text)
for m in modals:
    print(m + ':', fiction_dist[m], end=' ')

can: 39 could: 168 may: 10 might: 44 must: 55 will: 56 

In [82]:
for f in five_w:
    print(f + ':', fiction_dist[f], end=' ')

what: 186 when: 192 where: 89 who: 112 why: 42 

We would like to obtain counts for each genre of interest. We'll use NLTK's support for conditional frequency distributions.

In [83]:
cfd = nltk.ConditionalFreqDist((genre, word)
                                for genre in brown.categories()
                                for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']

In [84]:
cfd.tabulate(conditions=genres, samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13 


In [85]:
cfd.tabulate(conditions=genres, samples=five_w)

                 what  when where   who   why 
           news    76   128    58   268     9 
       religion    64    53    20   100    14 
        hobbies    78   119    72   103    10 
science_fiction    27    21    10    13     4 
        romance   121   126    54    89    34 
          humor    36    52    15    48     9 


## WordNet

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. We'll begin by looking at synonyms and how they are accessed in WordNet.

### Senses and Synonyms

Let's consider the words: "car", "motorcar", and "automobile":

In [86]:
from nltk.corpus import wordnet as wn
wn.synsets('motorcar')

[Synset('car.n.01')]

Thus, motorcar has just one possible meaning and it is identified as car.n.01, the first noun sense of car. The entity car.n.01 is called a synset, or "synonym set", a collection of synonymous words (or "lemmas"):

In [87]:
wn.synset('car.n.01').lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

Each word of a synset can have several meanings, e.g., car can also signify a train carriage, a gondola, or an elevator car. However, we are only interested in the single meaning that is common to all words of the above synset. Synsets also come with a prose definition and some example sentences:

In [88]:
wn.synset('car.n.01').definition()

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

In [89]:
wn.synset('car.n.01').examples()

['he needs a car to get to work']

Although definitions help humans to understand the intended meaning of a synset, the words of the synset are often more useful for our programs. To eliminate ambiguity, we will identify these words as car.n.01.automobile, car.n.01.motorcar, and so on. This pairing of a synset with a word is called a lemma. 

### Lemma

So a lemma again is: __pairing of a word with a syncset__.

Let's see what we can do with a word's lemmas:
- get all the lemmas for a given synset
- look up a particular lemma
- get the synset corresponding to a lemma
- get the "name" of a lemma

In [90]:
wn.synset('car.n.01').lemmas()

[Lemma('car.n.01.car'),
 Lemma('car.n.01.auto'),
 Lemma('car.n.01.automobile'),
 Lemma('car.n.01.machine'),
 Lemma('car.n.01.motorcar')]

In [91]:
wn.lemma('car.n.01.automobile')

Lemma('car.n.01.automobile')

In [92]:
wn.lemma('car.n.01.automobile').synset()

Synset('car.n.01')

In [93]:
wn.lemma('car.n.01.automobile').name()

'automobile'

Unlike the word motorcar, which is unambiguous and has one synset, the word car is ambiguous, having five synsets:

In [94]:
wn.synsets('motorcar')

[Synset('car.n.01')]

In [95]:
wn.synsets('car')

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

For convenience, we can access all the lemmas involving the word car as follows.

In [96]:
wn.lemmas('car')

[Lemma('car.n.01.car'),
 Lemma('car.n.02.car'),
 Lemma('car.n.03.car'),
 Lemma('car.n.04.car'),
 Lemma('cable_car.n.01.car')]

### WordNet Hierarchy

WordNet synsets correspond to abstract concepts, and they don't always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event — these are called unique beginners or root synsets. Others, such as gas guzzler and hatchback, are much more specific. A small portion of a concept hierarchy is illustrated here: nodes correspond to synsets; edges indicate the hypernym/hyponym relation, i.e. the relation between superordinate and subordinate concepts.

<img src="./images/wordnet-hierarchy.png" width=500px>

WordNet makes it easy to navigate between concepts. For example, given a concept like motorcar, we can look at the concepts that are more specific; the (immediate) hyponyms.


In [97]:
motorcar = wn.synset('car.n.01')

In [98]:
types_of_motorcar = motorcar.hyponyms()
types_of_motorcar

[Synset('ambulance.n.01'),
 Synset('beach_wagon.n.01'),
 Synset('bus.n.04'),
 Synset('cab.n.03'),
 Synset('compact.n.03'),
 Synset('convertible.n.01'),
 Synset('coupe.n.01'),
 Synset('cruiser.n.01'),
 Synset('electric.n.01'),
 Synset('gas_guzzler.n.01'),
 Synset('hardtop.n.01'),
 Synset('hatchback.n.01'),
 Synset('horseless_carriage.n.01'),
 Synset('hot_rod.n.01'),
 Synset('jeep.n.01'),
 Synset('limousine.n.01'),
 Synset('loaner.n.02'),
 Synset('minicar.n.01'),
 Synset('minivan.n.01'),
 Synset('model_t.n.01'),
 Synset('pace_car.n.01'),
 Synset('racer.n.02'),
 Synset('roadster.n.01'),
 Synset('sedan.n.01'),
 Synset('sport_utility.n.01'),
 Synset('sports_car.n.01'),
 Synset('stanley_steamer.n.01'),
 Synset('stock_car.n.01'),
 Synset('subcompact.n.01'),
 Synset('touring_car.n.01'),
 Synset('used-car.n.01')]

We can also navigate up the hierarchy by visiting hypernyms. Some words have multiple paths, because they can be classified in more than one way. There are two paths between car.n.01 and entity.n.01 because wheeled_vehicle.n.01 can be classified as both a vehicle and a container.

In [99]:
motorcar.hypernyms()

[Synset('motor_vehicle.n.01')]

In [100]:
paths = motorcar.hypernym_paths()
len(paths)

2

In [101]:
[synset.name() for synset in paths[0]]

['entity.n.01',
 'physical_entity.n.01',
 'object.n.01',
 'whole.n.02',
 'artifact.n.01',
 'instrumentality.n.03',
 'container.n.01',
 'wheeled_vehicle.n.01',
 'self-propelled_vehicle.n.01',
 'motor_vehicle.n.01',
 'car.n.01']

In [102]:
[synset.name() for synset in paths[1]]

['entity.n.01',
 'physical_entity.n.01',
 'object.n.01',
 'whole.n.02',
 'artifact.n.01',
 'instrumentality.n.03',
 'conveyance.n.03',
 'vehicle.n.01',
 'wheeled_vehicle.n.01',
 'self-propelled_vehicle.n.01',
 'motor_vehicle.n.01',
 'car.n.01']

We can get the most general hypernyms (or root hypernyms) of a synset as follows:

In [103]:
motorcar.root_hypernyms()

[Synset('entity.n.01')]

### More Lexical Relations

Hypernyms and hyponyms are called lexical relations because they relate one synset to another. These two relations navigate up and down the "is-a" hierarchy. Another important way to navigate the WordNet network is from items to their components (__meronyms__) or to the things they are contained in (__holonyms__). For example, the parts of a tree are its trunk, crown, and so on; the part_meronyms(). The substance a tree is made of includes heartwood and sapwood; the substance_meronyms(). A collection of trees forms a forest; the member_holonyms():

In [None]:
wn.synset('tree.n.01').part_meronyms()

In [None]:
wn.synset('tree.n.01').substance_meronyms()

In [None]:
wn.synset('tree.n.01').member_holonyms()

To see just how intricate things can get, consider the word mint, which has several closely-related senses. We can see that mint.n.04 is part of mint.n.02 and the substance from which mint.n.05 is made.

In [None]:
for synset in wn.synsets('mint', wn.NOUN):
     print(synset.name() + ':', synset.definition())

In [None]:
wn.synset('mint.n.04').part_holonyms()

In [None]:
wn.synset('mint.n.04').substance_holonyms()

There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments:

In [None]:
wn.synset('walk.v.01').entailments()

In [None]:
wn.synset('eat.v.01').entailments()

In [None]:
wn.synset('tease.v.03').entailments()

Some lexical relationships hold between lemmas, e.g., antonymy:

In [None]:
wn.lemma('supply.n.02.supply').antonyms()

In [None]:
wn.lemma('rush.v.01.rush').antonyms()

In [None]:
wn.lemma('horizontal.a.01.horizontal').antonyms()

In [None]:
wn.lemma('staccato.r.01.staccato').antonyms()

Additional methods for syncset can be viewed using dir():

In [None]:
dir(wn.synset('harmony.n.02'))

### Semantic Similarity

We have seen that synsets are linked by a complex network of lexical relations. Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term like vehicle will match documents containing specific terms like limousine.

Recall that each synset has one or more hypernym paths that link it to a root hypernym such as entity.n.01. Two synsets linked to the same root may have several hypernyms in common (cf 5.1). If two synsets share a very specific hypernym — one that is low down in the hypernym hierarchy — they must be closely related.

In [None]:
right = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
minke = wn.synset('minke_whale.n.01')
tortoise = wn.synset('tortoise.n.01')
novel = wn.synset('novel.n.01')
right.lowest_common_hypernyms(minke)

In [None]:
right.lowest_common_hypernyms(orca)

In [None]:
right.lowest_common_hypernyms(tortoise)

In [None]:
right.lowest_common_hypernyms(novel)

Of course we know that whale is very specific (and baleen whale even more so), while vertebrate is more general and entity is completely general. We can quantify this concept of generality by looking up the depth of each synset:

In [None]:
wn.synset('baleen_whale.n.01').min_depth()

In [None]:
wn.synset('whale.n.02').min_depth()

In [None]:
wn.synset('vertebrate.n.01').min_depth()

In [None]:
wn.synset('entity.n.01').min_depth()

Similarity measures have been defined over the collection of WordNet synsets which incorporate the above insight. For example, path_similarity assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found). Comparing a synset with itself will return 1. Consider the following similarity scores, relating right whale to minke whale, orca, tortoise, and novel. Although the numbers won't mean much, they decrease as we move away from the semantic space of sea creatures to inanimate objects.

In [None]:
right.path_similarity(minke)

In [None]:
right.path_similarity(orca)

In [None]:
right.path_similarity(tortoise)

In [None]:
right.path_similarity(novel)