# Corpus and Lexicon

## Objectives
- Understanding: 
    - relation between corpus and lexicon
    - effects of pre-processing (tokenization) on lexicon
    
- Learning how to:
    - load basic corpora for processing
    - compute basic descriptive statistic of a corpus
    - building lexicon and frequency lists from a corpus
    - perform basic lexicon operations
    - perform basic text pre-processing (tokenization and sentence segmentation) using python libraries

### Recommended Reading
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)
- Steven Bird, Ewan Klein, and Edward Loper. [__Natural Language Processing with Python__ (NLTK)](https://www.nltk.org/book/)

### Covered Material
- SLP
    - [Chapter 2: Regular Expressions, Text Normalization, Edit Distance](https://web.stanford.edu/~jurafsky/slp3/2.pdf) 
- NLTK 
    - [Chapter 2: Accessing Text Corpora and Lexical Resources](https://www.nltk.org/book/ch02.html)
    - [Chapter 3: Processing Raw Text](https://www.nltk.org/book/ch03.html)

### Requirements

- [NLTK](http://www.nltk.org/)
    - run `pip install nltk`
    
- [spaCy](https://spacy.io/)
    - run `pip install spacy`
    - run `python -m spacy download en_core_web_sm` to install English language model (`spacy>=3.0`)

- [scikit-learn](https://scikit-learn.org/)
    - run `pip install scikit-learn`
    

## 1. Corpora and Counting

### 1.1. Corpus

[Corpus](https://en.wikipedia.org/wiki/Text_corpus) is a collection of written or spoken texts that is used for language research. Before doing anything with a corpus we need to know its properties:

__Corpus Properties__:
- *Format* -- how to read/load it?
- *Language* -- which tools/models can I use?
- *Annotation* -- what it is intended for?
- *Split* for __Evaluation__: (terminology varies from source to source)

| Set         | Purpose                                       |
|:------------|:----------------------------------------------|
| Training    | training model, extracting rules, etc.        |
| Development | tuning, optimization, intermediate evaluation |
| Test        | final evaluation (remains unseen)             |


#### 1.1.1. Text Corpora in NLTK
NLTK provides several corpora with loading functions. Plain text corpora come from a _Project Gutenberg_.

`nltk.corpus.gutenberg.fileids()` lists available books.

In [1]:
import numpy as np

In [2]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\xdieg\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\xdieg\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

#### 1.1.2. Units of Text Corpus
Depending on a goal, corpus can be seen as a sequence of:
- characters
- words (tokens)
- sentences
- paragraphs
- document

Each level, in turn, can be seen as a sequence of elements of the previous level.

- word -- a sequence of characters
- sentence -- a sequence of words
- paragraph -- a sequence of sentences
- document -- a sequence of paragraphs (or sentences, depending on our purpose)

#### 1.1.3. Loading NLTK Corpora

NLTK provides functions to load a corpus using these different levels, as `raw` (characters), `words`, and `sentences`.

In [4]:
alice_chars = nltk.corpus.gutenberg.raw('carroll-alice.txt')
print('chars:', alice_chars[0])
alice_words = nltk.corpus.gutenberg.words('carroll-alice.txt')
print('words:', alice_words[0])
alice_sents = nltk.corpus.gutenberg.sents('carroll-alice.txt')
print('sents:', alice_sents[0])

chars: [
words: [
sents: ['[', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']']


### 1.2. Corpus Descriptive Statistics (Counting)

*Corpus* can be described in terms of:

- total number of characters
- total number of words (_tokens_: includes punctuation, etc.)
- total number of sentences

- minimum/maximum/average number of character per token
- minimum/maximum/average number of words per sentence
- minimum/maximum/average number of sentences per document


__Example__

$$\text{Av. Token Count} = \frac{\text{count}(tokens)}{\text{count}(sentences)}$$


In [5]:
# let's compute average sentence length & round to closes integer
round(len(alice_words)/len(alice_sents))

20

In [6]:
# let's compute length of each sentence
sent_lens = [len(sent) for sent in alice_sents]
# let's compute length of each word
word_lens = [len(word) for word in alice_words]
# let's compute length the number of characters in each sentence
chars_lens = [len(''.join(sent)) for sent in alice_sents]

avg_sent_len = round(sum(sent_lens)/len(sent_lens))
min_sent_len = min(sent_lens)
max_sent_len = max(sent_lens)
print("AVG sent len", avg_sent_len)
print("MIN sent len", min_sent_len)
print("MAX sent len", max_sent_len)

AVG sent len 20
MIN sent len 2
MAX sent len 204


In [7]:
# JOIN built-in function example
tmp = ['H', 'e', 'l', 'l', 'o']
print(''.join(tmp))
print('⭐'.join(tmp))

Hello
H⭐e⭐l⭐l⭐o


#### Exercise

- Define a function to compute corpus descriptive statistics

    - input:
        - raw text (Chars)
        - words
        - sentences
    - output (print): 
        - average number of:
            - chars per word
            - words per sentence
            - chars per sentence
        - Size of the longest word and sentence


In [8]:
def statistics(chars, words, sents):
    word_lens = [len(w) for w in words] # Add word lens
    sent_lens = [len(s) for s in sents] # Add sentence lens
    chars_in_sents = [len("".join(s)) for s in sents] # Add char lens
    
    word_per_sent = round(sum(sent_lens) / len(sent_lens))
    char_per_word = round(sum(word_lens) / len(word_lens))
    char_per_sent = round(sum(chars_in_sents) / len(chars_in_sents))
    
    longest_sentence = max(sent_lens)
    longest_word = max(word_lens)
    
    return word_per_sent, char_per_word, char_per_sent, longest_sentence, longest_word

word_per_sent, char_per_word, char_per_sent, longest_sent, longest_word = statistics(alice_chars, alice_words, alice_sents)

print('Word per sentence', word_per_sent)
print('Char per word', char_per_word)
print('Char per sentence', char_per_sent)
print('Longest sentence', longest_sent)
print('Longest word', longest_word)

Word per sentence 20
Char per word 3
Char per sentence 68
Longest sentence 204
Longest word 14


## 2. Lexicon

[Lexicon](https://en.wikipedia.org/wiki/Lexicon) is the *vocabulary* of a language. In linguistics, a lexicon is a language's inventory of lexemes.

Linguistic theories generally regard human languages as consisting of two parts: a lexicon, essentially a catalog of a language's words; and a grammar, a system of rules which allow for the combination of those words into meaningful sentences. 

*Lexicon (or Vocabulary) Size* is one of the statistics reported for corpora. While *Word Count* is the number of __tokens__, *Lexicon Size* is the number of __types__ (unique words).


### 2.1. Lexicon and Its Size

#### 2.1.1. Constructing Lexicon and Computing its Size

Since lexicon is a list of unique elemets, it is a `set` of corpus words (i.e. tokens).
Consequently, its size is the size of the set.

In [9]:
alice_lexicon = set(alice_words)
len(alice_lexicon)

3016

__NOTE__:
We did not process our corpus in any way. Consequently, words with case variations are different entries in our lexicon.

In [10]:
print('ALL' in alice_lexicon)
print('All' in alice_lexicon)
print('all' in alice_lexicon)

True
True
True


#### 2.1.2. Lowercased Lexicon
Let's lowercase our corpus and re-compute the lexicon size.

In [11]:
alice_lexicon = set([w.lower() for w in alice_words])
len(alice_lexicon)

2636

In [12]:
print('ALL' in alice_lexicon)
print('All' in alice_lexicon)
print('all' in alice_lexicon)

False
False
True


### 2.2. Frequency List

In Natural Language Processing (NLP), [a frequency list](https://en.wikipedia.org/wiki/Word_lists_by_frequency) is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.

What is a "word"?

- case sensitive counts
- case insensitive counts

#### 2.2.1. Computing Frequency List with python

In python, frequency list can be constructed in several ways. The most convenient is the `Counter`.

In [13]:
from collections import Counter
alice_freq_list = Counter(alice_words)

In [14]:
print(alice_freq_list.get('ALL', 0))
print(alice_freq_list.get('All', 0))
print(alice_freq_list.get('all', 0))

4
5
173


#### 2.2.2. Computing Frequency List with NLTK
NLTK provides `FreqDist` class to construct a Frequency List (`FreqDist` == _Frequency Distribution_)

In [15]:
alice_freq_dist = nltk.FreqDist(alice_words)

In [16]:
print(alice_freq_dist.get('ALL', 0))
print(alice_freq_dist.get('All', 0))
print(alice_freq_dist.get('all', 0))

4
5
173


#### Exercise

- compute frequency list of __lowercased__ "alice" corpus (you can use either method)
- report `5` most frequent words (use can use provided `nbest` function to get a dict of top N items)
- compare the frequencies to the reference values below

| Word   | Frequency |
|--------|----------:|
| ,      |     1,993 |
| '      |     1,731 |
| the    |     1,642 |
| and    |       872 |
| .      |       764 |


In [17]:
def nbest(d, n=1):
    """
    get n max values from a dict
    :param d: input dict (values are numbers, keys are stings)
    :param n: number of values to get (int)
    :return: dict of top n key-value pairs
    """
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=True)[:n])

In [18]:
alice_lowercase_freq_list = Counter([w.lower() for w in alice_words]) # Replace X with the word list of the corpus in lower case (see above))
nbest(alice_lowercase_freq_list, n=5) # Change N form 1 to 5

{',': 1993, "'": 1731, 'the': 1642, 'and': 872, '.': 764}

### 2.3. Lexicon Operations

It is common to process the lexicon according to the task at hand (not every transformation makes sense for all tasks). The common operations are removing words by frequency (minimum or maximum, i.e. *Frequency Cut-Off*) and removing words for a specific lists (i.e. *Stop Word Removal*).

#### 2.3.1. Frequency Cut-Off

##### Exercise

<!-- - define a function to compute a lexicon from a frequency list applying minimum and maximum frequency cut-offs
    
    - input: frequence list (dict)
    - output: list
    - use default values for min and max
     -->
- Using the function cut_off
    
    - compute lexicon applying:
    
        - minimum cut-off 2 (remove words that appear less than 2 times, i.e. remove [hapax legomena](https://en.wikipedia.org/wiki/Hapax_legomenon))
        - maximum cut-off 100 (remove words that appear more that 100 times)
        - both minimum and maximum thresholds together
        
    - report size for each comparing to the reference values in the table (on the lowercased lexicon)

| Operation  | Min | Max | Size |
|------------|----:|----:|-----:|
| original   | N/A | N/A | 2636 |
| cut-off    |   2 | N/A | 1503 |
| cut-off    | N/A | 100 | 2586 |
| cut-off    |   2 | 100 | 1453 |


In [19]:
def cut_off(vocab, n_min=100, n_max=100):
    new_vocab = []
    # frequency data structure is a dictionary
    for word, count in vocab.items():
        if count >= n_min and count <= n_max:
            new_vocab.append(word)
    return new_vocab

lower_bound = float("2") # Change these two number to compute the required cut offs
upper_bound = float("100")
lexicon_cut_off = len(cut_off(alice_lowercase_freq_list, n_min=lower_bound, n_max=upper_bound))

print('Original', len(alice_lowercase_freq_list))
print('CutOFF Min:', lower_bound, 'MAX:', upper_bound, ' Lexicon Size:', lexicon_cut_off)

Original 2636
CutOFF Min: 2.0 MAX: 100.0  Lexicon Size: 1453


#### 2.3.2. StopWord Removal

In computing, [stop words](https://en.wikipedia.org/wiki/Stop_words) are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Any group of words can be chosen as the stop words for a given purpose.

Let's check the stop word lists from the popular python libraries.

- spaCy
- NLTK
- scikit-learn

    
For NLTK we need to download them first

```python
import nltk
nltk.download('stopwords')
```

In [20]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\xdieg\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
from spacy.lang.en.stop_words import STOP_WORDS as SPACY_STOP_WORDS
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as SKLEARN_STOP_WORDS
from nltk.corpus import stopwords

NLTK_STOP_WORDS = set(stopwords.words('english'))

print('spaCy: {}'.format(len(SPACY_STOP_WORDS)))
print('NLTK: {}'.format(len(NLTK_STOP_WORDS)))
print('sklearn: {}'.format(len(SKLEARN_STOP_WORDS)))


spaCy: 326
NLTK: 179
sklearn: 318


##### Exercise
- using Python's built it `set` [methods](https://docs.python.org/2/library/stdtypes.html#set):
    - compute the intersection between the 100 most frequent words in frequency list of the alice corpus and the list of stopwords (report count)
    - remove stopwords from the lexicon
    - print the size of:
            - original lexicon
            - lexicon without stopwords
            - overlap between 100 most freq. words and stopwords

| Operation       | Size |
|-----------------|-----:|
| original        | 2636 |
| no stop words   | 2490 |
| top 100 overlap |   65 |

In [22]:
# Set built-in Function
set_a = set(['a', 'b', 'c', 'd', 'e'])
set_b = set(['a', 'b', 'f'])

print(set_a.intersection(set_b)) # Compute overlap
print(set_a.difference(set_b)) # Remove Elements by computing the set diff

{'b', 'a'}
{'c', 'd', 'e'}


In [23]:
alice_vocab = set([w.lower() for w in alice_words])
top100 = list(nbest(alice_lowercase_freq_list, n=100).keys())

stop_words = NLTK_STOP_WORDS

# semantically, overlap means tokens present also in top100 (most frequent tokens in alice vocab) that are stopwords indeed.
overlap = set(top100).intersection(set(stop_words)) # Compute the intersation between top100 and stop_words
alice_vocab_no_stopwords = alice_vocab.difference(overlap) # Remove Stopwords from alice vocab

print('Original', len(alice_vocab))
print('No stopwords', len(alice_vocab_no_stopwords))
print('Top100 overlap', len(overlap))

Original 2636
No stopwords 2571
Top100 overlap 65


## 3. Basic Text Pre-processing

Both frequency cut-off and stop word removal are frequently used text pre-processing steps. Depending on the application, there are several other common text pre-processing steps that are usually applied for tranforming text for Machine Learning tasks.

__Text Normalization Steps__

- removing extra white spaces

- tokenization
    - documents to sentences (sentence segmentation/tokenization)
    - sentences to tokens

- lowercasing/uppercasing


- removing punctuation

- removing accent marks and other diacritics 

- removing stop words (see above)

- removing sparse terms (frequency cut-off)

- number normalization
    - numbers to words (i.e. `10` to `ten`)
    - number words to numbers (i.e. `ten` to `10`)
    - removing numbers

- verbalization (specifically for speech applications)

    - numbers to words
    - expanding abbreviations (or spelling out)
    - reading out dates, etc.
    

- [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation)
    - the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

- [stemming](https://en.wikipedia.org/wiki/Stemming)
    - the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.


### 3.1. Tokenization and Sentence Segmentation

Given a "clean" text, in order to allow any analysis, we need to identify its units.
In other words, we need to _segment_ the text into sentences and words.

__NOTE__:
Since both _tokenization_ and _sentence segmentation_ are automatic, different tools yield different results.

#### 3.1.1. Tokenization and Sentence Segmentation with spaCy
The default spaCy NLP pipeline does several processing steps including __tokenization__, *part of speech tagging*, lemmatization, *dependency parsing* and *Named Entity Recognition* (ignore the ones in *italics* for today). 


SpaCy produces a `Doc` object that contains `Span`s (sentences) and `Token`s (words).

*Installing the model*
```python
python -m spacy download en_core_web_sm
```

In [24]:
import spacy
import en_core_web_sm
#nlp = en_core_web_sm.load()
# un-comment the lines above, if you get 'ModuleNotFoundError'
nlp = spacy.load("en_core_web_sm")
txt = alice_chars

In [25]:
# process the document
doc = nlp(txt, disable=["tagger", "ner"])



In [26]:
print("first token: '{}'".format(doc[0]))
print("first sentence: '{}'".format(list(doc.sents)[0]))

first token: '['
first sentence: '[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

'


In [27]:
# access list of tokens (Token objects)
print(len(doc))
# access list of sentences (Span objects)
print(len(list(doc.sents)))

37033
1558


#### 3.1.2. Tokenization and Sentence Segmentation with NLTK
NLTK's [tokenize](https://www.nltk.org/api/nltk.tokenize.html) package provides similar functionality using the methods below.

- `word_tokenize` 
- `sent_tokenize`

There are several tokenizer available (read documentation for more information).

In [28]:
# download NLTK tokenizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\xdieg\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [29]:
alice_words_nltk = nltk.word_tokenize(alice_chars) # alice_chars is a book string
alice_sents_nltk = nltk.sent_tokenize(alice_chars)
print(len(alice_words_nltk))
print(len(alice_sents_nltk))

33494
1625


In [30]:
print("first token: '{}'".format(alice_words_nltk[0]))
print("first sentence: '{}'".format(alice_sents_nltk[0]))

first token: '['
first sentence: '[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I.'


## Lab Exercise
- Load another corpus from Gutenberg (e.g. `milton-paradise.txt`)
- Compute descriptive statistics on the __reference__ sentences and tokens.
- Compute descriptive statistics in the __automatically__ processed corpus
    - both with `spacy` and `nltk`
- Compute lowercased lexicons for all 3 versions of the corpus
    - compare lexicon sizes
- Compute frequency distribution for all 3 versions of the corpus
    - compare top N frequencies

In [31]:
milton_chars = nltk.corpus.gutenberg.raw('milton-paradise.txt')

**Descriptive statistics on reference sentences and tokens**

In [32]:
milton_words = nltk.corpus.gutenberg.words('milton-paradise.txt')
milton_sents = nltk.corpus.gutenberg.sents('milton-paradise.txt') # joined sentences =/= whole raw text

# Or: word_per_sent, char_per_word, char_per_sent, longest_sent, longest_word = statistics(milton_chars, milton_words, milton_sents)

word_lens = [len(w) for w in milton_words] # chars in words
sents_lens = [len(s) for s in milton_sents] # words in sents
chars_sents_lens = [len("".join(s)) for s in milton_sents] # chars in sents

word_per_sent = round(sum(sents_lens) / len(sents_lens))
char_per_word = round(len(milton_chars) / len(milton_words))
char_per_sent = round(sum(chars_sents_lens) / len(milton_sents))
longest_sent = max(sents_lens)
longest_word = max(word_lens)

print('Word per sentence:\t', word_per_sent)
print('Char per word:\t\t', char_per_word)
print('Char per sentence:\t', char_per_sent)
print('Longest sentence len:\t', longest_sent)
print('Longest word len:\t', longest_word)

Word per sentence:	 52
Char per word:		 5
Char per sentence:	 203
Longest sentence len:	 533
Longest word len:	 16


In [33]:
# Comparing reference with automatic fns
print(f"1. Reference #corpus words: {len(milton_words)}")
print(f"1. Tokenized #corpus words: {len(nltk.word_tokenize(milton_chars))}\n")

print(f"2. Reference #corpus sents: {len(milton_sents)}")
print(f"2. Tokenized #corpus sents: {len(nltk.sent_tokenize(milton_chars))}")

1. Reference #corpus words: 96825
1. Tokenized #corpus words: 95716

2. Reference #corpus sents: 1851
2. Tokenized #corpus sents: 1835


**Descriptive statistics on NLTK sentences and tokens**

In [34]:
milton_words_nltk = nltk.word_tokenize(milton_chars)
milton_sents_nltk = nltk.sent_tokenize(milton_chars)

word_per_sent, char_per_word, char_per_sent, longest_sent, longest_word = statistics(milton_chars, milton_words_nltk, milton_sents_nltk)

print('Word per sentence:\t', word_per_sent)
print('Char per word:\t\t', char_per_word)
print('Char per sentence:\t', char_per_sent)
print('Longest sentence len:\t', longest_sent)
print('Longest word len:\t', longest_word)

Word per sentence:	 253
Char per word:		 4
Char per sentence:	 253
Longest sentence len:	 2658
Longest word len:	 18


NLTK tokenizes sentences in a very different way!

In [35]:
milton_sents_nltk[0]

"[Paradise Lost by John Milton 1667] \n \n \nBook I \n \n \nOf Man's first disobedience, and the fruit \nOf that forbidden tree whose mortal taste \nBrought death into the World, and all our woe, \nWith loss of Eden, till one greater Man \nRestore us, and regain the blissful seat, \nSing, Heavenly Muse, that, on the secret top \nOf Oreb, or of Sinai, didst inspire \nThat shepherd who first taught the chosen seed \nIn the beginning how the heavens and earth \nRose out of Chaos: or, if Sion hill \nDelight thee more, and Siloa's brook that flowed \nFast by the oracle of God, I thence \nInvoke thy aid to my adventurous song, \nThat with no middle flight intends to soar \nAbove th' Aonian mount, while it pursues \nThings unattempted yet in prose or rhyme."

In [36]:
milton_sents[0]

['[', 'Paradise', 'Lost', 'by', 'John', 'Milton', '1667', ']']

**Descriptive statistics on SpaCy sentences and tokens**

In [37]:
# Statistics SpaCy
nlp = spacy.load("en_core_web_sm")
doc = nlp(milton_chars, disable=["tagger", "ner"])

# SpaCy words
milton_words_spacy = list(doc)

# SpaCy sentences: tokens to strings
milton_sents_spacy = []
for i, s in enumerate(list(doc.sents)):
    sent = [None] * len(s)
    for j, w in enumerate(list(s)): 
        sent[j] = w.orth_
    milton_sents_spacy.append(sent)

word_per_sent, char_per_word, char_per_sent, longest_sent, longest_word = statistics(milton_chars, milton_words_spacy, milton_sents_spacy)

print('Word per sentence:\t', word_per_sent)
print('Char per word:\t\t', char_per_word)
print('Char per sentence:\t', char_per_sent)
print('Longest sentence len:\t', longest_sent)
print('Longest word len:\t', longest_word)

Word per sentence:	 48
Char per word:		 4
Char per sentence:	 174
Longest sentence len:	 321
Longest word len:	 63


**Compute lowercased lexicons for all 3 versions of the corpus**

In [38]:
lex_milton = set([w.lower() for w in milton_words])
lex_milton_nltk = set([w.lower() for w in milton_words_nltk])
lex_milton_spacy = set([w.orth_.lower() for w in milton_words_spacy])

print(f"Reference milton-paradise lexicon size: \t{len(lex_milton)}")
print(f"NLTK milton-paradise lexicon size: \t\t{len(lex_milton_nltk)}")
print(f"SpaCy milton-paradise lexicon size: \t\t{len(lex_milton_spacy)}")

Reference milton-paradise lexicon size: 	9021
NLTK milton-paradise lexicon size: 		9280
SpaCy milton-paradise lexicon size: 		9144


**Compute frequency distribution for all 3 versions of the corpus**

In [39]:
milton_freqlist = nltk.FreqDist([w.lower() for w in milton_words])
milton_freqlist_nltk = nltk.FreqDist([w.lower() for w in milton_words_nltk])
milton_freqlist_spacy = nltk.FreqDist([w.orth_.lower() for w in milton_words_spacy])

In [40]:
nbest(milton_freqlist, n=5)

{',': 10198, 'and': 3395, 'the': 2968, ';': 2317, 'to': 2228}

In [41]:
nbest(milton_freqlist_nltk, n=5)

{',': 10228, 'and': 3391, 'the': 2964, ';': 2326, 'to': 2223}

In [42]:
# Here we notice spacy keeps line breaks!
nbest(milton_freqlist_spacy, n=5)

{'\n': 10425, ',': 10224, 'and': 3392, 'the': 2965, ';': 2303}