# Word tokenization and frequencies with NLTK

by Koenraad De Smedt at UiB

---

This notebook will introduce [NLTK](https://www.nltk.org/), the Natural Language ToolKit. This notebook will show how to do the following with this toolkit:

1.  How to *word-tokenize* a text, i.e. make a list of tokens (words but also punctuation) obtained from a text string.
2.  How to compute the *vocabulary*, also called the *types*, i.e. the set of unique tokens.
3.  How to make a *frequency distribution*, i.e. a counter of token occurrences.

In later notebooks, these techniques will be applied to larger texts read from the Web.

For those who want to know more on NLTK, there is an [online book](https://www.nltk.org/book/).

---

The NLTK module provides several useful functions for manipulating text. See the [documentation](https://www.nltk.org/) if you want to know more.

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize, FreqDist

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Let's use Shakespeare's sonnet 141 as an example text.

In [None]:
sonnet = '''In faith I do not love thee with mine eyes,
For they in thee a thousand errors note;
But 'tis my heart that loves what they despise,
Who, in despite of view, is pleased to dote.
Nor are mine ears with thy tongue's tune delighted;
Nor tender feeling, to base touches prone,
Nor taste, nor smell, desire to be invited
To any sensual feast with thee alone:
But my five wits nor my five senses can
Dissuade one foolish heart from serving thee,
Who leaves unswayed the likeness of a man,
Thy proud heart's slave and vassal wretch to be:
   Only my plague thus far I count my gain,
   That she that makes me sin awards me pain.'''

## Tokenization

NLTK provides the function `word_tokenize` which extracts all tokens and returns a list. This tokenizer is somewhat more sophisticated than the simple tokenizer from the previous notebook. Hyphenated words are kept together. Punctuation is split off and tokens for punctuation are included in the list. Notice how *'tis* and *heart's* are tokenized.

In [None]:
tokens = word_tokenize(sonnet)
print(tokens)

['In', 'faith', 'I', 'do', 'not', 'love', 'thee', 'with', 'mine', 'eyes', ',', 'For', 'they', 'in', 'thee', 'a', 'thousand', 'errors', 'note', ';', 'But', "'t", 'is', 'my', 'heart', 'that', 'loves', 'what', 'they', 'despise', ',', 'Who', ',', 'in', 'despite', 'of', 'view', ',', 'is', 'pleased', 'to', 'dote', '.', 'Nor', 'are', 'mine', 'ears', 'with', 'thy', 'tongue', "'s", 'tune', 'delighted', ';', 'Nor', 'tender', 'feeling', ',', 'to', 'base', 'touches', 'prone', ',', 'Nor', 'taste', ',', 'nor', 'smell', ',', 'desire', 'to', 'be', 'invited', 'To', 'any', 'sensual', 'feast', 'with', 'thee', 'alone', ':', 'But', 'my', 'five', 'wits', 'nor', 'my', 'five', 'senses', 'can', 'Dissuade', 'one', 'foolish', 'heart', 'from', 'serving', 'thee', ',', 'Who', 'leaves', 'unswayed', 'the', 'likeness', 'of', 'a', 'man', ',', 'Thy', 'proud', 'heart', "'s", 'slave', 'and', 'vassal', 'wretch', 'to', 'be', ':', 'Only', 'my', 'plague', 'thus', 'far', 'I', 'count', 'my', 'gain', ',', 'That', 'she', 'that', 

Use `casefold` if you want to convert everything to lowercase. This may have advantages and disadvantages.

In [None]:
tokens = word_tokenize(sonnet.casefold())
print(tokens)

['in', 'faith', 'i', 'do', 'not', 'love', 'thee', 'with', 'mine', 'eyes', ',', 'for', 'they', 'in', 'thee', 'a', 'thousand', 'errors', 'note', ';', 'but', "'t", 'is', 'my', 'heart', 'that', 'loves', 'what', 'they', 'despise', ',', 'who', ',', 'in', 'despite', 'of', 'view', ',', 'is', 'pleased', 'to', 'dote', '.', 'nor', 'are', 'mine', 'ears', 'with', 'thy', 'tongue', "'s", 'tune', 'delighted', ';', 'nor', 'tender', 'feeling', ',', 'to', 'base', 'touches', 'prone', ',', 'nor', 'taste', ',', 'nor', 'smell', ',', 'desire', 'to', 'be', 'invited', 'to', 'any', 'sensual', 'feast', 'with', 'thee', 'alone', ':', 'but', 'my', 'five', 'wits', 'nor', 'my', 'five', 'senses', 'can', 'dissuade', 'one', 'foolish', 'heart', 'from', 'serving', 'thee', ',', 'who', 'leaves', 'unswayed', 'the', 'likeness', 'of', 'a', 'man', ',', 'thy', 'proud', 'heart', "'s", 'slave', 'and', 'vassal', 'wretch', 'to', 'be', ':', 'only', 'my', 'plague', 'thus', 'far', 'i', 'count', 'my', 'gain', ',', 'that', 'she', 'that', 

Make the vocabulary, i.e. the word types, by computing the set of unique tokens in the text.

In [None]:
vocab = set(tokens)
print(vocab)

{'to', 'man', 'pleased', 'any', 'far', 'base', 'the', 'prone', 'with', 'dissuade', 'sensual', 'ears', 'despite', 'thee', 'only', 'be', 'faith', 'they', 'i', 'one', 'feeling', 'gain', 'and', 'serving', 'wits', 'alone', 'is', 'are', 'love', "'s", 'wretch', 'thus', 'nor', 'from', 'heart', 'unswayed', 'mine', 'view', '.', 'in', 'taste', ';', 'she', 'desire', 'feast', 'that', 'vassal', 'tender', 'touches', 'slave', 'likeness', 'note', 'do', 'five', 'eyes', 'leaves', 'awards', 'plague', 'pain', 'for', 'smell', 'dote', 'not', ',', 'invited', 'thy', 'can', 'foolish', 'proud', 'tongue', 'what', 'delighted', 'thousand', 'errors', 'sin', 'a', ':', 'my', 'of', 'me', 'but', 'tune', 'despise', 'makes', "'t", 'loves', 'senses', 'who', 'count'}


Print the length of the text and the length of the vocabulary.

In [None]:
print(len(tokens), len(vocab))

138 89


Make a list of types with more than five characters.

In [None]:
print([word for word in vocab if len(word) > 5])

['pleased', 'dissuade', 'sensual', 'despite', 'feeling', 'serving', 'wretch', 'unswayed', 'desire', 'vassal', 'tender', 'touches', 'likeness', 'leaves', 'awards', 'plague', 'invited', 'foolish', 'tongue', 'delighted', 'thousand', 'errors', 'despise', 'senses']


## Distribution of word counts

In computational and corpus linguistics, the term *frequency* is often used in different ways:

*   *Absolute frequencies* are simply counts, that is, the number of times something occurs in a text or corpus.
*   *Relative frequencies* are proportions of the number of occurrences to a certain amount of text (such as the length of a given text, or a million words)

The NLTK function `FreqDist` computes the distribution of tokens in a text in terms of absolute frequencies. It produces a kind of *counter*, which is a special *dict* in which each token is associated with its number of occurrences.

In [None]:
freq_dist = FreqDist(tokens)
freq_dist

FreqDist({',': 11, 'my': 5, 'to': 5, 'nor': 5, 'thee': 4, 'in': 3, 'with': 3, 'heart': 3, 'that': 3, 'i': 2, ...})

We can find the counts of a token by using the token as a key.

In [None]:
freq_dist['heart']

3

Sort the counts with the most common first. This produces a list of tuples.

In [None]:
freq_list = freq_dist.most_common(10)
print(freq_list)

[(',', 11), ('my', 5), ('to', 5), ('nor', 5), ('thee', 4), ('in', 3), ('with', 3), ('heart', 3), ('that', 3), ('i', 2)]


Print the nine most common items in the list, with the counts.

In [None]:
for (item, freq) in freq_dist.most_common(10):
  print(freq, ':', item)

11 : ,
5 : my
5 : to
5 : nor
4 : thee
3 : in
3 : with
3 : heart
3 : that
2 : i


NLTK can also provide a list of stopwords.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_eng = stopwords.words('english')
print(stop_eng[:12])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll"]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Compute the set of words in `vocab` which are not stopwords by using the minus sign `-` for set difference.

In [None]:
non_stop = vocab - set(stop_eng)
print(non_stop)
len(non_stop)

{'slave', 'likeness', 'note', 'five', 'proud', 'tongue', 'gain', 'man', 'serving', 'wits', 'delighted', 'pleased', 'eyes', 'touches', 'leaves', 'thousand', 'errors', 'alone', 'awards', 'sin', 'plague', 'far', 'base', 'love', "'s", 'wretch', 'thus', ':', 'pain', 'heart', 'smell', 'prone', 'unswayed', 'mine', 'view', 'dissuade', '.', 'taste', 'dote', 'count', 'sensual', 'tune', 'ears', ';', 'despite', 'thee', 'desire', 'despise', ',', 'makes', 'feast', 'faith', 'invited', "'t", 'thy', 'loves', 'vassal', 'foolish', 'senses', 'one', 'tender', 'feeling'}


62

## Languages other than English

NLTK supports some other languages, but the coverage of some forms, such as elisions, is not complete. See the following lines from [Kindertotenlieder](https://oxfordsong.org/song/kindertotenlieder).

In [None]:
verses = '''Du mußt nicht die Nacht in dir verschränken,
Mußt sie ins ew’ge Licht versenken!
[...]
Was dir nur Augen sind in diesen Tagen:
In künft’gen Nächten sind es dir nur Sterne.'''

gtokens = word_tokenize(verses, language='german')
print(gtokens)

### Exercises

1.   Compute the lexical variation, i.e. the ratio of types to tokens, of the sonnet (or another text).
2.   Print the nine most common tokens with their counts, but also print its length on the same line.
3.   Instead of printing in the previous exercise, use a comprehension to make a list of triples containing the count, the item and the item's length.
4.   Print the five most common tokens with at least three characters. The easiest is to first use a comprehension that includes items with at least three characters, and then make a frequency distribution of that list.
5.   (optional) For some purposes, one wants a list containing only words, excluding tokens that consist of just punctuation marks. What needs to be done to keep only words?
> Tip: if you `import string`, you can use the variable `string.punctuation` which has all punctuation marks. So you can write a function that checks if all characters of a token are in `string.punctuation` or not. Then you can use that function in a comprehension over all tokens.