# 8. Tokenisation

## Introduction

Quantitative analyses of texts are often based initially on counts of the smaller linguistic units that occur within texts, such as the words, the paragraph, or the sentences. The process of dividing a text into such smaller components is referred to as tokenisation. 

Word tokenisation generally takes place on the basis of the spaces that occur in between the words. The individual words that are found in a text are called “**tokens**". When this full list is deduplicated, leaving only the unique words, the items in such lists are called “**types**”. 

This tutorial explains how you can tokenise a text using the `nltk` package, a toolkit that enables you you work with texts in natural languages. 

To make use of this package, you firstly need to import it.

In [3]:
import nltk 

SyntaxError: invalid syntax (1250429073.py, line 1)

In [9]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable


The methods that are discussed in this tutorial make use of a number of additional resources which are not installed by default. If you have never used `nltk` before, you need to run the code below to install these resources.

In [10]:
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
    
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.download('wordnet')
nltk.download('sentiwordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\s2929392\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\s2929392\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\s2929392\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping help\tagsets.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\s2929392\AppData\Roaming\nltk_data...
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\s2929392\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\sentiwordnet.zip.


True

## Word tokenisation 

The `ntlk.tokenize` module has a method named `word_tokenize()`. This methods demand a string as input. This may be a sentence or a whole paragraph. When it is run, the method returns a Python list containing all the individual words found in the string that is provided. As mentioned, the individual words are identified using the spaces found in this string. 

In [3]:
from nltk.tokenize import word_tokenize

quote = '''
It was the best of times, 
it was the worst of times'''

words = word_tokenize(quote)

print(words)

['It', 'was', 'the', 'best', 'of', 'times', ',', 'it', 'was', 'the', 'worst', 'of', 'times']


When you look closely at the output of the code above, you can see that the `word_tokenize()` method treats punctuation marks as separate tokens. In the example above, the comma following the first occurrence of the word 'times' is actually a separate item in the list. 

The function `remove_punctuation`, defined below, can be used to remove the tokens that consist of punctuation only. As input, the function demands a list of tokens. For each string in this list, the function tests whether it consists of alphanumeric characters, using the `isalnum()` method. The function returns those words which pass this test only. 

In [43]:
def remove_punctuation(words):
    new_list= []
    for w in words:
        if w.isalnum():
            new_list.append( w )
    return new_list

This function can be used as follows:

In [5]:
words = remove_punctuation(words)
print(words)

['It', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']


Once you have created a list containing all the individual tokens, you can easily count the total number of token, using the `len()` function.

In [6]:
nr_tokens = len( words )

print( f'The text fragment contains {nr_tokens} tokens.' )

The text fragment contains 12 tokens.


## Sentence tokenisation

The method `sent_tokenize()` from the `nltk` package can be used to divide a text into its separate sentences. This type of tokenisation take place on the basis of full stops and upper case characters. 

The cell below contains an illustration of how the `sent_tokenize()` method can be used.

In [11]:
import nltk
from nltk.tokenize import sent_tokenize

## First paragraph of "A Farewell to Arms"
quote = '''In the late summer of that year we lived in a house in a village that looked across the river and the plain to the mountains. In the bed of the river there were pebbles and boulders, dry and white in the sun, and the water was clear and swiftly moving and blue in the channels. Troops went by the house and down the road and the dust they raised powdered the leaves of the trees. The trunks of the trees too were dusty and the leaves fell early that year and we saw the troops marching along the road and the dust rising and leaves, stirred by the breeze, falling and the soldiers marching and afterward the road bare and white except for the leaves.'''

sentences = sent_tokenize(quote)

print( f'The fragment contains { len(sentences) } sentences.\n' )

for s in sentences:
    print(s + '\n') 

The fragment contains 4 sentences.

In the late summer of that year we lived in a house in a village that looked across the river and the plain to the mountains.

In the bed of the river there were pebbles and boulders, dry and white in the sun, and the water was clear and swiftly moving and blue in the channels.

Troops went by the house and down the road and the dust they raised powdered the leaves of the trees.

The trunks of the trees too were dusty and the leaves fell early that year and we saw the troops marching along the road and the dust rising and leaves, stirred by the breeze, falling and the soldiers marching and afterward the road bare and white except for the leaves.



### Exercise 8.1.

The 'Corpus' that you have downloaded as part of this tutorial contains the full text of 'Pride and Prejudice'. 
Write code that can help you to answer the following questions about this text:

* How many characters are there in the novel?
* How many words does the novel contain?
* How often does the novel mention the word "marriage"?
* How long are the words in *Pride and Prejudice* on average (measured in number of characters)?

In [11]:
text_file = open ('corpus/PrideandPrejudice.txt', encoding ='utf-8')
full_text = text_file.read()

words = nltk.word_tokenize(full_text)

NameError: name 'nltk' is not defined

In [31]:


print(len(words))


143569


In [33]:
words = nltk.word_tokenize(full_text)

words.count ('marriage')

66

In [45]:
words = nltk.word_tokenize(full_text)
words = remove_punctuation(words)
print(len(words))


120081


In [46]:
from collections import Counter 

frequencies = Counter(words)

for word,count in frequencies.most_common(20):
    print(f"{word} => {count}")

to => 4076
the => 4056
of => 3605
and => 3371
her => 2131
I => 2020
a => 1874
was => 1842
in => 1783
that => 1520
not => 1506
she => 1378
it => 1273
be => 1230
his => 1191
had => 1148
you => 1137
as => 1132
he => 1096
for => 1029


### Exercise 8.2.

Can you print the first five sentences of *Pride and Prejudice*?
How many sentences does the novel contain in total?
Building on the results of the previous exercise, can you calculate the average length of sentences? In other words, how many tokens do the sentences contain on average?

In [3]:
from nltk import sent_tokenize

### Exercise 8.3.

What is the longest word in the novel *Ullyses*?

In [9]:
from nltk import sent_tokenize

In [10]:
text_file = open ('corpus/Ullyses.txt', encoding ='utf-8')
full_text = text_file.read()

print(len(words))
