# Tokenisation



Many of the functionalities that can be offered by text analysis tools are based on counts of the smaller linguistic units that occur within texts, such as their words or their sentences. 
This preparatory process, in which the program divides the linear texts into discrete units, is generally referred to as “segmentation” or “tokenisation”. Segmentation generally takes place on the basis of the spaces, the punctuation marks and the line breaks that can be found within texts. Utilising such notational conventions, which can be referred to as ‘soft markup’, text mining applications can be developed for the recognition of units such as words, sentences or paragraphs. 

The individual words that are found are commomly referred to as “**tokens**". When this full list is deduplicated, leaving only the unique words, the items in such lists are called “**types**”. Frequency lists, counting occurrences of types, often form the basis for further statistical analyses. 

The cell below contains a function named `tokenise()`. It demand a certain string as input. This may be a sentences, or a whole paragraph. The function returns a list containing all the individual words found in this string.  

The first line of the function creates an empty list which will eventually receive all the tokens in this list. In the first few lines, the sting that is supplied is converted into lower case. All  occurrences of the em-dash (i.e. two consecutive hyphenss) are also surrounded by spaces. In some texts, such em-dashed are used to separate words. 

Next, the string is divided into its separate words using the `split()` method from the `re` module. All leading and trailing punctuation is removed from the words that are found in this way, using the standard `strip` method, which is available for all strings. The `strip()` method in the code below removes all the symbols that are predefined in the Python language as `string.punctuation`. 

It was decided, finally, that words should minimally contain one alphabetic character. The condition given after the `if` keyword stipulates that substrings can only be appended to the `tokens` list when they contain at least one character from our alphabet, either in upper or lower case. Strings which only consist of numbers are disregarded in this function. 

In [None]:
import re
import string

# function to tokenise a string into words
def word_tokenise( text ):
    tokens = []
    text = text.lower()
    text = re.sub( '--' , ' -- ' , text)
    words = re.split( r'\s+' , text )
    for w in words:
        w = w.strip( string.punctuation )
        if re.search( r"[a-zA-Z]" , w ):
            tokens.append(w)
    return tokens

You can use this function as follows:

In [None]:
sentence = 'How many words are there in this sentence?'

#t = text('Corpus/ARoomWithaView.txt')

words = word_tokenise(sentence)

    
print( f'The sentence contains {len(words)} tokens:' )

for w in words:
    print(w)


The `tokenise()` function can subsequently be used to calculate word frequencies. 

To keep track of the word counts, it can be very useful to work with a Python dictionary. As explained in the tutorial, a dictionary is variable which can hold value pairs. In this situation, we also want to store pairs of values: 

* As keys: the types found in the text and 
* As values: the number of times thes types occur. 

The code below defines the words (i.e. the types) as the indices of the dictionary named `freq`, and the values that these indices are associated with represent the number of times these words are found in the text. 

The following code reads in a specific text line by line. It firstly tokenises each line, using the `tokenise()` method. Each word that is found in this way will become an index in the `freq` dictionary. At the first occurrence of this word, the word will be assigned the value 1. The value will be updated, however, with every new occurrence of the word in the text. The program does this for each individual word. Once Python has read the full text, it will have data about the occurrences of all the words in this text. 

The final few lines print the 30 most frequent words. The number of words to be shown can be manipulated by changing the value of the variable named `max`.  

To calculate a frequency list for one of the texts in your own corpus, you obviously need to change the value of the variable named `file`. 

In [None]:
import os
from os.path import join
import re
import tdm

# create variable for the name of the folder containing your texts
dir = 'Corpus'

# create variable for the name of file you want to analyse
file = 'ARoomWithAView.txt'
# create a dictionary which will count the tokens
# word as index, and frequency as value
freq = dict()

# read the text
text = open( join( dir, file ) )


# Read the text, and iterate through the text line by line. 
for line in text:
    # Tokenise each line, and update the dictionary as you go along
    words = word_tokenise( line )  
    for w in words:
        freq[w] = freq.get( w, 0 ) + 1
         


            
count = 0 
max = 30 

for word in reversed( tdm.sortedByValue(freq) ):
    count += 1
    print( '{}. {} => {}'.format( count , word , freq[word] ) )
    if count == max:
        break

When you have managed to run the code above, you shall probably see that the list that is created is headed by so-called function words. These are words such as articles, prepositions and pronouns. These words are important for producting grammatically correct sentences, but they mostly have little expessive value independently. 

If you are interested in studying the contents of a text, it can be useful to remove such frequently used function words. One option is to download a text file containing stopwords from the web and to copy all of these words into a list. The code below copies stopwords from a file created by the [*Information Retrieval Group*](http://ir.dcs.gla.ac.uk/) at the University of Glasgow.


In [None]:
import requests
import re

response = requests.get('http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words')

stopwords = []
if response.status_code == 200:
    response.encoding = 'utf-8'
    contents = response.text
    lines = re.split(r'\n' , contents)
    for word in lines:
        if re.search( r'\w' , word ):
            stopwords.append(word.strip())

print(stopwords)

Alternatively, you can make use of the module named `stopwords` from the `nltk` library. The code that follows creates a list named `stopwords`, on the basis of the predefined list from the `nltk` library. 

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.download('wordnet')
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

print(stopwords)

The code below is is a revision of the code that was given earlier for calculating word frequencies. The only difference is that there is a condition in the second for-loop. Words will be added to the dictionary named `freq` only if it is *not* in the list of stopwords   

In [None]:
from tdm import *
import os
from os.path import join


freq = dict()


dir = 'Corpus'
file = 'ARoomWithAView.txt'

text = open( join( dir, file ) )

for line in text:
    words = word_tokenise( line )  
    for w in words:
        if w not in stopwords:
            freq[w] = freq.get( w, 0 ) + 1
            
max = 30 
count = 1

for word in reversed( sortedByValue(freq) ):
    count += 1
    print( '{}. {} => {}'.format( count , word , freq[word] ) )
    if count == max:
        break


## A word cloud

The code below can be used to visualise a given dictionary with word frequencies into a word cloud. 

To make sure that the word cloud works with the correct dictionary, firstly run the code in one of the cells above to ensure that the dictionaru naed `freq` actually exists. 

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt 
from wordcloud import WordCloud 

wordcloud = WordCloud( background_color="white",  width=1500,height=1000, max_words= 100,relative_scaling=1,normalize_plurals=False).generate_from_frequencies(freq)


plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


If the code above produced an error message, this may be caused by the fact that wordcloud module has not been inestalled yet on your computer. To get the code above to work, you may need to install the wordcloud module using the commands in the cell below:

In [None]:
import sys
!pip install wordcloud

## Dispersion graphs

Frequencies can also be clarified in dispersion graphs. Instead of giving information about the total number of occurrences, dispersion graphs (or distribution graphs) indicate the number of occurrences in separate sections of the texts. While it is possible to calculate frequencies within, for example, the separate chapters of a novel, such dispersion graphs often work with random segments, simply created by chopping up the text into chunks of a given number of words. 

To produce such graphs, a full text firstly needs to be divided into segments. The graph that is generated provides information on the frequency of a particular word within each of these segments. The graph can be used to locate those sections in which a given term is used most frequently.

In the code below, the method `divideIntoSegments()` demands two parameters. The first of these is the text to be analysed. Via the second parameter, you can specify the number of segments that need to be created. The number of segments als determines the size of the segments. The number of these segments are calculated by dividing the total number of tokens by the number of segments. 

Note this code also makes use of the `word_tokenise()` function that was discussed above. Make sure that the cell containing this function has been executed before running the cell below. 

In [None]:
import re
import string
from os.path import join


def divideIntoSegments( full_text , nr_segments ):

    segments = []


    all_words = word_tokenise( full_text )

    segmentSize = int( len(all_words) / nr_segments )

    count_words = 0
    text = ''

    for word in all_words:
        count_words += 1
        text += word + ' '

        ## This line below used the modulo operator:
        ## We can use it to test if the first number is
        ## divisible by the second number
        if count_words % segmentSize == 0:
            segments.append(text.strip())
            text = ''
    return segments



dir = 'Corpus'
file = 'MobyDick.txt'

text_file = open( join( dir, file ) )
full_text = text_file.read()
text_file.close()


segments = divideIntoSegments( full_text , 30 )

data = dict()

count = 0
for s in segments:
    count += 1
    hits = 0 
    for w in word_tokenise(s):
        if re.search( r'\bwhale*\b' , w , re.IGNORECASE ):
            hits += 1
    data[ count ] = hits
    

%matplotlib inline
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

fig = plt.figure( figsize=( 12 , 6 ) )
ax = plt.axes()

ax.plot( list(data.keys() ) , list( data.values() ) , color = '#930d08' , linestyle = 'solid')

ax.set_xlabel('Section')
ax.set_ylabel('Frequency')

ax.set_title( 'Dispersion graph for "Moby Dick" ' , fontsize=20)
plt.show()



The functions that have been discussed in this notebook (`word_tokenise()` and `divideIntoSegments()`) can also be found in the module named `tdm`, so that they can easily be reused in other programs.  

## Other tokenisers

Finally, it is important to emphasise that there are many other libraries and modules that enable you to tokenise a texts. The `nltk` library, for instance, includes a method named `word_tokenize()`, which can be used as follows:

In [None]:
from nltk import word_tokenize

dir = 'Corpus'
text = 'ARoomWithAView.txt'
path = join( dir, text )

with open( path , encoding = 'utf-8' ) as file:
    full_text = file.read()
    full_text = full_text.lower()

words_nltk = word_tokenize(full_text)

Note that the `word_tokenize()` method leaves the case at it is in the original text. If you want to be able to calculate frequencies in a case-insensitive manner, you need to convert every word either to lower case (using the `lower()` method) or to upper case (using `upper()`). 

There are a number of differences, however, in the word tokenisation process. We can compare the two approaches by comparing the word lists that are created using `nltk` to the words that are recognised by the `word_tokenise()` function that was discussed at the beginning of this notebook. The code of this function has been copied and pasted into the `tdm` module, whichis imported in the code below. 

In [None]:
import tdm

dir = 'Corpus'
text = 'ARoomWithAView.txt'
path = join( dir, text )

with open( path , encoding = 'utf-8' ) as file:
    full_text = file.read()

words = tdm.word_tokenise(full_text)


An important different between the two approaches is that `nltk` treats all the punction marks as separate tokens. When you run the code below, you should see many occurrences of semi-colons, dots, commans and quotes. Arguably, such punctuation marks should not be in a frequency list.

In [None]:
for word in words_nltk:
    if word not in words:
        print(word)
    

Another difference is that `nltk` aims to separate the genitival 's' from the tokens such as the following:

* people's 
* father's
* child's

`nltk` likewise aims to separate the various parts in contracted verb forms. The `word_tokenise()` function retained character sequences such as `don't`, `i'm` and `weren't` as tokens in their own right. `nltk` aims to separate the stem form the rest of such tokens, but this also results in tokens such as 

* 'm
* 'nt
* 's
* 't

The question whether words in the genitival forms and contracted verb forms must be counted separately is of course open to debate. It is useful to be aware of the different approaches that may be followed while tokenising words, however.

In [None]:
for word in words:
    if word not in words_nltk:
        print(word)