# Vocabulary

## Tokenisation

Many of the functionalities that can be offered by text analysis tools are based on
counts of the smaller linguistic units that occur within texts, such as their words or their sentences. 
This preparatory
process, in which the program divides the linear texts into discrete units, is generally referred to as “segmentation” or “tokenisation”. Segmentation generally takes place on the basis of the spaces, the punctuation marks and the line breaks that can be found within texts. Utilising
such notational conventions, which can be referred to as ‘soft markup’, text mining
applications can be developed for the recognition of units such as words, sentences
or paragraphs. 

The individual words that are found are commomly referred to as “**tokens**". When this full list is deduplicated, leaving only the unique words, the items in such lists are called “**types**”. Frequency lists, counting occurrences of types, often form the basis for further statistical analyses. 

The cell below contains the function named `tokenise()` from the dtdpTdm module. It demand a certain string as input. The first line creates an empty list which will eventually receive all the tokens in this list. In the first few lines, the sting that is supplied is converted into lower case, and all occurrences of hyphens or dashes are surrounded by spaces. In some texts, these punctuation marks are used to separate words. Next, the string is divided into its separate words using the `split()` method from the `re` module. From all the words that are found in this way, leading an trailing punctuation is removed using the standard `strip` method, which is available for all strings. The `strip()` method in the code below removes all the symbols that are predefined in the Python language as `string.punctuation`. It was decided, finally, that words should minimally contain one alphabetic character. The condition given after the `if` keyword stipulates that substrings can only be appended to the `tokens` list when they contain at least one character from our alphabet, either in upper or lower case.  

In [None]:
import string
import re

# function to tokenise a string into words
def tokenise( text ):
    tokens = []
    text = text.lower()
    text = re.sub( '--' , ' -- ' , text)
    text = re.sub( '-' , ' - ' , text)
    words = re.split( r'\s+' , text )
    for w in words:
        w = w.strip( string.punctuation )
        if re.search( r"[a-zA-Z']" , w ):
            tokens.append(w)
    return tokens


You can test this function as follows:

In [None]:
import dtdpTdm as tdm
import string

sentence = 'How many words are there in this sentence?'

words = tdm.tokenise(sentence)

    
print('The sentence contains {} tokens:'.format( len(words) ) )

for w in words:
    print(w)

print( '\nThe variable string.punctuation contains the following characters: \n{}'.format( string.punctuation )   )

The `tokenise()` function, defined in the `dtdpTdm` module, can obviously be very helpful when we want to calculate word frequencies. To keep track of the word counts, it can be very useful to work with a Python dictionary. As explained in the tutorial, a dictionary is variable which can hold value pairs. In this situation, we also want to store pairs of values: (1) the types found in the text and (2) the number of tims thes types occur. The code below defines the words (i.e. the types) as the indices of the dictionary named `freq`, and the values that these indices are associated represent the number of times these words are found in the text. 

The following code reads in a spefific text line by line. It firstly tokenises each line, using the `tokenise()` method. Each word that is found in this way will become an index in the `freq` dictionary. At the first occurrence of this word, the word will be assigned the value 1. The value will be updated, however, with every new occurrence of the owrd in the text. The program does this for each individual word. Once Python has read the full text, it will have data about the occurrences of all the words in this text. 

The final few lines print the 30 most frequent words. The number of words to be shown can be manipulated by changing the value of the variable named `max`.  

To calculate a frequency list for one of the texts in your own corpus, you obviously need to change the value of the file variable. 

In [None]:
import os
from os.path import join
import re
import dtdpTdm as tdm

# create variable for the name of the folder containing your texts
dir = 'Corpus'

# create variable for the name of file you want to analyse
file = 'ATaleofTwoCities.txt'

# create a dictionary which will count the tokens
# word as index, and frequency as value
freq = dict()

# read the text
text = open( join( dir, file ) )


# Read the text, and iterate through the text line by line. 
for line in text:
    # Tokenise each line, and update the dictionary as you go along
    words = tdm.tokenise( line )  
    for w in words:
        freq[w] = freq.get( w, 0 ) + 1
         
count = 0 
max = 30        
for word in reversed( tdm.sortedByValue(freq) ):
    count += 1
    print( '{}. {} => {}'.format( count , word , freq[word] ) )
    if count == max:
        break

When you have managed to run the code above, you shall probay see that the list that is created is headed by so-called functional words. These are words such as articles, prepositions and pronouns which are important for producting grammatically correct sentences, but which mostly have little expessive value by themselves. If you are interested in studying the contents of a text, it can be useful to remove such frequently used functional words. The `nltk` library can help you with this. It contain a module named `stopwords`, which includes a predefined list of frequently used function words. 

The code below gives a demonstration of this. It firstly creates a list named `stopwords`, on the basis of the predefined list from the `nltk` library. Next, it created a copy of the `freq` dictionary that was created earlier, and it saves the copy under the name `freq_copy`. Next, while iterating through the `freq_copy` dictionary, all the words that are on the list of stopwords are removed from the original `freq` dictionary. 

When you run the coe below, you should see an updated list, without the frequently used function words.

In [None]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
# print(stopwords)


freq_copy = freq.copy()

for w in freq_copy:
    if w in stopwords:
        del freq[w]
        
        
count = 0 
max = 30        
# print the 30 most common words       
for word in reversed( tdm.sortedByValue(freq) ):
    count += 1
    print( '{}. {} => {}'.format( count , word , freq[word] ) )
    if count == max:
        break

In regular word lists, frequencies are usually distributed according to Zipf’s Law, which states that there are normally small numbers of words that occur very frequently, and large numbers of words with low frequences or so-called hapax legomena, which are words that occur only once. As was explained, the high-frequency words are usually function words. 

These highly frequent words can indeed be removed by working with a list of stopwords. As an alternative, we can work with the term frequency-inverse document frequency formula (tf-idf). It is a statistical operation which
was designed to indicate the significance or the relative uniqueness of a specific term
within the context of a corpus. 

<img src="https://www.seoadviesmkb.nl/images/blog/tf-idf.png" style="width: 250px; "/>

The tf-idf formula assigns weights
to the bare counts of the words. These weights are calculated by firstly dividing the total number of texts in the corpus by the total number of texts that contain a given word. If the word occurs in all the texts of the corpus, this number will be one. If the term occurs in only one text, this number will obviously be higher. To amplify the effects of differences such as these, the formula works with the logarithm of the division that was discussed. As log(1) equals to zero, words which occur in all texts are effctively removed. The formula can thus be use to retrieve the rarer, more distinctive words from a certain text. 

The dtdpTdm module has a method which can be used calculate frequencies following this tdf-Idf formula. As parameters, it demands (1) a reference to the directory containing the full corpus, and (2) the name of the file containing the text you want to analyse. The method returns a dictionary containing all the weighted word frequencies.

In [None]:
import dtdpTdm as tdm

dir = 'Corpus'
text = 'ATaleofTwoCities.txt'

freq = tdm.tdIdf( dir , text )

for w in reversed( tdm.sortedByValue(freq) ):
    print( w + ' => ' + str(freq[w]) )


# Concordances

To examine *how* words are used in a text, it can be useful to create a concordance. Such resources are sometimes referred to as keyword in context lists. In a concordance, all the occurrences of a given search term are shown in combination with words that occur before and after this term. 

The dtdpTdm module has a method name `concordance()` which can help you to create such a concordance. To work with it, you need to supply three parameters: (1) the text that you want to analyse, i.e. its filename; (2) a regular expression representing the word(s) whose contexts you want to examine; and (3) a number that specifies the extent of the context. The number that is given as the third parameter determines the number of words that will be shown before and after the search result.


In [None]:
from os.path import join
import dtdpTdm as tdm

dir = 'Corpus'
fileName = 'MobyDick.txt'

fullPath = join( dir , fileName )

## the next line produces a list of all lines containing the regular expression
## formatted as a KWIC list
c = tdm.concordance( fullPath , r'\bwhales?' , 25 )

for line in c:
    print(line)


## Collocation analysis

Collocation analyses focus on the words that are used in the vicinity of a provided search term. Such analyses give an impression of the semantic contexts of these term. In the dtdpTdm module, collocations can be analysed using the `collocation()` method. The parameters are the same as those of the `concordance()` method. 

The function `removeStopWords()` can also be useful in this context. It removes the most common words from given dictionary. Analogous to the situation that was discussed earlier, it makes use of the list of stopwords defined in the `nltk` library. 

In [None]:
from os.path import join
import dtdpTdm as dtdp

dir = 'Corpus'
fileName = 'MobyDick.txt'
fullPath = join( dir , fileName )


freq = dtdp.collocation( fullPath , r'whales?' , 30 )
freq = dtdp.removeStopWords(freq)

for w in freq:
    print(w , freq[w])

## A word cloud

The code below can be used to visualise a given dictionary with word frequencies into a word cloud. It can be used, in fact, to visualise the results of the basic frequency lists, the lists created using the td-idf formula, and the collocation analysis. 

To make sure that the word cloud works with the correct dictionary, firstly run the code in of the cells above, and run the code for making the word cloud immediately after this. 

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt 
from wordcloud import WordCloud 

wordcloud = WordCloud( background_color="white",  width=1500,height=1000, max_words= 100,relative_scaling=1,normalize_plurals=False).generate_from_frequencies(freq)


plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


If the code above produced an error message, this may be caused by the fact that wordcloud module has not been inestalled yet on your computer. To get the code above to work, you may need to install the wordcloud module using the commands in the cell below:

In [None]:
import sys
!pip install wordcloud

Using code that was explained in Part 4 of the Python tutorial, can you visualise the 50 most frequent words as a bar chart?

## Dispersion graphs

Frequencies can also be clarified in dispersion graphs. Instead of giving information about the total number of occurrences, dispersion graphs indicate the number of occurrences in separate sections of the texts. While it is possible to calculate frequencies within, for example, the separate chapters of a novel, such dispersion graphs often work with random segments, simply created by chopping up the text into chunks of a given number of words. 

To produce such graphs, a full text firstly needs to be divided into segments. The graph that is generated provides information frequency of a particular word within each of these segments. The graph can be used to locate those sections in which a given term is used most frequently.

The method `divideIntoSegments()` demands two parameters. The first of these is the text to be analysed. Via the second parameter, you can specify the number of segments that need to be created. The number of segments als determines the size of the segments. The number of these segments are calculated by dividing the total number of tokens by the number of segments. 

In [None]:
import re
import string
from os.path import join
import dtdpTdm as tdm

dir = 'Corpus'
fileName = 'MobyDick.txt'
fullPath = join( dir , fileName )

segments = tdm.divideIntoSegments( fullPath , 30 )

data = dict()

count = 0
for s in segments:
    count += 1
    hits = re.findall( r'\bwhale' , s , re.IGNORECASE )
    data[ count ] = len( hits )
    

%matplotlib inline
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

fig = plt.figure( figsize=( 12 , 4 ) )
ax = plt.axes()

ax.plot( list(data.keys() ) , list( data.values() ) , color = '#930d08' , linestyle = 'solid')

ax.set_xlabel('Section')
ax.set_ylabel('Frequency')

ax.set_title( 'Dispersion graph for "Moby Dick" ')
plt.show()



The code in this notebook can help you to examine the vocabulary in a single text. The notebook named '[NTLK.ipynb](NLTK.ipynb)' enables you to create quantitative data about all the texts in your corpus, and to do compare your texts on the basis of these data.