# 12. Type-token ratio

As was discussed in one of the previous notebooks, the individual words that are found in a text are referred to as “tokens”, and the unique words are called “types”. Frequency lists count occurrences of types. 

The ratio between the number of types and the number of tokens can, under certain conditions, offer useful information about texts as well. The type-token ratio is calculated by dividing the number of types by the number of tokens. This division obviously results a number in between 0 and 1. This number gives an indication of the lexical diversity: the capacity of the author to vary the vocabulary. 

If the type-token ratio is high, this indicates that the author uses many unique words and that the text contains very little lexical repetition. If, by contrast, the type-token ratio is low, this implies that there is low level of lexical diversity; the same words are over and over. 

The type-token ratio can be calculated using the `word_tokenize()` function from the `nltk` package, as follows.

In [None]:
from os.path import join
import re
from nltk import word_tokenize
from tdmh import *

dir = 'Corpus'
text = 'ARoomWithAView.txt'
path = join( dir, text )

with open( path , encoding = 'utf-8' ) as file:
    full_text = file.read()

words = tokenise(full_text)

tokens = len(words)
unique_words = set(words)
types = len(unique_words)

ttr = types / tokens

print( f'Types: {types}' )
print( f'Tokens: {tokens}' )
print( f'Type-token ratio: {ttr}' )

The code above makes use of the function `set()`. It can be used to convert a Python list into a set. A set is default data structure in Python, very similar to a list. An important difference, however, is that, while a list may contain the same item multiple times, a set can only contain unique items. A list also stores the items in a specific order, while a set is **unordered**. The `set()` function can be used very effectively to deduplicate a Python list.

## Comparing the lexical diversity of different texts

The cell below defines a function named `ttr`. It contains the code that was explained to calculate the type-token ratio. This function only needs some text as input. 

In [None]:
def ttr(full_text):
    words = tokenise(full_text)
    words = remove_punctuation(words)

    tokens = len(words)
    unique_words = set(words)
    types = len(unique_words)


    return types / tokens


Having this code wrapped together into a single funcion, the formula for calculating the type-token ratio can easily be applied to all the texts in a given corpus. 

In [None]:
import os
import re
from os.path import join
import tdm

    
dir = 'Corpus'    
for text in os.listdir(dir):
    if re.search( r'\.txt' , text ):
        path = join( dir , text) 
        with open( path , encoding = 'utf-8' ) as file:
            full_text = file.read()
            full_text = full_text.lower()
        
        print( f'{ remove_extension(text) }: {ttr(full_text)} ' )


Whenever you work with type-token ratios, it is important to realise that the result of such calculations tend to vary along with the total length of the text. In a relatively short text, it is easier for an author to continue to introduce new words as the text progresses. When texts become much longer, however, the chances that words will be repeated also increase accordingly. Shorter texts generally have much higher type-token ratios. 

One solution can be to ensure that all the texts are of the same lengths before calculating the type token ratios. We can do this by firstly calculating the length (i.e. the total number of words) of the **shortest text in the corpus**. Next, we can artifically harmonise the lengths of all the texts by creating substrings of the longer texts. These substrings should have exactly the same number of words as the shortest text in the corpus. The code below illustrates this principle.  

In [None]:
dir = 'Corpus'
texts = []
min_tokens = 0 
import tdmh

for text in os.listdir(dir):
    if re.search( r'\.txt' , text ):
        texts.append(text)
        path = join( dir , text) 
        with open( path , encoding = 'utf-8' ) as file:
            full_text = file.read()
            words = tdmh.tokenise(full_text)
            tokens = len(words)
            print( f'{text} contains {tokens} words.' )
            if min_tokens == 0:
                min_tokens = tokens
            elif tokens < min_tokens:
                min_tokens = tokens
                
print( f'\nShortest text has {min_tokens} words.\n' )

ttr_scores = dict()
                
            
for text in texts:
    if re.search( r'\.txt' , text ):
        path = join( dir , text) 
        print( f'Calculating the TTR of {path}' )
        with open( path , encoding = 'utf-8' ) as file:
            full_text = file.read()
            full_text = full_text.lower()
            full_text = full_text[ 0 : min_tokens]
        
        print( f'{ removeExtension(text) }: {ttr(full_text)} ' )
        ttr_scores[ removeExtension(text) ] = ttr(full_text)

# Exercises

## Exercise 5.1

Try to create a bar chart which visualises the type-token ratios of all the texts in the folder 'Corpus'. 

As you do this, you can make use of the dictionary named `ttr_scores`, which is created in the last cell under the section 'Comparing the lexical diversity of different texts'. The titles of the texts in the corpus serve as keys, and the type-token ratios that are calculated are stored as the values. 

You can plot all the values in `list(ttr_scores.keys())` on the X-axis, and the values in `list(ttr_scores.values())` on the Y-axis.



## Exercise 5.2

Calculate the type-token ratios of *Sons and Lovers*, *Ivanhoe* and *Through the Looking Glass*. Use the function `ttr()` that is defined in this notebook, but in this case, focus on the first 3000 words of the novels that are mentioned only.

Does this result in different type-topken ratios? If yes, do these different numbers also prompt different conclusions about lexical diversity of these novels?