## Type/Token Ratio - lexical variety

Token: all the words  
Type: unique words  
e.g. pear pear apple (3 tokens, two types)

The Type/Token Ratio is the tokens (total number of all words) divided by the types (number of unique instances of words)

A high TTR = high degree of lexical variation/vocabulary richness  
A low TTR = low degree of lexical variation/vocabulary richness

If TTR is high, there is a greater proportion of unique words relative to the total number of words, i.e. there is a greater number of different words being used, a more varied vocabulary is being used.

In [None]:
import re

def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

text = open('kafka_metamorphosis.txt', encoding="utf-8").read()

all_the_words = tokenize(text)

unique_words = set(all_the_words)

TTR = (len(unique_words) / len(all_the_words)) * 100
TTR

In [None]:
#TTR with stopwords
import re
from collections import Counter

def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

text = open('kafka_metamorphosis.txt', encoding="utf-8").read()

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll']

all_the_words = tokenize(text)

meaningful_words = [word for word in all_the_words if word not in stopwords]

unique_words = set(meaningful_words)

TTR = (len(unique_words) / len(meaningful_words)) * 100
TTR

Is this surprising? Does this prompt you to ask other questions? For example, does lexical variety change over the course of the text? How would you go about analyzing this? (Note that, because language often repeats a lot of words (e.g. function words), a sample of 100 words will have more unique words than a sample of 1,000. So, it's possible to compare 100 words from one chunk of text to 100 words from another, but because language is finite and repetitive, analyses will be skwewed if we compare 100 words from one to 200 words from another.)