[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/1.words/HW1_Tokenization.ipynb)

# Homework 1: Tokenization

In this homework, you'll compare the tokenizations outputs from different classes of tokenizers. This homework is also an opportunity for you to check in on your Python proficiency; for all of the operations below (downloading a file, reading it in, counting objects), you should either be comfortable implementing them already or know how to find out how to do so yourself (if you find yourself struggling with them, we encourage you to take this class at a later date, with a bit more Python experience under your belt).

We've added some space for you to write the code for each section, but feel free to create more code cells if you'd like.

## Part 1

Tokenize the following document with each of these models. Feel free to use the documentation linked (and AI Assistance) to do so for this low-level operation (but again remember that you have to be able to explain what your code is doing).  For each of the tokenizers above, we want to see a list of tokens for this document (not numeric token IDs, but legible words) -- e.g., \["London", ".", ...\]

* NLTK `word_tokenize` (https://www.nltk.org/book/ch03.html)
* Spacy `tokenize` (https://spacy.io/usage/spacy-101#annotations-token)
* Tiktoken BPE tokenization (https://github.com/openai/tiktoken) -- cl100k_base (GPT-3.5, GPT-4).



In [2]:
!pip install nltk
!pip install spacy
!pip install tiktoken



In [3]:
from nltk import word_tokenize
import spacy
import tiktoken

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [17]:
nlp = spacy.load("en_core_web_sm")

def nlkt_tokenise(text):
  return word_tokenize(text)

def spacy_tokenise(text):
  tokens = nlp(text)
  return [token.text for token in tokens]


def tiktoken_tokenise(text):
  encoding = tiktoken.encoding_for_model("gpt-4o")
  token_ids = encoding.encode(text)
  token_strings = [encoding.decode_single_token_bytes(token_id).decode('utf-8') for token_id in token_ids]
  return token_strings


def tokenise(document):
  if document is not None:
    # get nltk tokens
    nltk_tokens = nlkt_tokenise(document)
    # print(f'NLTK Tokens:{nltk_tokens}\n')

    # get spacy tokens
    spacy_tokens = spacy_tokenise(document)
    # print(f'Spacy Tokens:{spacy_tokens}\n')

    # get tiktoken tokens
    tiktoken_tokens = tiktoken_tokenise(document)
    # print(f'Tiktoken Tokens:{tiktoken_tokens}\n')

    return nltk_tokens, spacy_tokens, tiktoken_tokens

In [18]:
def main():
  document = "London. Michaelmas term lately over, and the Lord Chancellor sitting in Lincoln’s Inn Hall. Implacable November weather. As much mud in the streets as if the waters had but newly retired from the face of the earth, and it would not be wonderful to meet a Megalosaurus, forty feet long or so, waddling like an elephantine lizard up Holborn Hill. Smoke lowering down from chimney-pots, making a soft black drizzle, with flakes of soot in it as big as full-grown snowflakes—gone into mourning, one might imagine, for the death of the sun. Dogs, undistinguishable in mire. Horses, scarcely better; splashed to their very blinkers. Foot passengers, jostling one another’s umbrellas in a general infection of ill temper, and losing their foot-hold at street-corners, where tens of thousands of other foot passengers have been slipping and sliding since the day broke (if this day ever broke), adding new deposits to the crust upon crust of mud, sticking at those points tenaciously to the pavement, and accumulating at compound interest."

  nltk_tokens,spacy_tokens,tiktoken_tokens = tokenise(document)

  print(f'NLTK Tokens:{nltk_tokens}\n')
  print(f'Spacy Tokens:{spacy_tokens}\n')
  print(f'Tiktoken Tokens:{tiktoken_tokens}\n')


if __name__ == "__main__":
  main()


NLTK Tokens:['London', '.', 'Michaelmas', 'term', 'lately', 'over', ',', 'and', 'the', 'Lord', 'Chancellor', 'sitting', 'in', 'Lincoln', '’', 's', 'Inn', 'Hall', '.', 'Implacable', 'November', 'weather', '.', 'As', 'much', 'mud', 'in', 'the', 'streets', 'as', 'if', 'the', 'waters', 'had', 'but', 'newly', 'retired', 'from', 'the', 'face', 'of', 'the', 'earth', ',', 'and', 'it', 'would', 'not', 'be', 'wonderful', 'to', 'meet', 'a', 'Megalosaurus', ',', 'forty', 'feet', 'long', 'or', 'so', ',', 'waddling', 'like', 'an', 'elephantine', 'lizard', 'up', 'Holborn', 'Hill', '.', 'Smoke', 'lowering', 'down', 'from', 'chimney-pots', ',', 'making', 'a', 'soft', 'black', 'drizzle', ',', 'with', 'flakes', 'of', 'soot', 'in', 'it', 'as', 'big', 'as', 'full-grown', 'snowflakes—gone', 'into', 'mourning', ',', 'one', 'might', 'imagine', ',', 'for', 'the', 'death', 'of', 'the', 'sun', '.', 'Dogs', ',', 'undistinguishable', 'in', 'mire', '.', 'Horses', ',', 'scarcely', 'better', ';', 'splashed', 'to', 't

## Part 2

Examine the different tokenizations for the passage above -- i.e., actually read through them and see how they differ. In a paragraph or two, characterize the salient differences in tokenization between a.) NLTK and Spacy and b.) NLTK and BPE.  Reference real examples in the text. At the end of this homework, you want to be able to discuss the practical differences between tokenization methods.

**Response**:

When comparing the tokenizations of the passage using NLTK and SpaCy, both generally agree on splitting the text into words and punctuation. However, a key difference is how they handle hyphenated words. For instance, in the phrase "foot-hold at street-corners", SpaCy tends to split "foot-hold" and "street-corners" into separate tokens like ["foot", "-", "hold", "and", "street", "-", "corners"], while NLTK keeps the full hyphenated term as a single token ["foot-hold", "and", "street-corners"].


In contrast, the difference between these two and Byte Pair Encoding
(BPE) tokenization is more substantial. BPE breaks words into smaller subword units, especially for rare or compound words. For example, the same phrase might be split by BPE into [' foot', '-h', 'old', ' at', ' street', '-c', 'orners',]. NLTK and Spacy, on the other hand, keeps "tokenisation" as a single token. This leads BPE to produce a much higher number of tokens overall compared to the other two.

## Part 3

Download the full text of *Pride and Prejudice* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt) and tokenize it using each of the methods above. How many word types (in the formal sense we discussed in class) does each tokenization method have for that complete file?

In [None]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/158_emma.txt

In [19]:
pride_and_prejudice = open('1342_pride_and_prejudice.txt', 'r').read()

nltk_tokens, spacy_tokens, tiktoken_tokens = tokenise(pride_and_prejudice)

print(f'Number of NLTK Tokens word types: {len(set(nltk_tokens))}')
print(f'Number of Spacy Tokens word types: {len(set(spacy_tokens))}')
print(f'Number of Tiktoken Tokens word types: {len(set(tiktoken_tokens))}')

Number of NLTK Tokens word types: 7475
Number of Spacy Tokens word types: 6780
Number of Tiktoken Tokens word types: 8488


## Part 4

Which text has the greater type-token ratio, *Pride and Prejudice* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt) or *Emma* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/158_emma.txt)?  Calculate the TTR for both texts using the NLTK tokenizer, but only use the first 1,000 tokens from each text when calculating its TTR.

In [20]:
emma = open('158_emma.txt', 'r').read()

pp_tokens = nlkt_tokenise(pride_and_prejudice)[:1000]
emma_tokens = nlkt_tokenise(emma)[:1000]
pp_types = len(set(pp_tokens))
emma_types = len(set(emma_tokens))


In [21]:
pp_ttr = pp_types / len(pp_tokens)
emma_ttr = emma_types / len(emma_tokens)
answer = "Pride and prejudice"  if pp_ttr > emma_ttr else "Emma"

print("The TTR for 'Pride and Prejudice' is", pp_ttr)
print("The TTR for 'Emma' is", emma_ttr)
print(f"{answer} has the higher TTR.")

The TTR for 'Pride and Prejudice' is 0.36
The TTR for 'Emma' is 0.41
Emma has the higher TTR.
