---
title: Model to Correct Homophone Misue
author: Marion Bauman
---


For our natural language processing model, we will be building a tool that corrects homophone misuse.

Our model will compute the likelihood that a given word is correct, and if it is not, it will suggest a replacement word. This will be tested on a corpus of data including some intentional homophone misuse.

In [28]:
from transformers import BertForMaskedLM, BertTokenizer, AdamW, get_linear_schedule_with_warmup

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertForMaskedLM.from_pretrained('bert-base-cased')

Downloading tokenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 1.95kB/s]
Downloading vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 3.97MB/s]
Downloading tokenizer.json: 100%|██████████| 436k/436k [00:00<00:00, 13.7MB/s]
Downloading config.json: 100%|██████████| 570/570 [00:00<00:00, 123kB/s]
Downloading model.safetensors: 100%|██████████| 436M/436M [00:07<00:00, 56.0MB/s] 
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a B

In [30]:
# Chat GPT Prompt: Give me a list of lists in python of 100 sets of homophones
homophones_list = [
    ["ate", "eight"],
    ["bare", "bear"],
    ["brake", "break"],
    ["capital", "capitol"],
    ["cell", "sell"],
    ["cite", "site", "sight"],
    ["complement", "compliment"],
    ["desert", "dessert"],
    ["die", "dye"],
    ["flour", "flower"],
    ["hear", "here"],
    ["hour", "our"],
    ["knight", "night"],
    ["know", "no"],
    ["mail", "male"],
    ["meat", "meet"],
    ["morning", "mourning"],
    ["one", "won"],
    ["pair", "pear"],
    ["peace", "piece"],
    ["principal", "principle"],
    ["rain", "reign", "rein"],
    ["right", "write"],
    ["sea", "see"],
    ["serial", "cereal"],
    ["sole", "soul"],
    ["stationary", "stationery"],
    ["tail", "tale"],
    ["threw", "through"],
    ["to", "too", "two"],
    ["weather", "whether"],
    ["week", "weak"],
    ["wear", "where"],
    ["which", "witch"],
    ["your", "you're"],
    ["allowed", "aloud"],
    ["board", "bored"],
    ["brake", "break"],
    ["capital", "capitol"],
    ["compliment", "complement"],
    ["desert", "dessert"],
    ["dual", "duel"],
    ["fair", "fare"],
    ["genre", "jinja"],
    ["hare", "hair"],
    ["here", "hear"],
    ["hoard", "horde"],
    ["loan", "lone"],
    ["pail", "pale"],
    ["peak", "peek", "pique"],
    ["profit", "prophet"],
    ["role", "roll"],
    ["root", "route"],
    ["sail", "sale"],
    ["scene", "seen"],
    ["serial", "cereal"],
    ["so", "sow"],
    ["stare", "stair"],
    ["steal", "steel"],
    ["their", "there", "they're"],
    ["throne", "thrown"],
    ["vain", "vein", "vane"],
    ["weak", "week"],
    ["wood", "would"],
    ["yew", "you"],
    ["bridal", "bridle"],
    ["cereal", "serial"],
    ["chord", "cord"],
    ["compliment", "complement"],
    ["dew", "due"],
    ["foul", "fowl"],
    ["grate", "great"],
    ["groan", "grown"],
    ["heal", "heel"],
    ["him", "hymn"],
    ["lay", "lie"],
    ["main", "mane"],
    ["marry", "merry"],
    ["mite", "might"],
    ["moose", "mousse"],
    ["mourn", "morn"],
    ["peace", "piece"],
    ["plum", "plumb"],
    ["pour", "pore"],
    ["rap", "wrap"],
    ["scene", "seen"],
    ["scent", "cent", "sent"],
    ["serial", "cereal"],
    ["shear", "sheer"],
    ["soar", "sore"],
    ["sow", "sew"],
    ["stake", "steak"],
    ["tide", "tied"],
    ["toe", "tow"],
    ["there", "their", "they're"],
    ["waist", "waste"],
    ["week", "weak"],
    ["write", "right", "rite"],
]

In [4]:
import pandas as pd

gutenberg_homophone_data = pd.read_csv('../data/gutenberg-homophone-errors.csv')

In [10]:
test_sentence = gutenberg_homophone_data['sentences'][0]

In [40]:
tokenized_sentence = tokenizer.tokenize(test_sentence)
print(tokenized_sentence)

['the', 'project', 'gut', '##enberg', 'e', '##book', 'of', 'f', '##rank', '##enstein', ';', 'or', ',', 'the', 'modern', 'pro', '##met', '##heus', 'this', 'e', '##book', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'united', 'states', 'and', '##most', 'other', 'parts', 'of', 'the', 'world', 'at', 'know', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', '##w', '##hat', '##so', '##ever', '.']


In [34]:
# flatten the list of lists
homophones_list_flat = [item for sublist in homophones_list for item in sublist]

In [47]:
outs = []
for ts in range(1, len(tokenized_sentence)):
    ts_full = tokenized_sentence[0:ts+1]
    if ts_full[ts] in homophones_list_flat:
        print(ts_full[ts])
        ts_full[ts] = '[MASK]'
        tokenize_ts_full = tokenizer(ts_full, return_tensors='pt', padding=True, truncation=True)
        model_output = model(**tokenize_ts_full)
        outs.append(model_output)

know


no


In [48]:
outs

[MaskedLMOutput(loss=None, logits=tensor([[[ -7.2337,  -7.1655,  -7.2393,  ...,  -5.9906,  -5.7929,  -6.1766],
          [ -7.8210,  -7.9910,  -7.8512,  ...,  -6.4410,  -6.3874,  -6.7109],
          [-10.6334, -10.4260, -10.4234,  ...,  -8.4052,  -9.2172,  -9.4748],
          [ -5.6448,  -5.4847,  -5.7676,  ...,  -3.5781,  -5.0492,  -4.3255],
          [ -5.3552,  -5.2033,  -5.4628,  ...,  -3.5359,  -4.8507,  -4.1003],
          [ -5.4304,  -5.3131,  -5.5703,  ...,  -3.4580,  -5.1538,  -4.4302]],
 
         [[ -7.2990,  -7.1915,  -7.2768,  ...,  -6.1238,  -5.7690,  -6.2951],
          [ -7.9527,  -8.1324,  -7.9212,  ...,  -6.6422,  -6.5342,  -7.0476],
          [-11.1965, -11.2232, -11.4031,  ...,  -8.1919, -10.1863, -10.9385],
          [ -4.9157,  -4.6635,  -4.8502,  ...,  -3.1893,  -4.0294,  -5.2144],
          [ -4.6782,  -4.4324,  -4.6052,  ...,  -3.2563,  -3.8875,  -5.3541],
          [ -5.0233,  -4.7579,  -5.0839,  ...,  -3.6479,  -4.0018,  -5.2299]],
 
         [[ -7.2721,  -7.