# CS4765/6765 Assignment 1: Spelling Correction (with BERT and GPT2)

Here is an example of how BERT and GPT2 can be applied to the spelling correction task from assignment 1. BERT is a masked language model and was the state-of-the-art for many NLP tasks in ~2019. GPT2 is an autoregressive language model and was the pre-cursor to GPT3 and ChatGPT.

## Candidate model

We use the same candidate model as for assignment 1. The code in this section is taken straight from the assignment 1 starter code.

In [1]:
# A model of candidate in-vocabulary corrections for a spelling error.
# See below for an example of how to use it.
class CandidateModel:
    def __init__(self, train_corpus_fname):
        self.ALPHABET = 'abcdefghijklmnopqrstuvwxyz'
        self.vocabulary = set()
        for line in open(train_corpus_fname):
            line = line.split()
            # Ignore start and end of sentence markers
            words = line[1:len(line) - 1]
            for w in words:
                self.vocabulary.add(w)

    def delete_edits(self, w):
        # Return the set of strings that can be formed by applying one
        # delete operation to word w
        result = []
        for i in range(len(w)):
            candidate = w[:i] + w[i+1:]
            result.append(candidate)
        return result

    def insert_edits(self, w):
        result = []
        for i in range(len(w) + 1):
            for c in self.ALPHABET:
                candidate = w[:i] + c + w[i:]
                result.append(candidate)
        return result

    def transpose_edits(self, w):
        result = []
        for i in range(1, len(w)):
            transposed_letters = w[i] + w[i-1]
            candidate = w[:i-1] + transposed_letters + w[i+1:]
            result.append(candidate)
        return result

    def replace_edits(self, w):
        result = []
        for i in range(len(w)):
            for c in self.ALPHABET:
                if c != w[i]:
                    candidate = w[:i] + c + w[i+1:]
                    result.append(candidate)
        return result

    def candidates(self, w, in_vocabulary=True):
        all_candidates = self.delete_edits(w) + self.insert_edits(w) + self.transpose_edits(w) + self.replace_edits(w)
        if in_vocabulary:
            all_candidates = [x for x in all_candidates if x in self.vocabulary]
        return set(all_candidates)        

In [2]:
train_fname = 'data/corpus.txt'
candidate_model = CandidateModel(train_fname)

In [3]:
# candidate_model can be used to get the set of in-vocabulary candidate corrections for a spelling error.
# All candidate corrections are within edit distance one of the spelling error.
candidate_model.candidates('frend')

{'fiend', 'fred', 'freed', 'freud', 'friend', 'rend', 'trend'}

In [4]:
def print_accuracy_for_predictions(predictions, keys):
    # If the length of the output and keys are not the same, something went
    # wrong...
    assert len(predictions) == len(keys)

    num_correct = 0
    total = 0
    for p,k in zip(predictions,keys):
        if p == k:
            num_correct += 1
        total += 1
    accuracy = num_correct / total
    print("Num correct: ", num_correct)
    print("Total: ", total)
    print("Accuracy:", round(accuracy, 3))

In [5]:
dev_fname = 'data/dev.txt'
dev_keys_fname = 'data/dev.keys.txt'
dev_keys = [x.strip() for x in open(dev_keys_fname)]

test_fname = 'data/test.txt'
test_keys_fname = 'data/test.keys.txt'
test_keys = [x.strip() for x in open(test_keys_fname)]

## BERT

This example of applying BERT to spelling correction uses a HuggingFace pipeline and attempts to emphasize simplicity in implementation. It replaces the target word to correct by [MASK] and then uses a fill-mask pipeline to get the top-100 predicted words from BERT for this mask. It then filters these predicted words to include only those that are also among the in-vocabulary candidate corrections from the candidate model. The highest scoring word (according to BERT) remaining after this filtering is returned as the predicted correction. 

In [6]:
from transformers import pipeline
bert_unmasker = pipeline('fill-mask', model='distilbert-base-uncased')

Device set to use mps:0


In [7]:
def get_bert_prediction(line):
    # Split the line on a tab; get the target word to correct and
    # the sentence it's in
    target_index,sentence = line.split('\t')
    target_index = int(target_index)
    sentence = sentence.split()
    target_word = sentence[target_index]

    # Get the in-vocabulary candidates 
    iv_candidates = candidate_model.candidates(target_word)

    masked_sentence = list(sentence)
    masked_sentence[target_index] = "[MASK]"
    masked_sentence = " ".join(masked_sentence)

    # Here we only get the top 100 predictions from BERT. To get predictions for 
    # all items in the vocabulary, we could set top_k to the vocab size, in
    # this case 30522 (see the commented out line of code below). This would allow
    # the spelling correction to consider more possible corrections and potentially
    # achieve a higher accuracy, but this slows things down quite a bit.
    predictions = bert_unmasker(masked_sentence, top_k=100)
    predictions = [x for x in predictions if x['token_str'] in iv_candidates]

    if len(predictions) == 0:
        return sentence[target_index]
    else: 
        return predictions[0]['token_str']

In [8]:
def get_bert_predictions(predict_dataset_fname):
    predictions = []
    for line in open(predict_dataset_fname):
        curr_prediction = get_bert_prediction(line)
        predictions.append(curr_prediction)
    return predictions

In [9]:
bert_dev_predictions = get_bert_predictions(dev_fname)
bert_test_predictions = get_bert_predictions(test_fname)

In [10]:
print('BERT:')
print('Dev:')
print_accuracy_for_predictions(bert_dev_predictions, dev_keys)

print()
print('Test:')
print_accuracy_for_predictions(bert_test_predictions, test_keys)

BERT:
Dev:
Num correct:  321
Total:  421
Accuracy: 0.762

Test:
Num correct:  238
Total:  332
Accuracy: 0.717


## GPT-2

This application of GPT-2 to spelling correction takes the following approach. For each in-vocabulary candidate correction from the candidate model, do the following:

- substitute the target word to correct with the candidate
- determine the loss for the resulting sentence

Then return the best candidate correction (i.e., the candidate correction with the lowest loss).


In [11]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

In [12]:
import torch

def get_gpt_prediction(line):
    # Select the in-vocabulary candidate that gives lowest loss when
    # substituted for the target word in the sentence. We could get 
    # the probability instead, but for this demo it seemed easier
    # to use the loss.
    
    target_index,sentence = line.split('\t')
    target_index = int(target_index)
    sentence = sentence.split()
    target_word = sentence[target_index]
    
    # Get the in-vocabulary candidates 
    iv_candidates = candidate_model.candidates(target_word)

    best_loss = float("inf")
    best_correction = sentence[target_index]
    for iv_candidate in iv_candidates:
        sentence_with_candidate = list(sentence)
        sentence_with_candidate[target_index] = iv_candidate
        sentence_with_candidate = " ".join(sentence_with_candidate)

        inputs = tokenizer(sentence_with_candidate, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss.item()

        if loss < best_loss:
            best_loss = loss
            best_correction = iv_candidate

    return best_correction

In [13]:
def get_gpt_predictions(predict_dataset_fname):
    predictions = []
    for line in open(predict_dataset_fname):
        curr_prediction = get_gpt_prediction(line)
        predictions.append(curr_prediction)
    return predictions

In [14]:
gpt_dev_predictions = get_gpt_predictions(dev_fname)
gpt_test_predictions = get_gpt_predictions(test_fname)

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


In [15]:
print('GPT:')
print('Dev:')
print_accuracy_for_predictions(gpt_dev_predictions, dev_keys)

print()
print('Test:')
print_accuracy_for_predictions(gpt_test_predictions, test_keys)

GPT:
Dev:
Num correct:  346
Total:  421
Accuracy: 0.822

Test:
Num correct:  266
Total:  332
Accuracy: 0.801
