# **Legal Grammar Error Corrector**
Note: We are only interested in correcting the following error types in English: non-words, morphology, articles and prepositons.

## TODO List
- TODO: (Isaac): Create validation set and evaluate results with metrics
- TODO: Tune tau
- TODO: Test results on test set
- TODO: Record presentation
- TODO: Edit together presentations.
- TODO: (Emily): Writeup results.

In [1]:
!pip install transformers torch datasets numpy gradio cyhunspell lemminflect errant rich
!python3 -m spacy download en
import torch
import numpy as np

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
Collecting gradio
  Downloading gradio-3.16.2-py3-none-any.whl (14.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.2/14.2 MB[0m [31m72.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cyhunspell
  Downloading cyhunspell-2.0.2-cp38-cp38-manylinux2010_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m63.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting lemminflect
  Downloading lemminflect-0.2.3-py3-none-any.whl (769 kB)

In [2]:
import random
import sys
import os

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Set the seeds for reproducibility 
seed = 123
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

# Mount files
from google.colab import drive
drive.mount("/content/drive")

dir_path = 'MyDrive/Colab Notebooks/LGEC'

env_path = f'/content/drive/{dir_path}'
print(os.listdir(env_path))

if env_path not in sys.path:
    sys.path.append(env_path)


Mounted at /content/drive
['model_outputs', 'Data', 'runs', 'Presentation', 'Emily', 'GPT_trainer.ipynb', 'BERT_trainer.ipynb', 'corrector.ipynb']


## Loading our Trained Models

In [3]:
from transformers import AutoTokenizer, DistilBertForMaskedLM, GPT2LMHeadModel 
# Load pre-trained BERT model
BERT_model_checkpoint = "distilbert-base-cased"
BERT_tokenizer = AutoTokenizer.from_pretrained(BERT_model_checkpoint)
BERT_base_model = DistilBertForMaskedLM.from_pretrained(BERT_model_checkpoint).to(device)
BERT_tuned_model = DistilBertForMaskedLM.from_pretrained("isaacjeffersonlee/distilbert-for-legal-grammar-error-correction").to(device)
# BERT_state_dict_path = f"{env_path}/model_outputs/BERT_trained_state_dict_cased.pt"
# BERT_tuned_model.load_state_dict(torch.load(BERT_state_dict_path, map_location=torch.device(device)))

# Load pre-trained GPT2 model
GPT_model_checkpoint = "distilgpt2"
GPT_tokenizer = AutoTokenizer.from_pretrained(GPT_model_checkpoint)
GPT_base_model = GPT2LMHeadModel.from_pretrained(GPT_model_checkpoint).to(device)
GPT_tuned_model = GPT2LMHeadModel.from_pretrained("isaacjeffersonlee/distilgpt2-for-legal-grammar-error-correction").to(device)
# GPT_state_dict_path = f"{env_path}/model_outputs/GPT_trained_state_dict_cased.pt"
# GPT_tuned_model.load_state_dict(torch.load(GPT_state_dict_path, map_location=torch.device(device)))

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/334M [00:00<?, ?B/s]

# Generating Correction Candidates

### Correction of non-words with CyHunspell
We use CyHunspell to generate correction candidates for **non-words**.

We will also add in special legal words, which would normally be mistaken for spelling errors.

In [4]:
from hunspell import Hunspell
h = Hunspell()
h.spell("barratry")  # Returns False

False

In [5]:
h.spell("barratry")

False

In [6]:
# Add legal word list scraped from meriam webster legal dictionary.
legal_word_list_path = os.path.join(f"{env_path}/Data", "legal.dic")
h.add_dic(legal_word_list_path)

0

In [7]:
h.spell("barratry")

True

## Morphological Errors
E.g cat -> cats, eat -> ate, big -> bigger.

### Approach 1: AGID
We can use the Automatically Generated Inflection Database to generate better morphological variations:
http://wordlist.aspell.net/other/

In [8]:
infl_file = f"{env_path}/Data/infl.txt"
print(f"Reading {infl_file}...")
with open(infl_file, "r") as f:
    infl_list = f.read().split("\n")

print(f"Inflections for {len(infl_list)} words found!")

Reading /content/drive/MyDrive/Colab Notebooks/LGEC/Data/infl.txt...
Inflections for 112506 words found!


In [9]:
import re

def parse_line(text):
    parsed_str = re.sub("[:?,~<.!>|+-123456789]", "", text).strip(" ").rstrip(" ")
    parsed_list = parsed_str.split(" ")
    return [item for item in parsed_list if item != ""]


infl_list = list(map(parse_line, infl_list))
infl_list.remove([])  # Remove any empty lists
print(infl_list[1000])

['Assamese', 'N', 'Assameses']


In [10]:
infl_map = {infl[0] : set(infl[2:]) for infl in infl_list}
def get_inflections(lemma):
    try:
        return infl_map[lemma]
    except KeyError:
        # If no inflecitons were found, just return an empty list
        return []

In [11]:
get_inflections("watch")

{'watched', 'watches', 'watching'}

### Approach 2: LemmInflect
It seems AGID has a lot of incorrect possible inflections, so we instead turn to the
python package LemmInflect...

In [12]:
# Example incorrect inflection
get_inflections("all")

{'alls'}

In [13]:
from lemminflect import getLemma, getAllInflections

def get_inflections_and_lemmas(word):
    lemmas = list(getLemma(word, upos="VERB"))
    if len(lemmas) == 1:
        if lemmas[0] == word:
            lemmas = []  # Return empty list if the only word found was the original word
    infl = []
    for val in getAllInflections(word).values():  # Flatten nested list
        infl += list(val)
    return set(lemmas + infl)

In [14]:
# Yay no incorrect all inflection this time!
get_inflections_and_lemmas("all")

set()

In [15]:
get_inflections_and_lemmas("know")

{'knew', 'know', 'knowing', 'known', 'knows'}

In [16]:
get_inflections_and_lemmas("watches")  # Example of Lemmatization

{'watch'}


## Articles and Prepositions
Since there are only a small number of articles and
prepositions, we can define them manually here.
(Note the prepositions defined here are the top ten most frequent prepositions).

TODO: Experiment with using more prepositions.

In [17]:
eps = "ε"  # Empty string, representing a deletion
articles = {eps, "a", "an", "the"}
preps = {eps, "about", "at", "by", "for", "from", "in", "of", "on", "to", "with"}

## Generating the Confusion Sets
For each word in our sentence, we combine the above
methods to generate a confusion set.


In [18]:
def generate_confusion_set(word):
    # Deal with articles and prepositions
    articles = {"ε", "a", "an", "the"}
    if word in articles:
        return articles
        
    preps = {"ε", "about", "at", "by", "for", "from", "in", "of",  "on", "to", "with"}
    if word in preps:
        return preps

    if not h.spell(word):
        confusion_set = set()
        suggested_words = h.suggest(word)
        confusion_set = confusion_set.union(suggested_words)
        for suggested_word in suggested_words:  # Add inflections of suggested words
            confusion_set = confusion_set.union(get_inflections_and_lemmas(suggested_word))
        return confusion_set
    else:  # word is a valid word in our Hunspell dict
        return get_inflections_and_lemmas(word).union(set([word]))


generate_confusion_set("committed")

{'commit', 'committed'}

# Log-likelihood
Given some text, what is the log-likelihood of that text according to our language model?

In [19]:
import math

def log_likelihood(text, model, tokenizer):
    encoded = tokenizer(text, return_tensors="pt")
    for key in encoded:
        encoded[key] = encoded[key].to(device)
    N = len(encoded.input_ids[0])
    if isinstance(model, GPT2LMHeadModel):
        log_prob = 0
        with torch.no_grad():
            outputs = model(**encoded)
        for idx in range(N - 1):  # Offset because first token is not predicted
            token_id = encoded.input_ids[0][idx + 1]
            log_prob += torch.log_softmax(outputs.logits[0][idx], dim=-1)[token_id].item()
    elif isinstance(model, DistilBertForMaskedLM):
        stacked_masked = encoded.input_ids.repeat((N, 1))
        for idx, id in enumerate(encoded.input_ids[0]):
            stacked_masked[idx, idx] = torch.tensor(103)
        with torch.no_grad():
            outputs = model(stacked_masked)
        log_prob = 0
        for token_idx, token_id in enumerate(encoded.input_ids[0]):
            log_prob += torch.log_softmax(outputs.logits[token_idx][token_idx], dim=-1)[token_id].item()

    return log_prob - math.log(N)  # Normalize according to number of tokens


In [20]:
ll = log_likelihood("The defendant was acquitted.", BERT_tuned_model, BERT_tokenizer)
print(f"log-likelihood: {ll}")

log-likelihood: -28.518709796973457


In [21]:
ll = log_likelihood("The defendant was acquitted.", GPT_base_model, GPT_tokenizer)
print(f"log-likelihood: {ll}")

log-likelihood: -26.38870260092592


In [22]:
# TODO: Fix Errors Vs. Replace with [UNK] >> Test both
# i.e Test Fix Errors Vs. if not h.spell(word): word = '[UNK]'

In [23]:
"barratry" in BERT_tokenizer.vocab

False

In [24]:
h.spell("barratry")

True

# Putting it all together
So now we have all the components we need to define our GEC iterative algorithm that takes in text and suggests a more grammatically correct version.

In [25]:
# TODO: Refactor this
import re

def replace_word(text, from_word, to_word):
    if to_word == "ε":  # Deletion special case
        return re.sub(fr' {from_word}([!()\'\"?.,\s]|$)', r' ', text) 
    else:
        return re.sub(fr'{from_word}(?=[!()\'\"?.,\s]|$)', fr'{to_word}', text) 

def highlight_words(text, words):
    highlighted_text = text
    for word in words:
        highlighted_text = replace_word(highlighted_text, word, "\033[92m" + word + "\033[00m")
    return highlighted_text

def capitalize_sentence(sentence):
    sentence = list(sentence.lstrip())  # Remove any whitespace from the start of a sentence.
    alpha = "abcdefghijklmnopqrxtuv" 
    first_letter = sentence[0]
    if not first_letter.isupper() and first_letter in alpha:
        sentence[0] = first_letter.upper()
    return "".join(sentence)

def correct_word(text, word, idx, max_idx, ll_threshold, model, tokenizer):
    confusion_set = generate_confusion_set(word)
    # If no other valid alternatives are found
    if len(confusion_set) == 1 and word in confusion_set:
        return text, None
    max_ll = -999999.999
    best_alt_text = None
    best_alt_word = None
    # sum_ll = 0.0 # Useful for tuning
    for alt_word in confusion_set:
        alt_text = replace_word(text, word, alt_word)
        ll = log_likelihood(alt_text, model, tokenizer)
        # sum_ll += ll
        # print(ll > ll_threshold, ll, ll_threshold)
        if ll > max_ll and ll > ll_threshold:
            max_ll = ll
            best_alt_text = alt_text
            best_alt_word = alt_word
    
    if best_alt_text is not None:
        # if len(confusion_set) > 0:
        #     print(f"Average ll: {sum_ll / len(confusion_set)}")
        return best_alt_text, best_alt_word
    else:
        return text, None # Failed to find a better replacement

def correct_text(text, model, tokenizer, ll_threshold=-200.0, highlight_changes=False, capitalize_first=True):
    # First we want to break the text down into a list of words.
    words = re.sub(r'[^\w\s]', '', text).split(" ")
    corrected_text = text
    # Indices of words spelled incorrectly, we will address these first.
    spelling_errors = [not h.spell(word) for word in words]
    spelled_wrong_idx = [idx for idx, error in enumerate(spelling_errors) if error]
    spelled_correct_idx = [idx for idx, error in enumerate(spelling_errors) if not error]
    corrections = {}
    for idx in spelled_wrong_idx:  # First iterate over incorrectly spelled words.
        word = words[idx]
        corrected_text, corrected_word = correct_word(corrected_text, word, idx, len(words)-1, ll_threshold, model, tokenizer)
        if corrected_word != word:
            corrections[word] =  corrected_word
    for idx in spelled_correct_idx:
        word = words[idx]
        corrected_text, corrected_word = correct_word(corrected_text, word, idx, len(words)-1, ll_threshold, model, tokenizer)
        if corrected_word != word and corrected_word is not None:
            corrections[word] =  corrected_word

    if capitalize_first:
        corrected_text = capitalize_sentence(corrected_text)

    if highlight_changes:
        if not corrections:
            print("No corrections!")
        for original_word, corrected_word in corrections.items():
            if corrected_word is not None:
                print(f"\033[91m{original_word}\033[00m -> \033[92m{corrected_word}\033[00m")
        corrected_text = highlight_words(corrected_text, corrections.values())     

    return corrected_text


In [26]:
# gold = "The defendent were guilty."
gold = "It will start by a speech from the Director of the conference, followed by a meal."

In [27]:
import time

In [28]:
start_time = time.perf_counter()
cor = correct_text(gold, BERT_base_model, BERT_tokenizer, highlight_changes=True)
end_time = time.perf_counter()
print(f"Correcting took: {round(end_time-start_time, 2)}s")
print(cor)

No corrections!
Correcting took: 1.57s
It will start by a speech from the Director of the conference, followed by a meal.


In [29]:
start_time = time.perf_counter()
cor = correct_text(gold, BERT_tuned_model, BERT_tokenizer, highlight_changes=True)
end_time = time.perf_counter()
print(f"Correcting took: {round(end_time-start_time, 2)}s")
print(cor)

[91mby[00m -> [92mwith[00m
[91mfrom[00m -> [92mby[00m
[91mof[00m -> [92mat[00m
Correcting took: 1.54s
It will start [92mwith[00m a speech [92mby[00m the Director [92mat[00m the conference, followed [92mwith[00m a meal.


In [30]:
start_time = time.perf_counter()
cor = correct_text(gold, GPT_base_model, GPT_tokenizer, highlight_changes=True)
end_time = time.perf_counter()
print(f"Correcting took: {round(end_time-start_time, 2)}s")
print(cor)

[91mby[00m -> [92mwith[00m
[91mfrom[00m -> [92mby[00m
[91mof[00m -> [92mto[00m
Correcting took: 1.01s
It will start [92mwith[00m a speech [92mby[00m the Director [92mto[00m the conference, followed [92mwith[00m a meal.


In [31]:
start_time = time.perf_counter()
cor = correct_text(gold, GPT_tuned_model, GPT_tokenizer, highlight_changes=True)
end_time = time.perf_counter()
print(f"Correcting took: {round(end_time-start_time, 2)}s")
print(cor)

[91mwill[00m -> [92mwould[00m
[91mfrom[00m -> [92mto[00m
[91mDirector[00m -> [92mDirectors[00m
[91mof[00m -> [92mat[00m
Correcting took: 1.0s
It [92mwould[00m start by a speech [92mto[00m the [92mDirectors[00m [92mat[00m the conference, followed by a meal.


In [32]:
print(correct_text("I am an man.", GPT_tuned_model, GPT_tokenizer, highlight_changes=True))

[91man[00m -> [92mε[00m
[91mman[00m -> [92mmen[00m
I am [92mmen[00m.


## Interactive Output

In [33]:
from difflib import Differ

import gradio as gr


def diff_texts(text1):
    d = Differ()
    text1 = text1.strip("\n")
    text2 = correct_text(text1, GPT_tuned_model, GPT_tokenizer)
    diffs = [
        (token[2:], token[0] if token[0] != " " else None)
         for token in d.compare(text1, text2)
    ]
    change = [(text1, "Original"), (text2, "Corrected")]
    return diffs, change

demo = gr.Interface(
    fn=diff_texts,
    inputs=gr.Textbox(
            label="Input Text",
            lines=1,
            value="The defendant wos guilty.",
        ),
    outputs=[
        gr.HighlightedText(
        label="Diff",
        combine_adjacent=True,
        ).style(color_map={"-": "red", "+": "green"}),
        gr.HighlightedText(
            label="Change",
            combine_adjacent=True,
        ).style(color_map={"Original": "red", "Corrected": "green"}),
    ]
)

demo.launch(share=False, debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Keyboard interruption in main thread... closing server.




# Evaluation
We will evaluate our results using the ERRANT toolkit: https://github.com/chrisjbryant/errant

### Generating Sentences with Errors.

In [45]:
import random

def generate_incorrect_word(word):
    # Flip a coin, whether to generate an incorrect spelling, or confusion set.
    if random.randint(0, 1) == 1:
        char_list = list(word)
        if not char_list:
            return word
        rand_idx_1 = random.randint(0, len(char_list)-1)
        first_char = char_list[rand_idx_1]
        rand_idx_2 = random.randint(0, len(char_list)-1)
        if rand_idx_2 == rand_idx_1:
            rand_idx_2 = (rand_idx_1 + 1) % len(char_list)
        second_char = char_list[rand_idx_2]
        # Swap chars, simulating a typo
        char_list[rand_idx_1] = second_char
        char_list[rand_idx_2] = first_char
        return "".join(char_list)
    else:
        conf_set = generate_confusion_set(word)
        rand_idx = random.randint(0, len(conf_set)-1)
        return list(conf_set)[rand_idx]

def generate_incorrect_sentence(text):
    words = re.sub(r'[^\w\s]', '', text).split(" ")
    num_words_to_errorize = random.randint(1, 3)
    for i in range(num_words_to_errorize):
        rand_idx = random.randint(0, len(words)-1)
        og_word = words[rand_idx]
        incorrect_word = generate_incorrect_word(og_word)
        text = replace_word(text, og_word, incorrect_word)
    return text

In [35]:
generate_incorrect_sentence("The Question of Jurisdiction .")

'The Questioning fo Jurisdiction .'

In [37]:
val_file = f"{env_path}/Data/all_val.txt"
with open(val_file, "r") as f:
    val_text = f.read().split("\n")

test_file = f"{env_path}/Data/all_test.txt"
with open(test_file, "r") as f:
    test_text = f.read().split("\n")

### Finetuning
In order to finetune our model on the validation set,
we combine all of the above evaluation cells into a single method.

In [40]:
import subprocess
def run_cmd(cmd):
    return subprocess.run(cmd, shell=True, capture_output=True, encoding='UTF-8').stdout

In [43]:
from rich.progress import track

def evaluate_models(mode, params={}, max_sentence_length=50, num_sentences=1000):
    if mode == "val":
        eval_text = val_text
    elif mode == "test":
        eval_text = test_text
    gold_sentences = [sentence for sentence in eval_text if len(sentence) < max_sentence_length and sentence][:num_sentences]
    print("Generating incorrect sentences...")
    incorrect_sentences = [generate_incorrect_sentence(sentence) for sentence in gold_sentences]
    print("Saving results...")
    with open(f"{env_path}/Data/gold_sentences_{mode}.txt", "w") as f:
        f.write("\n".join(gold_sentences))
    with open(f"{env_path}/Data/incorrect_sentences_{mode}.txt", "w") as f:
        f.write("\n".join(incorrect_sentences))
    true_edits_path = f"'{env_path}/Data/true_edits_{mode}.m2'"
    incorrect_path = f"'{env_path}/Data/incorrect_sentences_{mode}.txt'"
    gold_path = f"'{env_path}/Data/gold_sentences_{mode}.txt'"
    print("Generating gold reference edits m2 file...")
    cmd = f"errant_parallel -orig {incorrect_path} -cor {gold_path} -out {true_edits_path}"
    print(run_cmd(cmd))
    model_tokenizer_triplets = (
        (GPT_tuned_model, GPT_tokenizer, "GPT_tuned"),
        (GPT_base_model, GPT_tokenizer, "GPT_base"),
        (BERT_tuned_model, BERT_tokenizer, "BERT_tuned"),
        (BERT_base_model, BERT_tokenizer, "BERT_base")
    )
    for (model, tokenizer, model_name) in model_tokenizer_triplets: 
        if params:
            tau_vals = (params[model_name], -1000.0)
        else:
            tau_vals = (-100.0,)
        for tau in tau_vals:
            print(f"Starting model evaluation for {model_name}, on the {mode} set, with τ={tau}.")
            corrected_sentences = []
            for idx in track(range(len(incorrect_sentences)), description=f"Correcting {mode} sentences using {model_name}..."):
                sentence = incorrect_sentences[idx]
                corrected_sentences.append(correct_text(sentence, model, tokenizer, ll_threshold=tau))
            with open(f"{env_path}/Data/{model_name}_corrected_sentences_{mode}.txt", "w") as f:
                f.write("\n".join(corrected_sentences))
        
            corrected_path = f"'{env_path}/Data/{model_name}_corrected_sentences_{mode}.txt'"
            pred_edits_path = f"'{env_path}/Data/{model_name}_pred_edits_{mode}.m2'"
            cmd = f"errant_parallel -orig {incorrect_path} -cor {corrected_path} -out {pred_edits_path}"
            print(run_cmd(cmd))
            cmd = f"errant_compare -hyp {pred_edits_path} -ref {true_edits_path}"
            print(f"{mode} results for {model_name}, τ={tau}:")
            print(run_cmd(cmd))

In [None]:
# TODO: Tuning Tau
evaluate_models("val", num_sentences=100, params={"GPT_tuned": -100.0, "GPT_base": -200.0, "BERT_tuned": -100.0, "BERT_base": -200.0})

Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the val set, with τ=-80.0.


Loading resources...
Processing parallel files...

val results for GPT_tuned, τ=-80.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
103	47	33	0.6867	0.7574	0.6997


Starting model evaluation for GPT_tuned, on the val set, with τ=-1000.0.


Loading resources...
Processing parallel files...

val results for GPT_tuned, τ=-1000.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
106	47	30	0.6928	0.7794	0.7086


Starting model evaluation for BERT_tuned, on the val set, with τ=-90.0.


Loading resources...
Processing parallel files...

val results for BERT_tuned, τ=-90.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
95	55	41	0.6333	0.6985	0.6454


Starting model evaluation for BERT_tuned, on the val set, with τ=-1000.0.


Loading resources...
Processing parallel files...

val results for BERT_tuned, τ=-1000.0:

TP	FP	FN	Prec	Rec	F0.5
100	58	36	0.6329	0.7353	0.651




Interesingly it seems that for our data, higher values of $\tau$ only make the $F_{0.5}$ score worse, so we leave it at a relatively low value (low as in large negative number).

### Final Evaluation

In [44]:
num_runs = 10
for i in range(num_runs):
    print("")
    print("")
    print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
    print(f"============================ Run {i} / {num_runs-1} =============================================")
    print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
    evaluate_models("test", num_sentences=1000)



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
941	451	287	0.676	0.7663	0.6923


Starting model evaluation for GPT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_base, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
878	636	350	0.5799	0.715	0.6027


Starting model evaluation for BERT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
901	502	327	0.6422	0.7337	0.6586


Starting model evaluation for BERT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_base, τ=-100.0:

TP	FP	FN	Prec	Rec	F0.5
840	652	388	0.563	0.684	0.5837




>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
926	472	302	0.6624	0.7541	0.6789


Starting model evaluation for GPT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_base, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
851	654	377	0.5654	0.693	0.5871


Starting model evaluation for BERT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
862	521	366	0.6233	0.702	0.6376


Starting model evaluation for BERT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_base, τ=-100.0:

TP	FP	FN	Prec	Rec	F0.5
790	685	438	0.5356	0.6433	0.5542




>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
946	502	330	0.6533	0.7414	0.6692


Starting model evaluation for GPT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_base, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
896	661	380	0.5755	0.7022	0.597


Starting model evaluation for BERT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
923	525	353	0.6374	0.7234	0.6529


Starting model evaluation for BERT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_base, τ=-100.0:

TP	FP	FN	Prec	Rec	F0.5
842	678	434	0.5539	0.6599	0.5723




>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
977	496	313	0.6633	0.7574	0.6802


Starting model evaluation for GPT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_base, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
900	676	390	0.5711	0.6977	0.5926


Starting model evaluation for BERT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
943	553	347	0.6303	0.731	0.6482


Starting model evaluation for BERT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_base, τ=-100.0:

TP	FP	FN	Prec	Rec	F0.5
862	705	428	0.5501	0.6682	0.5703




>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
942	511	324	0.6483	0.7441	0.6654


Starting model evaluation for GPT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_base, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
877	662	389	0.5699	0.6927	0.5908


Starting model evaluation for BERT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
902	557	364	0.6182	0.7125	0.635


Starting model evaluation for BERT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_base, τ=-100.0:

TP	FP	FN	Prec	Rec	F0.5
815	711	451	0.5341	0.6438	0.5529




>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
949	485	308	0.6618	0.755	0.6785


Starting model evaluation for GPT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_base, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
885	662	372	0.5721	0.7041	0.5944


Starting model evaluation for BERT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
921	546	336	0.6278	0.7327	0.6463


Starting model evaluation for BERT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_base, τ=-100.0:

TP	FP	FN	Prec	Rec	F0.5
850	720	407	0.5414	0.6762	0.5639




>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
950	504	323	0.6534	0.7463	0.6701


Starting model evaluation for GPT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_base, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
883	697	390	0.5589	0.6936	0.5815


Starting model evaluation for BERT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
914	530	359	0.633	0.718	0.6483


Starting model evaluation for BERT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_base, τ=-100.0:

TP	FP	FN	Prec	Rec	F0.5
827	704	446	0.5402	0.6496	0.559




>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
963	500	318	0.6582	0.7518	0.675


Starting model evaluation for GPT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_base, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
913	664	368	0.5789	0.7127	0.6015


Starting model evaluation for BERT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
958	525	323	0.646	0.7479	0.6641


Starting model evaluation for BERT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_base, τ=-100.0:

TP	FP	FN	Prec	Rec	F0.5
872	696	409	0.5561	0.6807	0.5773




>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
931	513	354	0.6447	0.7245	0.6593


Starting model evaluation for GPT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_base, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
891	672	394	0.5701	0.6934	0.5911


Starting model evaluation for BERT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
921	542	364	0.6295	0.7167	0.6452


Starting model evaluation for BERT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_base, τ=-100.0:

TP	FP	FN	Prec	Rec	F0.5
827	702	458	0.5409	0.6436	0.5587




>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Generating incorrect sentences...
Saving results...
Generating gold reference edits m2 file...


Output()

Loading resources...
Processing parallel files...

Starting model evaluation for GPT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
976	522	331	0.6515	0.7467	0.6686


Starting model evaluation for GPT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for GPT_base, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
921	688	386	0.5724	0.7047	0.5947


Starting model evaluation for BERT_tuned, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_tuned, τ=-100.0:


Output()


TP	FP	FN	Prec	Rec	F0.5
950	563	357	0.6279	0.7269	0.6455


Starting model evaluation for BERT_base, on the test set, with τ=-100.0.


Loading resources...
Processing parallel files...

test results for BERT_base, τ=-100.0:

TP	FP	FN	Prec	Rec	F0.5
862	720	445	0.5449	0.6595	0.5645




## Final Results
#### ERRANT Metrics
https://github.com/chrisjbryant/errant


### Validation Results

```
val results for GPT_tuned, τ=-100.0:
=========== Span-Based Correction ============
TP	FP	FN	Prec	Rec	F0.5
1028  459	296	0.6913	0.7764	0.7068
989	  498	313	0.6651	0.7596	0.6821
1000  507	315	0.6636	0.7605	0.6809
1000  521	334	0.6575	0.7496	0.674
1051  461	279	0.6951	0.7902	0.7123
1065  468	300	0.6947	0.7802	0.7103
1025  506	299	0.6695	0.7742	0.6881
997	  494	317	0.6687	0.7588	0.6849
980	  508	309	0.6586	0.7603	0.6767
965	  490	284	0.6632	0.7726	0.6826
==============================================

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

val results for GPT_base, τ=-100.0:
=========== Span-Based Correction ============
TP	FP	FN	Prec	Rec	F0.5
942	695	382	0.5754	0.7115	0.5983
893	736	409	0.5482	0.6859	0.5711
946	698	369	0.5754	0.7194	0.5994
915	741	419	0.5525	0.6859	0.5749
934	700	396	0.5716	0.7023	0.5937
969	677	396	0.5887	0.7099	0.6095
952	708	372	0.5735	0.719	0.5977
942	687	372	0.5783	0.7169	0.6015
884	726	405	0.5491	0.6858	0.5719
896	675	353	0.5703	0.7174	0.5947
==============================================

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

val results for BERT_tuned, τ=-100.0:
=========== Span-Based Correction ============
TP	FP	FN	Prec	Rec	F0.5
983	516	341	0.6558	0.7424	0.6714
943	555	359	0.6295	0.7243	0.6464
966	547	349	0.6385	0.7346	0.6556
972	572	362	0.6295	0.7286	0.6471
994	544	336	0.6463	0.7474	0.6643
1010	527	355	0.6571	0.7399	0.6722
986	538	338	0.647	0.7447	0.6644
963	533	351	0.6437	0.7329	0.6598
966	555	323	0.6351	0.7494	0.6551
938	521	311	0.6429	0.751	0.662
==============================================


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

val results for BERT_base, τ=-100.0:
=========== Span-Based Correction ============
TP	FP	FN	Prec	Rec	F0.5
871	753	453	0.5363	0.6579	0.5569
834	778	468	0.5174	0.6406	0.5381
847	773	468	0.5228	0.6441	0.5433
877	738	457	0.543	0.6574	0.5626
879	755	451	0.5379	0.6609	0.5587
893	767	472	0.538	0.6542	0.5578
887	762	437	0.5379	0.6699	0.56
874	741	440	0.5412	0.6651	0.5621
853	754	436	0.5308	0.6618	0.5527
851	735	398	0.5366	0.6813	0.5604
==============================================
```

### Test Results

```
test results for GPT_tuned, τ=-100.0:
=========== Span-Based Correction ============
TP	FP	FN	Prec	Rec	F0.5
941	451	287	0.676	0.7663	0.6923
926	472	302	0.6624	0.7541	0.6789
946	502	330	0.6533	0.7414	0.6692
977	496	313	0.6633	0.7574	0.6802
942	511	324	0.6483	0.7441	0.6654
949	485	308	0.6618	0.755	0.6785
950	504	323	0.6534	0.7463	0.6701
963	500	318	0.6582	0.7518	0.675
931	513	354	0.6447	0.7245	0.6593
976	522	331	0.6515	0.7467	0.6686
==============================================

test results for GPT_base, τ=-100.0:
=========== Span-Based Correction ============
TP	FP	FN	Prec	Rec	F0.5
878	636	350	0.5799	0.715	0.6027
851	654	377	0.5654	0.693	0.5871
896	661	380	0.5755	0.7022	0.597
900	676	390	0.5711	0.6977	0.5926
877	662	389	0.5699	0.6927	0.5908
885	662	372	0.5721	0.7041	0.5944
883	697	390	0.5589	0.6936	0.5815
913	664	368	0.5789	0.7127	0.6015
891	672	394	0.5701	0.6934	0.5911
921	688	386	0.5724	0.7047	0.5947
==============================================

test results for BERT_tuned, τ=-100.0:
=========== Span-Based Correction ============
TP	FP	FN	Prec	Rec	F0.5
901	502	327	0.6422	0.7337	0.6586
862	521	366	0.6233	0.702	0.6376
923	525	353	0.6374	0.7234	0.6529
943	553	347	0.6303	0.731	0.6482
902	557	364	0.6182	0.7125	0.635
921	546	336	0.6278	0.7327	0.6463
914	530	359	0.633	0.718	0.6483
958	525	323	0.646	0.7479	0.6641
921	542	364	0.6295	0.7167	0.6452
950	563	357	0.6279	0.7269	0.6455
==============================================


test results for BERT_base, τ=-100.0:
=========== Span-Based Correction ============
TP	FP	FN	Prec	Rec	F0.5
840	652	388	0.563	0.684	0.5837
790	685	438	0.5356	0.6433	0.5542
842	678	434	0.5539	0.6599	0.5723
862	705	428	0.5501	0.6682	0.5703
815	711	451	0.5341	0.6438	0.5529
850	720	407	0.5414	0.6762	0.5639
827	704	446	0.5402	0.6496	0.559
872	696	409	0.5561	0.6807	0.5773
827	702	458	0.5409	0.6436	0.5587
862	720	445	0.5449	0.6595	0.5645
==============================================
```