# spellchk: default program

In [26]:
from default import *

## Documentation

Read `answer/default.py` starting with the `spellchk` function and see how it solves the task of spell correction using a pre-trained language model that can predict a replacement token for a masked token in the input.

In your submission, write some beautiful documentation of your program here.

In [27]:
from io import StringIO
with StringIO("4\tit will put your maind into non-stop learning.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

4	it will put your mind into non-stop learning.


## Analysis

Do some analysis of the results. What ideas did you try? What worked and what did not?

The dev score of the default solution was 0.23 

The first change we made was to select correction, Select correction is supposed to return the most likely correction for a typo from a given list of predictions, The initial select correction would just select the first item in the list of predictions with no scoring mechanisms or sorting

In [28]:
def select_correction(typo, predict):
    # return the most likely prediction for the mask token
    for val in predict:
        #first change
        val["combined_len_diff"] = max(len(set(val["token_str"]) - set(typo)), len(set(typo) - set(val["token_str"]))) + abs(len(val["token_str"]) - len(typo))
        val["overall_score"] = - val["combined_len_diff"]
    predict = sorted(predict, key = lambda x: x["overall_score"], reverse = True)
    return predict[0]['token_str']

In [29]:
from io import StringIO
with StringIO("0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33\tYestrday, I decded to vissit the musuem. The exhbits were fasinating, showcassing anshent artifacs and modren art. Howevr, the highligt for me was the interractive siense secction, wher I lernd a lot abuot astronomey and biollogy.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33	therefore I decided to remove the symbols The objects were displaying including ancient artifacts and modren art. whereas the solution for me was the traditional cosmic symbolism where I tried a lot abuot ##uration and biollogy.


In our method we modified select correction to loop over the list of predictions and for each prediction it would calculate the combined length difference between the predicted token and the typo and the difference in the set of letters used. This combined score is then made negative and is applied to the overall score to create a descending order of scores and higher scores indicate better matches. After calculating all the scores we sort to list to keep the higher score at the top and then we choose that value as the prediction

Our thought process behind making this change is that select correction just choosing the first item in the list is obviously not going to work so we made changes to it first, Upon thinking of a way to implement this we realized that often times when there is a typo the word is ususally still going to be very similar to the orignal like for example farther and father the letter set is the same so it is good to compare the letter sets between the prediction vales and the typo made, another example would be wera which the model could see as either wear or pear but if we compare the letter sets then wear would win out, The next issue we saw was what if a word has similar letter sets but the lenght of the words was vastly different then our approach of only measuring the letter sets would fail, for example if the typo is biilling then model could come up with bling and billing and both these words would have the same letter set so then we would diffrentiate with the lenght of the letters in which case billing would win out as the correct solution

After this change our dev score went to 0.45 as you see from output above it does change the words but not into the correct correction so clearly there is more work to be done

we set about trying to find the next area of improvenment which an obvious one was to increase the numbers of predictions (top_k value), we increased it from 20 to 28 This value was obtained by trial and error and just testing different values for top_k. This is the optimal value in the tradeoff between precision and running time. 

In [30]:
from io import StringIO
with StringIO("0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33\tYestrday, I decded to vissit the musuem. The exhbits were fasinating, showcassing anshent artifacs and modren art. Howevr, the highligt for me was the interractive siense secction, wher I lernd a lot abuot astronomey and biollogy.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33	therefore I decided to remove the symbols The objects were displaying including ancient artifacts and modren art. whereas the solution for me was the traditional cosmic symbolism where I tried a lot abuot ##uration and biollogy.


Changing the top_k to 28 got out dev score up to 0.46

The next change we noticed was when we take a look at how the spellchk function works, it corrects the typos in the original sentence by replacing them with the predicted corrections, then it yields the results for each line in the input file. The next change that we tried was

In [33]:
def spellchk(fh):
     for (locations, sent) in get_typo_locations(fh):
        spellchk_sent = sent
        for i in locations:
            # predict top_k replacements only for the typo word at index i
            predict = fill_mask(
                " ".join([ sent[j] if j != i else mask for j in range(len(sent)) ]), 
                top_k=28
            )
            if i < len(sent):
                predict+=fill_mask(
                    " ".join([ sent[:i+1][j] if j != i else mask for j in range(len(sent[:i+1])) ]), 
                    top_k=28
                )
            logging.info(predict)
            spellchk_sent[i] = select_correction(sent[i], predict)
        yield(locations, spellchk_sent)

we extended the prediction context for each typo by incorporating tokens from both the left and right sides of the typo location. Specifically, when predicting replacements for the masked token at the typo index, we added an extra step to include context from the left side as well. This was achieved by constructing a new sentence with the masked token at the typo location and incorporating tokens from the left side. The impact of this change is a more comprehensive set of predictions, considering a broader context around the typo. By expanding the context, the language model has access to additional information, potentially leading to more accurate and contextually appropriate corrections. This modification improved the dev score to 0.5

In [38]:
def spellchk(fh):
     for (locations, sent) in get_typo_locations(fh):
        spellchk_sent = sent
        for i in locations:
            # predict top_k replacements only for the typo word at index i
            predict = fill_mask(
                " ".join([ sent[j] if j != i else mask for j in range(len(sent)) ]), 
                top_k=35
            )
            if i < len(sent):
                predict+=fill_mask(
                    " ".join([ sent[:i+1][j] if j != i else mask for j in range(len(sent[:i+1])) ]), 
                    top_k=35
                )
            logging.info(predict)
            spellchk_sent[i] = select_correction(sent[i], predict)
        yield(locations, spellchk_sent)

In [39]:
from io import StringIO
with StringIO("0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33\tYestrday, I decded to vissit the musuem. The exhbits were fasinating, showcassing anshent artifacs and modren art. Howevr, the highligt for me was the interractive siense secction, wher I lernd a lot abuot astronomey and biollogy.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33	yesterday I decided to visit the museum The exhibits were interesting showing things renaissance and modren art. before the thing for me was the picture in question where I learned a lot abuot anatomy and biollogy.


the above change got us to 0.52. While this adjustment improved our ability to select more suitable replacements for masked words, analysis of test sentence outputs revealed that there were still errors happening. In some instances, the replaced words did not align with our intended choices.

To address this issue, we further improved our code. Now, when replacing masked words, the code ensures that the length of the substituted word is more closely aligned with the predicted word. This modification aims to mitigate instances where the length of the replaced word deviates from our intended target, thereby enhancing the overall accuracy of the model.

In [44]:
def spellchk(fh):
     for (locations, sent) in get_typo_locations(fh):
        corrected_sent = sent[:]
        for i in locations:
            masked_sentence = " ".join([sent[j] if j != i else mask for j in range(len(sent))])
            predictions = fill_mask(masked_sentence, top_k=28)
            corrected_sent[i] = select_correction(sent[i], predictions)
        yield (locations, corrected_sent)

In [45]:
from io import StringIO
with StringIO("0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33\tYestrday, I decded to vissit the musuem. The exhbits were fasinating, showcassing anshent artifacs and modren art. Howevr, the highligt for me was the interractive siense secction, wher I lernd a lot abuot astronomey and biollogy.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33	here I decided to find the dust The effects were plainly suggesting the ##ics and modren art. for the sight for me was the perfect ##ne ##n where I learned a lot abuot ##ure and biollogy.


The dev score remains at 0.52, and the word replacements did not align with our intentions. Additionally, some replaced words contain "##". In response to these challenges, we did some research and identified the usefullness of the levenshtein score. To use this, we faced a choice between creating a custom helper function or importing a library. Given the uncertainty about adding another library, we opted to implement a dedicated helper function.

In [None]:
def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

def select_correction(typo, predict):
    # return the most likely prediction for the mask token
    # Apply Levenshtein distance to find the closest prediction
    closest_prediction = min(predict, key=lambda x: levenshtein_distance(typo, x["token_str"]))
    return closest_prediction["token_str"]

In [47]:
from io import StringIO
with StringIO("0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33\tYestrday, I decded to vissit the musuem. The exhbits were fasinating, showcassing anshent artifacs and modren art. Howevr, the highligt for me was the interractive siense secction, wher I lernd a lot abuot astronomey and biollogy.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33	here I decided to find the dust The effects were plainly suggesting the ##ics and modren art. for the sight for me was the perfect ##ne ##n where I learned a lot abuot ##ure and biollogy.


The updated score is now at 0.56, reflecting an improvement, but theres still the same problem that the words are not replacing properly, so we decided to improve our spellchk function





In [48]:
def spellchk(fh):
     for (locations, sent) in get_typo_locations(fh):
        spellchk_sent = sent.copy()  # Create a copy of the sentence tokens
        for i in locations:
            # Formulate the sentence with a masked token for the current typo
            masked_sentence = " ".join(sent[j] if j != i else mask for j in range(len(sent)))
            # Generate predictions for the masked token
            predict = fill_mask(masked_sentence, top_k=50)  # Increased top_k for a broader selection

            # Select the best correction based on the enhanced selection logic
            correction = select_correction(sent[i], predict)
            spellchk_sent[i] = correction
        
        yield (locations, spellchk_sent)

In [49]:
from io import StringIO
with StringIO("0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33\tYestrday, I decded to vissit the musuem. The exhbits were fasinating, showcassing anshent artifacs and modren art. Howevr, the highligt for me was the interractive siense secction, wher I lernd a lot abuot astronomey and biollogy.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33	yesterday I decided to view the dust The effects were painted showing the ##ics and modren art. for the sight for me was the perfect ##ence ##n where I learned a lot abuot ##ron and biollogy.




Increasing the top_k value increased our dev score to 0.57, so we've made progress, but the primary concern persists: strings containing "##." To address this, we implemented the following.

In [55]:
def select_correction(typo, predict):
    # Assuming 'predict' includes a confidence score along with the token string
    filtered_predict = [p for p in predict if "##" not in p["token_str"]]
    
    if not filtered_predict:
        filtered_predict = predict
    
    # Combine Levenshtein distance and model confidence to select the best correction
    def combined_score(prediction):
        levenshtein_score = levenshtein_distance(typo, prediction["token_str"])
        confidence_score = prediction.get("score", 0)  # Use model confidence if available
        return levenshtein_score - confidence_score  # Adjust weights as necessary
    
    closest_prediction = min(filtered_predict, key=combined_score)
    return closest_prediction["token_str"]

In [56]:
from io import StringIO
with StringIO("0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33\tYestrday, I decded to vissit the musuem. The exhbits were fasinating, showcassing anshent artifacs and modren art. Howevr, the highligt for me was the interractive siense secction, wher I lernd a lot abuot astronomey and biollogy.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33	yesterday I decided to view the dust The effects were painted showing the shapes and modren art. for the sight for me was the perfect lens , where I learned a lot abuot , and biollogy.


dev score of 0.57. We did sucessfully solve the issue with the "##" strings however the score did not go up by as much as we expected so while fine tuning we further increased the top_k score.

In [63]:
def spellchk(fh):
     for (locations, sent) in get_typo_locations(fh):
        spellchk_sent = sent.copy()  # Create a copy of the sentence tokens
        for i in locations:
            # Formulate the sentence with a masked token for the current typo
            masked_sentence = " ".join(sent[j] if j != i else mask for j in range(len(sent)))
            # Generate predictions for the masked token
            predict = fill_mask(masked_sentence, top_k=180)  # Increased top_k for a broader selection

            # Select the best correction based on the enhanced selection logic
            correction = select_correction(sent[i], predict)
            spellchk_sent[i] = correction
        
        yield (locations, spellchk_sent)

In [62]:
from io import StringIO
with StringIO("0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33\tYestrday, I decded to vissit the musuem. The exhbits were fasinating, showcassing anshent artifacs and modren art. Howevr, the highligt for me was the interractive siense secction, wher I lernd a lot abuot astronomey and biollogy.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33	yesterday I decided to visit the masses The bits were painting showing ancient crystals and modren art. however the highlight for me was the treatise lens thing where I learned a lot abuot anatomy and biollogy.


Increasing the top_k score further got us to a 0.65 and we tried a few other methods after this but none were too rewarding and we unfortunately  ran out of time to further improve the code here.

In [None]:
def select_correction(typo, predict):
    def combined_score(prediction):
        weight_levenshtein = 1.0
        weight_confidence = 1.0 
        levenshtein_score = levenshtein_distance(typo, prediction["token_str"])
        confidence_score = prediction.get("score", 0) 
        normalized_levenshtein = levenshtein_score / max(len(typo), len(prediction["token_str"]))
        return (weight_levenshtein * normalized_levenshtein) - (weight_confidence * confidence_score)
    
    filtered_predict = [p for p in predict if "##" not in p["token_str"]]

    if not filtered_predict:
        filtered_predict = predict
        
    closest_prediction = min(filtered_predict, key=combined_score)
    return closest_prediction["token_str"]

In [65]:
from io import StringIO
with StringIO("0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33\tYestrday, I decded to vissit the musuem. The exhbits were fasinating, showcassing anshent artifacs and modren art. Howevr, the highligt for me was the interractive siense secction, wher I lernd a lot abuot astronomey and biollogy.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

0,2,4,6,8,10,11,12,13,17,19,24,25,26,27,29,30,31,33	yesterday I decided to visit the masses The bits were painting showing ancient crystals and modren art. however the sight for me was the treatise lens thing where I learned a lot abuot anatomy and biollogy.


Above you can see that we tried to change the weights and further fine-tune but out results reamined the same or lower

## Group work
We worked together in the same room for a lot of the project however below is the split of the work.
* mkubavat mostly did the testing and coding for default.py and helped with comments in the juptyer notebook.
* yserai helped with fine-tuning and idea generation,and mostly made the report and juptyer notebook.
