Name: Marcel Aguilar Garcia

Student ID: 20235620

# Assignment 1

This assignment will involve the creation of a spellchecking system and an evaluation of its performance. You may use the code snippets provided in Python for completing this or you may use the programming language or environment of your choice

Please start by downloading the corpus `holbrook.txt` from Blackboard

The file consists of lines of text, with one sentence per line. Errors in the line are marked with a `|` as follows

    My siter|sister go|goes to Tonbury .
    
In this case the word 'siter' was corrected to 'sister' and the word 'go' was corrected to 'goes'.

In some places in the corpus two words maybe corrected to a single word or one word to a multiple words. This is denoted in the data using underscores e.g.,

    My Mum goes out some_times|sometimes .
    
For the purpose of this assignment you do not need to separate these words, but instead you may treat them like a single token.

*Note: you may use any functions from NLTK to complete the assignment. It should not be necessary to use other libraries and so please consult with us if your solution involves any other external library. If you use any function from NLTK in Task 6 please include a brief description of this function and how it contributes to your solution.*

## Task 1 (10 Marks)

Write a parser that can read all the lines of the file `holbrook.txt` and print out for each line the original (misspelled) text, the corrected text and the indexes of any changes. The indexes refers to the index of the words in the sentence. In the example given, there is only an error in the 10th word and so the list of indexes is [9]. It is not necessary to analyze where the error occurs inside the word.

Then split your data into a test set of 100 lines and a training set.

In [1]:
lines = open("holbrook.txt").readlines()
data = []
# Write your code here
import nltk

for line in lines:
    indexes = []
    tokenize_line = nltk.word_tokenize(line)
    misspelled_line = tokenize_line.copy()
    corrected_line  = tokenize_line.copy()
    for index,word in enumerate(tokenize_line):
        if '|' in word:
            misspelled_line[index] = word.split('|')[0]
            corrected_line[index]  = word.split('|')[1]    
            indexes.append(index)
    data.append({'original':misspelled_line,'corrected':corrected_line,'indexes':indexes})

# assert(data[2] == {
#    'original': ['I', 'have', 'four', 'in', 'my', 'Family', 'Dad', 'Mum', 'and', 'siter', '.'], 
#    'corrected': ['I', 'have', 'four', 'in', 'my', 'Family', 'Dad', 'Mum', 'and', 'sister', '.'], 
#    'indexes': [9]
# })

The counts and assertions given in the following sections are based on splitting the training and test set as follows

In [2]:
test = data[:100]
train = data[100:]

## **Task 2** (10 Marks): 
Calculate the frequency (number of occurrences), *ignoring case*, of all words and their unigram probability from the corrected *training* sentences.

*Hint: use `Counter` to implement this so it may be called many times*

In [3]:
from collections import Counter

corrected_sentences = [train[n]['corrected'] for n in range(len(train))]
corrected_words = [word.lower() for sentence in corrected_sentences for word in sentence]
unique_corrected_words = set(corrected_words)

def unigram(word):
    # Write your code here.
    return Counter(corrected_words)[word] 
    

def prob(word):
    # Write your code here.
    word = word.lower()
    word_counts = unigram(word)
    total_number_of_words = len(corrected_words)
    return word_counts / total_number_of_words

# Test your code with the following
# assert(unigram("me")==87)

## **Task 3** (15 Marks): 
[Edit distance](https://en.wikipedia.org/wiki/Edit_distance) is a method that calculates how similar two strings are to one another by counting the minimum number of operations required to transform one string into the other. There is a built-in implementation in NLTK that works as follows:


In [4]:
from nltk.metrics.distance import edit_distance

# Edit distance returns the number of changes to transform one word to another
print(edit_distance("hello", "hi"))

4


Write a function that calculates all words with *minimal* edit distance to the misspelled word. You should do this as follows

1. Collect the set of all unique tokens in `train`
2. Find the minimal edit distance, that is the lowest value for the function `edit_distance` between `token` and a word in `train`
3. Output all unique words in `train` that have this same (minimal) `edit_distance` value

*Do not implement edit distance, use the built-in NLTK function `edit_distance`*

In [5]:
def get_candidates(token):
    # Write your code here.
    distance_token_to_words = {word:edit_distance(word,token.lower()) for word in unique_corrected_words}
    minimum_distance = min(distance_token_to_words.values())
    return sorted([word for word, distance in distance_token_to_words.items() 
                   if distance == minimum_distance], reverse=True)
        
# Test your code as follows
# assert get_candidates("minde") == ['mine', 'mind']

## Task 4 (15 Marks):

Write a function that takes a (misspelled) sentence and returns the corrected version of that sentence. The system should scan the sentence for words that are not in the dictionary (set of unique words in the training set) and for each word that is not in the dictionary choose a word in the dictionary that has minimal edit distance and has the highest *unigram probability*. 

*Your solution to this should involve `get_candidates`*


In [6]:
def correct(sentence):
    # Write your code here
    for index,word in enumerate(sentence):
        if (word and word.lower()) not in unique_corrected_words:
            candidates = {candidate:prob(candidate) for candidate in get_candidates(word)}
            best_candidate = max(candidates, key=candidates.get)
            sentence[index] = best_candidate
    return sentence

# assert(correct(["this","whitr","cat"]) == ['this','white','cat'])   

## **Task 5** (10 Marks): 
Using the test corpus evaluate the *accuracy* of your method, i.e., how many words from your system's output match the corrected sentence (you should count words that are already spelled correctly and not changed by the system).

In [7]:
def accuracy(test):
    # Write your code here
    count_total_words = 0
    count_corrected_words = 0
    for sentence in test:
        corrected_sentence = correct(sentence['original'].copy())
        count_total_words+=len(sentence['corrected'])
        count_corrected_words += sum(corrected_sentence[n] == sentence['corrected'][n] 
                                     for n in range(len(sentence['corrected'])))
    return count_corrected_words/count_total_words

print(accuracy(test))

0.8380281690140845


## **Task 6 (35 Marks):**

Consider a modification to your algorithm that would improve the accuracy of the algorithm developed in Task 3 and 4

* You may resources beyond those provided here.
* You must **not use the test data** in this task.
* Provide a short text describing what you intend to do and why. 
* Full marks for this section may be obtained without an implementation, but an implementation is preferred.
* Your implementation should not consist of more than 50 lines of code

Please note this task is marked according to: demonstration of knowledge from the lectures (10), originality and appropriateness of solution (10), completeness of description (10) and technical correctness (5)


## **Explanation:**
* As the train set is relatively small, I have decided to change the unigram probability function to use "add one smoothing" which, after testing this in the test set, it gave a significant improvement from the previous algorithm

* On the other hand, I have defined all bigrams from the train set and their frequencies in order to calculate their probability

* Finally, I have interpolated the unigram probability into the bigram probability in the function interpolation_probability. After trying some values for lambda, I have decided to give 70% of weight to the bigram probability and 30% to the unigram probability.

* As the training set is not large, I realised that the previous algorithm was attempting to correct too many proper nouns.  Most of these nouns were not found in the train set and were being corrected to the wrong word. Because of this, I have decided to ignore words that start by capital letter.

* Again, as the training set is not large enough, I thought it would be a good idea to limit the distance between words in get_candidates. I decided that words that are "far" enough from the candidates, will not be attempted to be corrected. 

* By doing all these changes, I brought the previous accuracy 83.8% up to 90.14% which is around 6.34% improvement in the corrections

* Please, note that in an attempt of limiting the code to 50 lines, I have summarised some functions more than I would like. 

In [8]:
from nltk.collocations import BigramCollocationFinder
from nltk.metrics.distance import edit_distance
from collections import Counter
corrected_sentences = [train[n]['corrected'] for n in range(len(train))]
corrected_words = [word.lower() for sentence in corrected_sentences for word in sentence]
unique_corrected_words = set(corrected_words)
finder = BigramCollocationFinder.from_words(corrected_words)
bigram_freq_dictionary = dict(finder.ngram_fd.items())

def prob(word): 
    return (Counter(corrected_words)[word]+1) / (len(corrected_words)+len(unique_corrected_words))

def bigrams_starting_by(word): 
    return [t for t in list(bigram_freq_dictionary.keys()) if t[0] == word]

def return_dictionary_value(bigram):
    try:
        return bigram_freq_dictionary[bigram]
    except KeyError:
        return 0

def count_bigrams(list_bigrams): 
    return sum([return_dictionary_value(bigram) for bigram in list_bigrams])

def probability_bigram(word1,word2):
    if count_bigrams([(word1,word2)]) == 0:
        return 0
    else:
        return count_bigrams([(word1,word2)])/count_bigrams(bigrams_starting_by(word1))

def interpolation_probability(word1,word2,lambda_1 = 0.3): 
    return (1-lambda_1)*probability_bigram(word1,word2)+lambda_1*prob(word2)
    
def get_candidates(token):
    distance_token_to_words = {word:edit_distance(word,token.lower()) for word in unique_corrected_words}
    minimum_distance = min(distance_token_to_words.values())
    if minimum_distance < 2:
        return sorted([word for word, distance in distance_token_to_words.items() if distance == minimum_distance])
    return [token]

def correct(sentence):
    for index,word in enumerate(sentence):
        if ((word and word.lower()) not in unique_corrected_words) and (not word[0].isupper()):
            if index == 0: 
                previous_word = '.'
            else:
                previous_word = sentence[index-1].lower()
            candidates = {candidate:interpolation_probability(previous_word,candidate) for candidate in get_candidates(word)}
            sentence[index] = max(candidates, key=candidates.get)
    return sentence

## **Task 7 (5 Marks):**

Repeat the evaluation (as in Task 5) of your new algorithm and show that it outperforms the algorithm from Task 3 and 4

In [9]:
def accuracy(test):
    # Write your code here
    count_total_words = 0
    count_corrected_words = 0
    for sentence in test:
        corrected_sentence = correct(sentence['original'].copy())
        count_total_words+=len(sentence['corrected'])
        count_corrected_words += sum(corrected_sentence[n] == sentence['corrected'][n] 
                                     for n in range(len(sentence['corrected'])))
    return count_corrected_words/count_total_words

print(accuracy(test))

0.9014084507042254
