# Jupyter assignment 4: Spell checking and edit distances
## LING-UA 6 

Written by Lucas Champollion - Based in part on questions by Kushal Chattopadhyay


At the end of this lesson, you will be able to:


- Calculate conditional probabilities using tree-drawing methods and Bayes' rule
- Look up frequencies of English words in the Corpus of Contemporary American English
- Use these frequencies to estimate unigram probabilities of English words within an isolated-word spell checker
- Look up frequencies of English letter-based typos
- Look up frequencies of English word-based bigrams on Google
- Combine these frequencies to estimate the most likely correction of a typo within a context-dependent spell checker
- Calculate edit distances for strings and change the weights associated with various types of edit operations

# 1. Probability quiz



Answer the questions below. They are a bit more involved than in previous assignments. Feel free to use a sheet of paper to write out the formulas and probabilities.

Give all the answers as numbers between 0 and 1, expressed as decimals or as fractions of integers (e.g. 1/2). Answers should not be strings, and should not be fractions of decimals (so, not 0.5/0.3). Percentages should be converted into decimal numbers between 0 and 1. E.g. 50 percent should be written as .5 or 0.5 or 1/2. If you use decimal numbers, you may round for the first two digits. E.g. 2/3 can be written as 0.67.

If you are familiar with Google Sheets or Excel, this is a good opportunity to use it.

**Question 1.1 (5 points)** Bob, as he is prone to do, walks into a dark room with a bag in his hands that contains two coins. On this particular day, both coins are biased but in different directions. Coin A comes up heads 30% of the time. Coin B comes up heads 90% of the time. Bob reaches into his bag and picks a coin at random (so, he has a 50% chance of picking either coin). He throws it and it lands on heads. 

What is the probability that the coin he picked is coin A? Give the result either as a fraction, or as a decimal number (you may round for the first two digits).

$P(\text{coin A } | \text{ heads}) = x$

**Hint**: You can solve this question either by drawing balanced trees or by applying Bayes' theorem, or both (this can help you make the connection and prepare you for the subsequent questions). If you decide to apply Bayes' theorem, you'll need to know $P(\text{heads})$, the probability that a randomly chosen coin lands on heads. (This is called the _prior_.) Either coin might land on heads. You can calculate the prior by conditioning on whether Bob has picked coin A or coin B, like this:

$P(\text{heads})=[P(\text{heads}|\text{coin A}) \times P(\text{coin A})] + [P(\text{heads}|\text{coin B}) \times P(\text{coin B})]$

This is called marginalization. It's essentially the same as balancing a tree.


In [None]:
solution_q1_1 = ...

**Question 1.2 (10 points)** You live in a community within which one out of ten people is infected with the coronavirus at a given point in time (as was the case for people living in New York City at the beginning of the pandemic in Spring 2020). To make things simpler, pretend that the virus occurs randomly across the general population.

$P(\text{virus}) = 0.1$ (This is called the _prevalence_ of the disease.)

You decide to get tested using a coronavirus test that has the following properties: 

- Out of any 100 people in your community who are infected with the virus and who take this test, 70 receive a positive test result; that is, the test correctly detects the virus. The remaining 30 are given a false negative test result, that is, they are told they don't have the virus even though they really do. 

$P(\text{positive test}|\text{virus}) = 0.7$ (This is called the _sensitivity_ of the test.)

- Out of any 100 people in your community who are not infected with the virus and who take this test, 99 receive a negative test result; that is, the test correctly detects the absence of the virus. The remaining person is given a false positive test result, that is, this person is told they have the virus even though they really don't. 

$P(\text{negative test}|\text{no virus}) = 0.99$ (This is called the _specificity_ of the test.)

The coronavirus test comes back negative. What is the probability that you don't have the virus, given this result?

$P(\text{no virus}|\text{negative test}) = x$ (This is called the _posterior probability_.)

**Hint 1**: Drawing trees for this question isn't practical because the trees get very large, so you'll want to apply Bayes' theorem:

$$ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$$

This will require you to derive various probabilities from those given above.

**Hint 2**: When you have two events that between them cover the entire space of possibilities (e.g. heads/tails, virus/no virus, positive/negative test), their probabilities always add up to 1. E.g. $P(\text{no virus}) + P(\text{virus}) = 1$. So you can get either probability by subtracting the other one from 1. This is also true for conditional probabilities when the two events in question are on the left of the line and there is the same event on the right of the line: e.g. $P(\text{negative test}|\text{virus}) + P(\text{positive test}|\text{virus}) = 1$. It doesn't work on the right of the line though: in our example, $P(\text{negative test}|\text{virus}) + P(\text{negative test}|\text{no virus}) = 0.3 + 0.99 = 1.29$ which is different from 1.

**Hint 3**: To apply Bayes' theorem, you'll also need to know $P(\text{negative test})$, the probability that a randomly chosen person tests negative regardless of whether they have the virus. (This is called the _prior_.) Someone might test negative either because they really don't have the virus, or because they do but the test doesn't work. So we can calculate the prior by conditioning on whether the person has the virus or not, i.e. by doing marginalization, like this:

$P(\text{negative test})=[P(\text{negative test}|\text{virus}) \times P(\text{virus})] + [P(\text{negative test}|\text{no virus}) \times P(\text{no virus})]$

As in the previous question, marginalization is essentially the same as balancing a tree.

**Background on this question:** In real life, the precise sensitivity and specificity of coronavirus tests depends on many factors, but the specificity will be typically higher than the sensitivity. In any case, these numbers vary a lot depending on the kind of test (with PCR tests being more accurate than antigen tests), and on when and how the sample is taken. We ignore all of this here. If you're curious, see this New York Times article ["How to Think Like an Epidemiologist - Don’t worry, a little Bayesian analysis won’t hurt you."](https://www.nytimes.com/2020/08/04/science/coronavirus-bayes-statistics-math.html) and [this Harvard Health blog post](https://www.health.harvard.edu/blog/which-test-is-best-for-covid-19-2020081020734) for more information.



In [None]:
solution_q1_2 = ...

**Question 1.3 (10 points)** Bob is working on a book of Existentialist poetry has hired you as his secretary to correct his spelling. Bob is not around, so you need to make decisions without checking with him. (You are working essentially in the same way as an autocorrect system.) Bob's writing looks a bit strange to you, almost as if he was picking each word out of a bag at random. His spelling, unfortunately, is absolutely terrible. One of his favorite words is "exist". 

You come across a typo in Bob's writing: "exitt". Unfortunately the context doesn't make it clear what he means. You surmise that this could be a typo either for "exist" or for "exit". (To make things simpler, we'll assume that these are the only two things he could have intended to type.) But which one?

On the one hand, Bob is an Existentialist and he likes to talk about existence. So he is much more likely to have intended "exist" than "exit", say three times as likely:

$P(\text{Bob intends "exist"}) = 3/4$

$P(\text{Bob intends "exit"}) = 1/4$

On the other hand, "s" and "t" are not really close to each other on Bob's keyboard, and you know that the "t" on his keyboard sometimes sticks because of an old piece of gum. Since he writes "exist" so often, you have a good idea of how often he mistypes it as "exitt" . You count the number of times in the past that he intended to type "exist" but mistyped it as "exitt" and you find that it's about 10% of the time. That is to say, on average, if Bob tries to type the word "exist" on 100 separate occasions, he will mistype it as "exitt" on ten of these occasions. So you estimate that given that he intended to type "exist", the probability that he types "exitt" is 10%: 

$P(\text{Bob types "exitt"|Bob intends "exist"}) = 1/10$

On the third hand :), suppose that Bob intended to type "exit". Given this, how likely is it that he would have typed "exitt" by mistake? You think of the piece of gum underneath Bob's "t" key and you estimate about 20%:

$P(\text{Bob types "exitt"|Bob intends "exit"}) = 1/5$

Of course, you don't actually know whether Bob intended to write "exist" or "exit". That's what you want to know, since you need to correct his spelling. Based on your assumptions so far, given that Bob wrote "exitt", what is the probability that he meant to write "exist"? (This is called the _posterior probability_.)

$P(\text{Bob intends "exist"|Bob types "exitt"}) = x$

**Hint**: You can solve this problem either via balanced trees or via Bayes' rule. If you use trees, note that the two words correspond to the two coins, but that Bob doesn't pick among them with equal probability. So you'll want to balance already the first level of the tree and not just the second. If you use Bayes' theorem, you'll want to calculate $P(\text{Bob types "exitt"})$ (this is called the _prior probability_) via marginalization as in the previous questions. 

In [None]:
solution_q1_3 = ...

**Question 1.4 (5 points)** And based on the same assumptions as in Question 1.3, given that Bob wrote "exitt", what is the probability that he meant to write "exit"? (This is called the _posterior probability_.)

$P(\text{Bob intends "exit"|Bob types "exitt"}) = x$


**Hint**: This is very easy if you've solved Question 1.3.

In [None]:
solution_q1_4 = ...

**Question 1.5 (5 points)** Based on your results in Questions 1.3 and 1.4, how would you correct Bob's spelling? Your answer should consist of either the word "exist" or the word "exit", in quotation marks.

**Hint**: This is very easy if you've solved Questions 1.3 and 1.4.

In [None]:
solution_q1_5 = ...

# 2. Spell check: Isolated-word error correction

Based on a question by Kushal Chattopadhyay

An automatic spelling corrector detects the string "poice" as a typo in a document which contains the following sentence:

*The woman’s melodious poice led her to a life of fame as a singer.*

This exercises retraces the steps of the spelling corrector as it fixes this typo.

**Question 2.1 (5 points)** Create a table including four hypothesized correct words (i.e. candidates for what the user might have intended to type; to hypothesize something is to consider it as a possible explanation).  We will assume that these hypothesized correct words are "voice", "police", "poise", and "price". The answer of this question will consist of four lists of five words each. Each list is a row, and each list item is a cell. The first list is already filled out for you as an example. The list items should be the following from left to right in each list:

1. First: the required fix, "insertion", "deletion", or "substitution". What modifications would we need to make to the misspelled word to turn it into the correct spelling? 
- If the mistyped string ("poice") and the intended string have the same length, the required fix is a "substitution". E.g. if someone  means to type "acquire" but writes "akquire", that calls for a substitution.
- If the mistyped string ("poice") is shorter than the string that was intended, the required fix is an "insertion". E.g. if someone means to type "acquire" but writes "aquire", that calls for an insertion. 
- If the mistyped string ("poice") is longer than the string that was intended, the required fix is a "deletion". E.g. if someone means to type "acquire" but writes "ackquire", that calls for a deletion. 


2. Second: the hypothesized correct word, i.e. what the user might have meant to type. This is one of the strings "voice", "police", "poise", and "price".


3. Third: the mistyped portion of the input, given what the user might have meant to type. 
- If the required fix is a substitution, give the letter that has actually been typed. E.g. if "acquire" was mistyped as "akquire", this is "k".
- If the required fix is an insertion, give the letter immediately *before* the letter that would need to be inserted to fix the typo. E.g. if someone mistypes "acquire" as "aquire", this is "a". 
- If the required fix is a deletion, give a sequence of two letters: the letter just before the deletion, and the letter that would need to be deleted to fix the typo. E.g. if someone mistypes "acquire" as "ackquire", this is "ck".

4. Fourth: the intended portion of the input.
- If the required fix is a substitution, give the letter that was intended. E.g. if "acquire" was mistyped as "akquire", this is "c".  
- If the required fix is an insertion, give a sequence of two letters consisting of the letter immediately *before* the letter that would need to be inserted to fix the typo, followed by that letter. E.g. if someone mistypes "acquire" as "aquire", this is "ac". 
- If the required fix is a deletion, give the letter immediately *before* the letter that would need to be deleted to fix the typo. E.g. if someone mistypes "acquire" as "ackquire", this is "c".


You can check how your table looks in the test cell. 

First run the following cell so that the table can be created properly. Then fill in the rows. Each row comes with a test that checks that you used the right format. Finally, run the cell below the checks to generate the table in a nice format.

If you get an error on Gradescope, check that you haven't inverted the third and fourth columns, and check that you haven't confused insertion and deletion.

Remember that the example sentence is:

*The woman’s melodious poice led her to a life of fame as a singer.*

In [None]:
correctionTable_voice = ["substitution", "voice", "p", "v"] # This row has been filled out for you.

In [None]:
correctionTable_police = ...

In [None]:
correctionTable_poise = ...

In [None]:
correctionTable_price = ...

In [None]:
# Check out how your table looks!
import pandas as pd
import numpy as np
correctionTable = pd.DataFrame(np.array([correctionTable_voice,
                                         correctionTable_police,
                                         correctionTable_poise,
                                         correctionTable_price]), 
                               columns=["type of required fix",
                                        "hypothesized correct word", 
                                        "mistyped", 
                                        "intended"])
correctionTable

#### Code Explanation
_The following explanation refers to the above code block._

Using a library called `pandas`, the above segment of code takes the `correctionTable_` lists above and organizes them into a neatly labeled table! The column headers from from the `column=[...]` section of the code. The `np` code is related to a library called `numpy`, which helps make a lot of list operations easier and faster with Python.

**Question 2.2 (5 points)** Next, let's guess which of these three hypothesized correct words is most likely the intended one, based on how common each of the possible corrections are in the English vocabulary. To this end, we'll use a corpus (a large and structured collection of texts; the plural of "corpus" is "corpora").

You will use the Corpus of Contemporary American English (COCA) to calculate the frequencies of each of your words. COCA describes itself as "the only large, genre-balanced corpus of American English. COCA is probably the most widely-used corpus of English, and it is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English. The corpus contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and (with the update in March 2020): TV and Movies subtitles, blogs, and other web pages." 

To access COCA, go to [https://persistent.library.nyu.edu/arch/NYU06294](https://persistent.library.nyu.edu/arch/NYU06294) and select "Corpus of Contemporary American English (COCA)". If you are off-campus and not using the NYU VPN, this link will put you through the proxy server so you can get access. When you access COCA the first time, you will need to register, which is free. Please use your NYU email address when registering. If you have problems registering, please post on the Brightspace discussions board under ["Registering for COCA"](https://brightspace.nyu.edu/d2l/le/120948/discussions/topics/271024/View).

1. To find the frequency of a word, type it into the COCA search box and click "Find matching strings", then look under "FREQ". For example, the word "voice" has a frequency of 156,861.

2. To estimate the probability of a word, you need its frequency and the total number of words in the corpus. As of the time of writing, the number of total words in COCA is 1,001,610,938, about one billion. Use this number in this assignment. (For reference, we have looked this up by clicking on the page icon in the top row whose tooltip is ’See texts and registers’ and looking at the rightmost cell in the top row of the table there.).

3. For each word, estimate its unigram probability by dividing its frequency by the total number of words in the corpus. Since the resulting values are very small, scale them by multiplying them by one thousand (1,000), then round the result to three decimals. For example, the word "voice" has an estimated probability of 156,861/1,001,610,938. This number multiplied by one thousand, rounded to three decimals, is 0.157. As long as you do this scaling consistently for every word, this will not affect the result of the upcoming computations.

Create a table of probabilities $P(intended)$. The columns of this table should be: 

1. hypothesized correct word
2. frequency of hypothesized correct word
3. P(hypothesized correct word). 

Generate your table in a nice format by running the code below the following cell and its check.

Remember that the example sentence is:

*The woman’s melodious poice led her to a life of fame as a singer.*

In [None]:
probabilityTable_voice = ["voice", 156861, 0.157]
probabilityTable_police = ...
probabilityTable_poise = ...
probabilityTable_price = ...

In [None]:
# Check out how your table looks!

probabilityTable = pd.DataFrame(np.array([probabilityTable_voice,
                                          probabilityTable_police,
                                          probabilityTable_poise,
                                          probabilityTable_price]), 
                                columns=["intended word", 
                                         "frequency of this word", 
                                         "scaled unigram probability of this word"])
probabilityTable

<!-- BEGIN QUESTION -->

**Question 2.3 (5 points)** If you were to base your decision solely on these unigram probabilities, which of your hypothesized correct words would come out as most likely to be correct? That is to say, which of them has the highest value in the rightmost column? Did this match what you expect to be the most likely intended word? If so, why do you think this approach works? If not, why do you think the guess failed?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.4 (5 points)** Now, we will refine our guess by adding in how common each incorrect letter error is. Create a table for the noisy channel probabilities of the possible corrections. 

To calculate the error model probabilities, use the table at http://norvig.com/ngrams/count_1edit.txt, which is derived by Peter Norvig from the Google Web Trillion Word Corpus. You can search through the table by using your browser's search function (usually Cmd-F on Mac and Ctrl-F on Windows and Unix). Here is how to read this table:

- An entry like "e|i	917" means that 917 times in the corpus, there was a typo in which the letter "e" was mistyped (seen) where the letter "i" would have been intended (correct). 
- An entry like "t|te	478" stands for a typo in which the letter "e" should have been typed after the letter "t" but was left out by the user, and says that this happened 478 times. 
- An entry like "re|r	299" stands for a typo in which a superfluous letter "e" was inserted after "r" by the user, and says that this happened 299 times. 

This is the same format as you used in the "mistyped" and "intended" columns in Question 2.2 above.

As a quick-and-dirty way to estimate probabilities, take each of these numbers and divide it by 1000. This number is chosen somewhat arbitrarily; as before, it doesn't matter for the computations to follow, as long as it is applied consistently. If a specific transition doesn't occur in Norvig's file, assume that it has happened exactly once (so divide 1 by 1000). This is called "smoothing".

Create a table of error model probabilities $P(\text{mistyped letter}|\text{correct letter})$. The columns of this table should be: 

1. hypothesized correct word

2. a string of the form "letter|intended" (from Peter Norvig's data)

3. Frequency of this letter typo (estimated from Peter Norvig's data as described above)

4. Scaled unigram probability of this word (copied from question 2.2 above)

5. the result of multiplying the last two columns

As before, generate your table in a nice format by running the code below the check. 

Remember that the example sentence is:

*The woman’s melodious poice led her to a life of fame as a singer.*

In [None]:
noisyChannelTable_voice = ["voice",  "p|v", 0.001, 0.157, 0.001*0.157]
noisyChannelTable_police = ...
noisyChannelTable_poise = ...
noisyChannelTable_price = ...

In [None]:
# Check out how your table looks!

noisyChannelTable = pd.DataFrame(np.array([noisyChannelTable_voice,
                                           noisyChannelTable_police,
                                           noisyChannelTable_poise,
                                           noisyChannelTable_price]), 
                                 columns=["hypothesized correct word", 
                                          "mistyped letter | intended letter",
                                          "Frequency of this letter typo",
                                          "Scaled unigram probability of this word", 
                                          "Multiplying the last two columns"])
noisyChannelTable

<!-- BEGIN QUESTION -->

**Question 2.5 (5 points)** Which word did you find had the highest noisy channel probability on this approach? That is to say, which one has the highest value in the rightmost column? Did this match your expectations? If so, why do you think this approach works? If not, why do you think the guess failed? What does this tell you about the limits of isolated-word error correction?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

# 3. Spell check: Context-dependent word correction


Based on a question by Kushal Chattopadhyay

In this exercise, we will move from isolated-word error correction to context-dependent word correction. We will use bigram counts to calculate the best correction, acknowledging context. We will estimate for each of the three hypothesized correct words you considered in part 2, the probability that it is the right word, given that the previous word is "melodious". 

To this purpose, we will multiply the bigram probability with the noisy channel probability value for each word. First, though, we need an estimate of bigram probabilities. Given that the current word is "melodious", what is the relative probability that the next word is "voice" as opposed to "police" or "poise"? 

To do this, do a Google search for the strings "melodious voice", "melodious police", and "melodious poise" (don't forget to include quotation marks in each case) and report the number of results. (If you don't see the results, this is due to a Google bug. As a workaround, try running your search from a Private Mode or Incognito Mode window.) Make sure to ignore Google's own autocorrection tools: If Google tells you something like "Showing results for X, search  instead for Y", click on "search instead for Y".

The resulting numbers are raw counts rather than probabilities, but since the only thing that matters is their proportion, we won't worry about this.  (Using Google search results is only good as a quick and dirty approximation, and can return different results for the same searches at different times.)

**Question 3.1 (5 points)** Create a table of bigram counts $P(\text{correct word is }x|\text{previous word was "melodious"})$. The columns of this table should be: 

1. hypothesized correct word (e.g. "voice")

2. frequency of this letter typo (copied over from the previous question)

3. number of times that the hypothesized correct word follows "melodious" according to Google

4. multiplying the last two columns

The first column is filled out for you.

As before, generate your table in a nice format by running the cell below the check.


In [None]:
googleTable_voice = ["voice", 0.001, 827000, 0.001*827000]
googleTable_police = ...
googleTable_poise = ...
googleTable_price = ...

In [None]:
googleTable = pd.DataFrame(np.array([googleTable_voice,
                                              googleTable_police,
                                              googleTable_poise,
                                              googleTable_price
                                             ]), 
                                    columns=["hypothesized correct word",
                                             "frequency of this letter typo",
                                             "'melodious X' count on Google",
                                             "Multiplying the last two columns"])

googleTable

<!-- BEGIN QUESTION -->

**Question 3.2 (5 points)** Which word did you find is the most likely candidate on this approach? That is to say, which one has the highest value in the rightmost column? Did this match your expectations? If so, why do you think this approach works? If not, why do you think the guess failed?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

# 4. Edit Distances

(20 points - all manual)

Based on a question by Kushal Chattopadhyay

A user types "halvs" in a document. The computer comes up with four hypothesized correct words:
- halves 
- calves
- halts
- helps

<!-- BEGIN QUESTION -->

**Question 4.1 (5 points)** Start by ranking the 4 hypothesized correct words simply by following your intuition about which word you think the writer intended. List them after the colon of each prompt. All responses are accepted.


1. Most likely: 

2. Second most likely: 

3. Third most likely: 

4. Least likely:

<!-- END QUESTION -->

This line takes an implementation of the edit distance algorithm from the textbook that was implemented elsewhere (as part of the NLTK module) and allows us to use it here:

In [None]:
from nltk.metrics.distance import edit_distance  # we can now use the edit_distance function

In [None]:
# The non-word string that was typed:
typed_input = "halvs"
# A word that might have been intended:
candidates = ["halves","halts","calves","helps"]

for word in candidates:
    print(typed_input, '-', word)
    print("Edit distance:", edit_distance(typed_input,word))
    print()

#### Code Explanation
_The following explanation refers to the above code block._

The above code segment declares in a `typed_input`, which represents a word with a potential typo, and a list `candidates` to compare the typo to. Our for loop below then says "for every candidate, compare how different it is with the `typed_input` string". It checks this difference using the `edit_distance` function from the `nltk` library (as declared two code cells above). 

<!-- BEGIN QUESTION -->

**Question 4.2 (5 points)** Run the code above to calculate the minimum edit distance from the string "halvs" to each of the hypothesized correct words. Rank them in terms of their edit distance. Which one(s) has/have the smallest edit distance? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

An alternative version of the edit distance metric has different costs associated with different operations. In this version, substitutions have a cost of 2, whereas deletions and insertions have a cost of 1. (The logic behind this is that a substitution can be described as a combination of a deletion and an insertion.)

In the code below, find the line of code where the substitution cost is defined, and change it to this alternative version.

The function `edit_distance2` goes through a pair of strings character by character to compute the edit distance. To do this, it uses a second function `edit_dist_step` to update the cost of an edit at each step through the strings. 

You will need to edit the function `edit_distance_step` below so that a substitution has a cost of 2 rather than 1. The new edit distance function will be called `edit_distance2` so that we can still use the old `edit_distance` function for comparison.

Don't change the following code cell:

In [None]:
# Don't change this code cell
def edit_distance2(word1, word2): # 1
    # Set up a 2-D array to store the edit distance costs. The columns
    # correspond to characters in word1, the rows correspond to characters in
    # word2.
    len1 = len(word1)
    len2 = len(word2)
    costs = []
    for i in range(len1): # 2
        costs.append([0] * len2)  # initialize 2-D array to zero
    for i in range(len1):
        costs[i][0] = i           # column 0: 0,1,2,3,4,...
    for j in range(len2):
        costs[0][j] = j           # row 0: 0,1,2,3,4,...

    # iterate over the array
    for i in range(len1): # 3
        for j in range(len2):
            edit_dist_step(costs, i, j, word1, word2)
    return costs[len1-1][len2-1] # 4


#### Code Explanation
_The following explanation refers to the above code block, referencing parts in parentheses, e.g. `(1)`._

(1) Here we declare a function called `edit_distance2`. Like all function syntax, it allows us to pass in arguments- in this case, `word1` and `word2`. These arguments, whatever they may be when the function is called, are then referred to by these names within the body of the function.

(2) Here, we have a couple of for loops. We first full up the list called `costs` with all zeros. The second and third for loops then take this list and fill all the columns and rows with 0, 1, 2, 3, 4... etc. until the end of the length of word 1 and 2. This 2D array of costs represents the distance of each letter between the two words.

(3) Now, we go ahead and iterate over the entire 2D costs array with these two for loops here. Using the `edit_dist_step` function, we create a matrix of all costs of a character.

(4) Lastly, we return a value from our function. If we don't remember to return values from functions, they return nothing! It's like calling a friend, asking them to solve something for you, and then they hang up the friend before they give you the answer to the question. Just like calling a friend, you can call a function multiple times- but you need to make sure they return something!

**A note on "costs"**
When we talk about costs in natural language processing (or machine learning as a broader field), it refers to how "costly" an operation is. For example, I might say that lifting my hand, moving it above a button on my desk, and then pressing said button has a cost of 3 (for each movement). If walking around my room, taking the button from the desk, putting it on another table, then pressing the button has a cost of 10 (arbitrarily)... well, that higher cost represents more effort in the case of physical effort. Costs in machine learning are similar. If an operation has a higher cost, i.e. replacing one letter versus another, then we want to avoid it. The lower the cost, the better!

Modify the following code cell in the place indicated (the third line from the bottom, where it says "MODIFY THIS ROW"):

In [None]:
## MODIFY THIS CODE CELL IN THE PLACE INDICATED
def edit_dist_step(graph, row, col, word1, word2):
    """Calculates the minimum edit distance to the cell at (row, column) in the
    edit distance graph from word1 -> word2.

    `graph` is a two-dimensional list

    """
    character_1 = word1[row - 1]
    character_2 = word2[col - 1]

    # deleting a letter from word1
    deletion_cost = graph[row - 1][col] + 1

    # inserting a letter from word2
    insertion_cost = graph[row][col - 1] + 1

    if character_1 == character_2:
        # "substituting" a character for itself, a -> a
        substitution_cost = graph[row - 1][col - 1] + 0
    else:
        # "substituting" a character for another character, a -> b
        substitution_cost = graph[row - 1][col - 1] + 1 ########## MODIFY THIS ROW ##########

    # pick the cheapest of the three options
    graph[row][col] = min(deletion_cost, insertion_cost, substitution_cost)

Test your modified code by running the following code cell.

In [None]:
# Don't change this code cell
# The non-word string that was typed:
typed_input = "halvs"
# Words that might have been intended:
candidates = ["halves","halts","calves","helps"]

for word in candidates:
    print(typed_input, '-', word)
    print("Original edit distance: ",edit_distance(typed_input,word))
    print("New edit distance     : ",edit_distance2(typed_input, word))
    print()


If the two edit distance functions give the same result, you probably skipped the part of the exercise that asks you to edit the code above, or your change didn't go through. Go back and fix the issue. If the two functions give different results, answer the following questions.

#### Code Explanation
_The following explanation refers to the above code block._

Head the warning above if your edit distances are the same. All this code block does is say "for everyword in the `candidates` list, let's go ahead and see what the cost is by using our `edit_distance` function and then let's see what the cost is when we use our `edit_distance2` function". As the above explanation block mentioned, the lower the cost- the better!

<!-- BEGIN QUESTION -->

**Question 4.3 (10 points)**
- Rank the hypothesized correct words in terms of their edit distance. Which of them has/have the smallest edit distance?
- Which of the two cost functions (if any) better matches your intuitions and why? Do you think that one of the two versions is better than the other? Why/why not? (This is not discussed in the textbook. Just think for yourself and give a well reasoned response.)


_Type your answer here, replacing this text._

<!-- END QUESTION -->

# 5. Reflection


<!-- BEGIN QUESTION -->

**Question 5 (10 points)** Tell us what you thought of the exercise and what questions or problems you had. So long as you give an answer, its content will not determine or affect your grade. The answer is only designed to help us develop future problem sets for this and subsequent semesters. (Yes, this question is worth points).


_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(force_save=True, run_tests=True)