# Levenshtein Distance

From [RWS Moravia](https://www.rws.com/insights/rws-moravia-blog/interview-with-an-expert-how-do-you-measure-mt/):

Levenshtein distance calculates the difference between the MT output and the post-edited translation. It shows what the post-editor did to the original MT output. Let's say the machine translation output is "the cat is barking," and a post-editor changes this to "the dog is barking." The difference would be six, because you include the three letters deleted and three letters added when editing from "cat" to "dog." Then you divide six by the number of letters in the whole segment to come up with a result that is a percentage.

# TER

From [RWS Moravia](https://www.rws.com/insights/rws-moravia-blog/interview-with-an-expert-how-do-you-measure-mt/):  

The second metric we use to measure the human effort of post-editing is the TER score. Whereas the Levenshtein distance counts on a character level—which characters are deleted, added or replaced—the TER score tries to account for the kinds of changes made and makes a calculation based on the number of edits rather than the number of character changes.

Again, take the example "the cat is barking" and "the dog is barking." The Levenshtein distance counts both the three letters deleted and the three letters added. When you calculate TER, it recognizes a single replacement: one string is replaced with another. That string has a length of three. So, it calculates a single edit with a length of three characters.

Therefore, Levenshtein can actually overestimate the effort of making long edits that are in fact only single edits—for example, if you replace one or two characters here and there throughout a long sentence. Levenshtein wouldn't be able to tell the difference in effort between that and overwriting full words. In this case, TER is more reliable because its logic is closer to the actual post-editing effort.

## Procedure

I don't know of any good Python modules for evaluating TER, but we do have a good method for evaluating Levenshtein edit distance. Note: This particular code seems to give the expected result that TER gives. In other words, the it evaluates the distance between "the cat is barking" and "the dog is barking" as three, not six.

Collect the following:

1.  A source text file of at least a couple thousand words.
2.  The raw machine translation of the source text
3.  The post-edited raw machine translation

Requirements for the files to be tested:

*   Each file should be a text file, UTF-8 encoded.
*   We will perform our Levenshtein calculation on text in the Spanish language.
*   Each line should have only one sentence.
*   Each file should have exactly the same number of lines.
*   Each line should line up with the other lines in the other files.

Requirements for our Python environment:

Please make sure you have installed pandas and string similarity in Python.

To install pandas:

`pip install pandas`

To install string similarity:

`pip install -U strsimpy`

## Importing necessary modules

We will need pathlib, pandas, and the levenshtein package from similarity. Import them with these statements below:

In [1]:
from pathlib import Path
import pandas as pd
from similarity.levenshtein import Levenshtein
from similarity.normalized_levenshtein import NormalizedLevenshtein

levenshtein = Levenshtein()
normalized_levenshtein = NormalizedLevenshtein()

## Set paths of files

Tell Python where to find the files you'll be evaluating.

In [2]:
SOURCE = Path(r"Levenshtein_files/source.txt")
RAW_MT = Path(r"Levenshtein_files/raw_mt.txt")
PEMT = Path(r"Levenshtein_files/post-edited.txt")

## Set up a table for results

The result list stores rows which contain the results of the BLEU score evaluation. The column headers are indicated by COLUMNS.

In [5]:
result = list()
COLUMNS = ("Source",
           "Machine Translation",
           "Post-edited",
           "Levenshtein",
           "Normalized Levenshtein")

## Open the files

Now, open each file. You can open multiple files in one line by using the backslash to continue code onto the next line.

## Read the lines and evaluate Levenshtein

Read all the lines into a tuple called lines. Then, pass that tuple to a `zip` function. When you put an asterisk `*` in front of your function argument, it passes each item in the tuple as its own argument.

Levenshtein will give us the number of edits that occured between the `raw_mt` and the `pemt`. However, we also want to evaluate this in terms of the length of the `raw_mt`.

Normalized Levenshtein divides the edit distance by the length of the longest segment, giving you a percentage. This is more useful in some ways than a static number of edits.

Append each row to the `result` list of rows.

In [10]:
# Open the files
with open(SOURCE, encoding="UTF-8") as a, \
     open(RAW_MT, encoding="UTF-8") as b, \
     open(PEMT, encoding="UTF-8") as c:
    
    # Read the lines and evaluate Levenshtein
    lines = a.readlines(), b.readlines(), c.readlines()
    for source, raw_mt, pemt in zip(*lines):
        lv = levenshtein.distance(raw_mt, pemt)
        norm = normalized_levenshtein.distance(raw_mt, pemt)
        
        # Append each row to the results list of rows
        result.append((source, raw_mt, pemt, lv, norm))

## Write the results

Write an excel file with the results.

In [11]:
df = pd.DataFrame(result, columns=COLUMNS)
df.to_excel("Levenshtein.xlsx", index=False)
print(df.head(5))

                                              Source  \
0  At this stage you are using documented process...   
1  Ultimately, however, reports have limited inhe...   
2  Some organizations are just getting started an...   
3  “To sit back and to see a full year of data, t...   
4                                 What’s possible?\n   

                                 Machine Translation  \
0  En esta etapa, usará un proceso de documentaci...   
1  Sin embargo, en última instancia, el significa...   
2  Algunas organizaciones son incipientes y neces...   
3  "Es impresionante sentarse a examinar los dato...   
4                          ¿Qué posibilidades hay?\n   

                                         Post-edited  Levenshtein  \
0  En esta etapa, está utilizando procesos docume...        163.0   
1  Sin embargo, en última instancia, los informes...         70.0   
2  Algunas organizaciones recién comienzan y nece...         26.0   
3  "Para sentarse y ver un año completo de datos,.

Open the file and view the results. Set the Levenshtein rate to a percentage. What do you think? Are these the scores you expected? Remember, it's a good idea to track the time spent on the post-editing task. The edit distance only tells you half the story. Although an edit may be minimal, a disproportionate amount of time may have been required to make it.