In this notebook, we compare Moses, spaCy and NLTK tokenizers for various languages.

In [71]:
pip install -q mosestokenizer nltk spacy pandas

[33mDEPRECATION: Loading egg at /home/jirka/edukate/awesome-align/awesome_align_env/lib/python3.11/site-packages/awesome_align-0.1.7-py3.11.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..[0m[33m
[0m[33mDEPRECATION: Loading egg at /home/jirka/edukate/awesome-align/awesome_align_env/lib/python3.11/site-packages/tokenize_uk-0.1.5-py3.11.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [73]:
from mosestokenizer import MosesTokenizer
import spacy
from nltk.tokenize.destructive import NLTKWordTokenizer
from itertools import islice
from collections import defaultdict

In [137]:
first_lines = 10000
data_paths = {
    "en": "./data/parallel/en-fr/original_data/train-parts/OPUS-opus100_train-1-eng-fra.eng",
    "fr": "./data/parallel/en-fr/original_data/train-parts/OPUS-opus100_train-1-eng-fra.fra",
    "de": "./data/parallel/en-de/original_data/train-parts/OPUS-opus100_train-1-deu-eng.deu",
    "es": "./data/parallel/en-es/original_data/train-parts/OPUS-opus100_train-1-eng-spa.spa",
    "pl": "./data/parallel/en-pl/original_data/train-parts/OPUS-opus100_train-1-eng-pol.pol",
    "ru": "./data/parallel/ru-cs/original_data/train-parts/Statmt-news_commentary-16-ces-rus.rus",
}
lang = "en"
data_path = data_paths[lang]

In [138]:
lines = [line.strip() for line in islice(open(data_path), first_lines)]
spacyTokenizer = spacy.blank(lang)
def create_moses(lang):
    moses = MosesTokenizer(lang, no_escape=True) # Turn off html escaping
    def _tokenize(line):
        output = moses(line)
        # Replace @-@ with -, Moses adds the @ symbols because of aggressive hyphen splitting
        output = [x if x != "@-@" else "-" for x in output]
        return output
    return _tokenize


tokenizers = {
    "moses": create_moses(lang),
    "nltk": NLTKWordTokenizer().tokenize,
    "spacy": lambda text: list(map(str, spacyTokenizer.tokenizer(text)))
}
outputs = defaultdict(list)
for tokenizer_name, tokenizer in tokenizers.items():
    print(tokenizer_name)
    for line in lines:
        # print(tokenizer(line))
        # outputs[tokenizer_name].append(tokenizer(line))
        outputs[tokenizer_name].append(" ".join(tokenizer(line)))

moses
nltk
spacy


In [139]:
import pandas as pd
df = pd.DataFrame({
    "original": lines,
    "moses": outputs["moses"],
    "nltk": outputs["nltk"],
    "spacy": outputs["spacy"]
})

In [140]:
# stats
df.map(lambda x: len(x.split())).describe()

Unnamed: 0,original,moses,nltk,spacy
count,10000.0,10000.0,10000.0,10000.0
mean,13.7551,16.6635,16.148,16.4548
std,16.347992,18.77709,18.040129,18.477735
min,1.0,1.0,1.0,1.0
25%,4.0,6.0,6.0,6.0
50%,8.0,10.0,10.0,10.0
75%,18.0,22.0,21.0,21.0
max,651.0,713.0,693.0,713.0


In [152]:
from compare_tokenizers_utils import display_diff
import difflib
fst = "moses"
snd = "spacy"
# show some examples for these differences
print("Differences", len(df[df[fst] != df[snd]]))
df_filtered = df[df[fst] != df[snd]].sample(20)
df_filtered
diffs = df_filtered.apply(lambda row: display_diff(difflib.SequenceMatcher(None, row[fst], row[snd])), axis=1)
pd.DataFrame({
    "original": df_filtered.original,
    fst: df_filtered[fst],
    f"{fst} versus {snd}": diffs,
    snd: df_filtered[snd],
}).style

Differences 1388


Unnamed: 0,original,moses,moses versus spacy,spacy
5742,"If the optional parameter is missing the attributes' Title', 'Description 'and'Keyword' are treated as language attributes and the attributes'Group', 'Parent 'and'HtmlAttr' as non-prefixed multi-value attributes.","If the optional parameter is missing the attributes ' Title ' , ' Description ' and 'Keyword ' are treated as language attributes and the attributes 'Group ' , ' Parent ' and 'HtmlAttr ' as non - prefixed multi - value attributes .","If the optional parameter is missing the attributes ' Title ' , ' Description ' and'Keyword ' are treated as language attributes and the attributes'Group ' , ' Parent ' and'HtmlAttr ' as non - prefixed multi - value attributes .","If the optional parameter is missing the attributes ' Title ' , ' Description ' and'Keyword ' are treated as language attributes and the attributes'Group ' , ' Parent ' and'HtmlAttr ' as non - prefixed multi - value attributes ."
9286,Avg. wind speed 5.7 km/h 6.4 km/h 6 km/h 5.7 km/h 6.9 km/h 5.3 km/h 6.2 km/h 3.8 km/h,Avg. wind speed 5.7 km / h 6.4 km / h 6 km / h 5.7 km / h 6.9 km / h 5.3 km / h 6.2 km / h 3.8 km / h,Avg . wind speed 5.7 km / h 6.4 km / h 6 km / h 5.7 km / h 6.9 km / h 5.3 km / h 6.2 km / h 3.8 km / h,Avg . wind speed 5.7 km / h 6.4 km / h 6 km / h 5.7 km / h 6.9 km / h 5.3 km / h 6.2 km / h 3.8 km / h
2612,http://www.akalmy.com Webdesigner HTML5,http : / / www.akalmy.com Webdesigner HTML5,http://www.akalmy.com Webdesigner HTML5,http://www.akalmy.com Webdesigner HTML5
1885,"He begged the priests therefore to read the governor’s proclamation in the churches and to preach loyalty, “which may very well be done with commonplaces.”","He begged the priests therefore to read the governor ’ s proclamation in the churches and to preach loyalty , “ which may very well be done with commonplaces . ”","He begged the priests therefore to read the governor ’s proclamation in the churches and to preach loyalty , “ which may very well be done with commonplaces . ”","He begged the priests therefore to read the governor ’s proclamation in the churches and to preach loyalty , “ which may very well be done with commonplaces . ”"
2617,"Cathy, Sheila doesn't drive.","Cathy , Sheila doesn 't drive .","Cathy , Sheila does n't drive .","Cathy , Sheila does n't drive ."
5608,- Aren't you in Florodora anymore?,- Aren 't you in Florodora anymore ?,- Are n't you in Florodora anymore ?,- Are n't you in Florodora anymore ?
1986,"Informal consultations on agenda item 112 (Programme budget for the biennium 2002-2003: Cooperation between headquarters departments and regional commissions (A/57/361, E/2002/15, E/2002/15/Add.1, E/2002/15/Add.2, E/2002/15/Add.3 and E/2002/15/Add.3/Corr.1 and A/57/7/Add.3))","Informal consultations on agenda item 112 ( Programme budget for the biennium 2002 - 2003 : Cooperation between headquarters departments and regional commissions ( A / 57 / 361 , E / 2002 / 15 , E / 2002 / 15 / Add.1 , E / 2002 / 15 / Add.2 , E / 2002 / 15 / Add.3 and E / 2002 / 15 / Add.3 / Corr.1 and A / 57 / 7 / Add.3 ) )","Informal consultations on agenda item 112 ( Programme budget for the biennium 2002 - 2003 : Cooperation between headquarters departments and regional commissions ( A/57/361 , E/2002/15 , E/2002/15 / Add.1 , E/2002/15 / Add.2 , E/2002/15 / Add.3 and E/2002/15 / Add.3 / Corr.1 and A/57/7 / Add.3 ) )","Informal consultations on agenda item 112 ( Programme budget for the biennium 2002 - 2003 : Cooperation between headquarters departments and regional commissions ( A/57/361 , E/2002/15 , E/2002/15 / Add.1 , E/2002/15 / Add.2 , E/2002/15 / Add.3 and E/2002/15 / Add.3 / Corr.1 and A/57/7 / Add.3 ) )"
597,Isn't this great?,Isn 't this great ?,Is n't this great ?,Is n't this great ?
3380,"15:23And they all wept with a loud voice, and all the people passed over: the king also himself went over the brook Cedron, and all the people marched towards the way that looketh to the desert.","15 : 23And they all wept with a loud voice , and all the people passed over : the king also himself went over the brook Cedron , and all the people marched towards the way that looketh to the desert .","15:23And they all wept with a loud voice , and all the people passed over : the king also himself went over the brook Cedron , and all the people marched towards the way that looketh to the desert .","15:23And they all wept with a loud voice , and all the people passed over : the king also himself went over the brook Cedron , and all the people marched towards the way that looketh to the desert ."
7994,"39 According to the Americans, the official report of their losses gave 30 dead, 42 wounded and 389 prisoners, but in reality there were more dead (and no doubt also more wounded), as demonstrated by Stanley, George F. G., in Canada Invaded (Toronto: 1973), pp. 103-104.","39 According to the Americans , the official report of their losses gave 30 dead , 42 wounded and 389 prisoners , but in reality there were more dead ( and no doubt also more wounded ) , as demonstrated by Stanley , George F. G. , in Canada Invaded ( Toronto : 1973 ) , pp. 103 - 104 .","39 According to the Americans , the official report of their losses gave 30 dead , 42 wounded and 389 prisoners , but in reality there were more dead ( and no doubt also more wounded ) , as demonstrated by Stanley , George F. G. , in Canada Invaded ( Toronto : 1973 ) , pp . 103 - 104 .","39 According to the Americans , the official report of their losses gave 30 dead , 42 wounded and 389 prisoners , but in reality there were more dead ( and no doubt also more wounded ) , as demonstrated by Stanley , George F. G. , in Canada Invaded ( Toronto : 1973 ) , pp . 103 - 104 ."


<h2>Legend for the detailed view</h2>
For tokenizer_A VS tokenizer_B:<br>
<span style='background-color: #FFFF88;'>yellow</span> = separating space between tokens present in both A and B<br>
<span style='background-color: #88FF88;'>green</span> = space present in B, missing in A<br>
<span style='background-color: #FF0000;'>red line</span> = space present in A, missing in B<br>
<span style='background-color: #FF88FF;'>purple</span> = replaced substring<br>