This repository contains helper scripts and tools for the DGT task in WNGT 2019 at EMNLP19.
Participants might want to utilize external resources from the list on the task website. We provide a word tokenizer and a sentnence tokenizer which were used to create RotoWire Parallel dataset. One can apply them on a raw text to get the consistent tokenization as the provided dataset.
Please use python version >=3.5
. Run pip install nltk
to install nltk and download punkt models
by opening up a python interpreter:
$ python
>>> import nltk
>>> nltk.download("punkt")
You can either use it as a script or in your python code. The following command will tokenize sentences
in input.txt
line by line and dumps space-delimited sentences into output.txt
.
$ python tools/tokenizer.py [english|german] < input.txt > output.txt
If used in python code, you can use the following two functions are defined in tools/tokenizer.py
.
word_tokenize(string: str, language: str) -> List[str]
: Tokenize a string into a list of words.sent_tokenize(string: str, language: str) -> List[str]
: Tokenize a string into a list of sentences.
The second argument can be one of english
and german
, depending on the langauge you want to tokenize.
The default language is english
. See the examples below:
# Copy and place the file in your project directory, and import in your code
from tokenizer import word_tokenize, sent_tokenize
word_tokenize("Vince Carter is a basketball player.", language="english")
# ['Vince', 'Carter', 'is', 'a', 'basketball', 'player', '.']
sent_tokenize("Vince Carter is a basketball player. Michael Jordan is a basketball player.")
# ['Vince Carter is a basketball player.', 'Michael Jordan is a basketball player.']
If interested, more details on the construciton of the tokenizer is discussed here.
We include the following helper scripts for processing the outputs before the submission. Both of the scripts are tested with python 2 and 3.
Use scripts/plain2json.py
to convert from plain text to JSON format. This script might be useful
for the participants in the MT track.
$ python plain2json.py --source-dir /path/to/sentence-by-sentence/translations --target-json output.json
Use scripts/validate_outputs.py
to confirm that the submission file is valid and contains all the
outputs for evaluation.
$ python validate_outputs.py /path/to/your/submission/file
wngt2019-organizers
[at]googlegroups.com