Skip to content

neulab/DGT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DGT Task for WNGT 2019 at EMNLP19

This repository contains helper scripts and tools for the DGT task in WNGT 2019 at EMNLP19.

Tools

Tokenization

Participants might want to utilize external resources from the list on the task website. We provide a word tokenizer and a sentnence tokenizer which were used to create RotoWire Parallel dataset. One can apply them on a raw text to get the consistent tokenization as the provided dataset.

Requirements

Please use python version >=3.5. Run pip install nltk to install nltk and download punkt models by opening up a python interpreter:

$ python
>>> import nltk
>>> nltk.download("punkt")

Usage

You can either use it as a script or in your python code. The following command will tokenize sentences in input.txt line by line and dumps space-delimited sentences into output.txt.

$ python tools/tokenizer.py [english|german] < input.txt > output.txt

If used in python code, you can use the following two functions are defined in tools/tokenizer.py.

  • word_tokenize(string: str, language: str) -> List[str]: Tokenize a string into a list of words.
  • sent_tokenize(string: str, language: str) -> List[str]: Tokenize a string into a list of sentences.

The second argument can be one of english and german, depending on the langauge you want to tokenize. The default language is english. See the examples below:

# Copy and place the file in your project directory, and import in your code
from tokenizer import word_tokenize, sent_tokenize

word_tokenize("Vince Carter is a basketball player.", language="english")
# ['Vince', 'Carter', 'is', 'a', 'basketball', 'player', '.']

sent_tokenize("Vince Carter is a basketball player. Michael Jordan is a basketball player.")
# ['Vince Carter is a basketball player.', 'Michael Jordan is a basketball player.']

If interested, more details on the construciton of the tokenizer is discussed here.

Helper Scripts

We include the following helper scripts for processing the outputs before the submission. Both of the scripts are tested with python 2 and 3.

Convert from plain text to JSON format

Use scripts/plain2json.py to convert from plain text to JSON format. This script might be useful for the participants in the MT track.

$ python plain2json.py --source-dir /path/to/sentence-by-sentence/translations --target-json output.json

Validate the output before submission

Use scripts/validate_outputs.py to confirm that the submission file is valid and contains all the outputs for evaluation.

$ python validate_outputs.py /path/to/your/submission/file

Contacts

  • wngt2019-organizers [at] googlegroups.com