NeMo/tests/data/nmt_en_zh_sample_data/README.md at v0.11.0 · mlgill/NeMo · GitHub

English - Chinese neural machine translation sample data format.

This directory contains sample data for Chinese Neural Machine Translation. Please refer to this data format and prepare your own data.

Note that this sample data should not be used to train the NMT model.

en_yttm.model English YouTokenToMe model for src tokenizer.
zh_vocab.txt Chinese character vocabulary for tgt tokenizer.
train.en English sentence data for training.
train.zh Chinese sentence data for training.
valid.en English sentence data for validing.
valid.zh Chinese sentence data for validing.