Skip to content

🔨 A Simple and Open Source Machine Translation Experiment Auxiliary Tools

Notifications You must be signed in to change notification settings

junchaoIU/OpenMT

Repository files navigation

OpenMT : A Simple and Open Source Machine Translation Experiment Auxiliary Tools

Installation

All dependencies can be installed via:

pip3 install -r requirements.txt

Basic Environments

mkdir "Envs"
cd "Envs"

# fairseq
echo setup fairseq...
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

# moses
echo setup moses...
git clone https://github.com/moses-smt/mosesdecoder.git

MT-Scripts

Dictionary Generate: Generate new dictionary for prepared bi-text

python3 Scripts/generate_dict.py dicts_path

Example: 
# generate dict for pair data in Dictionaries/pair_dict
python3 Scripts/generate_dict.py Dictionaries/pair_dict

Sentences AA: Using bi-text Dictionary to translate a lang to another lang, , Note that the proportion is 100% in our script, you can design it in the script easily

python3 Scripts/add_AA.py src_lang tgt_lang dicts_path input_file output_file

Example: 
# translate train.en_XX to train.ro_RO by Dictionaries/pair_dict
python3 Scripts/add_AA.py EN RO Dictionaries/pair_dict train.en_XX train.ro_RO

Data Split: Divide source and target corpus to train, valid and test data, Note that valid and test data are both 2k, you can design it in the script easily

python3 Scripts/split_data.py src_lang tgt_lang src_corpus tgt_corpus

Example: 
# divide corpus.en and corpus.ro to train, valid and test data 
# valid and test data are both 2k
python3 Scripts/split_data.py en_XX ro_RO corpus.en corpus.ro

Sentencepiece Sub-word: Prepare the Sentencepiece Sub-word for experimental data

python3 Scripts/spm.py model_path < input_file > output_file

Example: 
SPM=spm.py
MODEL=mbart.cc25/sentence.bpe.model
DATA=prepare_data
TRAIN=train
VALID=valid
TEST=test
SRC=en_XX
TGT=ro_RO

python3 ${SPM} ${MODEL} < ${DATA}/${TRAIN}.${SRC} > ${DATA}/${TRAIN}.spm.${SRC} 
python3 ${SPM} ${MODEL} < ${DATA}/${TRAIN}.${TGT} > ${DATA}/${TRAIN}.spm.${TGT} 
python3 ${SPM} ${MODEL} < ${DATA}/${VALID}.${SRC} > ${DATA}/${VALID}.spm.${SRC} 
python3 ${SPM} ${MODEL} < ${DATA}/${VALID}.${TGT} > ${DATA}/${VALID}.spm.${TGT} 
python3 ${SPM} ${MODEL} < ${DATA}/${TEST}.${SRC} > ${DATA}/${TEST}.spm.${SRC} 
python3 ${SPM} ${MODEL} < ${DATA}/${TEST}.${TGT} > ${DATA}/${TEST}.spm.${TGT} 

MT-Evaluation

To run the Python scripts and calculate the MT evaluation metrics on your machine translation output, you need to have two files:

  • ref.txt : It is the human translation (target) file of your test dataset.
  • hyp.txt: It is the MTed translation/prediction, generated by the machine translation model for the source of the same test dataset used for “Reference”.

Corpus BLEU: Calculates the BLEU score for the whole corpus and prints the result.

python3 MT-Evaluation/BLEU/compute-bleu.py ref.txt hyp.txt

Sentence BLEU: Calculates the BLEU score for sentence by sentence and saves the result to a file.

python3 MT-Evaluation/BLEU/compute-bleu-sentence.py ref.txt hyp.txt

Corpus TER: Calculates the TER score for the whole corpus and prints the result.

python3 MT-Evaluation/TER/compute-ter.py ref.txt hyp.txt

Sentence TER: Calculates the TER score for sentence by sentence and saves the result to a file.

python3 MT-Evaluation/TER/compute-ter-sentence.py ref.txt hyp.txt

Corpus CHRF: Calculates the CHRF score for the whole corpus and prints the result.

python3 MT-Evaluation/CHRF/compute-chrf.py ref.txt hyp.txt

Sentence CHRF: Calculates the CHRF score for sentence by sentence and saves the result to a file.

python3 MT-Evaluation/CHRF/compute-chrf-sentence.py ref.txt hyp.txt

Sentence METEOR: Note that METEOR works on the sentence level only.

python3 MT-Evaluation/METEOR/sentence-meteor.py ref.txt hyp.txt

Corpus WER: Calculates the WER score for the whole corpus and prints the result.

python3 MT-Evaluation/WER/corpus-wer.py ref.txt hyp.txt

Sentence WER: Calculate the WER score for sentence by sentence and saves the result to a file.

python3 MT-Evaluation/WER/sentence-wer.py ref.txt hyp.txt

About

🔨 A Simple and Open Source Machine Translation Experiment Auxiliary Tools

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages