All dependencies can be installed via:
pip3 install -r requirements.txt
mkdir "Envs"
cd "Envs"
# fairseq
echo setup fairseq...
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
# moses
echo setup moses...
git clone https://github.com/moses-smt/mosesdecoder.git
Dictionary Generate: Generate new dictionary for prepared bi-text
python3 Scripts/generate_dict.py dicts_path
Example:
# generate dict for pair data in Dictionaries/pair_dict
python3 Scripts/generate_dict.py Dictionaries/pair_dict
Sentences AA: Using bi-text Dictionary to translate a lang to another lang, , Note that the proportion is 100% in our script, you can design it in the script easily
python3 Scripts/add_AA.py src_lang tgt_lang dicts_path input_file output_file
Example:
# translate train.en_XX to train.ro_RO by Dictionaries/pair_dict
python3 Scripts/add_AA.py EN RO Dictionaries/pair_dict train.en_XX train.ro_RO
Data Split: Divide source and target corpus to train, valid and test data, Note that valid and test data are both 2k, you can design it in the script easily
python3 Scripts/split_data.py src_lang tgt_lang src_corpus tgt_corpus
Example:
# divide corpus.en and corpus.ro to train, valid and test data
# valid and test data are both 2k
python3 Scripts/split_data.py en_XX ro_RO corpus.en corpus.ro
Sentencepiece Sub-word: Prepare the Sentencepiece Sub-word for experimental data
python3 Scripts/spm.py model_path < input_file > output_file
Example:
SPM=spm.py
MODEL=mbart.cc25/sentence.bpe.model
DATA=prepare_data
TRAIN=train
VALID=valid
TEST=test
SRC=en_XX
TGT=ro_RO
python3 ${SPM} ${MODEL} < ${DATA}/${TRAIN}.${SRC} > ${DATA}/${TRAIN}.spm.${SRC}
python3 ${SPM} ${MODEL} < ${DATA}/${TRAIN}.${TGT} > ${DATA}/${TRAIN}.spm.${TGT}
python3 ${SPM} ${MODEL} < ${DATA}/${VALID}.${SRC} > ${DATA}/${VALID}.spm.${SRC}
python3 ${SPM} ${MODEL} < ${DATA}/${VALID}.${TGT} > ${DATA}/${VALID}.spm.${TGT}
python3 ${SPM} ${MODEL} < ${DATA}/${TEST}.${SRC} > ${DATA}/${TEST}.spm.${SRC}
python3 ${SPM} ${MODEL} < ${DATA}/${TEST}.${TGT} > ${DATA}/${TEST}.spm.${TGT}
To run the Python scripts and calculate the MT evaluation metrics on your machine translation output, you need to have two files:
- ref.txt : It is the human translation (target) file of your test dataset.
- hyp.txt: It is the MTed translation/prediction, generated by the machine translation model for the source of the same test dataset used for “Reference”.
Corpus BLEU: Calculates the BLEU score for the whole corpus and prints the result.
python3 MT-Evaluation/BLEU/compute-bleu.py ref.txt hyp.txt
Sentence BLEU: Calculates the BLEU score for sentence by sentence and saves the result to a file.
python3 MT-Evaluation/BLEU/compute-bleu-sentence.py ref.txt hyp.txt
Corpus TER: Calculates the TER score for the whole corpus and prints the result.
python3 MT-Evaluation/TER/compute-ter.py ref.txt hyp.txt
Sentence TER: Calculates the TER score for sentence by sentence and saves the result to a file.
python3 MT-Evaluation/TER/compute-ter-sentence.py ref.txt hyp.txt
Corpus CHRF: Calculates the CHRF score for the whole corpus and prints the result.
python3 MT-Evaluation/CHRF/compute-chrf.py ref.txt hyp.txt
Sentence CHRF: Calculates the CHRF score for sentence by sentence and saves the result to a file.
python3 MT-Evaluation/CHRF/compute-chrf-sentence.py ref.txt hyp.txt
Sentence METEOR: Note that METEOR works on the sentence level only.
python3 MT-Evaluation/METEOR/sentence-meteor.py ref.txt hyp.txt
Corpus WER: Calculates the WER score for the whole corpus and prints the result.
python3 MT-Evaluation/WER/corpus-wer.py ref.txt hyp.txt
Sentence WER: Calculate the WER score for sentence by sentence and saves the result to a file.
python3 MT-Evaluation/WER/sentence-wer.py ref.txt hyp.txt