Le module tmx-prepro (maintenant PortageTMXPrepro) permet d'automatiser l'extraction des paires de phrases à partir de fichiers TMX et de les diviser en jeux d'entrainement ("train"), de dévelopement ("dev") et de test ("test"), tel que requis pour l'entraînement et l'optimisation d'un système PortageII.
The tmx-prepro module (now PortageTMXPrepro) is intended to automate extracting sentence pairs from TMX files and splitting them into the train/dev/test files needed to train PortageII.
Directory template/tmx/
is where you drop your input TMX files.
Directory template/preparation/
contains a Makefile to extract text from TMX
files and clean it up. The Makefile.params
file is use to setup various
parameters indicating what the TMX file(s) is(are) called, which ones should be
used to sample dev and test sets from, which ones should be combined, etc.
Directory template/corpora/
is where the dev and test sampling happens. It
reuses the same Makefile.params
from template/preparation/
.
-
Make a copy of the template directory to use for the corpus.
~cd $PORTAGE/tmx-prepro cp -pr template /path/to/training/area/template
or
git clone https://github.com/nrc-cnrc/PortageTMXPrepro.git /path/to/training/area
-
Drop your TMX files (with .tmx extension) into the tmx sub-directory:
/path/to/training/area/template/tmx.
-
Edit the
Makefile.params
file in the preparation subdirectory as needed:/path/to/training/area/template/preparation/Makefile.params
If you plan to combine all the TMX files in the tmx directory from step 2 into a single training corpus, then Makefile.params may be used as is. If you want to have multiple domain corpora or you want a single corpus with a specific name prefix, you will need to edit the Makefile.params to define your domain names, and the makeup of the dev/test sets.
-
Extract text from the TMX files and clean it up:
cd /path/to/training/area/template/preparation make all
-
Sample to create dev, test and training corpus files.
cd /path/to/training/area/template/corpora make all
-
Copy the train, dev, and test raw files to your
framework/corpora
directory:cd /path/to/training/area/template/corpora cp all.* train.* test*.* dev*.* /path/to/training/area/framework/corpora
Additional information about processing TMX files can be found in the TMX Processing page of the user manual on your PortageII distro (doc/user-manual.html).
Running tmx2lfl.pl -h
will give details on tmx2lfl.pl, the tool used by
tmx-prepro to extract text out of your TMX files.
Traitement multilingue de textes / Multilingual Text Processing
Centre de recherche en technologies numériques / Digital Technologies Research Centre
Conseil national de recherches Canada / National Research Council Canada
Copyright 2004-2022, Sa Majesté la Reine du Chef du Canada / Her Majesty in Right of Canada
Published under the MIT License (see LICENSE)