Le module tmx-prepro

Le module tmx-prepro (maintenant PortageTMXPrepro) permet d'automatiser l'extraction des paires de phrases à partir de fichiers TMX et de les diviser en jeux d'entrainement ("train"), de dévelopement ("dev") et de test ("test"), tel que requis pour l'entraînement et l'optimisation d'un système PortageII.

tmx-prepro

The tmx-prepro module (now PortageTMXPrepro) is intended to automate extracting sentence pairs from TMX files and splitting them into the train/dev/test files needed to train PortageII.

Directory template/tmx/ is where you drop your input TMX files.

Directory template/preparation/ contains a Makefile to extract text from TMX files and clean it up. The Makefile.params file is use to setup various parameters indicating what the TMX file(s) is(are) called, which ones should be used to sample dev and test sets from, which ones should be combined, etc.

Directory template/corpora/ is where the dev and test sampling happens. It reuses the same Makefile.params from template/preparation/.

Getting Started

Make a copy of the template directory to use for the corpus.

 ~cd $PORTAGE/tmx-prepro
 cp -pr template /path/to/training/area/template

or

 git clone https://github.com/nrc-cnrc/PortageTMXPrepro.git /path/to/training/area

Drop your TMX files (with .tmx extension) into the tmx sub-directory:
```
 /path/to/training/area/template/tmx.
```
Edit the Makefile.params file in the preparation subdirectory as needed:
```
 /path/to/training/area/template/preparation/Makefile.params
```
If you plan to combine all the TMX files in the tmx directory from step 2 into a single training corpus, then Makefile.params may be used as is. If you want to have multiple domain corpora or you want a single corpus with a specific name prefix, you will need to edit the Makefile.params to define your domain names, and the makeup of the dev/test sets.

Extract text from the TMX files and clean it up:

 cd /path/to/training/area/template/preparation
 make all

Sample to create dev, test and training corpus files.

 cd /path/to/training/area/template/corpora
 make all

Copy the train, dev, and test raw files to your framework/corpora directory:

 cd /path/to/training/area/template/corpora
 cp all.* train.* test*.* dev*.* /path/to/training/area/framework/corpora

Copyright

Traitement multilingue de textes / Multilingual Text Processing
Centre de recherche en technologies numériques / Digital Technologies Research Centre
Conseil national de recherches Canada / National Research Council Canada
Copyright 2004-2022, Sa Majesté la Reine du Chef du Canada / Her Majesty in Right of Canada
Published under the MIT License (see LICENSE)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
template		template
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Le module tmx-prepro

tmx-prepro

Getting Started

Other Documentation

Copyright

About

Releases

Packages

Contributors 3

Languages

License

nrc-cnrc/PortageTMXPrepro

Folders and files

Latest commit

History

Repository files navigation

Le module tmx-prepro

tmx-prepro

Getting Started

Other Documentation

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages