Skip to content

Module to extract parallel data from TMX files for Portage SMT training — Module d’extraction de texte parallèle des fichiers TMX pour l’entraînement de modèles TAS Portage

License

Notifications You must be signed in to change notification settings

nrc-cnrc/PortageTMXPrepro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 

Repository files navigation

Le module tmx-prepro

Le module tmx-prepro (maintenant PortageTMXPrepro) permet d'automatiser l'extraction des paires de phrases à partir de fichiers TMX et de les diviser en jeux d'entrainement ("train"), de dévelopement ("dev") et de test ("test"), tel que requis pour l'entraînement et l'optimisation d'un système PortageII.

tmx-prepro

The tmx-prepro module (now PortageTMXPrepro) is intended to automate extracting sentence pairs from TMX files and splitting them into the train/dev/test files needed to train PortageII.

Directory template/tmx/ is where you drop your input TMX files.

Directory template/preparation/ contains a Makefile to extract text from TMX files and clean it up. The Makefile.params file is use to setup various parameters indicating what the TMX file(s) is(are) called, which ones should be used to sample dev and test sets from, which ones should be combined, etc.

Directory template/corpora/ is where the dev and test sampling happens. It reuses the same Makefile.params from template/preparation/.

Getting Started

  1. Make a copy of the template directory to use for the corpus.

     ~cd $PORTAGE/tmx-prepro
     cp -pr template /path/to/training/area/template
    

    or

     git clone https://github.com/nrc-cnrc/PortageTMXPrepro.git /path/to/training/area
    
  2. Drop your TMX files (with .tmx extension) into the tmx sub-directory:

     /path/to/training/area/template/tmx.
    
  3. Edit the Makefile.params file in the preparation subdirectory as needed:

     /path/to/training/area/template/preparation/Makefile.params
    

    If you plan to combine all the TMX files in the tmx directory from step 2 into a single training corpus, then Makefile.params may be used as is. If you want to have multiple domain corpora or you want a single corpus with a specific name prefix, you will need to edit the Makefile.params to define your domain names, and the makeup of the dev/test sets.

  4. Extract text from the TMX files and clean it up:

     cd /path/to/training/area/template/preparation
     make all
    
  5. Sample to create dev, test and training corpus files.

     cd /path/to/training/area/template/corpora
     make all
    
  6. Copy the train, dev, and test raw files to your framework/corpora directory:

     cd /path/to/training/area/template/corpora
     cp all.* train.* test*.* dev*.* /path/to/training/area/framework/corpora
    

Other Documentation

Additional information about processing TMX files can be found in the TMX Processing page of the user manual on your PortageII distro (doc/user-manual.html).

Running tmx2lfl.pl -h will give details on tmx2lfl.pl, the tool used by tmx-prepro to extract text out of your TMX files.

Copyright

Traitement multilingue de textes / Multilingual Text Processing
Centre de recherche en technologies numériques / Digital Technologies Research Centre
Conseil national de recherches Canada / National Research Council Canada
Copyright 2004-2022, Sa Majesté la Reine du Chef du Canada / Her Majesty in Right of Canada
Published under the MIT License (see LICENSE)

About

Module to extract parallel data from TMX files for Portage SMT training — Module d’extraction de texte parallèle des fichiers TMX pour l’entraînement de modèles TAS Portage

Resources

License

Stars

Watchers

Forks

Packages

No packages published