Skip to content

Latest commit

 

History

History
53 lines (37 loc) · 1.84 KB

File metadata and controls

53 lines (37 loc) · 1.84 KB

How to train for other languages

Create a dataset for your language

If Europarl dataset is available into your language, you can use the SEPP-NLG-2021 scripts to create a tsv file from Europarl datasets.

If there is no Europarl dataset for your language, we created a small tool that takes as an input file a text file and produces a tsv format (SEPP NLG task) file suitable for training. This opens the posibility to use any plan text corpus for training.

In both cases, the dataset has to be placed in the directory data with the name sepp_nlg_2021_train_dev_data_v5.zip. Inside the zip file, you will find the diretory structure /sepp_nlg_2021_data/LANGUAGE_CODE/ and then a dev and train directory.

Defining your language and model

In the file model_final_suite_task2.json you need to define your language and model. For Catalan language, this how it looks like:

{   
    "tests":[   
        {
            "id":22,
            "task":2,
            "model": "softcatala/julibert",
            "languages":["ca"],
            "augmentation":[""],
            "data_percentage":1,
            "use_token_type_ids":false,
            "tokenizer_config":{"strip_accent":false, "add_prefix_space":true },
            "opimizer_config":{"adafactor":true, "num_train_epochs":2}
        }
    ]
}

The most important fields are languages, where you need to define your language and model where you need to define the base model.

Training

Before starting the training make sure that the model_final_suite_results_task2.json file is empty, looking like: This file contains the statistics and the results of the training process.

{
    "tests": {}
}

You can start the training be running:

python model_test_suite.py 2