There is a small problem about the training data format. #29

genbei · 2021-10-26T10:41:38Z

Because I have never trained the pre-training model, I have a small question about what the paralleldata input format looks likeTRAIN_FILE=/path/to/train/file. Do you need a separator between src and tgt? What is the format?
In addition, can you fine-tune the xlm-roberta-large model?

And the xlm-roberta-base file contains these contents:config.json 、gitattributes 、pytorch_model.bin 、sentencepiece.bpe.model 、tokenizer.json

This error occurred while running：

10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/vocab.txt. We won't load it.
10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/added_tokens.json. We won't load it.
10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/special_tokens_map.json. We won't load it.
10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/tokenizer_config.json. We won't load it.

Now I really want to use bilingual data to continue training xlm-roberta-base model and ask for advice through TLM task.

The text was updated successfully, but these errors were encountered:

zdou0830 · 2021-10-26T18:39:53Z

As in README, the inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

I haven't fine-tuned xlm-roberta-large before, but I think you can use the code in the xlmr branch and tune some parameters (e.g. align_layer, learning_rate, max_steps) and see if you can get reasonable performance.

zdou0830 closed this as completed Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There is a small problem about the training data format. #29

There is a small problem about the training data format. #29

genbei commented Oct 26, 2021 •

edited

Loading

zdou0830 commented Oct 26, 2021

There is a small problem about the training data format. #29

There is a small problem about the training data format. #29

Comments

genbei commented Oct 26, 2021 • edited Loading

zdou0830 commented Oct 26, 2021

genbei commented Oct 26, 2021 •

edited

Loading