Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is a small problem about the training data format. #29

Closed
genbei opened this issue Oct 26, 2021 · 1 comment
Closed

There is a small problem about the training data format. #29

genbei opened this issue Oct 26, 2021 · 1 comment

Comments

@genbei
Copy link

genbei commented Oct 26, 2021

Because I have never trained the pre-training model, I have a small question about what the paralleldata input format looks likeTRAIN_FILE=/path/to/train/file. Do you need a separator between src and tgt? What is the format?
In addition, can you fine-tune the xlm-roberta-large model?

And the xlm-roberta-base file contains these contents:config.json 、gitattributes 、pytorch_model.bin 、sentencepiece.bpe.model 、tokenizer.json

This error occurred while running:

10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/vocab.txt. We won't load it.
10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/added_tokens.json. We won't load it.
10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/special_tokens_map.json. We won't load it.
10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/tokenizer_config.json. We won't load it.

Now I really want to use bilingual data to continue training xlm-roberta-base model and ask for advice through TLM task.

@zdou0830
Copy link
Collaborator

As in README, the inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

I haven't fine-tuned xlm-roberta-large before, but I think you can use the code in the xlmr branch and tune some parameters (e.g. align_layer, learning_rate, max_steps) and see if you can get reasonable performance.

@zdou0830 zdou0830 closed this as completed Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants