Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't create XLMRobertaTokenizer from xlm-roberta dataset #65

Closed
mlodato517 opened this issue May 19, 2022 · 2 comments
Closed

Can't create XLMRobertaTokenizer from xlm-roberta dataset #65

mlodato517 opened this issue May 19, 2022 · 2 comments

Comments

@mlodato517
Copy link

I'm not sure if this is where this is supposed to go but, at the highest level,
I'm using rust-bert and trying to instantiate an NERModel with seemingly
default XLMRoberta configurations like:

    let config = TokenClassificationConfig::new(
        ModelType::XLMRoberta,
        RemoteResource::from_pretrained(RobertaModelResources::XLM_ROBERTA_NER_EN),
        RemoteResource::from_pretrained(RobertaVocabResources::XLM_ROBERTA_NER_EN),
        RemoteResource::from_pretrained(RobertaConfigResources::XLM_ROBERTA_NER_EN),
        None,  //merges resource only relevant with ModelType::Roberta
        false, //lowercase
        None,  //strip_accents
        None,  //add_prefix_space
        LabelAggregationOption::Mode,
    );

    let ner_model = NERModel::new(config).unwrap();

and this is failing with:

TokenizerError("Error when loading vocabulary file, the file 
may be corrupted or does not match the expected format: incorrect tag")

That resource seems to be pointing to the sentencepiece.bpe.model file for this model
and it seems to be parsed by ModelProto in this repo.

I'm not sure if:

  1. I'm doing the wrong thing with rust-bert
  2. The file hosted in the xlm-roberta dataset is out of date
  3. This repo is out of date with a new update to the xlm-roberta dataset file

This issue is really only relevant for the last item (though I wouldn't know where
to report the second item). Apologies if it's the first item but I'm wondering
if you can offer any insight here!

Note
I'm using the version of rust-bert on the master git branch, not
the latest published crate (because I need 0.7 tch support).

@guillaume-be
Copy link
Owner

Hi @mlodato517 ,

Thank you for raising this. The TokenClassificationConfig expects the inputs in the following order:

  1. model resource
  2. config resource
  3. vocab resource
  4. merges resources

Could you please try swapping arguments 3 and 4, i.e.:

    let config = TokenClassificationConfig::new(
        ModelType::XLMRoberta,
        RemoteResource::from_pretrained(RobertaModelResources::XLM_ROBERTA_NER_EN),
        RemoteResource::from_pretrained(RobertaConfigResources::XLM_ROBERTA_NER_EN),
        RemoteResource::from_pretrained(RobertaVocabResources::XLM_ROBERTA_NER_EN),
        None,  //merges resource only relevant with ModelType::Roberta
        false, //lowercase
        None,  //strip_accents
        None,  //add_prefix_space
        LabelAggregationOption::Mode,
    );

@mlodato517
Copy link
Author

Oh my gosh you're kidding me 😆 Well, that's very embarrassing 😅 Thank you for the quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants