Can't create `XLMRobertaTokenizer` from xlm-roberta dataset #65

mlodato517 · 2022-05-19T14:12:46Z

I'm not sure if this is where this is supposed to go but, at the highest level,
I'm using rust-bert and trying to instantiate an NERModel with seemingly
default XLMRoberta configurations like:

    let config = TokenClassificationConfig::new(
        ModelType::XLMRoberta,
        RemoteResource::from_pretrained(RobertaModelResources::XLM_ROBERTA_NER_EN),
        RemoteResource::from_pretrained(RobertaVocabResources::XLM_ROBERTA_NER_EN),
        RemoteResource::from_pretrained(RobertaConfigResources::XLM_ROBERTA_NER_EN),
        None,  //merges resource only relevant with ModelType::Roberta
        false, //lowercase
        None,  //strip_accents
        None,  //add_prefix_space
        LabelAggregationOption::Mode,
    );

    let ner_model = NERModel::new(config).unwrap();

and this is failing with:

TokenizerError("Error when loading vocabulary file, the file 
may be corrupted or does not match the expected format: incorrect tag")

That resource seems to be pointing to the sentencepiece.bpe.model file for this model
and it seems to be parsed by ModelProto in this repo.

I'm not sure if:

I'm doing the wrong thing with rust-bert
The file hosted in the xlm-roberta dataset is out of date
This repo is out of date with a new update to the xlm-roberta dataset file

This issue is really only relevant for the last item (though I wouldn't know where
to report the second item). Apologies if it's the first item but I'm wondering
if you can offer any insight here!

Note
I'm using the version of rust-bert on the master git branch, not
the latest published crate (because I need 0.7 tch support).

The text was updated successfully, but these errors were encountered:

guillaume-be · 2022-05-19T17:40:27Z

Hi @mlodato517 ,

Thank you for raising this. The TokenClassificationConfig expects the inputs in the following order:

model resource
config resource
vocab resource
merges resources

Could you please try swapping arguments 3 and 4, i.e.:

    let config = TokenClassificationConfig::new(
        ModelType::XLMRoberta,
        RemoteResource::from_pretrained(RobertaModelResources::XLM_ROBERTA_NER_EN),
        RemoteResource::from_pretrained(RobertaConfigResources::XLM_ROBERTA_NER_EN),
        RemoteResource::from_pretrained(RobertaVocabResources::XLM_ROBERTA_NER_EN),
        None,  //merges resource only relevant with ModelType::Roberta
        false, //lowercase
        None,  //strip_accents
        None,  //add_prefix_space
        LabelAggregationOption::Mode,
    );

mlodato517 · 2022-05-19T17:41:47Z

Oh my gosh you're kidding me 😆 Well, that's very embarrassing 😅 Thank you for the quick response!

mlodato517 closed this as completed May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't create `XLMRobertaTokenizer` from xlm-roberta dataset #65

Can't create `XLMRobertaTokenizer` from xlm-roberta dataset #65

mlodato517 commented May 19, 2022

guillaume-be commented May 19, 2022

mlodato517 commented May 19, 2022

Can't create XLMRobertaTokenizer from xlm-roberta dataset #65

Can't create XLMRobertaTokenizer from xlm-roberta dataset #65

Comments

mlodato517 commented May 19, 2022

guillaume-be commented May 19, 2022

mlodato517 commented May 19, 2022

Can't create `XLMRobertaTokenizer` from xlm-roberta dataset #65

Can't create `XLMRobertaTokenizer` from xlm-roberta dataset #65