Embedding index getting out of range while running camemebert model #4153

ierezell · 2020-05-05T14:40:31Z

🐛 Bug

Information

Model I am using (Bert, XLNet ...):
Camembert

Language I am using the model on (English, Chinese ...):
French

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Take a file with french text
Load pretrained Camembert Model and tokenizer as in the doc
Run inference

Initialisation :
bert = CamembertModel.from_pretrained("camembert-base")
bert_tok = CamembertTokenizer.from_pretrained("camembert-base")
Inference : like https://huggingface.co/transformers/usage.html#question-answering

inputs = bert_tok.encode_plus(question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = bert_tok.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = bert(**inputs)

It works by removing the context argument (text_pair argument) but I need it to do question answering with other models and it lead to the same error with pipelines

Stack trace :

IndexError                                Traceback (most recent call last)
<ipython-input-9-73762e6cf69b> in <module>
      2     for utterances in file.readlines():
      3         input_tensor = bert_tok.batch_encode_plus([utterances], pad_to_max_length=True, return_tensors="pt")
----> 4         last_hidden, pool = bert(input_tensor["input_ids"], input_tensor["attention_mask"])
      5 
      6 

~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548                     functools.update_wrapper(wrapper, hook)
    549                     grad_fn.register_hook(wrapper)
--> 550         return result
    551 
    552     def __setstate__(self, state):

~/.local/lib/python3.8/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask)
    780             head_mask = [None] * self.config.num_hidden_layers
    781 
--> 782         embedding_output = self.embeddings(
    783             input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
    784         )

~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548                     functools.update_wrapper(wrapper, hook)
    549                     grad_fn.register_hook(wrapper)
--> 550         return result
    551 
    552     def __setstate__(self, state):

~/.local/lib/python3.8/site-packages/transformers/modeling_roberta.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
     62                 position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
     63 
---> 64         return super().forward(
     65             input_ids, token_type_ids=token_type_ids, position_ids=position_ids, inputs_embeds=inputs_embeds
     66         )

~/.local/lib/python3.8/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
    172         if inputs_embeds is None:
    173             inputs_embeds = self.word_embeddings(input_ids)
--> 174         position_embeddings = self.position_embeddings(position_ids)
    175         token_type_embeddings = self.token_type_embeddings(token_type_ids)
    176 

~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548                     functools.update_wrapper(wrapper, hook)
    549                     grad_fn.register_hook(wrapper)
--> 550         return result
    551 
    552     def __setstate__(self, state):

~/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    110 
    111     def forward(self, input):
--> 112         return F.embedding(
    113             input, self.weight, self.padding_idx, self.max_norm,
    114             self.norm_type, self.scale_grad_by_freq, self.sparse)

~/.local/lib/python3.8/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1722     if dim == 3:
   1723         div = pad(div, (0, 0, size // 2, (size - 1) // 2))
-> 1724         div = avg_pool2d(div, (size, 1), stride=1).squeeze(1)
   1725     else:
   1726         sizes = input.size()

IndexError: index out of range in self

Expected behavior

Run inference without any error

Environment info

transformers version: 2.8.0

Platform: Linux-5.6.10-arch1-1-x86_64-with-glibc2.2.5
Python version: 3.8.2
PyTorch version (GPU?): 1.4.0 (True) (Same with 1.5)
Tensorflow version (GPU?): 2.2.0-rc4 (False)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

The text was updated successfully, but these errors were encountered:

jwallat · 2020-05-10T11:25:16Z

I am running into the same error on my own script. Interestingly it only appears on CPU... Did you find a solution?

ierezell · 2020-05-10T20:25:59Z

No, I want to get a French Q&A pipeline, surprinsingly, with the hugging face pipeline everything works great, I can plug the code in a local server and make requests on it.

But when I try to use the same code in a docker envrionement to ship it, it fail with this error (only in french with camembert, classic bert works fine)

I get the error locally as well if I try not to use the hugging face pipeline but write my own inference (as described above)

ierezell · 2020-05-11T16:33:03Z

I can confirm it's working on GPU local (and even in a docker) but still stuck on CPU

jwallat · 2020-05-11T16:40:20Z

I actually figured out my error. I was adding special tokens to the tokenizer (like begin-of-sequence) but did not resize the models token embeddings via:
model.resize_token_embeddings(len(self.tokenizer))
Just in case someone else is not reading the documentation carefully enough 🙈
Considering that, the error message did actually make sense.

LysandreJik · 2020-05-11T17:09:05Z

Hi @ierezell, there is indeed an issue which I'm patching in #4289. Please be aware that you're using CamembertModel which cannot be used for question answering. Please use CamembertForQuestionAnswering instead.

LysandreJik · 2020-05-11T17:31:37Z

It's patched now, please install from source and there should be no error anymore!

ierezell · 2020-05-11T18:23:58Z

Hi @LysandreJik, I'm concious that I used it with a non QA model but it was to try the base model supported by hugging face.

I tried as well with illuin/camembert-base-fquad (large as well) and with fmikaelian/camembert-base-fquad

I will install the latest version and try it.

Thnaks a lot for the fast support !

ierezell · 2020-05-11T20:31:50Z

I tried your fix but it lead to key errors :

  File "/home/pedro/.local/lib/python3.8/site-packages/transformers/pipelines.py", line 1156, in __call__
    answers += [
  File "/home/pedro/.local/lib/python3.8/site-packages/transformers/pipelines.py", line 1159, in <listcomp>
    "start": np.where(char_to_word == feature.token_to_orig_map[s])[0][0].item(),
KeyError: 0

LysandreJik · 2020-05-12T20:20:41Z

Could you provide a reproducible script? I can't reproduce.

ierezell · 2020-06-03T15:05:03Z

My problem here was surely linked with #4674 everything seems to work now, thanks a lot

LysandreJik mentioned this issue May 11, 2020

CamemBERT does not make use of Token Type IDs #4289

Merged

LysandreJik closed this as completed in #4289 May 11, 2020

alxhrzg mentioned this issue Aug 12, 2020

Issue: IndexError: index out of range in self mickeysjm/R-BERT#5

Open

joawar mentioned this issue Sep 21, 2020

"index out of range in self" when calling BertForTokenClassification #7287

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding index getting out of range while running camemebert model #4153

Embedding index getting out of range while running camemebert model #4153

ierezell commented May 5, 2020 •

edited

Loading

jwallat commented May 10, 2020

ierezell commented May 10, 2020

ierezell commented May 11, 2020

jwallat commented May 11, 2020

LysandreJik commented May 11, 2020

LysandreJik commented May 11, 2020

ierezell commented May 11, 2020

ierezell commented May 11, 2020

LysandreJik commented May 12, 2020

ierezell commented Jun 3, 2020

Embedding index getting out of range while running camemebert model #4153

Embedding index getting out of range while running camemebert model #4153

Comments

ierezell commented May 5, 2020 • edited Loading

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

jwallat commented May 10, 2020

ierezell commented May 10, 2020

ierezell commented May 11, 2020

jwallat commented May 11, 2020

LysandreJik commented May 11, 2020

LysandreJik commented May 11, 2020

ierezell commented May 11, 2020

ierezell commented May 11, 2020

LysandreJik commented May 12, 2020

ierezell commented Jun 3, 2020

ierezell commented May 5, 2020 •

edited

Loading