Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier #11065

nithinreddyy · 2021-04-05T14:37:42Z

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.

Below I'm attaching the code please look at it

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd

model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')

classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)

data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')

data.head()

Output is :

    Review
0   If you've ever been to Disneyland anywhere you...
1   Its been a while since d last time we visit HK...
2   Thanks God it wasn t too hot or too humid wh...
3   HK Disneyland is a great compact park. Unfortu...
4   the location is not in the city, took around 1...

Followed by

classifier("My name is mark")

Output is

[{'label': 'POSITIVE', 'score': 0.9953688383102417}]

Followed by code

basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment

Output is

['POSITIVE']

Appending the total rows to empty list

text = []

for index, row in data.iterrows():
    text.append(row['Review'])

I'm trying to get the sentiment for all the rows

sent = []

for i in range(len(data)):
    sentiment = classifier(data.iloc[i,0])
    sent.append(sentiment)

The error is :

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
      2 
      3 for i in range(len(data)):
----> 4     sentiment = classifier(data.iloc[i,0])
      5     sent.append(sentiment)

11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1914         # remove once script supports set_grad_enabled
   1915         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1917 
   1918 

IndexError: index out of range in self

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-04-05T14:50:42Z

Could you specify truncation=True when calling the pipeline with your data?

Replacing classifier("My name is mark") by classifier("My name is mark", truncation=True)

nithinreddyy · 2021-04-05T15:04:14Z

Could you specify truncation=True when calling the pipeline with your data?

Replacing classifier("My name is mark") by classifier("My name is mark", truncation=True)

Yea of course I can do it for one comment. But i have a column with multiple comments, how about that?

github-actions · 2021-05-05T15:01:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

jmrjfs · 2021-06-21T14:48:22Z

I tried running it with truncation=True but still receive the following error message: InvalidArgumentError: indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather]

Many thanks!

nithinreddyy · 2021-06-21T15:00:49Z

I tried running it with truncation=True but still receive the following error message: InvalidArgumentError: indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather]

Many thanks!

Can you once post the code? I'll look into it. Are you trying to train a custom model?

jmrjfs · 2021-06-21T21:35:51Z

Many thanks for coming back!

I am just applying the BERT model to classify Reddit posts into neutral, negative and positive that range from as little as 5 words to as many as 3500 words. I know that there is a lot of ongoing research in extending the model to classify even larger tokens...

I am using pipeline from Hugging Face and under the base case model the truncation actually works but under the model I use (cardiffnlp/twitter-roberta-base-sentiment) it somehow doesn't…

classifier_2 = pipeline('sentiment-analysis', model = "cardiffnlp/twitter-roberta-base-sentiment")
sentiment = classifier_2(df_body.iloc[4]['Content'], truncation=True)
print(sentiment)

where df_body.iloc[4]['Content'] is a 3500 words long token.

The hint is "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation."

My dumb solution would be to drop all the words after the 512th occurrence in the pre-cleaning process…

nithinreddyy · 2021-06-22T07:06:16Z

Many thanks for coming back!

I am just applying the BERT model to classify Reddit posts into neutral, negative and positive that range from as little as 5 words to as many as 3500 words. I know that there is a lot of ongoing research in extending the model to classify even larger tokens...

I am using pipeline from Hugging Face and under the base case model the truncation actually works but under the model I use (cardiffnlp/twitter-roberta-base-sentiment) it somehow doesn't…

classifier_2 = pipeline('sentiment-analysis', model = "cardiffnlp/twitter-roberta-base-sentiment")
sentiment = classifier_2(df_body.iloc[4]['Content'], truncation=True)
print(sentiment)

where df_body.iloc[4]['Content'] is a 3500 words long token.

The hint is "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation."

My dumb solution would be to drop all the words after the 512th occurrence in the pre-cleaning process…

Can you try this code once, it's not roberta model, but it's Huggingface-Sentiment-Pipeline

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers

model = AutoModelForSequenceClassification.from_pretrained('Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('Huggingface-Sentiment-Pipeline')

classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)

content = 'enter your content here'

#Check your length of content
len(content)

#Now run the classifier pipeline

classifier(content, truncation=True)

Meanwhile i'll try to figure out for Roberta model

jmrjfs · 2021-06-22T21:06:08Z

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

nithinreddyy · 2021-06-23T06:46:15Z

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.

jmrjfs · 2021-06-23T10:05:48Z

Thanks! Yes that would be amazing. I still have the problem however that I receive the error message "index out of range in self" - even after cutting the text body down to 200 words. Thanks so much for your help! [image: Screenshot 2021-06-23 at 10.45.14.png] Am Mi., 23. Juni 2021 um 07:46 Uhr schrieb nithinreddyy < ***@***.***>:

…

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier. Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#11065 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUNYZHLQYFHANXYTD3GRTX3TUF7MFANCNFSM42M462JA> .

Abe410 · 2022-06-16T19:26:08Z

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.

Hey

I am working on exactly the same problem as well. Does it really make the model more accurate?

Mind sharing the code with me as well? Thanks

github-actions bot closed this as completed May 13, 2021

TheophileBlard mentioned this issue Nov 11, 2021

Length of text passed to nlp pipeline TheophileBlard/french-sentiment-analysis-with-bert#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier #11065

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier #11065

nithinreddyy commented Apr 5, 2021

LysandreJik commented Apr 5, 2021 •

edited

Loading

nithinreddyy commented Apr 5, 2021

github-actions bot commented May 5, 2021

jmrjfs commented Jun 21, 2021

nithinreddyy commented Jun 21, 2021

jmrjfs commented Jun 21, 2021

nithinreddyy commented Jun 22, 2021

jmrjfs commented Jun 22, 2021

nithinreddyy commented Jun 23, 2021

jmrjfs commented Jun 23, 2021 via email

Abe410 commented Jun 16, 2022

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier #11065

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier #11065

Comments

nithinreddyy commented Apr 5, 2021

LysandreJik commented Apr 5, 2021 • edited Loading

nithinreddyy commented Apr 5, 2021

github-actions bot commented May 5, 2021

jmrjfs commented Jun 21, 2021

nithinreddyy commented Jun 21, 2021

jmrjfs commented Jun 21, 2021

nithinreddyy commented Jun 22, 2021

jmrjfs commented Jun 22, 2021

nithinreddyy commented Jun 23, 2021

jmrjfs commented Jun 23, 2021 via email

Abe410 commented Jun 16, 2022

LysandreJik commented Apr 5, 2021 •

edited

Loading