Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier #11065

Closed
nithinreddyy opened this issue Apr 5, 2021 · 11 comments

Comments

@nithinreddyy
Copy link

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.

Below I'm attaching the code please look at it

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd

model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')

classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)

data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')

data.head()

Output is :

    Review
0   If you've ever been to Disneyland anywhere you...
1   Its been a while since d last time we visit HK...
2   Thanks God it wasn t too hot or too humid wh...
3   HK Disneyland is a great compact park. Unfortu...
4   the location is not in the city, took around 1...

Followed by

classifier("My name is mark")

Output is

[{'label': 'POSITIVE', 'score': 0.9953688383102417}]

Followed by code

basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment

Output is

['POSITIVE']

Appending the total rows to empty list

text = []

for index, row in data.iterrows():
    text.append(row['Review'])

I'm trying to get the sentiment for all the rows

sent = []

for i in range(len(data)):
    sentiment = classifier(data.iloc[i,0])
    sent.append(sentiment)

The error is :

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
      2 
      3 for i in range(len(data)):
----> 4     sentiment = classifier(data.iloc[i,0])
      5     sent.append(sentiment)

11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1914         # remove once script supports set_grad_enabled
   1915         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1917 
   1918 

IndexError: index out of range in self
@LysandreJik
Copy link
Member

LysandreJik commented Apr 5, 2021

Could you specify truncation=True when calling the pipeline with your data?

Replacing classifier("My name is mark") by classifier("My name is mark", truncation=True)

@nithinreddyy
Copy link
Author

Could you specify truncation=True when calling the pipeline with your data?

Replacing classifier("My name is mark") by classifier("My name is mark", truncation=True)

Yea of course I can do it for one comment. But i have a column with multiple comments, how about that?

@github-actions
Copy link

github-actions bot commented May 5, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@jmrjfs
Copy link

jmrjfs commented Jun 21, 2021

I tried running it with truncation=True but still receive the following error message: InvalidArgumentError: indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather]

Many thanks!

@nithinreddyy
Copy link
Author

I tried running it with truncation=True but still receive the following error message: InvalidArgumentError: indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather]

Many thanks!

Can you once post the code? I'll look into it. Are you trying to train a custom model?

@jmrjfs
Copy link

jmrjfs commented Jun 21, 2021

Many thanks for coming back!

I am just applying the BERT model to classify Reddit posts into neutral, negative and positive that range from as little as 5 words to as many as 3500 words. I know that there is a lot of ongoing research in extending the model to classify even larger tokens...

I am using pipeline from Hugging Face and under the base case model the truncation actually works but under the model I use (cardiffnlp/twitter-roberta-base-sentiment) it somehow doesn't…

classifier_2 = pipeline('sentiment-analysis', model = "cardiffnlp/twitter-roberta-base-sentiment")
sentiment = classifier_2(df_body.iloc[4]['Content'], truncation=True)
print(sentiment)

where df_body.iloc[4]['Content'] is a 3500 words long token.

The hint is "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation."

My dumb solution would be to drop all the words after the 512th occurrence in the pre-cleaning process…

@nithinreddyy
Copy link
Author

Many thanks for coming back!

I am just applying the BERT model to classify Reddit posts into neutral, negative and positive that range from as little as 5 words to as many as 3500 words. I know that there is a lot of ongoing research in extending the model to classify even larger tokens...

I am using pipeline from Hugging Face and under the base case model the truncation actually works but under the model I use (cardiffnlp/twitter-roberta-base-sentiment) it somehow doesn't…

classifier_2 = pipeline('sentiment-analysis', model = "cardiffnlp/twitter-roberta-base-sentiment")
sentiment = classifier_2(df_body.iloc[4]['Content'], truncation=True)
print(sentiment)

where df_body.iloc[4]['Content'] is a 3500 words long token.

The hint is "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation."

My dumb solution would be to drop all the words after the 512th occurrence in the pre-cleaning process…

Can you try this code once, it's not roberta model, but it's Huggingface-Sentiment-Pipeline

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers

model = AutoModelForSequenceClassification.from_pretrained('Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('Huggingface-Sentiment-Pipeline')

classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)

content = 'enter your content here'

#Check your length of content
len(content)

#Now run the classifier pipeline

classifier(content, truncation=True)

Meanwhile i'll try to figure out for Roberta model

@jmrjfs
Copy link

jmrjfs commented Jun 22, 2021

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

@nithinreddyy
Copy link
Author

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.

@jmrjfs
Copy link

jmrjfs commented Jun 23, 2021 via email

@Abe410
Copy link

Abe410 commented Jun 16, 2022

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.

Hey

I am working on exactly the same problem as well. Does it really make the model more accurate?

Mind sharing the code with me as well? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants