Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actually the extra_id are from 0-99 and not from 1-100 #5967

Merged
merged 1 commit into from
Jul 30, 2020
Merged

Actually the extra_id are from 0-99 and not from 1-100 #5967

merged 1 commit into from
Jul 30, 2020

Conversation

orena1
Copy link
Contributor

@orena1 orena1 commented Jul 22, 2020

a = tokenizer.encode("we got a <extra_id_99>", return_tensors='pt',add_special_tokens=True)
print(a)
>tensor([[   62,   530,     3,     9, 32000]])
a = tokenizer.encode("we got a <extra_id_100>", return_tensors='pt',add_special_tokens=True)
print(a)
>tensor([[   62,   530,     3,     9,     3,     2, 25666,   834,    23,    26,
           834,  2915,  3155]])

a = tokenizer.encode("we got a <extra_id_99>", return_tensors='pt',add_special_tokens=True)
print(a)
>tensor([[   62,   530,     3,     9, 32000]])
a = tokenizer.encode("we got a <extra_id_100>", return_tensors='pt',add_special_tokens=True)
print(a)
>tensor([[   62,   530,     3,     9,     3,     2, 25666,   834,    23,    26,
           834,  2915,  3155]])
@codecov
Copy link

codecov bot commented Jul 22, 2020

Codecov Report

Merging #5967 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5967   +/-   ##
=======================================
  Coverage   78.51%   78.51%           
=======================================
  Files         146      146           
  Lines       26214    26214           
=======================================
  Hits        20581    20581           
  Misses       5633     5633           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae67b24...e8e003f. Read the comment docs.

@orena1
Copy link
Contributor Author

orena1 commented Jul 24, 2020

Hi @patrickvonplaten, can you have a look?

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed!

Verified with

from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")

for i in range(101):
    print(i, f"<extra_id_{i}>" in tokenizer.get_vocab())

which outputs True for [0, 99] and False for 100.

@LysandreJik
Copy link
Member

Thansk @orena1 !

@LysandreJik LysandreJik merged commit d24ea70 into huggingface:master Jul 30, 2020
@LysandreJik
Copy link
Member

pinging @patrickvonplaten for notification

@orena1 orena1 deleted the patch-5 branch July 30, 2020 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants