Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support decoding single tokens with CodeGenTokenizer #28627

Closed
1 of 4 tasks
cmathw opened this issue Jan 22, 2024 · 1 comment · Fixed by #28628
Closed
1 of 4 tasks

Support decoding single tokens with CodeGenTokenizer #28627

cmathw opened this issue Jan 22, 2024 · 1 comment · Fixed by #28628

Comments

@cmathw
Copy link
Contributor

cmathw commented Jan 22, 2024

System Info

  • transformers version: 4.34.1
  • Platform: macOS-14.2.1-arm64-arm-64bit
  • Python version: 3.11.6
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.4.0
  • Accelerate version: 0.23.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No.
  • Using distributed or parallel set-up in script?: No.

Who can help?

@ArthurZucker @rooa

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Code to reproduce:

from transformers.models.auto.tokenization_auto import AutoTokenizer

phi_tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", 
                                              add_bos_token=True, 
                                              use_fast=False, 
                                              trust_remote_code=True)

gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", 
                                               add_bos_token=True, 
                                               use_fast=False, 
                                               trust_remote_code=True)

a = "The cat sat on the mat"
gpt2_tokens = gpt2_tokenizer(a, return_tensors="pt")["input_ids"][0] # torch.Size([7])
gpt2_str_tokens = gpt2_tokenizer.batch_decode(gpt2_tokens) # Essentially: [gpt2_tokenizer.decode(seq) for seq in gpt2_tokens]
print(gpt2_str_tokens) # <-- This is fine and will output: ['<|endoftext|>', 'The', ' cat', ' sat', ' on', ' the', ' mat']

gpt2_single_decode = [gpt2_tokenizer.decode(gpt2_tokens[0])]
print(gpt2_single_decode) # <-- Decoding a 0-D tensor, this is fine and will output: ['<|endoftext|>']

phi_tokens = phi_tokenizer(a, return_tensors="pt")["input_ids"][0] # torch.Size([7])
phi_str_tokens = phi_tokenizer.batch_decode(phi_tokens) # Essentially: [phi_tokenizer.decode(seq) for seq in phi_tokens]
print(phi_str_tokens) # <-- Cannot do this due to below...

phi_single_decode = [phi_tokenizer.decode(phi_tokens[0])]
print(phi_single_decode) # <-- Cannot decode a 0-D Tensor, hence cannot do above either

Returns:
TypeError: iteration over a 0-d tensor

Expected behavior

In the above example,

phi_str_tokens = phi_tokenizer.batch_decode(phi_tokens)

Should return: ['<|endoftext|>', 'The', ' cat', ' sat', ' on', ' the', ' mat']. This is what the gpt2 tokenizer returns for example.

phi_single_decode = [phi_tokenizer.decode(phi_tokens)]

Should return: ['<|endoftext|>']. This is what the gpt2 tokenizer returns for example.

@ArthurZucker
Copy link
Collaborator

thanks for the detailed issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants