Support decoding single tokens with `CodeGenTokenizer` #28627

cmathw · 2024-01-22T00:01:03Z

System Info

transformers version: 4.34.1
Platform: macOS-14.2.1-arm64-arm-64bit
Python version: 3.11.6
Huggingface_hub version: 0.16.4
Safetensors version: 0.4.0
Accelerate version: 0.23.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No.
Using distributed or parallel set-up in script?: No.

Who can help?

@ArthurZucker @rooa

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Code to reproduce:

from transformers.models.auto.tokenization_auto import AutoTokenizer

phi_tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", 
                                              add_bos_token=True, 
                                              use_fast=False, 
                                              trust_remote_code=True)

gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", 
                                               add_bos_token=True, 
                                               use_fast=False, 
                                               trust_remote_code=True)

a = "The cat sat on the mat"
gpt2_tokens = gpt2_tokenizer(a, return_tensors="pt")["input_ids"][0] # torch.Size([7])
gpt2_str_tokens = gpt2_tokenizer.batch_decode(gpt2_tokens) # Essentially: [gpt2_tokenizer.decode(seq) for seq in gpt2_tokens]
print(gpt2_str_tokens) # <-- This is fine and will output: ['<|endoftext|>', 'The', ' cat', ' sat', ' on', ' the', ' mat']

gpt2_single_decode = [gpt2_tokenizer.decode(gpt2_tokens[0])]
print(gpt2_single_decode) # <-- Decoding a 0-D tensor, this is fine and will output: ['<|endoftext|>']

phi_tokens = phi_tokenizer(a, return_tensors="pt")["input_ids"][0] # torch.Size([7])
phi_str_tokens = phi_tokenizer.batch_decode(phi_tokens) # Essentially: [phi_tokenizer.decode(seq) for seq in phi_tokens]
print(phi_str_tokens) # <-- Cannot do this due to below...

phi_single_decode = [phi_tokenizer.decode(phi_tokens[0])]
print(phi_single_decode) # <-- Cannot decode a 0-D Tensor, hence cannot do above either

Returns:
TypeError: iteration over a 0-d tensor

Expected behavior

In the above example,

phi_str_tokens = phi_tokenizer.batch_decode(phi_tokens)

Should return: ['<|endoftext|>', 'The', ' cat', ' sat', ' on', ' the', ' mat']. This is what the gpt2 tokenizer returns for example.

phi_single_decode = [phi_tokenizer.decode(phi_tokens)]

Should return: ['<|endoftext|>']. This is what the gpt2 tokenizer returns for example.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-01-23T15:27:42Z

thanks for the detailed issue!

cmathw mentioned this issue Jan 22, 2024

Support single token decode for CodeGenTokenizer #28628

Merged

5 tasks

ArthurZucker closed this as completed in #28628 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support decoding single tokens with `CodeGenTokenizer` #28627

Support decoding single tokens with `CodeGenTokenizer` #28627

cmathw commented Jan 22, 2024 •

edited

ArthurZucker commented Jan 23, 2024

Support decoding single tokens with CodeGenTokenizer #28627

Support decoding single tokens with CodeGenTokenizer #28627

Comments

cmathw commented Jan 22, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jan 23, 2024

Support decoding single tokens with `CodeGenTokenizer` #28627

Support decoding single tokens with `CodeGenTokenizer` #28627

cmathw commented Jan 22, 2024 •

edited