Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPT-2 tokenizer is NOT invertible #31884

Closed
2 of 4 tasks
jdeschena opened this issue Jul 10, 2024 · 19 comments · Fixed by #31938
Closed
2 of 4 tasks

[BUG] GPT-2 tokenizer is NOT invertible #31884

jdeschena opened this issue Jul 10, 2024 · 19 comments · Fixed by #31938
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@jdeschena
Copy link

jdeschena commented Jul 10, 2024

System Info

Hello,

It is my understanding that the gpt-2 tokenizer, obtained with AutoTokenizer.from_pretrained("gpt2"), should be invertible. That is, given a sentence text, we should have that

text == tokenizer.decode(tokenizer(text, add_special_tokens=False)["input_ids"])

However, it is not the case, unlike the tiktoken reference implementation, which is correctly invertible.

For example, given the sentence Is this restaurant family-friendly ? Yes No Unsure ? This is a follow-up sentence ., encoding + decoding removes the space before punctuations, yielding a different sentence.

I have tried instantiating the tokenizer using GPT2Tokenizer.from_pretrained("openai-community/gpt2"), and using the options add_prefix_space=True or is_split_into_words=True, but the problem persists.

Hence, it looks like a bug to me, since BPE tokenizers should be invertible, as far as I understand.

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run this code, and you should see the bug. I am using transformers==4.38.2

#gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
gpt2_tokenizer =  GPT2Tokenizer.from_pretrained("openai-community/gpt2")
oai_tokenizer = tiktoken.get_encoding("gpt2")

orig = "Is this restaurant family-friendly ? Yes No Unsure ? This is an other sentence ."

hf_enc = gpt2_tokenizer(orig)["input_ids"]
hf_dec = gpt2_tokenizer.decode(hf_enc)

oai_enc = oai_tokenizer.encode(orig)
oai_dec = oai_tokenizer.decode(oai_enc)

print(hf_dec)
print(oai_dec)

Expected behavior

The two decoded sentence should be equal, yet they are not.

@amyeroberts amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Jul 10, 2024
@ArthurZucker
Copy link
Collaborator

Hey! Pretty sure this is due to the cleaup_tokenization_spaces argument. cc @itazap let's see if we can do a deprecation cycle for this one -> de-activate it by default but allow for it to be set in the tokenizer's parameters (tokenizer.cleanup_tokenization_spaces) as it's something that has been coming up quite a lot !

@itazap
Copy link
Collaborator

itazap commented Jul 11, 2024

@ArthurZucker deprecate or set to False by default (currently it is set to True by default)? If we allow it to be set, then we do not deprecate?

@ArthurZucker
Copy link
Collaborator

We should still deprecate (if None, default to True, but next release we default to False)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

joelburget added a commit to joelburget/sae_expected_l0 that referenced this issue Aug 11, 2024
```
root@ceb53da4d023:~/sae_expected_l0# python3 main.py
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(
Loaded pretrained model roneneldan/TinyStories-1M into HookedTransformer
Repo card metadata block was not found. Setting CardData to empty.
sigma:   0%|                                                                                                                                                                          | 0/5 [00:00<?, ?it/swandb: Currently logged in as: joelb. Use `wandb login --relogin` to force relogin                                                                                                     | 0/5 [00:00<?, ?it/s]
wandb: Tracking run with wandb version 0.17.6                                                                                                                                         | 0/1 [00:00<?, ?it/s]
wandb: Run data is saved locally in /root/sae_expected_l0/wandb/run-20240811_194131-exu7ftic                                                                                          | 0/3 [00:00<?, ?it/s]
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run unique-snow-4
wandb: ⭐️ View project at https://wandb.ai/joelb/sae_expected_l0
wandb: 🚀 View run at https://wandb.ai/joelb/sae_expected_l0/runs/exu7ftic
sigma:   0%|                                                                                                                                                                        | 0/5 [2:00:45<?, ?it/s]
Traceback (most recent call last):
  File "/root/sae_expected_l0/main.py", line 113, in <module>
    train(model, sae, ds, learning_rate, l0_coefficient)
  File "/root/sae_expected_l0/main.py", line 61, in train
    _, cache = model.run_with_cache(torch.tensor(tokens), remove_batch_dim=True)
  File "/usr/local/lib/python3.10/dist-packages/transformer_lens/HookedTransformer.py", line 631, in run_with_cache
    out, cache_dict = super().run_with_cache(
  File "/usr/local/lib/python3.10/dist-packages/transformer_lens/hook_points.py", line 566, in run_with_cache
    model_out = self(*model_args, **model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_lens/HookedTransformer.py", line 522, in forward
    ) = self.input_to_embed(
  File "/usr/local/lib/python3.10/dist-packages/transformer_lens/HookedTransformer.py", line 330, in input_to_embed
    embed = self.hook_embed(self.embed(tokens))  # [batch, pos, d_model]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_lens/components/embed.py", line 34, in forward
    return self.W_E[tokens, :]
IndexError: tensors used as indices must be long, int, byte or bool tensors
wandb: | 0.011 MB of 0.011 MB uploaded
wandb: Run history:
wandb:             l0_loss ▃▅▃▂▃▂▂▂█▂▁▁▂▃▂▃▁▂▂▂▁▂▂▂▃▂▁▃▁▂▁▂▁▃▁▂▁▂▂▁
wandb:                loss ▃▅▃▂▃▂▂▂█▂▁▁▂▃▂▃▁▂▂▂▁▂▂▂▃▂▁▃▁▂▁▂▁▃▁▂▁▂▂▁
wandb: reconstruction_loss █▃▂▂▂▂▂▂▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:        total_tokens ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
wandb:
wandb: Run summary:
wandb:             l0_loss 29664.35352
wandb:                loss 29.66452
wandb: reconstruction_loss 0.00016
wandb:        total_tokens 7541274
wandb:
wandb: 🚀 View run unique-snow-4 at: https://wandb.ai/joelb/sae_expected_l0/runs/exu7ftic
wandb: ⭐️ View project at: https://wandb.ai/joelb/sae_expected_l0
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240811_194131-exu7ftic/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.
```
@Garrettlynch
Copy link

I'm getting this issue with Flux in ComfyUI and it points to this bug issue. What is the solution to resolve it? Where do I set the cleanup_tokenization_spaces parameter to false?

Full terminal output below:

got prompt
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
/home/garrett/AI/ComfyUI/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
./launch.sh: line 7: 47378 Killed                  python3 main.py

lloydchang added a commit to Complainers-R-Us/Complainers-R-Us that referenced this issue Aug 17, 2024
add first version of classify_text.py

❯ python classify_text.py
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
[{'label': 'POSITIVE', 'score': 0.9992625117301941}]
lloydchang added a commit to Complainers-R-Us/Complainers-R-Us that referenced this issue Aug 17, 2024
use CUDA / GPU

❯ python classify_text.py
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
[{'label': 'POSITIVE', 'score': 0.9992625117301941}]

It looks like your script is defaulting to using the CPU despite the availability of a GPU. To use the GPU, you'll need to specify the device argument in the pipeline configuration. Here's how you can modify your script to use the GPU:

Install PyTorch with GPU support: Make sure you have PyTorch installed with GPU support. You can install it using pip:

bash
Copy code
pip install torch torchvision torchaudio
Ensure you have the appropriate version based on your CUDA version. You can find the correct version here.

Modify Your Script: Update your script to specify the device. You can use the device argument in the pipeline function to set the device to GPU. For example:

python
Copy code
from transformers import pipeline

classifier = pipeline("sentiment-analysis", device=0)

result = classifier("I love using GPUs for deep learning!")
print(result)
In this example, device=0 indicates that the first GPU should be used. If you want to use a different GPU or multiple GPUs, you can adjust the device index accordingly.

Verify GPU Usage: Run your script again and verify that the GPU is being utilized. You can use tools like nvidia-smi to check GPU usage in real time.

If you encounter any issues or need further assistance, let me know!
lloydchang added a commit to Complainers-R-Us/Complainers-R-Us that referenced this issue Aug 17, 2024
can detect missing file — not found — or wrong file name

❯ python classify_text.py
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/dist

ilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(

I love using GPUs for deep learning!
[{'label': 'POSITIVE', 'score': 0.9990824460983276}]

I love using Hugging Face models!
[{'label': 'POSITIVE', 'score': 0.9990824460983276}]

The file cleaned_text.txt.1was not found.
lloydchang added a commit to Complainers-R-Us/Complainers-R-Us that referenced this issue Aug 17, 2024
❯ python classify_text.py
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(

I love using GPUs for deep learning!
[{'label': 'POSITIVE', 'score': 0.9990824460983276}]

I love using Hugging Face models!
[{'label': 'POSITIVE', 'score': 0.9990824460983276}]

Token indices sequence length is longer than the specified maximum sequence length for this model (944055 > 512). Running this sequence through the model will result in indexing errors
An error occurred: The size of tensor a (944055) must match the size of tensor b (512) at non-singleton dimension 1
ChatGPT said:
ChatGPT
It looks like you’re running into a couple of issues with the classify_text.py script:

Model Warning: The script is defaulting to a model (distilbert-base-uncased-finetuned-sst-2-english) and not specifying a model name and revision in production, which is not recommended.

Tokenization Warning: There's a FutureWarning about clean_up_tokenization_spaces. This won’t affect your current run, but it’s good to be aware of changes in future versions.

Token Indices Sequence Length Error: Your input text is too long for the model’s maximum sequence length (512 tokens). This is why you’re seeing the error message about tensor size mismatch.

Solutions:
Specify a Model: If you want to use a specific model, make sure to specify it explicitly in your script or pipeline to avoid relying on defaults. For example:

python
Copy code
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="your-chosen-model")
Handling Long Text: To address the sequence length issue, you need to truncate or split your text before passing it to the model. Here’s how you can handle it:

python
Copy code
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

def classify_long_text(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    label = 'POSITIVE' if predictions[0][1] > 0.5 else 'NEGATIVE'
    return label, predictions[0].tolist()

text = "Your long text goes here"
label, scores = classify_long_text(text)
print(f"Label: {label}, Scores: {scores}")
This code will truncate the input text to fit within the 512-token limit and classify it accordingly.

Feel free to ask if you need further clarification or assistance!

You said:
rewrite

from transformers import pipeline

classifier = pipeline('text-classification', device=0)
test_text = "I love using GPUs for deep learning!"
result = classifier(test_text)
print ("\n")
print (test_text)
print(result)

test_text = "I love using Hugging Face models!"
results = classifier(test_text)
print ("\n")
print (test_text)
print(result)
print ("\n")

transcribed_text = "cleaned_text.txt"

try:
    # Load transcribed text
    with open(transcribed_text, 'r') as file:
        transcribed_text = file.read()

    # Classify the text
    results = classifier(transcribed_text)

    # Print results
    print(results)

except FileNotFoundError:
    print("The file " + transcribed_text + "was not found.")

except Exception as e:
    print(f"An error occurred: {e}")

ChatGPT said:
ChatGPT
Certainly! Here’s a revised version of your script with better error handling and a clearer structure:

python
Copy code
from transformers import pipeline

classifier = pipeline('text-classification', device=0)

def classify_text(text):
    """Classify the given text and print results."""
    result = classifier(text)
    print("\n")
    print(text)
    print(result)
    print("\n")

test_texts = [
    "I love using GPUs for deep learning!",
    "I love using Hugging Face models!"
]

for text in test_texts:
    classify_text(text)

transcribed_text_file = "cleaned_text.txt"

try:
    # Load transcribed text
    with open(transcribed_text_file, 'r') as file:
        transcribed_text = file.read()

    # Check if the text is too long for the model
    if len(transcribed_text.split()) > 512:
        print("The text is too long and will be truncated.")
        # Truncate the text to the first 512 tokens (or any appropriate length)
        transcribed_text = ' '.join(transcribed_text.split()[:512])

    # Classify the text
    results = classifier(transcribed_text)

    # Print results
    print(results)

except FileNotFoundError:
    print(f"The file '{transcribed_text_file}' was not found.")

except Exception as e:
    print(f"An error occurred: {e}")
Changes Made:
Function for Classification: Added a classify_text function to avoid code repetition and make it easier to manage.
Iterate Over Test Texts: Used a loop to classify multiple test texts.
File Handling and Text Truncation: Added a check to truncate the text if it exceeds the model’s maximum token length.
Error Messages: Improved error handling messages for better clarity.
This structure should make your script more robust and easier to understand. Let me know if you need further adjustments!
lloydchang added a commit to Complainers-R-Us/Complainers-R-Us that referenced this issue Aug 17, 2024
❯ python classify_text.py
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(

I love using GPUs for deep learning!
[{'label': 'POSITIVE', 'score': 0.9990824460983276}]

I love using Hugging Face models!
[{'label': 'POSITIVE', 'score': 0.9992625117301941}]

The text is too long and will be truncated.
Token indices sequence length is longer than the specified maximum sequence length for this model (621 > 512). Running this sequence through the model will result in indexing errors
An error occurred: The size of tensor a (621) must match the size of tensor b (512) at non-singleton dimension 1

ChatGPT said:
ChatGPT
It looks like the script is handling the long text, but the truncation logic needs some adjustment to prevent the error. The error suggests that the truncation logic might not be correctly applied, or there might be another issue causing the length mismatch.

Here's an updated version of the script to ensure proper truncation and better handling of the text length:

python
Copy code
from transformers import pipeline

classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0)

def classify_text(text):
    """Classify the given text and print results."""
    result = classifier(text)
    print("\n")
    print(text)
    print(result)
    print("\n")

test_texts = [
    "I love using GPUs for deep learning!",
    "I love using Hugging Face models!"
]

for text in test_texts:
    classify_text(text)

transcribed_text_file = "cleaned_text.txt"

try:
    # Load transcribed text
    with open(transcribed_text_file, 'r') as file:
        transcribed_text = file.read()

    # Tokenize and check if the text is too long for the model
    tokenizer = classifier.tokenizer
    inputs = tokenizer(transcribed_text, return_tensors='pt', truncation=False, padding=False)
    if len(inputs['input_ids'][0]) > 512:
        print("The text is too long and will be truncated.")
        # Truncate the text to the first 512 tokens
        transcribed_text = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][:512]))

    # Classify the text
    results = classifier(transcribed_text)

    # Print results
    print(results)

except FileNotFoundError:
    print(f"The file '{transcribed_text_file}' was not found.")

except Exception as e:
    print(f"An error occurred: {e}")
Key Updates:
Specify the Model: Explicitly set the model in the pipeline initialization to avoid the default warning.
Truncate Correctly: Properly handle truncation by using the tokenizer to manage token length.
Handle Token Length: Ensure the text is truncated correctly to fit within the model's token limit.
This should handle text longer than 512 tokens more effectively. Let me know if you encounter any further issues!
lloydchang added a commit to Complainers-R-Us/Complainers-R-Us that referenced this issue Aug 17, 2024
❯ python classify_text.py
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 629/629 [00:00<00:00, 1.27MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████| 268M/268M [00:26<00:00, 10.3MB/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 182kB/s]
vocab.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 3.39MB/s]
/Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(

I love using GPUs for deep learning!
[{'label': 'POSITIVE', 'score': 0.9990824460983276}]

I love using Hugging Face models!
[{'label': 'POSITIVE', 'score': 0.9992625117301941}]

Token indices sequence length is longer than the specified maximum sequence length for this model (944055 > 512). Running this sequence through the model will result in indexing errors
The text is too long and will be truncated.
An error occurred: The size of tensor a (514) must match the size of tensor b (512) at non-singleton dimension 1
ChatGPT said:
ChatGPT
It seems like the truncation is still not being applied correctly, resulting in an error due to exceeding the token length limit. The input_ids handling may need to be corrected to ensure proper truncation and avoid mismatches.

Here’s an updated version of the script that handles long text by using explicit truncation in both tokenization and classification steps:

python
Copy code
from transformers import pipeline

classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0)

def classify_text(text):
    """Classify the given text and print results."""
    result = classifier(text)
    print("\n")
    print(text)
    print(result)
    print("\n")

test_texts = [
    "I love using GPUs for deep learning!",
    "I love using Hugging Face models!"
]

for text in test_texts:
    classify_text(text)

transcribed_text_file = "cleaned_text.txt"

try:
    # Load transcribed text
    with open(transcribed_text_file, 'r') as file:
        transcribed_text = file.read()

    # Tokenizer for handling long texts
    tokenizer = classifier.tokenizer
    inputs = tokenizer(transcribed_text, return_tensors='pt', truncation=True, padding=False, max_length=512)
    truncated_text = tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True)

    # Classify the truncated text
    results = classifier(truncated_text)

    # Print results
    print(results)

except FileNotFoundError:
    print(f"The file '{transcribed_text_file}' was not found.")

except Exception as e:
    print(f"An error occurred: {e}")
Key Changes:
Use Tokenizer with Truncation: Set truncation=True and max_length=512 in the tokenizer to handle the text length.
Decode Truncated Text: Convert token IDs back to text to ensure correct representation for classification.
This script should ensure that the text is properly truncated to fit the model's maximum sequence length and avoid errors. If you continue to experience issues, it might be helpful to verify that the classifier.tokenizer is the right tokenizer for your model.
lloydchang added a commit to Complainers-R-Us/Complainers-R-Us that referenced this issue Aug 17, 2024
❯ python classify_text.py
/Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(

I love using GPUs for deep learning!
[{'label': 'POSITIVE', 'score': 0.9990824460983276}]

I love using Hugging Face models!
[{'label': 'POSITIVE', 'score': 0.9992625117301941}]

[{'label': 'NEGATIVE', 'score': 0.9947984218597412}]
lloydchang added a commit to Complainers-R-Us/Complainers-R-Us that referenced this issue Aug 17, 2024
❯ python classify_text.py
/Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(

I love using GPUs for deep learning!
[{'label': 'POSITIVE', 'score': 0.9990824460983276}]

I love using Hugging Face models!
[{'label': 'POSITIVE', 'score': 0.9992625117301941}]

" xxxx xxxx xxxx xxxx xxxx xxxx xxxx ny xxxx xxxx xxxx date to macys xxxx inc xxxx xxxx xxxx xxxxxxxx xxxx xxxx ny xxxx attn xxxx xxxx executive office dear xxxx xxxx my name is xxxx xxxx and i am writing to address the ongoing issues with my macys credit account account number xxxx despite multiple attempts to resolve these matters over the phone i have not received the necessary assistance to correct the inaccuracies on my account xxxx disputed charge and lack of support i have called macys customer service many times to dispute a charge of 3100 and address other concerns each time the representatives were either unwilling or unable to assist me effectively macys should have records of these calls to verify my repeated efforts xxxx late fee i acknowledge the late fee of 3100 charged on xxxxyear and its subsequent credit on xxxxyear however this does not resolve the main issue of incorrect reporting xxxx credit reporting inaccuracy my account was reported as 30 days delinquent in xxxxyear i have consistently disputed this information stating that payments for xxxx and xxxxyear were made but not properly applied to my account additional impact due to this delinquency on my credit report my auto insurance increased to 33000 per month this has been a significant financial burden i request that macys conduct a comprehensive review including the phone call records and correct the inaccuracies on my account additionally i request the removal of the disputed by consumer note from my credit report once the issue is resolved additionally i do not want any macys advertisements sent to my apartment i hope to resolve this matter amicably and promptly thank you for your attention to this important issue sincerely xxxx xxxx " timely payments are always a priority for me and i am certain about this however i am unsure why this company is reporting me as late in certain months which should not be the case according to usc 1666b any billing error should be corrected or they will be liable to pay me 100000 for each account reporting inaccurately i never miss a payment i am certain of this i dont get why this company is marking me late for some months usc 1666b states that billing errors should be corrected and shown as paid on time i purchased a xxxx xxxx xxxx xxxx on xxxxxxxx using my td bank xxxx xxxx credit account phone was supposed to be delivered xxxxxxxx instead i get a delivery delayed notice i contact xx
[{'label': 'NEGATIVE', 'score': 0.9947984218597412}]
lloydchang added a commit to Complainers-R-Us/Complainers-R-Us that referenced this issue Aug 17, 2024
it's not working

❯ python classify_text.py
/Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
  warnings.warn(

I love using GPUs for deep learning!
[{'label': 'POSITIVE', 'score': 0.9990824460983276}]

I love using Hugging Face models!
[{'label': 'POSITIVE', 'score': 0.9992625117301941}]

Token indices sequence length is longer than the specified maximum sequence length for this model (944055 > 512). Running this sequence through the model will result in indexing errors
An error occurred: The size of tensor a (513) must match the size of tensor b (512) at non-singleton dimension 1

i think what you need to do is

split the text file into smaller sized chunks, then interate through them in multiple iterations, a little bit at a time, in a for loop or while loop or something similar
ChatGPT said:
ChatGPT
Got it! To ensure the text is split into manageable chunks and processed in iterations, we can adjust the script to handle splitting and iterating over chunks more explicitly. Here’s how you can do it:

Split the text into smaller chunks.
Iterate through these chunks and classify each chunk.
Here’s the revised script:

python
Copy code
from transformers import pipeline

classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0)

def classify_text(text):
    """Classify the given text and print results."""
    result = classifier(text)
    print("\n")
    print(text)
    print(result)
    print("\n")

def split_text_into_chunks(text, max_length=512):
    """Split text into chunks of a specified maximum length."""
    tokenizer = classifier.tokenizer
    tokens = tokenizer.encode(text, truncation=False)
    # Split tokens into chunks
    for i in range(0, len(tokens), max_length):
        yield tokenizer.decode(tokens[i:i + max_length], skip_special_tokens=True)

test_texts = [
    "I love using GPUs for deep learning!",
    "I love using Hugging Face models!"
]

for text in test_texts:
    classify_text(text)

transcribed_text_file = "cleaned_text.txt"

try:
    # Load transcribed text
    with open(transcribed_text_file, 'r') as file:
        transcribed_text = file.read()

    # Split text and classify each chunk
    for chunk in split_text_into_chunks(transcribed_text):
        classify_text(chunk)

except FileNotFoundError:
    print(f"The file '{transcribed_text_file}' was not found.")

except Exception as e:
    print(f"An error occurred: {e}")
Key Points:
split_text_into_chunks Function: This function splits the input text into chunks of up to max_length tokens.
Iterate Over Chunks: The script iterates over each chunk and classifies it.
This approach should handle long texts by processing them in smaller pieces and avoid the token length issues you were encountering.

You said:
are you looping? i don't see you looping
ChatGPT said:
ChatGPT
You're right; the loop was implied by the split_text_into_chunks generator function but wasn't explicitly shown. Let's make sure the iteration over chunks is clear. Here’s a version with an explicit loop to handle each chunk:

python
Copy code
from transformers import pipeline

classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0)

def classify_text(text):
    """Classify the given text and print results."""
    result = classifier(text)
    print("\n")
    print(text)
    print(result)
    print("\n")

def split_text_into_chunks(text, max_length=512):
    """Split text into chunks of a specified maximum length."""
    tokenizer = classifier.tokenizer
    tokens = tokenizer.encode(text, truncation=False)
    # Split tokens into chunks
    for i in range(0, len(tokens), max_length):
        yield tokenizer.decode(tokens[i:i + max_length], skip_special_tokens=True)

test_texts = [
    "I love using GPUs for deep learning!",
    "I love using Hugging Face models!"
]

for text in test_texts:
    classify_text(text)

transcribed_text_file = "cleaned_text.txt"

try:
    # Load transcribed text
    with open(transcribed_text_file, 'r') as file:
        transcribed_text = file.read()

    # Split text and classify each chunk
    for chunk in split_text_into_chunks(transcribed_text):
        classify_text(chunk)

except FileNotFoundError:
    print(f"The file '{transcribed_text_file}' was not found.")

except Exception as e:
    print(f"An error occurred: {e}")
Breakdown:
split_text_into_chunks Function: Uses a generator to yield chunks of text.
Explicit Loop: The for chunk in split_text_into_chunks(transcribed_text) loop processes each chunk individually.
This ensures each chunk of the text is classified separately. Let me know if this works for you or if you need further adjustments!
@Simontakeasit
Copy link

Issue : warnings.warn(clip missing: ['text_projection.weight']

Issue resolved ->

Fix approach : in workflow section UNET/MODEL , you can set weight_dtypeUNETweight_dtype from 'default' to one of the fp8 types in your dropdown list and then re-run.

Hopyfully this can help you guys.

@Garrettlynch
Copy link

Issue : warnings.warn(clip missing: ['text_projection.weight']

Issue resolved ->

Fix approach : in workflow section UNET/MODEL , you can set weight_dtypeUNETweight_dtype from 'default' to one of the fp8 types in your dropdown list and then re-run.

Hopyfully this can help you guys.

This resolved the issue for me - so simple.

@xie-yao-oos
Copy link

from transformers import MBartForConditionalGeneration, MBart50Tokenizer
from langdetect import detect

# Specify the model folder path
model_path = "model"

# Load the tokenizer and model
tokenizer = MBart50Tokenizer.from_pretrained(model_path)
model = MBartForConditionalGeneration.from_pretrained(model_path)

# Define the texts to be translated
texts_to_translate = [
    "Guten Morgen! Wie kann ich Ihnen heute helfen?",
    "Das Wetter ist heute schön. Lass uns spazieren gehen.",
    "Ich liebe es, in meiner Freizeit Bücher zu lesen。",
    "I love to read books in my spare time.",
    "Bonjour! Comment ça va?",
    "¿Cómo estás hoy?",
]

# Iterate over the texts and perform translation
for source_text in texts_to_translate:
    # Use langdetect to determine the language of the source text
    detected_language = detect(source_text)
    print(f"Detected language: {detected_language}")

    # Set the source language based on the detected language
    if detected_language == 'de':
        src_lang = "de_DE"
    elif detected_language == 'en':
        src_lang = "en_XX"
    elif detected_language == 'fr':
        src_lang = "fr_XX"
    elif detected_language == 'es':
        src_lang = "es_XX"
    else:
        print(f"Unsupported language: {detected_language}. Skipping...")
        continue  # If the language is unsupported, skip to the next text

    # Set the source language in the tokenizer
    tokenizer.src_lang = src_lang

    # Encode the source text
    encoded_text = tokenizer(source_text, return_tensors="pt")

    # Generate translation, ensuring the target language is Chinese
    generated_tokens = model.generate(
        **encoded_text,
        forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"]  # Target language code (Chinese)
    )

    # Decode the translation result
    translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

    # Check if the translated text is valid
    if not translated_text.strip():
        translated_text = "Translation error or empty output."

    print(f"Source Text: {source_text}")
    print("Translated Text (zh_CN):", translated_text)
    print()  # Output a blank line for better readability

out:

Detected language: de
Source Text: Guten Morgen! Wie kann ich Ihnen heute helfen?
Translated Text (zh_CN): Good morning! How can I help you today?

Detected language: de
Source Text: Das Wetter ist heute schön. Lass uns spazieren gehen.
Translated Text (zh_CN): The weather is fine today. Let’s go for a walk.

Detected language: de
Source Text: Ich liebe es, in meiner Freizeit Bücher zu lesen。
Translated Text (zh_CN): I love to read books in my spare time.

Detected language: en
Source Text: I love to read books in my spare time.
Translated Text (zh_CN): 我喜欢在空闲时间读书.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

Successfully merging a pull request may close this issue.