[BUG] GPT-2 tokenizer is NOT invertible #31884

jdeschena · 2024-07-10T08:57:22Z

System Info

Hello,

It is my understanding that the gpt-2 tokenizer, obtained with AutoTokenizer.from_pretrained("gpt2"), should be invertible. That is, given a sentence text, we should have that

text == tokenizer.decode(tokenizer(text, add_special_tokens=False)["input_ids"])

However, it is not the case, unlike the tiktoken reference implementation, which is correctly invertible.

For example, given the sentence Is this restaurant family-friendly ? Yes No Unsure ? This is a follow-up sentence ., encoding + decoding removes the space before punctuations, yielding a different sentence.

I have tried instantiating the tokenizer using GPT2Tokenizer.from_pretrained("openai-community/gpt2"), and using the options add_prefix_space=True or is_split_into_words=True, but the problem persists.

Hence, it looks like a bug to me, since BPE tokenizers should be invertible, as far as I understand.

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run this code, and you should see the bug. I am using transformers==4.38.2

#gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
gpt2_tokenizer =  GPT2Tokenizer.from_pretrained("openai-community/gpt2")
oai_tokenizer = tiktoken.get_encoding("gpt2")

orig = "Is this restaurant family-friendly ? Yes No Unsure ? This is an other sentence ."

hf_enc = gpt2_tokenizer(orig)["input_ids"]
hf_dec = gpt2_tokenizer.decode(hf_enc)

oai_enc = oai_tokenizer.encode(orig)
oai_dec = oai_tokenizer.decode(oai_enc)

print(hf_dec)
print(oai_dec)

Expected behavior

The two decoded sentence should be equal, yet they are not.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-07-10T12:18:39Z

Hey! Pretty sure this is due to the cleaup_tokenization_spaces argument. cc @itazap let's see if we can do a deprecation cycle for this one -> de-activate it by default but allow for it to be set in the tokenizer's parameters (tokenizer.cleanup_tokenization_spaces) as it's something that has been coming up quite a lot !

itazap · 2024-07-11T12:15:45Z

@ArthurZucker deprecate or set to False by default (currently it is set to True by default)? If we allow it to be set, then we do not deprecate?

ArthurZucker · 2024-07-11T16:37:42Z

We should still deprecate (if None, default to True, but next release we default to False)

github-actions · 2024-08-10T08:03:37Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

``` root@ceb53da4d023:~/sae_expected_l0# python3 main.py /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( Loaded pretrained model roneneldan/TinyStories-1M into HookedTransformer Repo card metadata block was not found. Setting CardData to empty. sigma: 0%| | 0/5 [00:00<?, ?it/swandb: Currently logged in as: joelb. Use `wandb login --relogin` to force relogin | 0/5 [00:00<?, ?it/s] wandb: Tracking run with wandb version 0.17.6 | 0/1 [00:00<?, ?it/s] wandb: Run data is saved locally in /root/sae_expected_l0/wandb/run-20240811_194131-exu7ftic | 0/3 [00:00<?, ?it/s] wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run unique-snow-4 wandb: ⭐️ View project at https://wandb.ai/joelb/sae_expected_l0 wandb: 🚀 View run at https://wandb.ai/joelb/sae_expected_l0/runs/exu7ftic sigma: 0%| | 0/5 [2:00:45<?, ?it/s] Traceback (most recent call last): File "/root/sae_expected_l0/main.py", line 113, in <module> train(model, sae, ds, learning_rate, l0_coefficient) File "/root/sae_expected_l0/main.py", line 61, in train _, cache = model.run_with_cache(torch.tensor(tokens), remove_batch_dim=True) File "/usr/local/lib/python3.10/dist-packages/transformer_lens/HookedTransformer.py", line 631, in run_with_cache out, cache_dict = super().run_with_cache( File "/usr/local/lib/python3.10/dist-packages/transformer_lens/hook_points.py", line 566, in run_with_cache model_out = self(*model_args, **model_kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformer_lens/HookedTransformer.py", line 522, in forward ) = self.input_to_embed( File "/usr/local/lib/python3.10/dist-packages/transformer_lens/HookedTransformer.py", line 330, in input_to_embed embed = self.hook_embed(self.embed(tokens)) # [batch, pos, d_model] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformer_lens/components/embed.py", line 34, in forward return self.W_E[tokens, :] IndexError: tensors used as indices must be long, int, byte or bool tensors wandb: | 0.011 MB of 0.011 MB uploaded wandb: Run history: wandb: l0_loss ▃▅▃▂▃▂▂▂█▂▁▁▂▃▂▃▁▂▂▂▁▂▂▂▃▂▁▃▁▂▁▂▁▃▁▂▁▂▂▁ wandb: loss ▃▅▃▂▃▂▂▂█▂▁▁▂▃▂▃▁▂▂▂▁▂▂▂▃▂▁▃▁▂▁▂▁▃▁▂▁▂▂▁ wandb: reconstruction_loss █▃▂▂▂▂▂▂▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: total_tokens ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███ wandb: wandb: Run summary: wandb: l0_loss 29664.35352 wandb: loss 29.66452 wandb: reconstruction_loss 0.00016 wandb: total_tokens 7541274 wandb: wandb: 🚀 View run unique-snow-4 at: https://wandb.ai/joelb/sae_expected_l0/runs/exu7ftic wandb: ⭐️ View project at: https://wandb.ai/joelb/sae_expected_l0 wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240811_194131-exu7ftic/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information. ```

Garrettlynch · 2024-08-17T13:33:13Z

I'm getting this issue with Flux in ComfyUI and it points to this bug issue. What is the solution to resolve it? Where do I set the cleanup_tokenization_spaces parameter to false?

Full terminal output below:

got prompt
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
/home/garrett/AI/ComfyUI/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
./launch.sh: line 7: 47378 Killed                  python3 main.py

add first version of classify_text.py ❯ python classify_text.py No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU. [{'label': 'POSITIVE', 'score': 0.9992625117301941}]

use CUDA / GPU ❯ python classify_text.py No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU. [{'label': 'POSITIVE', 'score': 0.9992625117301941}] It looks like your script is defaulting to using the CPU despite the availability of a GPU. To use the GPU, you'll need to specify the device argument in the pipeline configuration. Here's how you can modify your script to use the GPU: Install PyTorch with GPU support: Make sure you have PyTorch installed with GPU support. You can install it using pip: bash Copy code pip install torch torchvision torchaudio Ensure you have the appropriate version based on your CUDA version. You can find the correct version here. Modify Your Script: Update your script to specify the device. You can use the device argument in the pipeline function to set the device to GPU. For example: python Copy code from transformers import pipeline classifier = pipeline("sentiment-analysis", device=0) result = classifier("I love using GPUs for deep learning!") print(result) In this example, device=0 indicates that the first GPU should be used. If you want to use a different GPU or multiple GPUs, you can adjust the device index accordingly. Verify GPU Usage: Run your script again and verify that the GPU is being utilized. You can use tools like nvidia-smi to check GPU usage in real time. If you encounter any issues or need further assistance, let me know!

can detect missing file — not found — or wrong file name ❯ python classify_text.py No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/dist ilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] The file cleaned_text.txt.1was not found.

❯ python classify_text.py No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] Token indices sequence length is longer than the specified maximum sequence length for this model (944055 > 512). Running this sequence through the model will result in indexing errors An error occurred: The size of tensor a (944055) must match the size of tensor b (512) at non-singleton dimension 1 ChatGPT said: ChatGPT It looks like you’re running into a couple of issues with the classify_text.py script: Model Warning: The script is defaulting to a model (distilbert-base-uncased-finetuned-sst-2-english) and not specifying a model name and revision in production, which is not recommended. Tokenization Warning: There's a FutureWarning about clean_up_tokenization_spaces. This won’t affect your current run, but it’s good to be aware of changes in future versions. Token Indices Sequence Length Error: Your input text is too long for the model’s maximum sequence length (512 tokens). This is why you’re seeing the error message about tensor size mismatch. Solutions: Specify a Model: If you want to use a specific model, make sure to specify it explicitly in your script or pipeline to avoid relying on defaults. For example: python Copy code from transformers import pipeline classifier = pipeline("sentiment-analysis", model="your-chosen-model") Handling Long Text: To address the sequence length issue, you need to truncate or split your text before passing it to the model. Here’s how you can handle it: python Copy code from transformers import DistilBertTokenizer, DistilBertForSequenceClassification import torch tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english') model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english') def classify_long_text(text): inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512) outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) label = 'POSITIVE' if predictions[0][1] > 0.5 else 'NEGATIVE' return label, predictions[0].tolist() text = "Your long text goes here" label, scores = classify_long_text(text) print(f"Label: {label}, Scores: {scores}") This code will truncate the input text to fit within the 512-token limit and classify it accordingly. Feel free to ask if you need further clarification or assistance! You said: rewrite from transformers import pipeline classifier = pipeline('text-classification', device=0) test_text = "I love using GPUs for deep learning!" result = classifier(test_text) print ("\n") print (test_text) print(result) test_text = "I love using Hugging Face models!" results = classifier(test_text) print ("\n") print (test_text) print(result) print ("\n") transcribed_text = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text, 'r') as file: transcribed_text = file.read() # Classify the text results = classifier(transcribed_text) # Print results print(results) except FileNotFoundError: print("The file " + transcribed_text + "was not found.") except Exception as e: print(f"An error occurred: {e}") ChatGPT said: ChatGPT Certainly! Here’s a revised version of your script with better error handling and a clearer structure: python Copy code from transformers import pipeline classifier = pipeline('text-classification', device=0) def classify_text(text): """Classify the given text and print results.""" result = classifier(text) print("\n") print(text) print(result) print("\n") test_texts = [ "I love using GPUs for deep learning!", "I love using Hugging Face models!" ] for text in test_texts: classify_text(text) transcribed_text_file = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text_file, 'r') as file: transcribed_text = file.read() # Check if the text is too long for the model if len(transcribed_text.split()) > 512: print("The text is too long and will be truncated.") # Truncate the text to the first 512 tokens (or any appropriate length) transcribed_text = ' '.join(transcribed_text.split()[:512]) # Classify the text results = classifier(transcribed_text) # Print results print(results) except FileNotFoundError: print(f"The file '{transcribed_text_file}' was not found.") except Exception as e: print(f"An error occurred: {e}") Changes Made: Function for Classification: Added a classify_text function to avoid code repetition and make it easier to manage. Iterate Over Test Texts: Used a loop to classify multiple test texts. File Handling and Text Truncation: Added a check to truncate the text if it exceeds the model’s maximum token length. Error Messages: Improved error handling messages for better clarity. This structure should make your script more robust and easier to understand. Let me know if you need further adjustments!

❯ python classify_text.py No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9992625117301941}] The text is too long and will be truncated. Token indices sequence length is longer than the specified maximum sequence length for this model (621 > 512). Running this sequence through the model will result in indexing errors An error occurred: The size of tensor a (621) must match the size of tensor b (512) at non-singleton dimension 1 ChatGPT said: ChatGPT It looks like the script is handling the long text, but the truncation logic needs some adjustment to prevent the error. The error suggests that the truncation logic might not be correctly applied, or there might be another issue causing the length mismatch. Here's an updated version of the script to ensure proper truncation and better handling of the text length: python Copy code from transformers import pipeline classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0) def classify_text(text): """Classify the given text and print results.""" result = classifier(text) print("\n") print(text) print(result) print("\n") test_texts = [ "I love using GPUs for deep learning!", "I love using Hugging Face models!" ] for text in test_texts: classify_text(text) transcribed_text_file = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text_file, 'r') as file: transcribed_text = file.read() # Tokenize and check if the text is too long for the model tokenizer = classifier.tokenizer inputs = tokenizer(transcribed_text, return_tensors='pt', truncation=False, padding=False) if len(inputs['input_ids'][0]) > 512: print("The text is too long and will be truncated.") # Truncate the text to the first 512 tokens transcribed_text = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][:512])) # Classify the text results = classifier(transcribed_text) # Print results print(results) except FileNotFoundError: print(f"The file '{transcribed_text_file}' was not found.") except Exception as e: print(f"An error occurred: {e}") Key Updates: Specify the Model: Explicitly set the model in the pipeline initialization to avoid the default warning. Truncate Correctly: Properly handle truncation by using the tokenizer to manage token length. Handle Token Length: Ensure the text is truncated correctly to fit within the model's token limit. This should handle text longer than 512 tokens more effectively. Let me know if you encounter any further issues!

❯ python classify_text.py config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 629/629 [00:00<00:00, 1.27MB/s] model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████| 268M/268M [00:26<00:00, 10.3MB/s] tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 182kB/s] vocab.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 3.39MB/s] /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9992625117301941}] Token indices sequence length is longer than the specified maximum sequence length for this model (944055 > 512). Running this sequence through the model will result in indexing errors The text is too long and will be truncated. An error occurred: The size of tensor a (514) must match the size of tensor b (512) at non-singleton dimension 1 ChatGPT said: ChatGPT It seems like the truncation is still not being applied correctly, resulting in an error due to exceeding the token length limit. The input_ids handling may need to be corrected to ensure proper truncation and avoid mismatches. Here’s an updated version of the script that handles long text by using explicit truncation in both tokenization and classification steps: python Copy code from transformers import pipeline classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0) def classify_text(text): """Classify the given text and print results.""" result = classifier(text) print("\n") print(text) print(result) print("\n") test_texts = [ "I love using GPUs for deep learning!", "I love using Hugging Face models!" ] for text in test_texts: classify_text(text) transcribed_text_file = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text_file, 'r') as file: transcribed_text = file.read() # Tokenizer for handling long texts tokenizer = classifier.tokenizer inputs = tokenizer(transcribed_text, return_tensors='pt', truncation=True, padding=False, max_length=512) truncated_text = tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True) # Classify the truncated text results = classifier(truncated_text) # Print results print(results) except FileNotFoundError: print(f"The file '{transcribed_text_file}' was not found.") except Exception as e: print(f"An error occurred: {e}") Key Changes: Use Tokenizer with Truncation: Set truncation=True and max_length=512 in the tokenizer to handle the text length. Decode Truncated Text: Convert token IDs back to text to ensure correct representation for classification. This script should ensure that the text is properly truncated to fit the model's maximum sequence length and avoid errors. If you continue to experience issues, it might be helpful to verify that the classifier.tokenizer is the right tokenizer for your model.

❯ python classify_text.py /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9992625117301941}] [{'label': 'NEGATIVE', 'score': 0.9947984218597412}]

❯ python classify_text.py /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9992625117301941}] " xxxx xxxx xxxx xxxx xxxx xxxx xxxx ny xxxx xxxx xxxx date to macys xxxx inc xxxx xxxx xxxx xxxxxxxx xxxx xxxx ny xxxx attn xxxx xxxx executive office dear xxxx xxxx my name is xxxx xxxx and i am writing to address the ongoing issues with my macys credit account account number xxxx despite multiple attempts to resolve these matters over the phone i have not received the necessary assistance to correct the inaccuracies on my account xxxx disputed charge and lack of support i have called macys customer service many times to dispute a charge of 3100 and address other concerns each time the representatives were either unwilling or unable to assist me effectively macys should have records of these calls to verify my repeated efforts xxxx late fee i acknowledge the late fee of 3100 charged on xxxxyear and its subsequent credit on xxxxyear however this does not resolve the main issue of incorrect reporting xxxx credit reporting inaccuracy my account was reported as 30 days delinquent in xxxxyear i have consistently disputed this information stating that payments for xxxx and xxxxyear were made but not properly applied to my account additional impact due to this delinquency on my credit report my auto insurance increased to 33000 per month this has been a significant financial burden i request that macys conduct a comprehensive review including the phone call records and correct the inaccuracies on my account additionally i request the removal of the disputed by consumer note from my credit report once the issue is resolved additionally i do not want any macys advertisements sent to my apartment i hope to resolve this matter amicably and promptly thank you for your attention to this important issue sincerely xxxx xxxx " timely payments are always a priority for me and i am certain about this however i am unsure why this company is reporting me as late in certain months which should not be the case according to usc 1666b any billing error should be corrected or they will be liable to pay me 100000 for each account reporting inaccurately i never miss a payment i am certain of this i dont get why this company is marking me late for some months usc 1666b states that billing errors should be corrected and shown as paid on time i purchased a xxxx xxxx xxxx xxxx on xxxxxxxx using my td bank xxxx xxxx credit account phone was supposed to be delivered xxxxxxxx instead i get a delivery delayed notice i contact xx [{'label': 'NEGATIVE', 'score': 0.9947984218597412}]

it's not working ❯ python classify_text.py /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9992625117301941}] Token indices sequence length is longer than the specified maximum sequence length for this model (944055 > 512). Running this sequence through the model will result in indexing errors An error occurred: The size of tensor a (513) must match the size of tensor b (512) at non-singleton dimension 1 i think what you need to do is split the text file into smaller sized chunks, then interate through them in multiple iterations, a little bit at a time, in a for loop or while loop or something similar ChatGPT said: ChatGPT Got it! To ensure the text is split into manageable chunks and processed in iterations, we can adjust the script to handle splitting and iterating over chunks more explicitly. Here’s how you can do it: Split the text into smaller chunks. Iterate through these chunks and classify each chunk. Here’s the revised script: python Copy code from transformers import pipeline classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0) def classify_text(text): """Classify the given text and print results.""" result = classifier(text) print("\n") print(text) print(result) print("\n") def split_text_into_chunks(text, max_length=512): """Split text into chunks of a specified maximum length.""" tokenizer = classifier.tokenizer tokens = tokenizer.encode(text, truncation=False) # Split tokens into chunks for i in range(0, len(tokens), max_length): yield tokenizer.decode(tokens[i:i + max_length], skip_special_tokens=True) test_texts = [ "I love using GPUs for deep learning!", "I love using Hugging Face models!" ] for text in test_texts: classify_text(text) transcribed_text_file = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text_file, 'r') as file: transcribed_text = file.read() # Split text and classify each chunk for chunk in split_text_into_chunks(transcribed_text): classify_text(chunk) except FileNotFoundError: print(f"The file '{transcribed_text_file}' was not found.") except Exception as e: print(f"An error occurred: {e}") Key Points: split_text_into_chunks Function: This function splits the input text into chunks of up to max_length tokens. Iterate Over Chunks: The script iterates over each chunk and classifies it. This approach should handle long texts by processing them in smaller pieces and avoid the token length issues you were encountering. You said: are you looping? i don't see you looping ChatGPT said: ChatGPT You're right; the loop was implied by the split_text_into_chunks generator function but wasn't explicitly shown. Let's make sure the iteration over chunks is clear. Here’s a version with an explicit loop to handle each chunk: python Copy code from transformers import pipeline classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0) def classify_text(text): """Classify the given text and print results.""" result = classifier(text) print("\n") print(text) print(result) print("\n") def split_text_into_chunks(text, max_length=512): """Split text into chunks of a specified maximum length.""" tokenizer = classifier.tokenizer tokens = tokenizer.encode(text, truncation=False) # Split tokens into chunks for i in range(0, len(tokens), max_length): yield tokenizer.decode(tokens[i:i + max_length], skip_special_tokens=True) test_texts = [ "I love using GPUs for deep learning!", "I love using Hugging Face models!" ] for text in test_texts: classify_text(text) transcribed_text_file = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text_file, 'r') as file: transcribed_text = file.read() # Split text and classify each chunk for chunk in split_text_into_chunks(transcribed_text): classify_text(chunk) except FileNotFoundError: print(f"The file '{transcribed_text_file}' was not found.") except Exception as e: print(f"An error occurred: {e}") Breakdown: split_text_into_chunks Function: Uses a generator to yield chunks of text. Explicit Loop: The for chunk in split_text_into_chunks(transcribed_text) loop processes each chunk individually. This ensures each chunk of the text is classified separately. Let me know if this works for you or if you need further adjustments!

Simontakeasit · 2024-10-01T08:05:05Z

Issue : warnings.warn(clip missing: ['text_projection.weight']

Issue resolved ->

Fix approach : in workflow section UNET/MODEL , you can set weight_dtype from 'default' to one of the fp8 types in your dropdown list and then re-run.

Hopyfully this can help you guys.

Garrettlynch · 2024-10-15T13:22:43Z

Issue : warnings.warn(clip missing: ['text_projection.weight']

Issue resolved ->

Fix approach : in workflow section UNET/MODEL , you can set weight_dtype from 'default' to one of the fp8 types in your dropdown list and then re-run.

Hopyfully this can help you guys.

This resolved the issue for me - so simple.

xie-yao-oos · 2024-10-28T06:40:46Z

from transformers import MBartForConditionalGeneration, MBart50Tokenizer
from langdetect import detect

# Specify the model folder path
model_path = "model"

# Load the tokenizer and model
tokenizer = MBart50Tokenizer.from_pretrained(model_path)
model = MBartForConditionalGeneration.from_pretrained(model_path)

# Define the texts to be translated
texts_to_translate = [
    "Guten Morgen! Wie kann ich Ihnen heute helfen?",
    "Das Wetter ist heute schön. Lass uns spazieren gehen.",
    "Ich liebe es, in meiner Freizeit Bücher zu lesen。",
    "I love to read books in my spare time.",
    "Bonjour! Comment ça va?",
    "¿Cómo estás hoy?",
]

# Iterate over the texts and perform translation
for source_text in texts_to_translate:
    # Use langdetect to determine the language of the source text
    detected_language = detect(source_text)
    print(f"Detected language: {detected_language}")

    # Set the source language based on the detected language
    if detected_language == 'de':
        src_lang = "de_DE"
    elif detected_language == 'en':
        src_lang = "en_XX"
    elif detected_language == 'fr':
        src_lang = "fr_XX"
    elif detected_language == 'es':
        src_lang = "es_XX"
    else:
        print(f"Unsupported language: {detected_language}. Skipping...")
        continue  # If the language is unsupported, skip to the next text

    # Set the source language in the tokenizer
    tokenizer.src_lang = src_lang

    # Encode the source text
    encoded_text = tokenizer(source_text, return_tensors="pt")

    # Generate translation, ensuring the target language is Chinese
    generated_tokens = model.generate(
        **encoded_text,
        forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"]  # Target language code (Chinese)
    )

    # Decode the translation result
    translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

    # Check if the translated text is valid
    if not translated_text.strip():
        translated_text = "Translation error or empty output."

    print(f"Source Text: {source_text}")
    print("Translated Text (zh_CN):", translated_text)
    print()  # Output a blank line for better readability

out:

Detected language: de
Source Text: Guten Morgen! Wie kann ich Ihnen heute helfen?
Translated Text (zh_CN): Good morning! How can I help you today?

Detected language: de
Source Text: Das Wetter ist heute schön. Lass uns spazieren gehen.
Translated Text (zh_CN): The weather is fine today. Let’s go for a walk.

Detected language: de
Source Text: Ich liebe es, in meiner Freizeit Bücher zu lesen。
Translated Text (zh_CN): I love to read books in my spare time.

Detected language: en
Source Text: I love to read books in my spare time.
Translated Text (zh_CN): 我喜欢在空闲时间读书.

amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Jul 10, 2024

itazap mentioned this issue Jul 12, 2024

clean_up_tokenization_spaces=False if unset #31938

Merged

itazap mentioned this issue Aug 1, 2024

update clean_up_tokenization_spaces warning #32371

Merged

anqin mentioned this issue Aug 7, 2024

Sentence transformers error? LexiestLeszek/sova_ollama#1

Open

iKurama mentioned this issue Aug 8, 2024

Flux.1 Dev, memory issue comfyanonymous/ComfyUI#4271

Open

jia95812 mentioned this issue Aug 9, 2024

安装遇到问题，请求帮助 shariqfarooq123/LooseControl#18

Open

alanakbik mentioned this issue Aug 9, 2024

Ignore FutureWarning flairNLP/flair#3526

Merged

ZhaoLongjiea mentioned this issue Aug 10, 2024

TypeError: <lambda>() missing 1 required positional argument: 'include_feature' buaacyw/GaussianEditor#61

Open

Nonamedrew mentioned this issue Aug 11, 2024

nonetype error lllyasviel/stable-diffusion-webui-forge#982

Open

Firetheft mentioned this issue Aug 12, 2024

The xformers problem also exists in the webui_forge_cu121_torch231 version, so is there no normal xformers version? lllyasviel/stable-diffusion-webui-forge#1008

Open

YujiaKCL mentioned this issue Aug 13, 2024

Flux Training: AttributeError: 'T5EncoderModel' object has no attribute 'text_model' kohya-ss/sd-scripts#1451

Closed

grigio mentioned this issue Aug 13, 2024

Flux.1 Schnell, memory issue on AMD Rocm comfyanonymous/ComfyUI#4341

Open

TEM-112 mentioned this issue Aug 14, 2024

TypeError: 'NoneType' object is not iterable lshqqytiger/stable-diffusion-webui-amdgpu-forge#15

Closed

pinea00 mentioned this issue Aug 14, 2024

Why do I get the same error on reforge (the code has been updated), but it works perfectly on Forge? Haoming02/sd-forge-ic-light#6

Closed

Aderek514 mentioned this issue Aug 15, 2024

OverflowError: cannot fit 'int' into an index-sized integer ostris/ai-toolkit#70

Closed

ZeeMenng mentioned this issue Sep 28, 2024

Trying to convert Float8_e4m3fn to the MPS backend but it does not have support for that dtype. MinusZoneAI/ComfyUI-CogVideoX-MZ#19

Open

sunyiwk mentioned this issue Oct 1, 2024

RuntimeError: CUDA error: device-side assert triggered cocktailpeanut/fluxgym#158

Open

jbloomAus mentioned this issue Oct 1, 2024

[Bug Report] Unable to train Mamba SAE jbloomAus/SAELens#311

Open

1 task

Wolchenok57 mentioned this issue Oct 3, 2024

NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device. kohya-ss/sd-scripts#1665

Open

Toldblog mentioned this issue Oct 3, 2024

Free memory when already run the prompt pydn/ComfyUI-to-Python-Extension#83

Open

csw-work mentioned this issue Oct 3, 2024

OSError: Consistency check failed: file should be of size 133466304 but has size 635 huggingface/huggingface_hub#2589

Closed

LukeLIN-web mentioned this issue Oct 4, 2024

SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 2. Setting context_dim to [2048, 2048] now. Stability-AI/generative-models#416

Open

JHW5981 mentioned this issue Oct 6, 2024

../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [267,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed. #33985

Open

4 tasks

arefekh mentioned this issue Oct 8, 2024

export T5 model to onnx onnx/onnx#6432

Closed

hustxiayang mentioned this issue Oct 9, 2024

bump to vllm0.6.2 and add explicit chat template kserve/kserve#3964

Merged

4 tasks

zk495539792 mentioned this issue Oct 10, 2024

能支持设置连续训练队列吗 Akegarasu/lora-scripts#40

Open

werruww mentioned this issue Oct 11, 2024

loading big models into memory huggingface/accelerate#3153

Open

4 tasks

Illaren mentioned this issue Oct 12, 2024

[Bug]: Long texts cant be used for TTS IAHispano/Applio#807

Closed

bolli20000 mentioned this issue Oct 12, 2024

Cannot resume finetuning flux checkpoint - resume option cannot find input path, but input path is correct ###kohya_ss GUI release v24.2.0### kohya-ss/sd-scripts#1692

Closed

RoacherM mentioned this issue Oct 17, 2024

调用vllm-server/sglang部署的模型容易出现ConnectionError HKUDS/LightRAG#33

Closed

samfhritp mentioned this issue Oct 17, 2024

REST URI client error: 422 leondz/garak#952

Open

F4k3r22 mentioned this issue Oct 18, 2024

How can I use it with PDF instead of txt? HKUDS/LightRAG#41

Closed

NlsnBoa mentioned this issue Oct 19, 2024

Running 01-single-gpu/train_llm.py causes an error using tatsu-lab/alpaca dataset LambdaLabsML/distributed-training-guide#43

Closed

tempyoutub mentioned this issue Oct 21, 2024

Training doesn't start "Command exited with code 1" cocktailpeanut/fluxgym#207

Open

amin-kh96 mentioned this issue Oct 21, 2024

ValidationError explodinggradients/ragas#1545

Open

winatony mentioned this issue Oct 24, 2024

chai1 jobs fail with i/o error, seem to successfully create output files though pnnl-predictive-phenomics/concerto#30

Open

This was referenced Oct 29, 2024

(sd3-Flux) returned non-zero exit status 3221225477. 13:02:14-584959 INFO Training has ended kohya-ss/sd-scripts#1737

Closed

(Flux) Dreambooth trained model generates noise bmaltais/kohya_ss#2939

Open

Jarradmorden mentioned this issue Nov 1, 2024

numpy._core._exceptions._ArrayMemoryError: Unable to allocate 11.9 MiB for an array with shape (6, 1080, 1920) and data type bool roboflow/notebooks#336

Open

1 task

Poppy233 mentioned this issue Nov 5, 2024

Can Co-Storm be instantiated using Ollama+Serper? stanford-oval/storm#222

Open

codexq123 mentioned this issue Nov 6, 2024

patch model patcher + torch compile kijai/ComfyUI-KJNodes#134

Open

Hibiki82 mentioned this issue Nov 7, 2024

Issue on training Flux1 LORA, and Fine-Tuning pretrained model. kohya-ss/sd-scripts#1767

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GPT-2 tokenizer is NOT invertible #31884

[BUG] GPT-2 tokenizer is NOT invertible #31884

jdeschena commented Jul 10, 2024 •

edited

Loading

ArthurZucker commented Jul 10, 2024

itazap commented Jul 11, 2024 •

edited

Loading

ArthurZucker commented Jul 11, 2024

github-actions bot commented Aug 10, 2024

Garrettlynch commented Aug 17, 2024

Simontakeasit commented Oct 1, 2024

Garrettlynch commented Oct 15, 2024

xie-yao-oos commented Oct 28, 2024

[BUG] GPT-2 tokenizer is NOT invertible #31884

[BUG] GPT-2 tokenizer is NOT invertible #31884

Comments

jdeschena commented Jul 10, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jul 10, 2024

itazap commented Jul 11, 2024 • edited Loading

ArthurZucker commented Jul 11, 2024

github-actions bot commented Aug 10, 2024

Garrettlynch commented Aug 17, 2024

Simontakeasit commented Oct 1, 2024

Garrettlynch commented Oct 15, 2024

xie-yao-oos commented Oct 28, 2024

jdeschena commented Jul 10, 2024 •

edited

Loading

itazap commented Jul 11, 2024 •

edited

Loading