-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] GPT-2 tokenizer is NOT invertible #31884
Comments
Hey! Pretty sure this is due to the |
@ArthurZucker deprecate or set to False by default (currently it is set to True by default)? If we allow it to be set, then we do not deprecate? |
We should still deprecate (if None, default to True, but next release we default to False) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
``` root@ceb53da4d023:~/sae_expected_l0# python3 main.py /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( Loaded pretrained model roneneldan/TinyStories-1M into HookedTransformer Repo card metadata block was not found. Setting CardData to empty. sigma: 0%| | 0/5 [00:00<?, ?it/swandb: Currently logged in as: joelb. Use `wandb login --relogin` to force relogin | 0/5 [00:00<?, ?it/s] wandb: Tracking run with wandb version 0.17.6 | 0/1 [00:00<?, ?it/s] wandb: Run data is saved locally in /root/sae_expected_l0/wandb/run-20240811_194131-exu7ftic | 0/3 [00:00<?, ?it/s] wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run unique-snow-4 wandb: ⭐️ View project at https://wandb.ai/joelb/sae_expected_l0 wandb: 🚀 View run at https://wandb.ai/joelb/sae_expected_l0/runs/exu7ftic sigma: 0%| | 0/5 [2:00:45<?, ?it/s] Traceback (most recent call last): File "/root/sae_expected_l0/main.py", line 113, in <module> train(model, sae, ds, learning_rate, l0_coefficient) File "/root/sae_expected_l0/main.py", line 61, in train _, cache = model.run_with_cache(torch.tensor(tokens), remove_batch_dim=True) File "/usr/local/lib/python3.10/dist-packages/transformer_lens/HookedTransformer.py", line 631, in run_with_cache out, cache_dict = super().run_with_cache( File "/usr/local/lib/python3.10/dist-packages/transformer_lens/hook_points.py", line 566, in run_with_cache model_out = self(*model_args, **model_kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformer_lens/HookedTransformer.py", line 522, in forward ) = self.input_to_embed( File "/usr/local/lib/python3.10/dist-packages/transformer_lens/HookedTransformer.py", line 330, in input_to_embed embed = self.hook_embed(self.embed(tokens)) # [batch, pos, d_model] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformer_lens/components/embed.py", line 34, in forward return self.W_E[tokens, :] IndexError: tensors used as indices must be long, int, byte or bool tensors wandb: | 0.011 MB of 0.011 MB uploaded wandb: Run history: wandb: l0_loss ▃▅▃▂▃▂▂▂█▂▁▁▂▃▂▃▁▂▂▂▁▂▂▂▃▂▁▃▁▂▁▂▁▃▁▂▁▂▂▁ wandb: loss ▃▅▃▂▃▂▂▂█▂▁▁▂▃▂▃▁▂▂▂▁▂▂▂▃▂▁▃▁▂▁▂▁▃▁▂▁▂▂▁ wandb: reconstruction_loss █▃▂▂▂▂▂▂▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: total_tokens ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███ wandb: wandb: Run summary: wandb: l0_loss 29664.35352 wandb: loss 29.66452 wandb: reconstruction_loss 0.00016 wandb: total_tokens 7541274 wandb: wandb: 🚀 View run unique-snow-4 at: https://wandb.ai/joelb/sae_expected_l0/runs/exu7ftic wandb: ⭐️ View project at: https://wandb.ai/joelb/sae_expected_l0 wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240811_194131-exu7ftic/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information. ```
I'm getting this issue with Flux in ComfyUI and it points to this bug issue. What is the solution to resolve it? Where do I set the cleanup_tokenization_spaces parameter to false? Full terminal output below:
|
add first version of classify_text.py ❯ python classify_text.py No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU. [{'label': 'POSITIVE', 'score': 0.9992625117301941}]
use CUDA / GPU ❯ python classify_text.py No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU. [{'label': 'POSITIVE', 'score': 0.9992625117301941}] It looks like your script is defaulting to using the CPU despite the availability of a GPU. To use the GPU, you'll need to specify the device argument in the pipeline configuration. Here's how you can modify your script to use the GPU: Install PyTorch with GPU support: Make sure you have PyTorch installed with GPU support. You can install it using pip: bash Copy code pip install torch torchvision torchaudio Ensure you have the appropriate version based on your CUDA version. You can find the correct version here. Modify Your Script: Update your script to specify the device. You can use the device argument in the pipeline function to set the device to GPU. For example: python Copy code from transformers import pipeline classifier = pipeline("sentiment-analysis", device=0) result = classifier("I love using GPUs for deep learning!") print(result) In this example, device=0 indicates that the first GPU should be used. If you want to use a different GPU or multiple GPUs, you can adjust the device index accordingly. Verify GPU Usage: Run your script again and verify that the GPU is being utilized. You can use tools like nvidia-smi to check GPU usage in real time. If you encounter any issues or need further assistance, let me know!
can detect missing file — not found — or wrong file name ❯ python classify_text.py No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/dist ilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] The file cleaned_text.txt.1was not found.
❯ python classify_text.py No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] Token indices sequence length is longer than the specified maximum sequence length for this model (944055 > 512). Running this sequence through the model will result in indexing errors An error occurred: The size of tensor a (944055) must match the size of tensor b (512) at non-singleton dimension 1 ChatGPT said: ChatGPT It looks like you’re running into a couple of issues with the classify_text.py script: Model Warning: The script is defaulting to a model (distilbert-base-uncased-finetuned-sst-2-english) and not specifying a model name and revision in production, which is not recommended. Tokenization Warning: There's a FutureWarning about clean_up_tokenization_spaces. This won’t affect your current run, but it’s good to be aware of changes in future versions. Token Indices Sequence Length Error: Your input text is too long for the model’s maximum sequence length (512 tokens). This is why you’re seeing the error message about tensor size mismatch. Solutions: Specify a Model: If you want to use a specific model, make sure to specify it explicitly in your script or pipeline to avoid relying on defaults. For example: python Copy code from transformers import pipeline classifier = pipeline("sentiment-analysis", model="your-chosen-model") Handling Long Text: To address the sequence length issue, you need to truncate or split your text before passing it to the model. Here’s how you can handle it: python Copy code from transformers import DistilBertTokenizer, DistilBertForSequenceClassification import torch tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english') model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english') def classify_long_text(text): inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512) outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) label = 'POSITIVE' if predictions[0][1] > 0.5 else 'NEGATIVE' return label, predictions[0].tolist() text = "Your long text goes here" label, scores = classify_long_text(text) print(f"Label: {label}, Scores: {scores}") This code will truncate the input text to fit within the 512-token limit and classify it accordingly. Feel free to ask if you need further clarification or assistance! You said: rewrite from transformers import pipeline classifier = pipeline('text-classification', device=0) test_text = "I love using GPUs for deep learning!" result = classifier(test_text) print ("\n") print (test_text) print(result) test_text = "I love using Hugging Face models!" results = classifier(test_text) print ("\n") print (test_text) print(result) print ("\n") transcribed_text = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text, 'r') as file: transcribed_text = file.read() # Classify the text results = classifier(transcribed_text) # Print results print(results) except FileNotFoundError: print("The file " + transcribed_text + "was not found.") except Exception as e: print(f"An error occurred: {e}") ChatGPT said: ChatGPT Certainly! Here’s a revised version of your script with better error handling and a clearer structure: python Copy code from transformers import pipeline classifier = pipeline('text-classification', device=0) def classify_text(text): """Classify the given text and print results.""" result = classifier(text) print("\n") print(text) print(result) print("\n") test_texts = [ "I love using GPUs for deep learning!", "I love using Hugging Face models!" ] for text in test_texts: classify_text(text) transcribed_text_file = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text_file, 'r') as file: transcribed_text = file.read() # Check if the text is too long for the model if len(transcribed_text.split()) > 512: print("The text is too long and will be truncated.") # Truncate the text to the first 512 tokens (or any appropriate length) transcribed_text = ' '.join(transcribed_text.split()[:512]) # Classify the text results = classifier(transcribed_text) # Print results print(results) except FileNotFoundError: print(f"The file '{transcribed_text_file}' was not found.") except Exception as e: print(f"An error occurred: {e}") Changes Made: Function for Classification: Added a classify_text function to avoid code repetition and make it easier to manage. Iterate Over Test Texts: Used a loop to classify multiple test texts. File Handling and Text Truncation: Added a check to truncate the text if it exceeds the model’s maximum token length. Error Messages: Improved error handling messages for better clarity. This structure should make your script more robust and easier to understand. Let me know if you need further adjustments!
❯ python classify_text.py No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9992625117301941}] The text is too long and will be truncated. Token indices sequence length is longer than the specified maximum sequence length for this model (621 > 512). Running this sequence through the model will result in indexing errors An error occurred: The size of tensor a (621) must match the size of tensor b (512) at non-singleton dimension 1 ChatGPT said: ChatGPT It looks like the script is handling the long text, but the truncation logic needs some adjustment to prevent the error. The error suggests that the truncation logic might not be correctly applied, or there might be another issue causing the length mismatch. Here's an updated version of the script to ensure proper truncation and better handling of the text length: python Copy code from transformers import pipeline classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0) def classify_text(text): """Classify the given text and print results.""" result = classifier(text) print("\n") print(text) print(result) print("\n") test_texts = [ "I love using GPUs for deep learning!", "I love using Hugging Face models!" ] for text in test_texts: classify_text(text) transcribed_text_file = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text_file, 'r') as file: transcribed_text = file.read() # Tokenize and check if the text is too long for the model tokenizer = classifier.tokenizer inputs = tokenizer(transcribed_text, return_tensors='pt', truncation=False, padding=False) if len(inputs['input_ids'][0]) > 512: print("The text is too long and will be truncated.") # Truncate the text to the first 512 tokens transcribed_text = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][:512])) # Classify the text results = classifier(transcribed_text) # Print results print(results) except FileNotFoundError: print(f"The file '{transcribed_text_file}' was not found.") except Exception as e: print(f"An error occurred: {e}") Key Updates: Specify the Model: Explicitly set the model in the pipeline initialization to avoid the default warning. Truncate Correctly: Properly handle truncation by using the tokenizer to manage token length. Handle Token Length: Ensure the text is truncated correctly to fit within the model's token limit. This should handle text longer than 512 tokens more effectively. Let me know if you encounter any further issues!
❯ python classify_text.py config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 629/629 [00:00<00:00, 1.27MB/s] model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████| 268M/268M [00:26<00:00, 10.3MB/s] tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 182kB/s] vocab.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 3.39MB/s] /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9992625117301941}] Token indices sequence length is longer than the specified maximum sequence length for this model (944055 > 512). Running this sequence through the model will result in indexing errors The text is too long and will be truncated. An error occurred: The size of tensor a (514) must match the size of tensor b (512) at non-singleton dimension 1 ChatGPT said: ChatGPT It seems like the truncation is still not being applied correctly, resulting in an error due to exceeding the token length limit. The input_ids handling may need to be corrected to ensure proper truncation and avoid mismatches. Here’s an updated version of the script that handles long text by using explicit truncation in both tokenization and classification steps: python Copy code from transformers import pipeline classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0) def classify_text(text): """Classify the given text and print results.""" result = classifier(text) print("\n") print(text) print(result) print("\n") test_texts = [ "I love using GPUs for deep learning!", "I love using Hugging Face models!" ] for text in test_texts: classify_text(text) transcribed_text_file = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text_file, 'r') as file: transcribed_text = file.read() # Tokenizer for handling long texts tokenizer = classifier.tokenizer inputs = tokenizer(transcribed_text, return_tensors='pt', truncation=True, padding=False, max_length=512) truncated_text = tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True) # Classify the truncated text results = classifier(truncated_text) # Print results print(results) except FileNotFoundError: print(f"The file '{transcribed_text_file}' was not found.") except Exception as e: print(f"An error occurred: {e}") Key Changes: Use Tokenizer with Truncation: Set truncation=True and max_length=512 in the tokenizer to handle the text length. Decode Truncated Text: Convert token IDs back to text to ensure correct representation for classification. This script should ensure that the text is properly truncated to fit the model's maximum sequence length and avoid errors. If you continue to experience issues, it might be helpful to verify that the classifier.tokenizer is the right tokenizer for your model.
❯ python classify_text.py /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9992625117301941}] [{'label': 'NEGATIVE', 'score': 0.9947984218597412}]
❯ python classify_text.py /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9992625117301941}] " xxxx xxxx xxxx xxxx xxxx xxxx xxxx ny xxxx xxxx xxxx date to macys xxxx inc xxxx xxxx xxxx xxxxxxxx xxxx xxxx ny xxxx attn xxxx xxxx executive office dear xxxx xxxx my name is xxxx xxxx and i am writing to address the ongoing issues with my macys credit account account number xxxx despite multiple attempts to resolve these matters over the phone i have not received the necessary assistance to correct the inaccuracies on my account xxxx disputed charge and lack of support i have called macys customer service many times to dispute a charge of 3100 and address other concerns each time the representatives were either unwilling or unable to assist me effectively macys should have records of these calls to verify my repeated efforts xxxx late fee i acknowledge the late fee of 3100 charged on xxxxyear and its subsequent credit on xxxxyear however this does not resolve the main issue of incorrect reporting xxxx credit reporting inaccuracy my account was reported as 30 days delinquent in xxxxyear i have consistently disputed this information stating that payments for xxxx and xxxxyear were made but not properly applied to my account additional impact due to this delinquency on my credit report my auto insurance increased to 33000 per month this has been a significant financial burden i request that macys conduct a comprehensive review including the phone call records and correct the inaccuracies on my account additionally i request the removal of the disputed by consumer note from my credit report once the issue is resolved additionally i do not want any macys advertisements sent to my apartment i hope to resolve this matter amicably and promptly thank you for your attention to this important issue sincerely xxxx xxxx " timely payments are always a priority for me and i am certain about this however i am unsure why this company is reporting me as late in certain months which should not be the case according to usc 1666b any billing error should be corrected or they will be liable to pay me 100000 for each account reporting inaccurately i never miss a payment i am certain of this i dont get why this company is marking me late for some months usc 1666b states that billing errors should be corrected and shown as paid on time i purchased a xxxx xxxx xxxx xxxx on xxxxxxxx using my td bank xxxx xxxx credit account phone was supposed to be delivered xxxxxxxx instead i get a delivery delayed notice i contact xx [{'label': 'NEGATIVE', 'score': 0.9947984218597412}]
it's not working ❯ python classify_text.py /Users/lloyd/.pyenv/versions/3.12.4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884 warnings.warn( I love using GPUs for deep learning! [{'label': 'POSITIVE', 'score': 0.9990824460983276}] I love using Hugging Face models! [{'label': 'POSITIVE', 'score': 0.9992625117301941}] Token indices sequence length is longer than the specified maximum sequence length for this model (944055 > 512). Running this sequence through the model will result in indexing errors An error occurred: The size of tensor a (513) must match the size of tensor b (512) at non-singleton dimension 1 i think what you need to do is split the text file into smaller sized chunks, then interate through them in multiple iterations, a little bit at a time, in a for loop or while loop or something similar ChatGPT said: ChatGPT Got it! To ensure the text is split into manageable chunks and processed in iterations, we can adjust the script to handle splitting and iterating over chunks more explicitly. Here’s how you can do it: Split the text into smaller chunks. Iterate through these chunks and classify each chunk. Here’s the revised script: python Copy code from transformers import pipeline classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0) def classify_text(text): """Classify the given text and print results.""" result = classifier(text) print("\n") print(text) print(result) print("\n") def split_text_into_chunks(text, max_length=512): """Split text into chunks of a specified maximum length.""" tokenizer = classifier.tokenizer tokens = tokenizer.encode(text, truncation=False) # Split tokens into chunks for i in range(0, len(tokens), max_length): yield tokenizer.decode(tokens[i:i + max_length], skip_special_tokens=True) test_texts = [ "I love using GPUs for deep learning!", "I love using Hugging Face models!" ] for text in test_texts: classify_text(text) transcribed_text_file = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text_file, 'r') as file: transcribed_text = file.read() # Split text and classify each chunk for chunk in split_text_into_chunks(transcribed_text): classify_text(chunk) except FileNotFoundError: print(f"The file '{transcribed_text_file}' was not found.") except Exception as e: print(f"An error occurred: {e}") Key Points: split_text_into_chunks Function: This function splits the input text into chunks of up to max_length tokens. Iterate Over Chunks: The script iterates over each chunk and classifies it. This approach should handle long texts by processing them in smaller pieces and avoid the token length issues you were encountering. You said: are you looping? i don't see you looping ChatGPT said: ChatGPT You're right; the loop was implied by the split_text_into_chunks generator function but wasn't explicitly shown. Let's make sure the iteration over chunks is clear. Here’s a version with an explicit loop to handle each chunk: python Copy code from transformers import pipeline classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=0) def classify_text(text): """Classify the given text and print results.""" result = classifier(text) print("\n") print(text) print(result) print("\n") def split_text_into_chunks(text, max_length=512): """Split text into chunks of a specified maximum length.""" tokenizer = classifier.tokenizer tokens = tokenizer.encode(text, truncation=False) # Split tokens into chunks for i in range(0, len(tokens), max_length): yield tokenizer.decode(tokens[i:i + max_length], skip_special_tokens=True) test_texts = [ "I love using GPUs for deep learning!", "I love using Hugging Face models!" ] for text in test_texts: classify_text(text) transcribed_text_file = "cleaned_text.txt" try: # Load transcribed text with open(transcribed_text_file, 'r') as file: transcribed_text = file.read() # Split text and classify each chunk for chunk in split_text_into_chunks(transcribed_text): classify_text(chunk) except FileNotFoundError: print(f"The file '{transcribed_text_file}' was not found.") except Exception as e: print(f"An error occurred: {e}") Breakdown: split_text_into_chunks Function: Uses a generator to yield chunks of text. Explicit Loop: The for chunk in split_text_into_chunks(transcribed_text) loop processes each chunk individually. This ensures each chunk of the text is classified separately. Let me know if this works for you or if you need further adjustments!
out:
|
System Info
Hello,
It is my understanding that the gpt-2 tokenizer, obtained with
AutoTokenizer.from_pretrained("gpt2")
, should be invertible. That is, given a sentencetext
, we should have thattext == tokenizer.decode(tokenizer(text, add_special_tokens=False)["input_ids"])
However, it is not the case, unlike the
tiktoken
reference implementation, which is correctly invertible.For example, given the sentence
Is this restaurant family-friendly ? Yes No Unsure ? This is a follow-up sentence .
, encoding + decoding removes the space before punctuations, yielding a different sentence.I have tried instantiating the tokenizer using
GPT2Tokenizer.from_pretrained("openai-community/gpt2")
, and using the optionsadd_prefix_space=True
oris_split_into_words=True
, but the problem persists.Hence, it looks like a bug to me, since BPE tokenizers should be invertible, as far as I understand.
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run this code, and you should see the bug. I am using
transformers==4.38.2
Expected behavior
The two decoded sentence should be equal, yet they are not.
The text was updated successfully, but these errors were encountered: