New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: You are attempting to pad samples but the tokenizer you are using (GPT2Tokenizer) does not have one. #4122
Comments
I had the same error; I think the problem lies in "line_by_line" since if I remove this option, my code can run fine. |
@BramVanroy - sorry to just link you here. Did we decide to add a force-padding option here or not yet? |
We had the discussion over here #3388 (comment) @GCHQResearcher92457 mentions that they "tried to implement your suggestion". Perhaps it's better if you @patrickvonplaten could implement review the changes in the updated PR here 5ff6eb7 ? |
Any progress about this issue? I have the same problem if I use line_by_line with gpt2. Thanks |
Same problem with v2.9.1 |
Personally, I think the fix is just that you can't use the line_by_line dataset with gpt2 (because it doesn't have a padding token) @patrickvonplaten @BramVanroy Should I just raise an error that tells user to remove the |
@julien-c Perhaps you can have a look at PR 5ff6eb7 there, they suggest to add a force_padding_token option so that if a model does not have a padding token by default, it is added to the vocabulary manually. I have no preference: I like the implementation in the PR but it might not be what you would want or expect. Raising an error is also fine for me. |
@julien-c I am stuck with same error. If not line by line how else can I train the GPT2 model from scratch? Here is my GPT2 config and language Model: from transformers import GPT2LMHeadModel, GPT2Config
The training logic:
|
You need to use |
@borisdayma Thaks for the quick reply! Where can I find more how these models can be trained with what kind of datasets and what kind of tokenizers and special tokens? Please help me so that I can create simple and clear collab notebook and share it here so that others can easily use it. |
A good place to start would be the language_modeling section of the examples page from the doc |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi there, |
You should set the PAD token manually equal to the EOS token. See #2630 as well |
Thanks @patrickvonplaten it worked. All i had to do is to add the following after tokenizer initialization
|
I'm not sure of the consequences of this. To be safe you probably also should set the IDs, then. Something like this:
EDIT: this is wrong, see below |
@BramVanroy - it's actually not possible to set the ids equal to each other, doing |
@patrickvonplaten yes I tried that too bot could not set the ids |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I used this solution and the error went away. Though, this introduced a new problem for me - the model couldn't generate <|endoftext|> during inference. The model didn't learn to generate eos_token because it was ignored while computing the loss as it is same as pad_token. I had to use some other token as pad_token. Other than this, I also had to add eos_token to each list in LineByLineDataset.examples. Note: I am using transformers 3.4.0. |
馃悰 Bug
Information
Model I am using (Bert, XLNet ...):GPT2
Language I am using the model on (English, Chinese ...):English
The problem arises when using:
my own modified scripts: (give details below)
python examples/run_language_modeling.py
--train_data_file=temp_gpt2/gpt2.train
--output_dir=checkpoints/gpt2
--model_type=gpt2
--model_name_or_path=gpt2
--eval_data_file=temp_gpt2/test.txt
--line_by_line
--do_train
--do_eval
--evaluate_during_training
--per_gpu_train_batch_size=20
--per_gpu_eval_batch_size=20
--gradient_accumulation_steps=1
--learning_rate=8e-5
--weight_decay=0.075
--adam_epsilon=1e-8
--warmup_steps=500
--max_grad_norm=5.0
--num_train_epochs=20
--logging_steps=500
--save_steps=500
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
I get the following error:
Expected behavior
Environment info
transformers
version:masterThe text was updated successfully, but these errors were encountered: