You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to preprocess any dataset in this package with GPT-2 tokenizer, so I need to structure the datasets as long sequences of text without padding. I've been following a couple of your tutorials and here you can find the script that is failing right at the end
In the last iteration of the last dset.map, it gives the error that I copied in the title. Another issue that I have, if I leave the batch_size set as 1000 in the last .map, I'm afraid it's going to lose most text, so I'm considering setting both writer_batch_size and batch_size to 300 K, but I'm not sure it's the best way to go.
Can you help me?
Thanks!
The text was updated successfully, but these errors were encountered:
Hi !
The error you have is due to the input_ids column not having the same number of examples as the other columns.
Indeed you're concatenating the input_ids at this line:
However the other columns are kept unchanged, and therefore you end up with an input_ids column with 599 elements while the others columns like attention_mask have 1500.
To fix that you can instead concatenate them all using
Also you may need to drop the "text" column before applying group_texts since strings can't be concatenated with lists. You can drop it at the tokenization step:
I am trying to preprocess any dataset in this package with GPT-2 tokenizer, so I need to structure the datasets as long sequences of text without padding. I've been following a couple of your tutorials and here you can find the script that is failing right at the end
https://github.com/LuCeHe/GenericTools/blob/master/KerasTools/lm_preprocessing.py
In the last iteration of the last dset.map, it gives the error that I copied in the title. Another issue that I have, if I leave the batch_size set as 1000 in the last .map, I'm afraid it's going to lose most text, so I'm considering setting both writer_batch_size and batch_size to 300 K, but I'm not sure it's the best way to go.
Can you help me?
Thanks!
The text was updated successfully, but these errors were encountered: