pyarrow.lib.ArrowInvalid: Column 1 named input_ids expected length 599 but got length 1500 #1817

LuCeHe · 2021-02-04T02:30:23Z

I am trying to preprocess any dataset in this package with GPT-2 tokenizer, so I need to structure the datasets as long sequences of text without padding. I've been following a couple of your tutorials and here you can find the script that is failing right at the end

https://github.com/LuCeHe/GenericTools/blob/master/KerasTools/lm_preprocessing.py

In the last iteration of the last dset.map, it gives the error that I copied in the title. Another issue that I have, if I leave the batch_size set as 1000 in the last .map, I'm afraid it's going to lose most text, so I'm considering setting both writer_batch_size and batch_size to 300 K, but I'm not sure it's the best way to go.

Can you help me?
Thanks!

lhoestq · 2021-02-05T14:29:20Z

Hi !
The error you have is due to the input_ids column not having the same number of examples as the other columns.
Indeed you're concatenating the input_ids at this line:

https://github.com/LuCeHe/GenericTools/blob/431835d8e13ec24dceb5ee4dc4ae58f0e873b091/KerasTools/lm_preprocessing.py#L134

However the other columns are kept unchanged, and therefore you end up with an input_ids column with 599 elements while the others columns like attention_mask have 1500.

To fix that you can instead concatenate them all using

concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

Also you may need to drop the "text" column before applying group_texts since strings can't be concatenated with lists. You can drop it at the tokenization step:

dset = dset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

LuCeHe · 2021-02-07T14:04:24Z

You saved my life.

stas00 mentioned this issue Feb 11, 2022

DDP training hangs with run_glue.py and run_seq2seq.py huggingface/transformers#15618

Closed

4 tasks

mariosasko closed this as completed Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyarrow.lib.ArrowInvalid: Column 1 named input_ids expected length 599 but got length 1500 #1817

pyarrow.lib.ArrowInvalid: Column 1 named input_ids expected length 599 but got length 1500 #1817

LuCeHe commented Feb 4, 2021

lhoestq commented Feb 5, 2021

LuCeHe commented Feb 7, 2021

pyarrow.lib.ArrowInvalid: Column 1 named input_ids expected length 599 but got length 1500 #1817

pyarrow.lib.ArrowInvalid: Column 1 named input_ids expected length 599 but got length 1500 #1817

Comments

LuCeHe commented Feb 4, 2021

lhoestq commented Feb 5, 2021

LuCeHe commented Feb 7, 2021