Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow.lib.ArrowInvalid: Column 1 named input_ids expected length 599 but got length 1500 #1817

Closed
LuCeHe opened this issue Feb 4, 2021 · 2 comments

Comments

@LuCeHe
Copy link

LuCeHe commented Feb 4, 2021

I am trying to preprocess any dataset in this package with GPT-2 tokenizer, so I need to structure the datasets as long sequences of text without padding. I've been following a couple of your tutorials and here you can find the script that is failing right at the end

https://github.com/LuCeHe/GenericTools/blob/master/KerasTools/lm_preprocessing.py

In the last iteration of the last dset.map, it gives the error that I copied in the title. Another issue that I have, if I leave the batch_size set as 1000 in the last .map, I'm afraid it's going to lose most text, so I'm considering setting both writer_batch_size and batch_size to 300 K, but I'm not sure it's the best way to go.

Can you help me?
Thanks!

@lhoestq
Copy link
Member

lhoestq commented Feb 5, 2021

Hi !
The error you have is due to the input_ids column not having the same number of examples as the other columns.
Indeed you're concatenating the input_ids at this line:

https://github.com/LuCeHe/GenericTools/blob/431835d8e13ec24dceb5ee4dc4ae58f0e873b091/KerasTools/lm_preprocessing.py#L134

However the other columns are kept unchanged, and therefore you end up with an input_ids column with 599 elements while the others columns like attention_mask have 1500.

To fix that you can instead concatenate them all using

concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

Also you may need to drop the "text" column before applying group_texts since strings can't be concatenated with lists. You can drop it at the tokenization step:

dset = dset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

@LuCeHe
Copy link
Author

LuCeHe commented Feb 7, 2021

You saved my life.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants