IndexError: Invalid key: 0 is out of bounds for size 0 #6775

kk2491 · 2024-04-03T17:06:30Z

Describe the bug

I am trying to fine-tune llama2-7b model in GCP. The notebook I am using for this can be found here.

When I use the dataset given in the example, the training gets successfully completed (example dataset can be found here).
However when I use my own dataset which is in the same format as the example dataset, I get the below error (my dataset can be found here).

I see the files are being read correctly from the logs:

Steps to reproduce the bug

Clone the vertex-ai-samples repository.
Run the llama2-7b peft fine-tuning.
Change the dataset kk2491/finetune_dataset_002

Expected behavior

The training should complete successfully, and model gets deployed to an endpoint.

Environment info

Python version : Python 3.10.12
Dataset : https://huggingface.co/datasets/kk2491/finetune_dataset_002

The text was updated successfully, but these errors were encountered:

Minami-su · 2024-04-04T02:37:57Z

Same problem.

mariosasko · 2024-04-05T13:48:59Z

Hi! You should be able to fix this by passing remove_unused_columns=False to the transformers TrainingArguments as explained in huggingface/peft#1299.

(I'm not familiar with Vertex AI, but I'd assume remove_unused_columns can be passed as a flag to the docker container)

cyberyu · 2024-04-06T16:36:24Z

I had the same problem, but I spent a whole day trying different combination with my own dataset with the example data set and found the reason: the example data is multi-turn conversation between human and assistant, so # Humman or # Assistant appear at least twice. If your own custom data only has single turn conversation, it might end up with the same error. What you can do is repeat your single turn conversation twice in your training data (keep the key 'text' the same) and maybe it works. I guess the reason is the specific way processing the data requires and counts multi-turn only (single turn will be discarded so it ends up with no training data), but since I am using Google Vertex AI, I don't have direct access to the underlying code so that was just my guess.

kk2491 · 2024-04-07T17:01:36Z

Hi! You should be able to fix this by passing remove_unused_columns=False to the transformers TrainingArguments as explained in huggingface/peft#1299.

(I'm not familiar with Vertex AI, but I'd assume remove_unused_columns can be passed as a flag to the docker container)

@mariosasko Thanks for the response and suggestion.
When I set remove_unused_columns as False , I end up getting different error (will post the error soon).
Either the Vertex-AI does not support remove_unused_columns or my dataset is completely wrong.

Thank you,
KK

kk2491 · 2024-04-07T17:05:11Z

I had the same problem, but I spent a whole day trying different combination with my own dataset with the example data set and found the reason: the example data is multi-turn conversation between human and assistant, so # Humman or # Assistant appear at least twice. If your own custom data only has single turn conversation, it might end up with the same error. What you can do is repeat your single turn conversation twice in your training data (keep the key 'text' the same) and maybe it works. I guess the reason is the specific way processing the data requires and counts multi-turn only (single turn will be discarded so it ends up with no training data), but since I am using Google Vertex AI, I don't have direct access to the underlying code so that was just my guess.

@cyberyu Thanks for your suggestions.
I have tried the approach you suggested, copied the same conversation in each jsonl element so every jsonl item has 2 HUMAN and ASSISTANT.
However in my case, the issue persists. I am gonna give few more tries, and post the results here.
You can find my dataset here

Thank you,
KK

cyberyu · 2024-04-07T21:16:44Z

I had the same problem, but I spent a whole day trying different combination with my own dataset with the example data set and found the reason: the example data is multi-turn conversation between human and assistant, so # Humman or # Assistant appear at least twice. If your own custom data only has single turn conversation, it might end up with the same error. What you can do is repeat your single turn conversation twice in your training data (keep the key 'text' the same) and maybe it works. I guess the reason is the specific way processing the data requires and counts multi-turn only (single turn will be discarded so it ends up with no training data), but since I am using Google Vertex AI, I don't have direct access to the underlying code so that was just my guess.

@cyberyu Thanks for your suggestions. I have tried the approach you suggested, copied the same conversation in each jsonl element so every jsonl item has 2 HUMAN and ASSISTANT. However in my case, the issue persists. I am gonna give few more tries, and post the results here. You can find my dataset here

Thank you, KK

I think another reason is your training sample length is too short. I saw a relevant report (https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/16) stating that the processing code might have a bug discarding sequence length short than max_seq_length, which is 512. Not sure the Vertex AI backend code has fixed that bug or not. So I tried to add some garbage content in your data, and extended the length longer than 512 for a single turn, and repeated twice. You can copy the following line as 5 repeated lines as your training data jsonl file of five samples (no eval or test needed, for speed up, set evaluation step to 5 and training step to 10,), and it will pass.

{"text":"### Human: You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You will handle customers queries and provide effective help message. Please provide response to 'Can Interplai software optimize routes for minimizing package handling and transfer times in distribution centers'? ### Assistant: Yes, Interplai software can optimize routes for distribution centers by streamlining package handling processes, minimizing transfer times between loading docks and storage areas, and optimizing warehouse layouts for efficient order fulfillment. ### Human: You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You will handle customers queries and provide effective help message. Please provide response to 'Can Interplai software optimize routes for minimizing package handling and transfer times in distribution centers'? ### Assistant: Yes, Interplai software can optimize routes for distribution centers by streamlining package handling processes, minimizing transfer times between loading docks and storage areas, and optimizing warehouse layouts for efficient order fulfillment."}

kk2491 · 2024-04-08T01:24:34Z

@cyberyu Thank you so much, You saved my day (+ so many days).
I tried the example you provided above, and the training is successfully completed in Vertex-AI (through GUI).
I never thought there would be constraints on the length of the samples and also on the number of turns.
I will update my complete dataset and see update here once the training is completed.

Thank you,
KK

kk2491 mentioned this issue Apr 23, 2024

Vertex AI pipeline - IndexError: Invalid key: 0 is out of bounds for size 0 GoogleCloudPlatform/vertex-ai-samples#2813

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: Invalid key: 0 is out of bounds for size 0 #6775

IndexError: Invalid key: 0 is out of bounds for size 0 #6775

kk2491 commented Apr 3, 2024

Minami-su commented Apr 4, 2024

mariosasko commented Apr 5, 2024

cyberyu commented Apr 6, 2024 •

edited

kk2491 commented Apr 7, 2024

kk2491 commented Apr 7, 2024

cyberyu commented Apr 7, 2024 •

edited

kk2491 commented Apr 8, 2024

IndexError: Invalid key: 0 is out of bounds for size 0 #6775

IndexError: Invalid key: 0 is out of bounds for size 0 #6775

Comments

kk2491 commented Apr 3, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Minami-su commented Apr 4, 2024

mariosasko commented Apr 5, 2024

cyberyu commented Apr 6, 2024 • edited

kk2491 commented Apr 7, 2024

kk2491 commented Apr 7, 2024

cyberyu commented Apr 7, 2024 • edited

kk2491 commented Apr 8, 2024

cyberyu commented Apr 6, 2024 •

edited

cyberyu commented Apr 7, 2024 •

edited