Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: Invalid key: 0 is out of bounds for size 0 #6775

Open
kk2491 opened this issue Apr 3, 2024 · 7 comments
Open

IndexError: Invalid key: 0 is out of bounds for size 0 #6775

kk2491 opened this issue Apr 3, 2024 · 7 comments

Comments

@kk2491
Copy link

kk2491 commented Apr 3, 2024

Describe the bug

I am trying to fine-tune llama2-7b model in GCP. The notebook I am using for this can be found here.

When I use the dataset given in the example, the training gets successfully completed (example dataset can be found here).
However when I use my own dataset which is in the same format as the example dataset, I get the below error (my dataset can be found here).

image

I see the files are being read correctly from the logs:

image

Steps to reproduce the bug

  1. Clone the vertex-ai-samples repository.
  2. Run the llama2-7b peft fine-tuning.
  3. Change the dataset kk2491/finetune_dataset_002

Expected behavior

The training should complete successfully, and model gets deployed to an endpoint.

Environment info

Python version : Python 3.10.12
Dataset : https://huggingface.co/datasets/kk2491/finetune_dataset_002

@Minami-su
Copy link

Same problem.

@mariosasko
Copy link
Collaborator

Hi! You should be able to fix this by passing remove_unused_columns=False to the transformers TrainingArguments as explained in huggingface/peft#1299.

(I'm not familiar with Vertex AI, but I'd assume remove_unused_columns can be passed as a flag to the docker container)

@cyberyu
Copy link

cyberyu commented Apr 6, 2024

I had the same problem, but I spent a whole day trying different combination with my own dataset with the example data set and found the reason: the example data is multi-turn conversation between human and assistant, so # Humman or # Assistant appear at least twice. If your own custom data only has single turn conversation, it might end up with the same error. What you can do is repeat your single turn conversation twice in your training data (keep the key 'text' the same) and maybe it works. I guess the reason is the specific way processing the data requires and counts multi-turn only (single turn will be discarded so it ends up with no training data), but since I am using Google Vertex AI, I don't have direct access to the underlying code so that was just my guess.

@kk2491
Copy link
Author

kk2491 commented Apr 7, 2024

Hi! You should be able to fix this by passing remove_unused_columns=False to the transformers TrainingArguments as explained in huggingface/peft#1299.

(I'm not familiar with Vertex AI, but I'd assume remove_unused_columns can be passed as a flag to the docker container)

@mariosasko Thanks for the response and suggestion.
When I set remove_unused_columns as False , I end up getting different error (will post the error soon).
Either the Vertex-AI does not support remove_unused_columns or my dataset is completely wrong.

Thank you,
KK

@kk2491
Copy link
Author

kk2491 commented Apr 7, 2024

I had the same problem, but I spent a whole day trying different combination with my own dataset with the example data set and found the reason: the example data is multi-turn conversation between human and assistant, so # Humman or # Assistant appear at least twice. If your own custom data only has single turn conversation, it might end up with the same error. What you can do is repeat your single turn conversation twice in your training data (keep the key 'text' the same) and maybe it works. I guess the reason is the specific way processing the data requires and counts multi-turn only (single turn will be discarded so it ends up with no training data), but since I am using Google Vertex AI, I don't have direct access to the underlying code so that was just my guess.

@cyberyu Thanks for your suggestions.
I have tried the approach you suggested, copied the same conversation in each jsonl element so every jsonl item has 2 HUMAN and ASSISTANT.
However in my case, the issue persists. I am gonna give few more tries, and post the results here.
You can find my dataset here

Thank you,
KK

@cyberyu
Copy link

cyberyu commented Apr 7, 2024

I had the same problem, but I spent a whole day trying different combination with my own dataset with the example data set and found the reason: the example data is multi-turn conversation between human and assistant, so # Humman or # Assistant appear at least twice. If your own custom data only has single turn conversation, it might end up with the same error. What you can do is repeat your single turn conversation twice in your training data (keep the key 'text' the same) and maybe it works. I guess the reason is the specific way processing the data requires and counts multi-turn only (single turn will be discarded so it ends up with no training data), but since I am using Google Vertex AI, I don't have direct access to the underlying code so that was just my guess.

@cyberyu Thanks for your suggestions. I have tried the approach you suggested, copied the same conversation in each jsonl element so every jsonl item has 2 HUMAN and ASSISTANT. However in my case, the issue persists. I am gonna give few more tries, and post the results here. You can find my dataset here

Thank you, KK

I think another reason is your training sample length is too short. I saw a relevant report (https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/16) stating that the processing code might have a bug discarding sequence length short than max_seq_length, which is 512. Not sure the Vertex AI backend code has fixed that bug or not. So I tried to add some garbage content in your data, and extended the length longer than 512 for a single turn, and repeated twice. You can copy the following line as 5 repeated lines as your training data jsonl file of five samples (no eval or test needed, for speed up, set evaluation step to 5 and training step to 10,), and it will pass.

{"text":"### Human: You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You will handle customers queries and provide effective help message. Please provide response to 'Can Interplai software optimize routes for minimizing package handling and transfer times in distribution centers'? ### Assistant: Yes, Interplai software can optimize routes for distribution centers by streamlining package handling processes, minimizing transfer times between loading docks and storage areas, and optimizing warehouse layouts for efficient order fulfillment. ### Human: You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You will handle customers queries and provide effective help message. Please provide response to 'Can Interplai software optimize routes for minimizing package handling and transfer times in distribution centers'? ### Assistant: Yes, Interplai software can optimize routes for distribution centers by streamlining package handling processes, minimizing transfer times between loading docks and storage areas, and optimizing warehouse layouts for efficient order fulfillment."}

@kk2491
Copy link
Author

kk2491 commented Apr 8, 2024

@cyberyu Thank you so much, You saved my day (+ so many days).
I tried the example you provided above, and the training is successfully completed in Vertex-AI (through GUI).
I never thought there would be constraints on the length of the samples and also on the number of turns.
I will update my complete dataset and see update here once the training is completed.

Thank you,
KK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants