-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ORPO] Enable batched tokenization & multiprocessing to process large datasets #1624
Conversation
rejected_attention_mask = [f[r:] for f, r in zip(prompt_and_rejected_attention_mask, response_token_ids_start_idx)] | ||
|
||
return dict( | ||
prompt_input_ids=prompt_input_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note there is one small difference with the new implementation: we do not distinguish between the prompt input IDs of the chosen / rejected pairs, favouring instead to have a single set of IDs per prompt.
In general, I think the processing logic could be simplified significantly, but for now I've focused on ensuring parity with the current version
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
train_dataset = train_dataset.map( | ||
_process_tokens, | ||
fn_kwargs=fn_kwargs, | ||
num_proc=args.dataset_num_proc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so should num_proc
be in the _tokenize
map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided against it since I think in general it is best to just do batched processing without multiprocessing - see e.g. this small experiment we did with the course: https://huggingface.co/learn/nlp-course/chapter5/3#the-map-methods-superpowers
""" | ||
batch = {} | ||
|
||
if not kwargs["is_encoder_decoder"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess we can remove this too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised I forgot to add the enc-dec support, so I've now included the extra preprocessing step. On the other hand, this is now giving a strange error in training and I'm slightly tempted to drop the enc-dec support altogether for this trainer - this would technically be a breaking change, so checking with @lvwerra and @younesbelkada first to see what they think
prompt_attention_mask = prompt_tokenized["attention_mask"] | ||
|
||
# Process prompt & chosen | ||
prompt_and_chosen = [prompt + chosen for prompt, chosen in zip(batch["prompt"], batch["chosen"])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be prompt + chosen
or just chosen
? E.g., in ultrafeedback_binarized, prompt
is how can i develop a habit of drawing daily
, and chosen
is [ { "content": "how can i develop a habit of drawing daily", "role": "user" }, { "content": "Developing a daily habit of drawing can be challenging but with consistent practice and a few tips, it can become an enjoyable and rewarding part of your daily routine. Here are some strategies to help you develop the habit of drawing daily:\n\n1. Set a specific time: Allocate a specific time of the day to draw. It could be in the morning, afternoon, or evening. Make drawing a part of your daily routine.\n2. Set a specific duration: Determine the amount of time you want to spend on drawing each day. It can be as little as 10 minutes or as long as an hour. Be consistent with the duration to help build the habit.\n3. Start small and simple: Don't try to create a masterpiece every day, start with simple and easy-to-do sketches. Focus on improving your skills gradually.\n4. Use a variety of tools and mediums: Experiment with different tools like pencils, pens, markers, and different mediums like paper, canvas, or digital apps to keep your drawing practice interesting and engaging.\n5. Take breaks and rest: Taking breaks and resting after some time of drawing can help you avoid burnout and stay motivated.\n6. Challenge yourself: Set challenges like drawing objects from memory or a specific subject to improve your skills and keep your drawing practice interesting.\n7. Track your progress: Keep a record of your daily drawing practice and track your progress. This can be a source of motivation and help you see how far you've come.\n\nRemember, developing a habit takes time and patience. Stay consistent with your drawing practice, be flexible and open to trying new things, and with time, you'll develop a habit of daily drawing that brings you joy and satisfaction.", "role": "assistant" } ]
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
This PR applies a similar refactoring logic as #1470 to massively speed up the preprocessing of the
ORPOTrainer
. Similar to that PR, theORPOTrainer
currently applies all dataset preprocessing at the per-example-level via a method calledtokenize_row()
. This isn't ideal because:datasets
.This PR massively speeds up the dataset preprocessing of
ORPOTrainer
by doing the following:_tokenize
method and is done separately to the subsequent token ID processing_process_tokens()
method that supports multiprocessingdataset_num_proc
attribute to theORPOConfig
to allow users to configure the multiprocessing stepsI've also added some unit tests for the new logic and verified that the
orpo.py
example script produces near identical results with this refactor.