Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ORPO] Enable batched tokenization & multiprocessing to process large datasets #1624

Closed
wants to merge 11 commits into from

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented May 6, 2024

This PR applies a similar refactoring logic as #1470 to massively speed up the preprocessing of the ORPOTrainer. Similar to that PR, the ORPOTrainer currently applies all dataset preprocessing at the per-example-level via a method called tokenize_row(). This isn't ideal because:

  • Rust tokenizers support batching which is typically 3-5x faster
  • the method can't be hashed and thus parallelised via multiprocessing in datasets.

This PR massively speeds up the dataset preprocessing of ORPOTrainer by doing the following:

  • Rewrites the tokenization step to use batching (best practice for Rust tokenizers). This is now called the _tokenize method and is done separately to the subsequent token ID processing
  • Introduces a new _process_tokens() method that supports multiprocessing
  • Migrates all data processing methods to live outside the trainer
  • Adds a new dataset_num_proc attribute to the ORPOConfig to allow users to configure the multiprocessing steps

I've also added some unit tests for the new logic and verified that the orpo.py example script produces near identical results with this refactor.

Screenshot 2024-05-06 at 15 11 10
Screenshot 2024-05-06 at 15 10 59
Screenshot 2024-05-06 at 15 10 53

rejected_attention_mask = [f[r:] for f, r in zip(prompt_and_rejected_attention_mask, response_token_ids_start_idx)]

return dict(
prompt_input_ids=prompt_input_ids,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there is one small difference with the new implementation: we do not distinguish between the prompt input IDs of the chosen / rejected pairs, favouring instead to have a single set of IDs per prompt.

In general, I think the processing logic could be simplified significantly, but for now I've focused on ensuring parity with the current version

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

train_dataset = train_dataset.map(
_process_tokens,
fn_kwargs=fn_kwargs,
num_proc=args.dataset_num_proc,
Copy link
Collaborator

@kashif kashif May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should num_proc be in the _tokenize map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided against it since I think in general it is best to just do batched processing without multiprocessing - see e.g. this small experiment we did with the course: https://huggingface.co/learn/nlp-course/chapter5/3#the-map-methods-superpowers

"""
batch = {}

if not kwargs["is_encoder_decoder"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess we can remove this too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised I forgot to add the enc-dec support, so I've now included the extra preprocessing step. On the other hand, this is now giving a strange error in training and I'm slightly tempted to drop the enc-dec support altogether for this trainer - this would technically be a breaking change, so checking with @lvwerra and @younesbelkada first to see what they think

prompt_attention_mask = prompt_tokenized["attention_mask"]

# Process prompt & chosen
prompt_and_chosen = [prompt + chosen for prompt, chosen in zip(batch["prompt"], batch["chosen"])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be prompt + chosen or just chosen? E.g., in ultrafeedback_binarized, prompt is how can i develop a habit of drawing daily, and chosen is [ { "content": "how can i develop a habit of drawing daily", "role": "user" }, { "content": "Developing a daily habit of drawing can be challenging but with consistent practice and a few tips, it can become an enjoyable and rewarding part of your daily routine. Here are some strategies to help you develop the habit of drawing daily:\n\n1. Set a specific time: Allocate a specific time of the day to draw. It could be in the morning, afternoon, or evening. Make drawing a part of your daily routine.\n2. Set a specific duration: Determine the amount of time you want to spend on drawing each day. It can be as little as 10 minutes or as long as an hour. Be consistent with the duration to help build the habit.\n3. Start small and simple: Don't try to create a masterpiece every day, start with simple and easy-to-do sketches. Focus on improving your skills gradually.\n4. Use a variety of tools and mediums: Experiment with different tools like pencils, pens, markers, and different mediums like paper, canvas, or digital apps to keep your drawing practice interesting and engaging.\n5. Take breaks and rest: Taking breaks and resting after some time of drawing can help you avoid burnout and stay motivated.\n6. Challenge yourself: Set challenges like drawing objects from memory or a specific subject to improve your skills and keep your drawing practice interesting.\n7. Track your progress: Keep a record of your daily drawing practice and track your progress. This can be a source of motivation and help you see how far you've come.\n\nRemember, developing a habit takes time and patience. Stay consistent with your drawing practice, be flexible and open to trying new things, and with time, you'll develop a habit of daily drawing that brings you joy and satisfaction.", "role": "assistant" } ]

image

Copy link

github-actions bot commented Jun 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants