[ORPO] Enable batched tokenization & multiprocessing to process large datasets #1624

lewtun · 2024-05-06T13:11:32Z

This PR applies a similar refactoring logic as #1470 to massively speed up the preprocessing of the ORPOTrainer. Similar to that PR, the ORPOTrainer currently applies all dataset preprocessing at the per-example-level via a method called tokenize_row(). This isn't ideal because:

Rust tokenizers support batching which is typically 3-5x faster
the method can't be hashed and thus parallelised via multiprocessing in datasets.

This PR massively speeds up the dataset preprocessing of ORPOTrainer by doing the following:

Rewrites the tokenization step to use batching (best practice for Rust tokenizers). This is now called the _tokenize method and is done separately to the subsequent token ID processing
Introduces a new _process_tokens() method that supports multiprocessing
Migrates all data processing methods to live outside the trainer
Adds a new dataset_num_proc attribute to the ORPOConfig to allow users to configure the multiprocessing steps

I've also added some unit tests for the new logic and verified that the orpo.py example script produces near identical results with this refactor.

lewtun · 2024-05-06T13:15:33Z

trl/trainer/orpo_trainer.py

+    rejected_attention_mask = [f[r:] for f, r in zip(prompt_and_rejected_attention_mask, response_token_ids_start_idx)]
+
+    return dict(
+        prompt_input_ids=prompt_input_ids,


Note there is one small difference with the new implementation: we do not distinguish between the prompt input IDs of the chosen / rejected pairs, favouring instead to have a single set of IDs per prompt.

In general, I think the processing logic could be simplified significantly, but for now I've focused on ensuring parity with the current version

HuggingFaceDocBuilderDev · 2024-05-06T13:17:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kashif · 2024-05-06T14:02:58Z

trl/trainer/orpo_trainer.py

+            train_dataset = train_dataset.map(
+                _process_tokens,
+                fn_kwargs=fn_kwargs,
+                num_proc=args.dataset_num_proc,


so should num_proc be in the _tokenize map?

I decided against it since I think in general it is best to just do batched processing without multiprocessing - see e.g. this small experiment we did with the course: https://huggingface.co/learn/nlp-course/chapter5/3#the-map-methods-superpowers

kashif · 2024-05-06T14:06:44Z

trl/trainer/orpo_trainer.py

+    """
+    batch = {}
+
+    if not kwargs["is_encoder_decoder"]:


i guess we can remove this too

I realised I forgot to add the enc-dec support, so I've now included the extra preprocessing step. On the other hand, this is now giving a strange error in training and I'm slightly tempted to drop the enc-dec support altogether for this trainer - this would technically be a breaking change, so checking with @lvwerra and @younesbelkada first to see what they think

vwxyzjn · 2024-05-16T12:51:52Z

trl/trainer/orpo_trainer.py

+    prompt_attention_mask = prompt_tokenized["attention_mask"]
+
+    # Process prompt & chosen
+    prompt_and_chosen = [prompt + chosen for prompt, chosen in zip(batch["prompt"], batch["chosen"])]


Would it be prompt + chosen or just chosen? E.g., in ultrafeedback_binarized, prompt is how can i develop a habit of drawing daily, and chosen is [ { "content": "how can i develop a habit of drawing daily", "role": "user" }, { "content": "Developing a daily habit of drawing can be challenging but with consistent practice and a few tips, it can become an enjoyable and rewarding part of your daily routine. Here are some strategies to help you develop the habit of drawing daily:\n\n1. Set a specific time: Allocate a specific time of the day to draw. It could be in the morning, afternoon, or evening. Make drawing a part of your daily routine.\n2. Set a specific duration: Determine the amount of time you want to spend on drawing each day. It can be as little as 10 minutes or as long as an hour. Be consistent with the duration to help build the habit.\n3. Start small and simple: Don't try to create a masterpiece every day, start with simple and easy-to-do sketches. Focus on improving your skills gradually.\n4. Use a variety of tools and mediums: Experiment with different tools like pencils, pens, markers, and different mediums like paper, canvas, or digital apps to keep your drawing practice interesting and engaging.\n5. Take breaks and rest: Taking breaks and resting after some time of drawing can help you avoid burnout and stay motivated.\n6. Challenge yourself: Set challenges like drawing objects from memory or a specific subject to improve your skills and keep your drawing practice interesting.\n7. Track your progress: Keep a record of your daily drawing practice and track your progress. This can be a source of motivation and help you see how far you've come.\n\nRemember, developing a habit takes time and patience. Stay consistent with your drawing practice, be flexible and open to trying new things, and with time, you'll develop a habit of daily drawing that brings you joy and satisfaction.", "role": "assistant" } ]

github-actions · 2024-06-09T15:04:48Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

lewtun added 5 commits May 6, 2024 10:09

Add regression tester

8f8ce1f

Add logic for chosen input ids

d7509a7

Refactor

a71c66b

Clean up

47543a5

Add unit test

a197f71

lewtun requested review from kashif, younesbelkada and lvwerra May 6, 2024 13:11

lewtun added 2 commits May 6, 2024 13:12

Revert

bf58e31

Remove old fns

2e08b96

lewtun commented May 6, 2024

View reviewed changes

Merge branch 'main' into refactor-orpo

e38b566

kashif reviewed May 6, 2024

View reviewed changes

lewtun added 3 commits May 6, 2024 14:07

Add env dec

587d418

Style

adcb949

Merge branch 'main' into refactor-orpo

f616eca

vwxyzjn reviewed May 16, 2024

View reviewed changes

vwxyzjn mentioned this pull request May 16, 2024

Prototype Dataset Processor #1646

Draft

github-actions bot closed this Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ORPO] Enable batched tokenization & multiprocessing to process large datasets #1624

[ORPO] Enable batched tokenization & multiprocessing to process large datasets #1624

lewtun commented May 6, 2024 •

edited

Loading

lewtun May 6, 2024

HuggingFaceDocBuilderDev commented May 6, 2024

kashif May 6, 2024 •

edited

Loading

lewtun May 6, 2024

kashif May 6, 2024

lewtun May 6, 2024

vwxyzjn May 16, 2024

github-actions bot commented Jun 9, 2024

[ORPO] Enable batched tokenization & multiprocessing to process large datasets #1624

[ORPO] Enable batched tokenization & multiprocessing to process large datasets #1624

Conversation

lewtun commented May 6, 2024 • edited Loading

lewtun May 6, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 6, 2024

kashif May 6, 2024 • edited Loading

Choose a reason for hiding this comment

lewtun May 6, 2024

Choose a reason for hiding this comment

kashif May 6, 2024

Choose a reason for hiding this comment

lewtun May 6, 2024

Choose a reason for hiding this comment

vwxyzjn May 16, 2024

Choose a reason for hiding this comment

github-actions bot commented Jun 9, 2024

lewtun commented May 6, 2024 •

edited

Loading

kashif May 6, 2024 •

edited

Loading