🚩Benchmark setting used in Blog and Landing Page

As stated in Blog,

Very Important Details: The numbers in both Table 1 and 2 of the blog are for Step 3 of the training and based on actual measured training throughput on DeepSpeed-RLHF curated dataset and training recipe which trains for one epoch on a total of 135M tokens. We have in total 67.5M query tokens (131.9k queries with sequence length 256) and 67.5M generated tokens (131.9k answers with sequence length 256), and a maximum global batch size per step of 0.5M tokens (1024 query-answer pairs). We urge readers to pay attention to these specifications before making any cost and e2e time comparisons with DeepSpeed-RLHF. See our benchmark settings page for more details.

an apple-to-apple comparison is critical for the machine learning community, particularly for benchmarking. For example, it is not fair to compare DeepSpeed-Chat end-to-end training time to Alpaca and Vicuna (both focus on instruct finetuning) since they do not have the full RLHF training pipeline. Therefore, we here elaborate more on details.

We randomly select 40% training data from the six open-sourced training datasets, i.e., "Dahoas/rm-static", "Dahoas/full-hh-rlhf", "Dahoas/synthetic-instruct-gptj-pairwise", "yitingxie/rlhf-reward-datasets", "openai/webgpt_comparisons", and "stanfordnlp/SHP". The total training samples we have is 264,292. We fix the query (prompt) sequence length as 256 and generate fixed-length answer with 256 tokens. As such, the total training tokens per epoch is 135,317,504. During benchmark testing, we set the training epoch number as 1.

As mentioned in the instability of RLHF Training Tutorial, we found that it is not stable to update the actor model multiple times using the generated data. Therefore, we set per_device_generation_batch_size=per_device_training_batch_size and ppo_epochs=generation_batches=1 for all of our benchmark results. During testing, we also set an upper bound for the maximum global training tokens at 524,288 (batch size of 1024 with a sequence length of 512). This is the largest batch size we found during our exploration that provides a stable RLHF training experience. Users and practitioners may find better training hyperparameters to further increase this. Additionally, during testing, whenever the global training token batch size does not exceed our limit of 524,288, we always use the largest training batch size that does not result in an out-of-memory error to benchmark the time.

We hope this clearly explains our benchmark settings, and please do not hesitate to contact us if you need more information. If you'd like to reproduce our performance results or make a comparison with DeepSpeed-RLHF, we would like to encourage you to leverage the same / similar settings such that the performance results are more comparable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BenckmarkSetting.md

BenckmarkSetting.md

🚩Benchmark setting used in Blog and Landing Page

Files

BenckmarkSetting.md

Latest commit

History

BenckmarkSetting.md

File metadata and controls

🚩Benchmark setting used in Blog and Landing Page