[docs] trainer part 1 by stevhliu · Pull Request #44185 · huggingface/transformers

stevhliu · 2026-02-20T19:25:07Z

part 1 of refactoring the Trainer docs

restructure the toctree a bit to accommodate new sections and docs
slim down trainer.md to be a clearer entry point (will expand the ## Next steps section as we continue for better navigation). everything else here is either moved to their relevant sections or removed because its duplicate content
updates training.md tutorial to show training a language model instead of BERT lol
adds new trainer_customize.md guide showing how to subclass get_train_dataloader and compute_loss using real-world examples from TRL (we can add more examples here later)

HuggingFaceDocBuilderDev · 2026-02-20T19:35:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stevhliu · 2026-02-20T19:39:34Z

docs/source/en/trainer_customize.md

+
+The examples below show how to subclass some of these methods.
+
+## get_train_dataloader


maybe you can review these sections @qgallouedec since they're TRL examples? 🙏

stevhliu · 2026-02-23T18:55:13Z

docs/source/en/trainer_callbacks.md

@@ -0,0 +1,30 @@
+# Callbacks


you can ignore this section as its covered in #44239

SunMarc

Thanks a lot ! Left a couple of comment but overall really good !

SunMarc · 2026-02-24T10:54:49Z

docs/source/en/perf_train_gpu_one.md

+## GaLore
+
+[Gradient Low-Rank Projection (GaLore)](https://hf.co/papers/2403.03507) significantly reduces memory usage when training large language models (LLMs). One of GaLores key benefits is *full-parameter* learning, unlike low-rank adaptation methods like [LoRA](https://hf.co/papers/2106.09685), which produces better model performance.
+
+Install the [GaLore](https://github.com/jiaweizzhao/GaLore) and [TRL](https://hf.co/docs/trl/index) libraries.
+
+```bash
+pip install galore-torch trl
+```
+
+Pick a GaLore optimizer (`"galore_adamw"`, `"galore_adafactor"`, `"galore_adamw_8bit`") and pass it to the `optim` parameter in [`trl.SFTConfig`]. Use the `optim_target_modules` parameter to specify which modules to adapt (can be a list of strings, regex, or a full path).
+


Not sure if I would put that here. The doc here is quite nice as it is super light. So adding this galore section will incite ppl in the future to put everything there as we ded in trainer.md.

good point!

i'll move galore --> optimizers.md since thats what it actually is
move liger --> performance > speed optimizations > kernels (will do in future pr once i have those sections)
move neftune --> trainer cookbook recipes (will do in future pr once i have those sections)

this way we can keep this doc here more light and it focuses on general GPU training techniques?

docs/source/en/perf_train_gpu_one.md

docs/source/en/trainer_customize.md

SunMarc · 2026-02-24T10:59:22Z

docs/source/en/trainer_customize.md

+
+Subclass [`Trainer`] methods to change training behavior without rewriting the entire loop. Subclassing modifies the *training loop*, for example the forward pass or loss computation.
+
+Before subclassing, consider whether you need to change *what* [`Trainer`] computes or *when* and *whether* it acts. For timing and conditional logic, use a [Callback](./trainer_callbacks) instead. Callbacks control when things happen (logging, evaluation, early stopping) and subclassing changes what happens (loss computation, data loading, optimization).


maybe add some examples of trainer in trl or axolotl

added link to examples in trl/axolotl at the bottom

docs/source/en/training.md

qgallouedec

nice, I shared some suggestions :)

docs/source/en/perf_train_gpu_one.md

qgallouedec · 2026-02-24T14:59:34Z

docs/source/en/trainer_customize.md

+| method | description |
+|---|---|
+| [`~Trainer.get_train_dataloader`] | create a training DataLoader |
+| [`~Trainer.get_eval_dataloader`] | create an evaluation DataLoader |
+| [`~Trainer.get_test_dataloader`] | create a test DataLoader |
+| [`~Trainer.log`] | log information about the training process |
+| [`~Trainer.create_optimizer_and_scheduler`] | create an optimizer and learning rate scheduler (can also be separately customized with [`~Trainer.create_optimizer`] and [`~Trainer.create_scheduler`] if they weren't passed in `__init__`) |
+| [`~Trainer.compute_loss`] | compute the loss of a batch of training inputs |
+| [`~Trainer.training_step`] | perform the training step |
+| [`~Trainer.prediction_step`] | perform the prediction and test step |
+| [`~Trainer.evaluate`] | evaluate the model and return the evaluation metric |


not convinced by the added value of this table. it seems redundant with the [[autodoc]] Trainer

removed in favor of [[autodoc]] Trainer !

docs/source/en/trainer_customize.md

qgallouedec · 2026-02-24T15:21:34Z

docs/source/en/training.md


-metric = evaluate.load("accuracy")
+- Set `bf16=True` for fast mixed precision training if your hardware supports it (Ampere+ GPUs). Otherwise, fall back to `fp16=True` on older hardware.
+- Enable `gradient_accumulation_steps` and `gradient_checkpointing` to simulate training on larger batches and reduce memory usage.


mixing these two seems a bit weird. Maybe you meant per_device_train_batch_size instead of gradient_checkpointing?

splitted into two separate points to avoid conflating the two!

trainer part 1

3a64bc3

stevhliu requested a review from SunMarc February 20, 2026 19:38

stevhliu commented Feb 20, 2026

View reviewed changes

stevhliu mentioned this pull request Feb 23, 2026

[docs] callbacks and collators #44239

Merged

1 task

stevhliu commented Feb 23, 2026

View reviewed changes

SunMarc approved these changes Feb 24, 2026

View reviewed changes

qgallouedec approved these changes Feb 24, 2026

View reviewed changes

stevhliu added 3 commits February 24, 2026 11:59

feedback

6247898

fix title

fe184b3

fix title

a0f445b

stevhliu merged commit f2ba019 into huggingface:main Feb 24, 2026
15 checks passed

stevhliu deleted the trainer branch February 24, 2026 21:18


		The examples below show how to subclass some of these methods.

		## get_train_dataloader


		Subclass [`Trainer`] methods to change training behavior without rewriting the entire loop. Subclassing modifies the training loop, for example the forward pass or loss computation.

		Before subclassing, consider whether you need to change what [`Trainer`] computes or when and whether it acts. For timing and conditional logic, use a [Callback](./trainer_callbacks) instead. Callbacks control when things happen (logging, evaluation, early stopping) and subclassing changes what happens (loss computation, data loading, optimization).

Conversation

stevhliu commented Feb 20, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants