Skip to content

[docs] trainer part 1#44185

Merged
stevhliu merged 4 commits intohuggingface:mainfrom
stevhliu:trainer
Feb 24, 2026
Merged

[docs] trainer part 1#44185
stevhliu merged 4 commits intohuggingface:mainfrom
stevhliu:trainer

Conversation

@stevhliu
Copy link
Member

part 1 of refactoring the Trainer docs

  • restructure the toctree a bit to accommodate new sections and docs
  • slim down trainer.md to be a clearer entry point (will expand the ## Next steps section as we continue for better navigation). everything else here is either moved to their relevant sections or removed because its duplicate content
  • updates training.md tutorial to show training a language model instead of BERT lol
  • adds new trainer_customize.md guide showing how to subclass get_train_dataloader and compute_loss using real-world examples from TRL (we can add more examples here later)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@stevhliu stevhliu requested a review from SunMarc February 20, 2026 19:38

The examples below show how to subclass some of these methods.

## get_train_dataloader
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can review these sections @qgallouedec since they're TRL examples? 🙏

@stevhliu stevhliu mentioned this pull request Feb 23, 2026
1 task
@@ -0,0 +1,30 @@
# Callbacks
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can ignore this section as its covered in #44239

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot ! Left a couple of comment but overall really good !

Comment on lines +297 to +308
## GaLore

[Gradient Low-Rank Projection (GaLore)](https://hf.co/papers/2403.03507) significantly reduces memory usage when training large language models (LLMs). One of GaLores key benefits is *full-parameter* learning, unlike low-rank adaptation methods like [LoRA](https://hf.co/papers/2106.09685), which produces better model performance.

Install the [GaLore](https://github.com/jiaweizzhao/GaLore) and [TRL](https://hf.co/docs/trl/index) libraries.

```bash
pip install galore-torch trl
```

Pick a GaLore optimizer (`"galore_adamw"`, `"galore_adafactor"`, `"galore_adamw_8bit`") and pass it to the `optim` parameter in [`trl.SFTConfig`]. Use the `optim_target_modules` parameter to specify which modules to adapt (can be a list of strings, regex, or a full path).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I would put that here. The doc here is quite nice as it is super light. So adding this galore section will incite ppl in the future to put everything there as we ded in trainer.md.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

i'll move galore --> optimizers.md since thats what it actually is
move liger --> performance > speed optimizations > kernels (will do in future pr once i have those sections)
move neftune --> trainer cookbook recipes (will do in future pr once i have those sections)

this way we can keep this doc here more light and it focuses on general GPU training techniques?


Subclass [`Trainer`] methods to change training behavior without rewriting the entire loop. Subclassing modifies the *training loop*, for example the forward pass or loss computation.

Before subclassing, consider whether you need to change *what* [`Trainer`] computes or *when* and *whether* it acts. For timing and conditional logic, use a [Callback](./trainer_callbacks) instead. Callbacks control when things happen (logging, evaluation, early stopping) and subclassing changes what happens (loss computation, data loading, optimization).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add some examples of trainer in trl or axolotl

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added link to examples in trl/axolotl at the bottom

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I shared some suggestions :)

Comment on lines +25 to +35
| method | description |
|---|---|
| [`~Trainer.get_train_dataloader`] | create a training DataLoader |
| [`~Trainer.get_eval_dataloader`] | create an evaluation DataLoader |
| [`~Trainer.get_test_dataloader`] | create a test DataLoader |
| [`~Trainer.log`] | log information about the training process |
| [`~Trainer.create_optimizer_and_scheduler`] | create an optimizer and learning rate scheduler (can also be separately customized with [`~Trainer.create_optimizer`] and [`~Trainer.create_scheduler`] if they weren't passed in `__init__`) |
| [`~Trainer.compute_loss`] | compute the loss of a batch of training inputs |
| [`~Trainer.training_step`] | perform the training step |
| [`~Trainer.prediction_step`] | perform the prediction and test step |
| [`~Trainer.evaluate`] | evaluate the model and return the evaluation metric |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not convinced by the added value of this table. it seems redundant with the [[autodoc]] Trainer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in favor of [[autodoc]] Trainer !


metric = evaluate.load("accuracy")
- Set `bf16=True` for fast mixed precision training if your hardware supports it (Ampere+ GPUs). Otherwise, fall back to `fp16=True` on older hardware.
- Enable `gradient_accumulation_steps` and `gradient_checkpointing` to simulate training on larger batches and reduce memory usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixing these two seems a bit weird. Maybe you meant per_device_train_batch_size instead of gradient_checkpointing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitted into two separate points to avoid conflating the two!

@stevhliu stevhliu merged commit f2ba019 into huggingface:main Feb 24, 2026
15 checks passed
@stevhliu stevhliu deleted the trainer branch February 24, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants