Skip to content

Comments

Refactor dataset creation to take DataParams as input instead of TrainingConfig#1568

Merged
wizeng23 merged 6 commits intomainfrom
wizeng/o1122-refactor-dataset-build
Mar 25, 2025
Merged

Refactor dataset creation to take DataParams as input instead of TrainingConfig#1568
wizeng23 merged 6 commits intomainfrom
wizeng/o1122-refactor-dataset-build

Conversation

@wizeng23
Copy link
Contributor

@wizeng23 wizeng23 commented Mar 25, 2025

Description

Currently, datasets are created with build_dataset_mixture(), which takes a TrainingConfig as input. This is because our datasets code is currently only integrated with training. However, it'd be useful to allow datasets to be passed via config for inference and evaluation as well. For the latter, this is useful to support custom evaluation.

This is a fairly simple refactor, as aside from the DataParams within TrainingConfig, the only other field we use is model.model_max_length (which is only used for packing). This PR also fixes a related example in docs that was broken, and simplifies affected tests.

Finally, this PR deletes the build_dataset_from_params() function, with its thin wrapping logic moved to build_dataset(). Note that build_dataset() isn't used in our code codebase, but is referenced occasionally in our docs (likely as a convenience method).

Related issues

Towards OPE-1152

Before submitting

  • This PR only changes documentation. (You can ignore the following checks in that case)
  • Did you read the contributor guideline Pull Request guidelines?
  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

@wizeng23 wizeng23 merged commit 248d7f1 into main Mar 25, 2025
3 checks passed
@wizeng23 wizeng23 deleted the wizeng/o1122-refactor-dataset-build branch March 25, 2025 18:47
penfever pushed a commit that referenced this pull request Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants