Refactor dataset creation to take `DataParams` as input instead of `TrainingConfig` by wizeng23 · Pull Request #1568 · oumi-ai/oumi

wizeng23 · 2025-03-25T01:54:21Z

Description

Currently, datasets are created with build_dataset_mixture(), which takes a TrainingConfig as input. This is because our datasets code is currently only integrated with training. However, it'd be useful to allow datasets to be passed via config for inference and evaluation as well. For the latter, this is useful to support custom evaluation.

This is a fairly simple refactor, as aside from the DataParams within TrainingConfig, the only other field we use is model.model_max_length (which is only used for packing). This PR also fixes a related example in docs that was broken, and simplifies affected tests.

Finally, this PR deletes the build_dataset_from_params() function, with its thin wrapping logic moved to build_dataset(). Note that build_dataset() isn't used in our code codebase, but is referenced occasionally in our docs (likely as a convenience method).

Related issues

Towards OPE-1152

Before submitting

This PR only changes documentation. (You can ignore the following checks in that case)
Did you read the contributor guideline Pull Request guidelines?
Did you link the issue(s) related to this PR in the section above?
Did you add / update tests where needed?

…rainingConfig` (#1568)

wizeng23 added 3 commits March 24, 2025 18:11

Refactor build_dataset_mixture to take data_params as input

36e3259

Simplify tests

54c39e2

Fix doc

08e8c5d

wizeng23 requested review from kaisopos, nikg7, oelachqar and taenin March 25, 2025 01:54

wizeng23 added 3 commits March 24, 2025 19:27

Delete build_dataset_from_params

5e3d896

merge main

72e0a05

Fix test

97d935a

nikg7 approved these changes Mar 25, 2025

View reviewed changes

taenin approved these changes Mar 25, 2025

View reviewed changes

wizeng23 merged commit 248d7f1 into main Mar 25, 2025
3 checks passed

wizeng23 deleted the wizeng/o1122-refactor-dataset-build branch March 25, 2025 18:47

penfever pushed a commit that referenced this pull request Aug 27, 2025

Refactor dataset creation to take DataParams as input instead of `T…

27337a6

…rainingConfig` (#1568)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Refactor dataset creation to take `DataParams` as input instead of `TrainingConfig`#1568

Refactor dataset creation to take `DataParams` as input instead of `TrainingConfig`#1568
wizeng23 merged 6 commits intomainfrom
wizeng/o1122-refactor-dataset-build

wizeng23 commented Mar 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

wizeng23 commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Before submitting

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wizeng23 commented Mar 25, 2025 •

edited

Loading