Refactor dataset creation to take DataParams as input instead of TrainingConfig#1568
Merged
Refactor dataset creation to take DataParams as input instead of TrainingConfig#1568
DataParams as input instead of TrainingConfig#1568Conversation
nikg7
approved these changes
Mar 25, 2025
taenin
approved these changes
Mar 25, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Currently, datasets are created with
build_dataset_mixture(), which takes aTrainingConfigas input. This is because our datasets code is currently only integrated with training. However, it'd be useful to allow datasets to be passed via config for inference and evaluation as well. For the latter, this is useful to support custom evaluation.This is a fairly simple refactor, as aside from the
DataParamswithinTrainingConfig, the only other field we use ismodel.model_max_length(which is only used for packing). This PR also fixes a related example in docs that was broken, and simplifies affected tests.Finally, this PR deletes the
build_dataset_from_params()function, with its thin wrapping logic moved tobuild_dataset(). Note thatbuild_dataset()isn't used in our code codebase, but is referenced occasionally in our docs (likely as a convenience method).Related issues
Towards OPE-1152
Before submitting