Remove name ambiguity in dataset configuration #1111

shaahji · 2024-04-25T07:53:45Z

Remove name ambiguity in dataset configuration

Dataset configuration i.e. data_configs are represeneted as objects in config and so have two name entries -

the key of the object itself
DataConfig::name field

Use cases around the code use either/or and expect both these names to be identifical with no enforcement during validation. Removing the duplication by making the data_configs a list of objects so the key isn't required anymore. Updated the code to use DataConfig::name for all look-ups.

Note: Pending document update.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
Is this PR including examples changes? If yes, please remember to update example documentation in a follow-up PR.

(Optional) Issue link

Dataset configuration i.e. data_configs are represeneted as objects in config and so have two name entries - - the key of the object itself - DataConfig::name field Use cases around the code use either/or and expect both these names to be identifical with no enforcement during validation. Removing the duplication by making the data_configs a list of objects so the key isn't required anymore. Updated the code to use DataConfig::name for all look-ups.

guotuofeng · 2024-04-25T08:27:27Z

docs/source/tutorials/configure_data.rst

-    }
+    "data_configs": [
+        { "name": "dataset_1", ...},
+        { "name": "dataset_2", ...},


@trajepl , should we keep the key or keep the name?

I think it is better to direcly remove the name field from data_configs, and keep the original "data_configs": {'xx': {}...}

guotuofeng · 2024-04-25T08:31:58Z

docs/source/features/huggingface_model_optimization.md

-                    "pad_to_max_len": false
-                }
+"data_configs": [{
+    "name": "oasst1_train",


using the name instead of key would have the possibility that there are multiple dataset have the same name. we need dedup them in our code.

For me, it seems we could leverage key since the unique key is guaranteed by json loading itself.

dedup them in our code

Yes, but this happens once during validation. Not necessarily a hot code path.

Having DataConfig::name makes it easier to identify object when debugging/printing. Having the key doesn't provide any such advantage.

@jambayk, @trajepl, what's your opinion?

Hitesh and I previously discussed the two options. I was originally for removing the name field and keeping the dict key. But the name for debugging purpose sounds useful too.

I think it the decision depends on whether the debugging usefulness outweights the benefits of keeping the keys (1. consistent with other fields like systems, evaluators, etc, 2.guaranteed unique keys 3. direct index into data_configs)

Thanks @jambayk, I will let you and @trajepl to make decision since I am ok with either approach.

guotuofeng reviewed Apr 25, 2024

View reviewed changes

shaahji mentioned this pull request Apr 25, 2024

Remove HfConfig::dataset references in examples and tests #1113

Merged

6 tasks

trajepl approved these changes Apr 26, 2024

View reviewed changes

shaahji merged commit b01f806 into main Apr 26, 2024
35 checks passed

shaahji deleted the shaahji/dc1 branch April 26, 2024 01:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove name ambiguity in dataset configuration #1111

Remove name ambiguity in dataset configuration #1111

shaahji commented Apr 25, 2024

guotuofeng Apr 25, 2024 •

edited

trajepl Apr 25, 2024

guotuofeng Apr 25, 2024 •

edited

shaahji Apr 25, 2024

guotuofeng Apr 25, 2024

jambayk Apr 25, 2024

guotuofeng Apr 25, 2024

Remove name ambiguity in dataset configuration #1111

Remove name ambiguity in dataset configuration #1111

Conversation

shaahji commented Apr 25, 2024