Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More tests on dataset state #1094

Merged
merged 19 commits into from May 10, 2023
Merged

More tests on dataset state #1094

merged 19 commits into from May 10, 2023

Conversation

severo
Copy link
Collaborator

@severo severo commented Apr 27, 2023

  • changes in the Graph specification:
    • provides_dataset_config_names field indicates if the step is providing the list of config names for a dataset
    • provides_config_split_names field indicates if the step is providing the list of split names for a config
    • requires field has been renamed to triggered_by: indeed, the relation between steps is that, when step B is triggered_by step A, if A has been updated, a new job will be created for step B.
  • ProcessingGraph:
    • the concept of ProcessingStep is simplified: it's a data class, and provides the name, job runner version, input type (as well as the job type and cache kind, which are equal to the processing step name)
    • most of the methods are provided by ProcessingGraph, in particular those related to the edges, or to filtering nodes.
    • the data structure is a networkx.digraph: the nodes are the step names, and we use the attributes to store some properties
  • remove the /admin/jobs_duration endpoint, because the finished jobs are now deleted quickly, so: this statistic does not mean much now.
  • add tests for state.py. In particular: use small processing graphs with the special cases (children, grand-children, multiple parents, fan-in, fan-out, parallel steps, etc). The "real" processing graph now only has one test. This should reduce code to change when a step is added, modified or deleted.

@HuggingFaceDocBuilder
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@severo severo force-pushed the more-tests-on-dataset-state branch 2 times, most recently from 27ffb7b to ebaff0e Compare May 4, 2023 08:11
@severo severo mentioned this pull request May 5, 2023
@severo severo force-pushed the more-tests-on-dataset-state branch from 4ec8b38 to f736899 Compare May 9, 2023 10:38
@severo severo marked this pull request as ready for review May 9, 2023 14:17
@severo severo requested a review from a team May 9, 2023 14:17
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the huge contribution. In principle, all proposed changes seem sensible.

Copy link
Contributor

@AndreaFrancis AndreaFrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
There are some commented lines, will those be removed?
I dont quite understand the advantage of having processing graph in all job runners to make one call to get the children but maybe in the future, it will be done by the orchestrator.

libs/libcommon/tests/state/__init__.py Outdated Show resolved Hide resolved
libs/libcommon/tests/state/test_objects.py Outdated Show resolved Hide resolved
libs/libcommon/tests/state/test_plan.py Outdated Show resolved Hide resolved
@severo severo force-pushed the more-tests-on-dataset-state branch from 69c2f14 to c8bbd04 Compare May 10, 2023 07:44
severo and others added 2 commits May 10, 2023 07:47
Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>
@severo
Copy link
Collaborator Author

severo commented May 10, 2023

Thanks for the huge contribution. In principle, all proposed changes seem sensible.

Yes, the PR is too big, I should have broken it into some smaller parts... Thanks for reviewing it all (same @AndreaFrancis)!

@severo
Copy link
Collaborator Author

severo commented May 10, 2023

There are some commented lines, will those be removed?

Yes! I removed them. And part of them was due to me having forgotten to push the last commit (finishing the tests in tests/state). I'll let you review the last changes, before merging.

I dont quite understand the advantage of having processing graph in all job runners to make one call to get the children but maybe in the future, it will be done by the orchestrator.

Yes, in the future (the PR is nearly ready #1157), we will not get the children, but instead do a full DatasetState analysis + backfill at the end of each job. It's possibly too much, but it will be better than the current incoherencies we have in the cache.

@severo severo merged commit 1163909 into main May 10, 2023
17 checks passed
@severo severo deleted the more-tests-on-dataset-state branch May 10, 2023 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants