More tests on dataset state #1094

severo · 2023-04-27T14:36:59Z

changes in the Graph specification:
- provides_dataset_config_names field indicates if the step is providing the list of config names for a dataset
- provides_config_split_names field indicates if the step is providing the list of split names for a config
- requires field has been renamed to triggered_by: indeed, the relation between steps is that, when step B is triggered_by step A, if A has been updated, a new job will be created for step B.
ProcessingGraph:
- the concept of ProcessingStep is simplified: it's a data class, and provides the name, job runner version, input type (as well as the job type and cache kind, which are equal to the processing step name)
- most of the methods are provided by ProcessingGraph, in particular those related to the edges, or to filtering nodes.
- the data structure is a networkx.digraph: the nodes are the step names, and we use the attributes to store some properties
remove the /admin/jobs_duration endpoint, because the finished jobs are now deleted quickly, so: this statistic does not mean much now.
add tests for state.py. In particular: use small processing graphs with the special cases (children, grand-children, multiple parents, fan-in, fan-out, parallel steps, etc). The "real" processing graph now only has one test. This should reduce code to change when a step is added, modified or deleted.

HuggingFaceDocBuilder · 2023-04-27T14:40:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

albertvillanova

Thanks for the huge contribution. In principle, all proposed changes seem sensible.

AndreaFrancis

Thanks!
There are some commented lines, will those be removed?
I dont quite understand the advantage of having processing graph in all job runners to make one call to get the children but maybe in the future, it will be done by the orchestrator.

libs/libcommon/tests/state/__init__.py

libs/libcommon/tests/state/test_objects.py

libs/libcommon/tests/state/test_plan.py

instead of hard coding them

relying on a test processing graph (a->b->c). Also: remove .as_dict() methods that are not used.

it's just .ancestors

in ProcessingStep

in ProcessingStep.

also, remove /admin/job_duration also: renaming step -> processing_step nearly everywhere also: small refactorings proposed by sourcery (inverting logic to avoid double negations...)

also add docstrings

it makes the code in services/api a lot simpler. Also: fix the CI.

Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>

severo · 2023-05-10T07:50:01Z

Thanks for the huge contribution. In principle, all proposed changes seem sensible.

Yes, the PR is too big, I should have broken it into some smaller parts... Thanks for reviewing it all (same @AndreaFrancis)!

severo · 2023-05-10T07:52:43Z

There are some commented lines, will those be removed?

Yes! I removed them. And part of them was due to me having forgotten to push the last commit (finishing the tests in tests/state). I'll let you review the last changes, before merging.

I dont quite understand the advantage of having processing graph in all job runners to make one call to get the children but maybe in the future, it will be done by the orchestrator.

Yes, in the future (the PR is ~~nearly ready~~ #1157), we will not get the children, but instead do a full DatasetState analysis + backfill at the end of each job. It's possibly too much, but it will be better than the current incoherencies we have in the cache.

severo force-pushed the more-tests-on-dataset-state branch 2 times, most recently from 27ffb7b to ebaff0e Compare May 4, 2023 08:11

severo mentioned this pull request May 5, 2023

Plot processing graph #1132

Merged

severo force-pushed the more-tests-on-dataset-state branch from 4ec8b38 to f736899 Compare May 9, 2023 10:38

severo marked this pull request as ready for review May 9, 2023 14:17

severo requested a review from a team May 9, 2023 14:17

albertvillanova approved these changes May 9, 2023

View reviewed changes

AndreaFrancis approved these changes May 9, 2023

View reviewed changes

libs/libcommon/tests/state/__init__.py Outdated Show resolved Hide resolved

libs/libcommon/tests/state/test_objects.py Outdated Show resolved Hide resolved

libs/libcommon/tests/state/test_plan.py Outdated Show resolved Hide resolved

severo added 17 commits May 10, 2023 07:44

test: 💍 split test file in two

e91038e

test: 💍 more refactoring

db25ce7

refactor: 💡 declare name steps in config

6621f60

instead of hard coding them

test: 💍 simplify the tests on the state objects

4e97ae6

relying on a test processing graph (a->b->c). Also: remove .as_dict() methods that are not used.

refactor: 💡 remove .get_ancestors()

c3ade5c

it's just .ancestors

feat: 🎸 remove deprecated "endpoint" attribute of ProcessingSte

b8e7f1f

refactor: 💡 rename requires to triggered_by

284667a

refactor: 💡 don't specify attributes with default value

41db179

in ProcessingStep

docs: ✏️ add a docstring

a75cd48

refactor: 💡 only refer to other steps with strings

68425d9

in ProcessingStep.

fix: 🐛 fix glitches from rebase

476e58a

fix: 🐛 fix a test

3d32ac7

feat: 🎸 simplify ProcessingStep

2f74a8f

also, remove /admin/job_duration also: renaming step -> processing_step nearly everywhere also: small refactorings proposed by sourcery (inverting logic to avoid double negations...)

feat: 🎸 compute the fixed graph attributes upfront

df24ff2

also add docstrings

refactor: 💡 handle list of parquet steps in processing_graph

09d872d

it makes the code in services/api a lot simpler. Also: fix the CI.

test: 💍 add tests on state

9d41c98

test: 💍 move tests from "real" graph to use-case graphs

c8bbd04

severo force-pushed the more-tests-on-dataset-state branch from 69c2f14 to c8bbd04 Compare May 10, 2023 07:44

severo and others added 2 commits May 10, 2023 07:47

style: 💄 update year

7767f3d

Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>

refactor: 💡 remove commented lines

83aecc1

severo requested review from AndreaFrancis and albertvillanova May 10, 2023 07:52

AndreaFrancis approved these changes May 10, 2023

View reviewed changes

severo merged commit 1163909 into main May 10, 2023
17 checks passed

severo deleted the more-tests-on-dataset-state branch May 10, 2023 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More tests on dataset state #1094

More tests on dataset state #1094

severo commented Apr 27, 2023 •

edited

HuggingFaceDocBuilder commented Apr 27, 2023

albertvillanova left a comment

AndreaFrancis left a comment

severo commented May 10, 2023

severo commented May 10, 2023 •

edited

More tests on dataset state #1094

More tests on dataset state #1094

Conversation

severo commented Apr 27, 2023 • edited

HuggingFaceDocBuilder commented Apr 27, 2023

albertvillanova left a comment

Choose a reason for hiding this comment

AndreaFrancis left a comment

Choose a reason for hiding this comment

severo commented May 10, 2023

severo commented May 10, 2023 • edited

severo commented Apr 27, 2023 •

edited

severo commented May 10, 2023 •

edited