Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configs and splits #702

Merged
merged 17 commits into from
Jan 26, 2023
Merged

Configs and splits #702

merged 17 commits into from
Jan 26, 2023

Conversation

severo
Copy link
Collaborator

@severo severo commented Jan 25, 2023

See #701

This PR does:

  • create a new endpoint /config-names. It gives the list of config names for a dataset
  • create a new endpoint /split-names. It gives the list of split names for a config (not for a dataset, which is the difference with /splits for now). Note that /split-names, as /splits, may depend on the streaming mode and might fail for this reason.
  • introduce a new "input-type": "config" in the processing steps. Before, a processing step was "dataset" (the input is a dataset name) or "split" (the inputs are dataset, config, and split). Now, we enable "config" (the inputs are dataset and config) for the new endpoint /split-names.

Once this PR is merged, the plan is to:

  • fill the cache for the two new endpoints for all the datasets on the Hub (might take some hours to complete),
  • create a PR on moonlanding to use these two new endpoints instead of /splits
  • remove the /splits processing step, delete the cache for this endpoint, make the endpoint an alias to /split-names and add the ability for /split-names to concatenate the responses for all the configs if the only parameter is dataset (no config passed)

@codecov-commenter
Copy link

codecov-commenter commented Jan 25, 2023

Codecov Report

Base: 90.42% // Head: 89.11% // Decreases project coverage by -1.31% ⚠️

Coverage data is based on head (4dc0383) compared to base (8d890ea).
Patch coverage: 76.19% of modified lines in pull request are covered.

❗ Current head 4dc0383 differs from pull request most recent head e325904. Consider uploading reports for the commit e325904 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #702      +/-   ##
==========================================
- Coverage   90.42%   89.11%   -1.31%     
==========================================
  Files          65       36      -29     
  Lines        3163     1213    -1950     
==========================================
- Hits         2860     1081    -1779     
+ Misses        303      132     -171     
Flag Coverage Δ
services_admin 88.20% <0.00%> (-0.70%) ⬇️
services_api 90.12% <100.00%> (+0.24%) ⬆️
workers_datasets_based ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
services/admin/src/admin/routes/force_refresh.py 62.16% <0.00%> (-9.72%) ⬇️
services/api/src/api/routes/processing_step.py 60.00% <100.00%> (+4.44%) ⬆️
services/api/tests/conftest.py 97.56% <100.00%> (+0.19%) ⬆️
services/api/tests/routes/test_valid.py 100.00% <100.00%> (ø)
services/api/tests/test_app.py 100.00% <100.00%> (ø)
...orkers/datasets_based/src/datasets_based/config.py
...orkers/datasets_based/src/datasets_based/worker.py
...atasets_based/src/datasets_based/worker_factory.py
...s_based/src/datasets_based/workers/dataset_info.py
...ets_based/src/datasets_based/workers/first_rows.py
... and 24 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@severo severo marked this pull request as draft January 25, 2023 18:09
@severo severo marked this pull request as ready for review January 26, 2023 10:48
@severo severo requested a review from lhoestq January 26, 2023 10:55
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learnt a lot, thanks ! I added come comments and questions - mostly for my personal knowledge, I don't have any comment on the code itself :)

},
"workers": {
"datasets_based": "huggingface/datasets-server-workers-datasets_based:sha-5364f81"
"datasets_based": "huggingface/datasets-server-workers-datasets_based:sha-690d1cd"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to update which docker image to deploy right ?
I understand you have to update admin, api and datasets_based (you did changes to all of them) but why do you update mongodbMigration ?

Copy link
Collaborator Author

@severo severo Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to update which docker image to deploy right ?

Yes. It's used in the local docker compose, in the CI, and in the kubernetes deploy (dev and prod).

why do you update mongodbMigration

hmmm, good question. As it depends on libcommon and libcommon has been updated, a new docker image has been computed (https://github.com/huggingface/datasets-server/actions/runs/4013659307/jobs/6893262606). It's true that nothing should have changed, but I'm used to ensuring we always run the last versions in order to get quick feedback in case of bugs.

requests:
cpu: 0.01
limits:
cpu: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in case we want to deploy a dev cluster on ephemeral ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true

DATASETS_BASED_ENDPOINT: "/split-names" # hard-coded
depends_on:
- mongodb
restart: always
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use this for a local dev env right ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you can launch make start at the root, and it will use this docker compose file to deploy all the images.

Until now, I've not really used a "local dev env":

  • if I want to try something, I generally create a test (be it a unit test if it only affects a service or a worker, or an e2e test if I want to test the relation between them)
  • alternately, I just go to one of the services or worker, launch poetry run python then interact with the prompt to test the methods.
  • if I need to test the whole app manually, I deploy on the ephemeral k8s cluster

But now that we are various people working on the project, the last point is not really a good idea anymore, so, it would be better to launch locally when one wants to test the whole app manually.

Comment on lines +146 to +148
"/config-names": {"input_type": "dataset"},
"/split-names": {"input_type": "config", "requires": "/config-names"},
"/splits": {"input_type": "dataset", "required_by_dataset_viewer": True}, # to be deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where you define the dependencies between the processing steps ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It's a bit hidden for now, and it's hard-coded even if it seems to be a config option. I always thought of this as a temporal workaround, but I'm unsure how we want this to evolve.

@@ -0,0 +1,132 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just curious) why did you choose to add a new /split-names while you could have just extended /splits to support a config= query parameter ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a lot easier to create a new endpoint, populate its cache, then switch (blue/ green method) than having to manage two different logics at the same time.

dataset = hub_public_csv
worker = get_worker(dataset, app_config)
assert worker.process() is True
cached_response = get_response(kind=worker.processing_step.cache_kind, dataset=hub_public_csv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it using an actual mongo database for the cache when running tests ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes:

Every service or worker has a different mongo container with a different exposed port, so that we can run multiple tests at the same time (just in case).

Also: in the tests, we drop the content of the base before each test.

("gated", False, "ConfigNamesError", "FileNotFoundError"),
("private", False, "ConfigNamesError", "FileNotFoundError"),
],
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I call testing <3

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly it's too much and redundant. We could surely optimize the testing time.

Copy link
Contributor

@AndreaFrancis AndreaFrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a couple of minor comments/questions please review if they are valid or not. Also I noticed there is no new test on e2e for the new routes, are they needed?

`SplitNamesResponseContent`: An object with the list of split names for the dataset and config.
<Tip>
Raises the following errors:
- [`~workers.splits.EmptyDatasetError`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be workers.split_names.EmptyDatasetError instead?

Raises the following errors:
- [`~workers.splits.EmptyDatasetError`]
The dataset is empty.
- [`~workers.splits.SplitsNamesError`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be workers.split_names.SplitsNamesError instead?


@staticmethod
def get_version() -> str:
return "2.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure but since this is a new route, why it is not 1.0.0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy/paste from split.py, and I forgot to change. Thanks

@severo
Copy link
Collaborator Author

severo commented Jan 26, 2023

Thanks for your comments! I'm merging and deploying.

@severo severo merged commit 7d4e4d6 into main Jan 26, 2023
@severo severo deleted the configs_and_splits branch January 26, 2023 14:14
@severo
Copy link
Collaborator Author

severo commented Jan 26, 2023

In prod:

I'll add an admin endpoint to backfill the cache:

  • get the list of all the public datasets
  • find the cache misses
  • launch a dataset update for them

EDIT: See #705, where I introduce a priority level so that backfill jobs can be run without interfering with the regular updates and #708 where we add the /backfill endpoint.

@severo
Copy link
Collaborator Author

severo commented Jan 26, 2023

Also I noticed there is no new test on e2e for the new routes, are they needed?

I don't know if all the current e2e are needed. I only added to test_11_auth.py, which already checks that all the responses have been created. The unit tests should already cover more details, or if we have integration bugs, we should create a new e2e test for that specifically. But as such, I'm afraid we already have some redundant tests (splits, first rows).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants