Configs and splits #702

severo · 2023-01-25T17:59:32Z

See #701

This PR does:

create a new endpoint /config-names. It gives the list of config names for a dataset
create a new endpoint /split-names. It gives the list of split names for a config (not for a dataset, which is the difference with /splits for now). Note that /split-names, as /splits, may depend on the streaming mode and might fail for this reason.
introduce a new "input-type": "config" in the processing steps. Before, a processing step was "dataset" (the input is a dataset name) or "split" (the inputs are dataset, config, and split). Now, we enable "config" (the inputs are dataset and config) for the new endpoint /split-names.

Once this PR is merged, the plan is to:

fill the cache for the two new endpoints for all the datasets on the Hub (might take some hours to complete),
create a PR on moonlanding to use these two new endpoints instead of /splits
remove the /splits processing step, delete the cache for this endpoint, make the endpoint an alias to /split-names and add the ability for /split-names to concatenate the responses for all the configs if the only parameter is dataset (no config passed)

codecov-commenter · 2023-01-25T18:03:05Z

Codecov Report

Base: 90.42% // Head: 89.11% // Decreases project coverage by -1.31% ⚠️

Coverage data is based on head (4dc0383) compared to base (8d890ea).
Patch coverage: 76.19% of modified lines in pull request are covered.

❗ Current head 4dc0383 differs from pull request most recent head e325904. Consider uploading reports for the commit e325904 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #702      +/-   ##
==========================================
- Coverage   90.42%   89.11%   -1.31%     
==========================================
  Files          65       36      -29     
  Lines        3163     1213    -1950     
==========================================
- Hits         2860     1081    -1779     
+ Misses        303      132     -171

Flag	Coverage Δ
services_admin	`88.20% <0.00%> (-0.70%)`	⬇️
services_api	`90.12% <100.00%> (+0.24%)`	⬆️
workers_datasets_based	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
services/admin/src/admin/routes/force_refresh.py	`62.16% <0.00%> (-9.72%)`	⬇️
services/api/src/api/routes/processing_step.py	`60.00% <100.00%> (+4.44%)`	⬆️
services/api/tests/conftest.py	`97.56% <100.00%> (+0.19%)`	⬆️
services/api/tests/routes/test_valid.py	`100.00% <100.00%> (ø)`
services/api/tests/test_app.py	`100.00% <100.00%> (ø)`
...orkers/datasets_based/src/datasets_based/config.py
...orkers/datasets_based/src/datasets_based/worker.py
...atasets_based/src/datasets_based/worker_factory.py
...s_based/src/datasets_based/workers/dataset_info.py
...ets_based/src/datasets_based/workers/first_rows.py
... and 24 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

It's more explicit, even if it's a bit longer. Also, it prepares the next endpoint /split-names. It will be easier to implement it if it's not named as the existing and to be deprecated /splits

missing config input type in api and admin

lhoestq

I learnt a lot, thanks ! I added come comments and questions - mostly for my personal knowledge, I don't have any comment on the code itself :)

lhoestq · 2023-01-26T11:16:26Z

chart/docker-images.yaml

    },
    "workers": {
-      "datasets_based": "huggingface/datasets-server-workers-datasets_based:sha-5364f81"
+      "datasets_based": "huggingface/datasets-server-workers-datasets_based:sha-690d1cd"


This is to update which docker image to deploy right ?
I understand you have to update admin, api and datasets_based (you did changes to all of them) but why do you update mongodbMigration ?

This is to update which docker image to deploy right ?

Yes. It's used in the local docker compose, in the CI, and in the kubernetes deploy (dev and prod).

why do you update mongodbMigration

hmmm, good question. As it depends on libcommon and libcommon has been updated, a new docker image has been computed (https://github.com/huggingface/datasets-server/actions/runs/4013659307/jobs/6893262606). It's true that nothing should have changed, but I'm used to ensuring we always run the last versions in order to get quick feedback in case of bugs.

lhoestq · 2023-01-26T11:17:21Z

chart/env/dev.yaml

+    requests:
+      cpu: 0.01
+    limits:
+      cpu: 1


this is in case we want to deploy a dev cluster on ephemeral ?

Yes, that's true

lhoestq · 2023-01-26T11:25:09Z

tools/docker-compose-datasets-server.yml

+      DATASETS_BASED_ENDPOINT: "/split-names" # hard-coded
+    depends_on:
+      - mongodb
+    restart: always


we can use this for a local dev env right ?

yes, you can launch make start at the root, and it will use this docker compose file to deploy all the images.

Until now, I've not really used a "local dev env":

if I want to try something, I generally create a test (be it a unit test if it only affects a service or a worker, or an e2e test if I want to test the relation between them)

alternately, I just go to one of the services or worker, launch poetry run python then interact with the prompt to test the methods.

if I need to test the whole app manually, I deploy on the ephemeral k8s cluster

But now that we are various people working on the project, the last point is not really a good idea anymore, so, it would be better to launch locally when one wants to test the whole app manually.

lhoestq · 2023-01-26T11:25:50Z

libs/libcommon/src/libcommon/config.py

+            "/config-names": {"input_type": "dataset"},
+            "/split-names": {"input_type": "config", "requires": "/config-names"},
+            "/splits": {"input_type": "dataset", "required_by_dataset_viewer": True},  # to be deprecated


This is where you define the dependencies between the processing steps ?

Yes! It's a bit hidden for now, and it's hard-coded even if it seems to be a config option. I always thought of this as a temporal workaround, but I'm unsure how we want this to evolve.

lhoestq · 2023-01-26T11:34:23Z

workers/datasets_based/src/datasets_based/workers/split_names.py

@@ -0,0 +1,132 @@
+# SPDX-License-Identifier: Apache-2.0


(just curious) why did you choose to add a new /split-names while you could have just extended /splits to support a config= query parameter ?

I find it a lot easier to create a new endpoint, populate its cache, then switch (blue/ green method) than having to manage two different logics at the same time.

lhoestq · 2023-01-26T11:37:35Z

workers/datasets_based/tests/workers/test_config_names.py

+    dataset = hub_public_csv
+    worker = get_worker(dataset, app_config)
+    assert worker.process() is True
+    cached_response = get_response(kind=worker.processing_step.cache_kind, dataset=hub_public_csv)


is it using an actual mongo database for the cache when running tests ?

Yes:

we launch make up before launching the test, then make down
https://github.com/huggingface/datasets-server/blob/e32590450a1f0c33439b41bb5b0b8a44f58fd9db/tools/PythonTest.mk#L4-L7

which corresponds to launching the docker compose described in the DOCKER_COMPOSE variable
https://github.com/huggingface/datasets-server/blob/e32590450a1f0c33439b41bb5b0b8a44f58fd9db/tools/Docker.mk#L1-L7

which turns to be only a mongo container:
https://github.com/huggingface/datasets-server/blob/e32590450a1f0c33439b41bb5b0b8a44f58fd9db/services/api/Makefile#L6-L7

see:
https://github.com/huggingface/datasets-server/blob/e32590450a1f0c33439b41bb5b0b8a44f58fd9db/tools/docker-compose-mongo.yml#L1-L10

Every service or worker has a different mongo container with a different exposed port, so that we can run multiple tests at the same time (just in case).

Also: in the tests, we drop the content of the base before each test.

lhoestq · 2023-01-26T11:38:20Z

workers/datasets_based/tests/workers/test_config_names.py

+        ("gated", False, "ConfigNamesError", "FileNotFoundError"),
+        ("private", False, "ConfigNamesError", "FileNotFoundError"),
+    ],
+)


that's what I call testing <3

Possibly it's too much and redundant. We could surely optimize the testing time.

AndreaFrancis

I put a couple of minor comments/questions please review if they are valid or not. Also I noticed there is no new test on e2e for the new routes, are they needed?

AndreaFrancis · 2023-01-26T12:00:06Z

workers/datasets_based/src/datasets_based/workers/split_names.py

+        `SplitNamesResponseContent`: An object with the list of split names for the dataset and config.
+    <Tip>
+    Raises the following errors:
+        - [`~workers.splits.EmptyDatasetError`]


Should this be workers.split_names.EmptyDatasetError instead?

AndreaFrancis · 2023-01-26T12:00:15Z

workers/datasets_based/src/datasets_based/workers/split_names.py

+    Raises the following errors:
+        - [`~workers.splits.EmptyDatasetError`]
+          The dataset is empty.
+        - [`~workers.splits.SplitsNamesError`]


Should this be workers.split_names.SplitsNamesError instead?

AndreaFrancis · 2023-01-26T12:01:51Z

workers/datasets_based/src/datasets_based/workers/split_names.py

+
+    @staticmethod
+    def get_version() -> str:
+        return "2.0.0"


I am not sure but since this is a new route, why it is not 1.0.0?

copy/paste from split.py, and I forgot to change. Thanks

severo · 2023-01-26T14:14:27Z

Thanks for your comments! I'm merging and deploying.

severo · 2023-01-26T14:26:13Z

In prod:

I'll add an admin endpoint to backfill the cache:

get the list of all the public datasets
find the cache misses
launch a dataset update for them

EDIT: See #705, where I introduce a priority level so that backfill jobs can be run without interfering with the regular updates and #708 where we add the /backfill endpoint.

severo · 2023-01-26T17:15:41Z

Also I noticed there is no new test on e2e for the new routes, are they needed?

I don't know if all the current e2e are needed. I only added to test_11_auth.py, which already checks that all the responses have been created. The unit tests should already cover more details, or if we have integration bugs, we should create a new e2e test for that specifically. But as such, I'm afraid we already have some redundant tests (splits, first rows).

severo marked this pull request as draft January 25, 2023 18:09

severo added 13 commits January 26, 2023 09:00

refactor: 💡 move lines

efd08da

feat: 🎸 add /configs step

baa948c

refactor: 💡 fix name

f02f34a

docs: ✏️ fix the docstring

87d2a64

docs: ✏️ typo

a229f7f

docs: ✏️ fix docstring

de721bb

docs: ✏️ fix exceptions in docstring

98f345a

feat: 🎸 add /configs processing step

05f79b7

refactor: 💡 rename /configs to /config-names

8165ddc

It's more explicit, even if it's a bit longer. Also, it prepares the next endpoint /split-names. It will be easier to implement it if it's not named as the existing and to be deprecated /splits

refactor: 💡 no need to use elif

ef6790b

feat: 🎸 create jobs for children steps of input type "config"

35d03d2

refactor: 💡 better name

61b5bf3

feat: 🎸 add /split-names worker

690d1cd

severo force-pushed the configs_and_splits branch from bdebc51 to 690d1cd Compare January 26, 2023 09:00

severo added 3 commits January 26, 2023 10:40

feat: 🎸 add endpoints docker and helm + fix api/admin

4dc0383

missing config input type in api and admin

feat: 🎸 update docker image

027af00

test: 💍 fix e2e test

729af25

severo marked this pull request as ready for review January 26, 2023 10:48

severo requested a review from lhoestq January 26, 2023 10:55

lhoestq reviewed Jan 26, 2023

View reviewed changes

AndreaFrancis reviewed Jan 26, 2023

View reviewed changes

fix: 🐛 apply comments by @AndreaFrancis

e325904

severo merged commit 7d4e4d6 into main Jan 26, 2023

severo deleted the configs_and_splits branch January 26, 2023 14:14

severo mentioned this pull request Jan 26, 2023

feat: 🎸 make /first-rows depend on /split-names, not /splits #706

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configs and splits #702

Configs and splits #702

severo commented Jan 25, 2023 •

edited

codecov-commenter commented Jan 25, 2023 •

edited

lhoestq left a comment

lhoestq Jan 26, 2023

severo Jan 26, 2023 •

edited

lhoestq Jan 26, 2023

severo Jan 26, 2023

lhoestq Jan 26, 2023

severo Jan 26, 2023

lhoestq Jan 26, 2023

severo Jan 26, 2023

lhoestq Jan 26, 2023

severo Jan 26, 2023

lhoestq Jan 26, 2023

severo Jan 26, 2023

lhoestq Jan 26, 2023

severo Jan 26, 2023

AndreaFrancis left a comment

AndreaFrancis Jan 26, 2023

AndreaFrancis Jan 26, 2023

AndreaFrancis Jan 26, 2023

severo Jan 26, 2023

severo commented Jan 26, 2023

severo commented Jan 26, 2023 •

edited

severo commented Jan 26, 2023

Configs and splits #702

Configs and splits #702

Conversation

severo commented Jan 25, 2023 • edited

codecov-commenter commented Jan 25, 2023 • edited

Codecov Report

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

severo Jan 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndreaFrancis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

severo commented Jan 26, 2023

severo commented Jan 26, 2023 • edited

severo commented Jan 26, 2023

severo commented Jan 25, 2023 •

edited

codecov-commenter commented Jan 25, 2023 •

edited

severo Jan 26, 2023 •

edited

severo commented Jan 26, 2023 •

edited