feat: 🎸 make the queue agnostic to the types of jobs #608

severo · 2022-10-13T12:56:35Z

Before we had two collections: for splits and for first-rows jobs. Now only one collection name "jobs", with a field "type". Note that the job arguments are still restricted to dataset (required) and optionally config and split.

BREAKING CHANGE: 🧨 two collections are removed and a new one is created. The function names have changed too.

Before we had two collections: for splits and for first-rows jobs. Now only one collection name "jobs", with a field "type". Note that the job arguments are still restricted to dataset (required) and optionally config and split. BREAKING CHANGE: 🧨 two collections are removed and a new one is created. The function names have changed too.

refresh... functions are now the "compute" abstract method

also: move types-requests dependency to dev dependencies.

Note: we removed apache-beam for now because of an issue with the installation It must be added again later.

it only contains the splits/ worker

beware: the docker images don't exist, we will have to update

if the tests fail, it means that a side effect occurs somewhere

see huggingface/datasets#4875 (comment)

we explicitely pass it as an argument, so no need to store it on the disk

now the current package version is 1.0.31, no need to build it from source.

severo · 2022-10-17T15:47:00Z

Note that we have to manually migrate the jobs from the splits queue and the first_rows queue to the new jobs queue.

To do so, in the prod mongo database:

get the list of datasets with pending "splits" jobs

db.splits_jobs.distinct("dataset_name", {status: {"$in": ['started', 'waiting']}})

get the list of datasets with pending "first_rows" jobs

db.first_rows_jobs.distinct("dataset_name", {status: {"$in": ['started', 'waiting']}})

Then relaunch them by sending a webhook:

DATASETS=(datasetA datasetB datasetC ...)
for dataset in ${DATASETS[@]}; do curl -X POST https://datasets-server.huggingface.co/webhook -H 'Content-Type: application/json' -d '{"event": "update", "repo": {"type": "dataset", "name": "'$dataset'"}}'; done;

Then: delete both collections

severo · 2022-10-17T16:54:47Z

Done with

DATASETS=(Graphcore/gqa-lxmert YWjimmy/PeRFception-v1 YWjimmy/PeRFception-v1-1 YWjimmy/PeRFception-v1-2 YWjimmy/PeRFception-v1-3 anonymousdeepcc/DeepCC echarlaix/gqa-lxmert huggingface/stable-diffusion-prompts)

and

DATASETS=(CodedotAI/code_clippy_github YWjimmy/PeRFception-v1 YWjimmy/PeRFception-v1-3 abdusahmbzuai/masc_dev common_voice icelab/ntrs_meta imthanhlv/binhvq_news21_raw mlsum munggok/KoPI-CC munggok/Oscar_Indo_May_2022 nishita/ade20k-sample openclimatefix/dgmr-mrms openclimatefix/gfs-surface-pressure-0.5deg openclimatefix/gfs-surface-pressure-1.0deg openclimatefix/gfs-surface-pressure-2.0deg openclimatefix/gfs-surface-pressure-2deg openclimatefix/nimrod-uk-1km openclimatefix/nimrod-uk-1km-test openclimatefix/nimrod-uk-1km-validation polinaeterna/earn raghav66/whisper-gpt tensorcat/wikipedia-japanese vialibre/spanish3bwc_extended_vocab)

severo · 2022-10-17T17:02:49Z

And I deleted the old collections

severo added 10 commits October 14, 2022 08:47

feat: 🎸 publish new version

0d47c4f

feat: 🎸 upgrade to libqueue 0.3.0

45cea4c

feat: 🎸 remove created_at field in pending_jobs

42db817

style: 💄 fix style

8f6c441

feat: 🎸 upgrade libqueue to 0.3.0

6f27aee

refactor: 💡 use an enum to prevent typos

3c54b58

fix: 🐛 fix mypy

81f7bca

feat: 🎸 upgrade libqueue to 0.3.0 and datasets

3c0a2e4

test: 💍 fix test

d6664cf

severo force-pushed the agnostic_queue_and_split_workers branch from 49198cd to d6664cf Compare October 14, 2022 11:42

severo added 16 commits October 14, 2022 11:51

refactor: 💡 pack the queue functions into a Queue class

cab2941

refactor: 💡 use relative imports

2692096

feat: 🎸 upgrade to libqueue 0.3.1

4b40da1

refactor: 💡 use relative imports

2ec9085

feat: 🎸 upgrade to libqueue 0.3.1

205a358

refactor: 💡 use relative imports

45f0aee

feat: 🎸 upgrade to libqueue 0.3.1

cb39a41

refactor: 💡 use a common Worker class for the loop logic

d8f02d2

refactor: 💡 simplify the code

6a54405

refactor: 💡 factor process_job in Worker, and remove refresh

64bdb37

refresh... functions are now the "compute" abstract method

test: 💍 temporarily disable unrelated failing tests

b8b8b95

test: 💍 fix tests

2564e2e

refactor: 💡 add Worker to libqueue

51ee0f3

chore: 🤖 install types

c54e020

feat: 🎸 upgrade to libqueue 0.3.2

0a85620

also: move types-requests dependency to dev dependencies.

refactor: 💡 new project isolating the /first-rows worker

0d84eb1

Note: we removed apache-beam for now because of an issue with the installation It must be added again later.

severo force-pushed the agnostic_queue_and_split_workers branch from e9b41c4 to 0d84eb1 Compare October 14, 2022 20:10

severo added 2 commits October 14, 2022 20:33

feat: 🎸 create a new project: worker_splits

c72059b

it only contains the splits/ worker

chore: 🤖 add the commented dependency to think to reinstall it

881b814

severo added 9 commits October 14, 2022 21:06

feat: 🎸 replace services/workers with the workers/

5f03c49

beware: the docker images don't exist, we will have to update

ci: 🎡 fix argument name

d41b141

fix: 🐛 fix details and upgrade docker images for admin and api

1694b08

feat: 🎸 upgrade docker images for the two workers

fce3bc4

fix: 🐛 reinstall apache beam, pinned to 2.41.0

e64810a

fix: 🐛 use absolute imports, not relative imports

db1a233

fix: 🐛 upgrade httplib2 to remove safety alert

676c776

test: 💍 hack the tests order to fix the CI?

9d9f2fc

test: 💍 restore the tests order to show the problem

2c50fe8

if the tests fail, it means that a side effect occurs somewhere

severo mentioned this pull request Oct 17, 2022

_resolve_features ignores the token huggingface/datasets#4875

Open

severo added 6 commits October 17, 2022 13:02

test: 💍 force datasets to patch csv for streaming every time

328d445

see huggingface/datasets#4875 (comment)

style: 💄 fix style

33ddc92

test: 💍 don't store the HF token on the disk

4291351

we explicitely pass it as an argument, so no need to store it on the disk

chore: 🤖 install libsndfile1 from repos instead of building it

dadb8f2

now the current package version is 1.0.31, no need to build it from source.

chore: 🤖 add a missing package to have ICU work

bb202a6

feat: 🎸 update the docker images

e227558

severo marked this pull request as ready for review October 17, 2022 14:57

severo merged commit 7f69df0 into main Oct 17, 2022

severo deleted the agnostic_queue_and_split_workers branch October 17, 2022 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 🎸 make the queue agnostic to the types of jobs #608

feat: 🎸 make the queue agnostic to the types of jobs #608

severo commented Oct 13, 2022

severo commented Oct 17, 2022 •

edited

severo commented Oct 17, 2022

severo commented Oct 17, 2022

feat: 🎸 make the queue agnostic to the types of jobs #608

feat: 🎸 make the queue agnostic to the types of jobs #608

Conversation

severo commented Oct 13, 2022

severo commented Oct 17, 2022 • edited

severo commented Oct 17, 2022

severo commented Oct 17, 2022

severo commented Oct 17, 2022 •

edited