Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 馃幐 make the queue agnostic to the types of jobs #608

Merged
merged 43 commits into from
Oct 17, 2022

Conversation

severo
Copy link
Collaborator

@severo severo commented Oct 13, 2022

Before we had two collections: for splits and for first-rows jobs. Now only one collection name "jobs", with a field "type". Note that the job arguments are still restricted to dataset (required) and optionally config and split.

BREAKING CHANGE: 馃Ж two collections are removed and a new one is created. The function names have changed too.

Before we had two collections: for splits and for first-rows jobs. Now
only one collection name "jobs", with a field "type". Note that the job
arguments are still restricted to dataset (required) and optionally
config and split.

BREAKING CHANGE: 馃Ж two collections are removed and a new one is created. The function names
have changed too.
@severo severo force-pushed the agnostic_queue_and_split_workers branch from 49198cd to d6664cf Compare October 14, 2022 11:42
@severo severo force-pushed the agnostic_queue_and_split_workers branch from e9b41c4 to 0d84eb1 Compare October 14, 2022 20:10
@severo severo marked this pull request as ready for review October 17, 2022 14:57
@severo severo merged commit 7f69df0 into main Oct 17, 2022
@severo severo deleted the agnostic_queue_and_split_workers branch October 17, 2022 15:04
@severo
Copy link
Collaborator Author

severo commented Oct 17, 2022

Note that we have to manually migrate the jobs from the splits queue and the first_rows queue to the new jobs queue.

To do so, in the prod mongo database:

  • get the list of datasets with pending "splits" jobs
db.splits_jobs.distinct("dataset_name", {status: {"$in": ['started', 'waiting']}})
  • get the list of datasets with pending "first_rows" jobs
db.first_rows_jobs.distinct("dataset_name", {status: {"$in": ['started', 'waiting']}})

Then relaunch them by sending a webhook:

DATASETS=(datasetA datasetB datasetC ...)
for dataset in ${DATASETS[@]}; do curl -X POST https://datasets-server.huggingface.co/webhook -H 'Content-Type: application/json' -d '{"event": "update", "repo": {"type": "dataset", "name": "'$dataset'"}}'; done;

Then: delete both collections

@severo
Copy link
Collaborator Author

severo commented Oct 17, 2022

Done with

DATASETS=(Graphcore/gqa-lxmert YWjimmy/PeRFception-v1 YWjimmy/PeRFception-v1-1 YWjimmy/PeRFception-v1-2 YWjimmy/PeRFception-v1-3 anonymousdeepcc/DeepCC echarlaix/gqa-lxmert huggingface/stable-diffusion-prompts)

and

DATASETS=(CodedotAI/code_clippy_github YWjimmy/PeRFception-v1 YWjimmy/PeRFception-v1-3 abdusahmbzuai/masc_dev common_voice icelab/ntrs_meta imthanhlv/binhvq_news21_raw mlsum munggok/KoPI-CC munggok/Oscar_Indo_May_2022 nishita/ade20k-sample openclimatefix/dgmr-mrms openclimatefix/gfs-surface-pressure-0.5deg openclimatefix/gfs-surface-pressure-1.0deg openclimatefix/gfs-surface-pressure-2.0deg openclimatefix/gfs-surface-pressure-2deg openclimatefix/nimrod-uk-1km openclimatefix/nimrod-uk-1km-test openclimatefix/nimrod-uk-1km-validation polinaeterna/earn raghav66/whisper-gpt tensorcat/wikipedia-japanese vialibre/spanish3bwc_extended_vocab)

@severo
Copy link
Collaborator Author

severo commented Oct 17, 2022

And I deleted the old collections

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant