Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set number of retries per error code #2758

Merged
merged 4 commits into from
Apr 30, 2024
Merged

Conversation

severo
Copy link
Collaborator

@severo severo commented Apr 30, 2024

See https://huggingface.slack.com/archives/C04L6P8KNQ5/p1714480880576699?thread_ts=1714033939.429809&cid=C04L6P8KNQ5 (internal)

Errors like CreateCommitError should always be retried, because they correspond to the Hub being down, which should always be a temporary situation. I set the limit to 30 in that case (instead of 3). I set 30 to other error codes as well: HfHubError, LockedDatasetTimeoutError, PreviousStepStillProcessingError

@severo severo merged commit 94981e0 into main Apr 30, 2024
19 of 20 checks passed
@severo severo deleted the set_number_of_retries_per_error_code branch April 30, 2024 13:39
@severo
Copy link
Collaborator Author

severo commented Apr 30, 2024

CI error is unrelated, see #2759

@severo
Copy link
Collaborator Author

severo commented Apr 30, 2024

It's working as expected: for the first backfill on retryable errors, a lot of new jobs

Capture d’écran 2024-04-30 à 16 08 09

@severo
Copy link
Collaborator Author

severo commented Apr 30, 2024

And the number of CreateCommitErrors is decreasing, soon zero

Capture d’écran 2024-04-30 à 17 16 34

@severo
Copy link
Collaborator Author

severo commented May 2, 2024

Is there an issue with the metrics? We never get back to zero:

Capture d’écran 2024-05-02 à 13 37 52 Capture d’écran 2024-05-02 à 13 38 35 Capture d’écran 2024-05-02 à 13 38 09

@severo
Copy link
Collaborator Author

severo commented May 2, 2024

For PreviousStepStillProcessingError, it seems it's due to the previous step being JobManagerCrashedError. When we detect that a job has crashed, maybe we're not creating jobs for its children?

I opened an issue: #2765

@severo
Copy link
Collaborator Author

severo commented May 2, 2024

Also: the metrics include details.copied_from_artifact: false | true, ie: we currently have only 4 datasets with the CreateCommitError, but 40 cache entries for all the following steps, and the metrics above show 40.

For the 4 datasets, we reached failed_runs: 30, so we must have a bug here. One of these is https://huggingface.co/datasets/venetis/VMMRdb_make_model_test. I'm not sure what happens. Opening an issue: #2766

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants