Skip to content

Conversation

@hsinfang
Copy link
Collaborator

This adds retry to such errors.

count = 0
while count <= 20:
try:
md = json.loads(sidecar.read())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json package is slow in general so if you want this to be as fast as possible consider switching to pydantic_core.from_json.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Premature optimization is premature.

Aside from Pydantic's usual pains, I assume this would introduce a dependency on the full sidecar schema and not just the presence of a GROUPID key?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no schema check involved in this call. This is the rust-based json parser used by pydantic but not pydantic validation itself. It was a suggestion not a requirement. It's much faster but if the python json parsing is very fast already then being even faster isn't going to help much.

@hsinfang hsinfang requested a review from kfindeisen November 18, 2025 19:31
Copy link
Member

@kfindeisen kfindeisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though I don't understand the rewrite of the read loop in _get_group_id_from_sidecar.

Comment on lines 399 to 407
sidecar : `ResourcePath`
A `ResourcePath` object of a sidecar JSON file.
Copy link
Member

@kfindeisen kfindeisen Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use fully qualified names: ~lsst.resources.ResourcePath. Numpydoc(?) object lookups don't use the module-level imports.

The sidecar file normally show up before the image.
If not present, wait a bit but not too long for the file.
Sometimes, the object store gives other transient
ClientErrors, which are retried in a longer timescale.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapping seems a bit tight. Why not use 80 characters?

count = 0
while count <= 20:
try:
md = json.loads(sidecar.read())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Premature optimization is premature.

Aside from Pydantic's usual pains, I assume this would introduce a dependency on the full sidecar schema and not just the presence of a GROUPID key?

return group_id


@retry(2, botocore.exceptions.ClientError, wait=5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're using attempted reads to check for file existence, that increases the chance of reading during a write. Should the number of retries be increased?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourcePath has its own internal retries for S3 connections (using backoff).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, ResourcePath's internal retries currently doesn't retry the botocore.exceptions.ClientError kind of error that we are encountering from embargo s3 recently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the feeling was that ClientError was never retryable (eg because of auth problems). If you have examples of ClientErrors that are amenable to retries then we can modify resources.

Copy link
Member

@kfindeisen kfindeisen Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://rubinobs.atlassian.net/browse/DM-53057; we're getting ClientError: An error occurred (ConcurrentModification) when calling the GetObject operation: None.

Unfortunately, this is bad API design on Boto's part, and the only advice I see is to parse the exception message. 😱 So assuming ClientError is user error might be the lesser evil.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That definitely doesn't seem like a client error.

Copy link
Member

@kfindeisen kfindeisen Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe not so much bad design as bad documentation (the existence of ClientError.response is buried here, and it's still example-based docs instead of any explicit guarantee that these keys are defined):

except botocore.exceptions.ClientError as error:
    if error.response['Error']['Code'] == 'LimitExceededException':
except botocore.exceptions.ClientError as err:
    if err.response['Error']['Code'] == 'InternalError': # Generic error

As for ClientError itself,

This is a general exception when an error response is provided by an AWS service to your Boto3 client’s request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm keeping this retry here. If `ResourcePath's internal retries change in the future we should revisit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm increasing the number of retries to 4.

"""
return self._prep_pipeline(pipeline_file).to_graph()

@connect.retry(2, DATASTORE_EXCEPTIONS, wait=repo_retry)
Copy link
Member

@kfindeisen kfindeisen Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DATASTORE_EXCEPTIONS and repo_retry both assume that it's a Butler operation being retried. In particular, repo_retry is as large as it is to avoid spamming the Butler registry and making congestion problems even worse.

For a pure S3 operation I'd recommend using a smaller delay (EDIT: or the built-in retry @timj mentioned above).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I changed this to retry specifically botocore.exceptions.ClientError and lowered the wait to 5 second


import botocore.exceptions
import requests
from pydantic_core import from_json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you keep this, I suggest keeping this as a plain import so that it's more obvious where the from_json comes from.

@hsinfang hsinfang force-pushed the tickets/DM-53057 branch 3 times, most recently from 87318ae to b10b944 Compare November 18, 2025 23:17
Previously, we only retry when the sidecar does not exist,
for occasional slower file transfer from summit.

But it's possible that the sidecar exists but is being
re-written and hence cannot be read temporarily.

We have seen intermittent ConcurrentModification errors
from the embargo Ceph S3. It could have been due to
summit re-sending successful transfer or other transient
storage issues. Hence, this retries in a longer timescale
when S3 ClientError is seen.

This also combines existence checking and file reading to
one operation so to avoid race.
@hsinfang hsinfang merged commit 2b2055e into main Nov 19, 2025
11 checks passed
@hsinfang hsinfang deleted the tickets/DM-53057 branch November 19, 2025 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants