DM-53057: Intermittent ConcurrentModification errors on S3 reads #358

hsinfang · 2025-11-18T19:26:22Z

This adds retry to such errors.

timj · 2025-11-18T19:31:14Z

python/shared/raw.py

+    count = 0
+    while count <= 20:
+        try:
+            md = json.loads(sidecar.read())


json package is slow in general so if you want this to be as fast as possible consider switching to pydantic_core.from_json.

Premature optimization is premature.

Aside from Pydantic's usual pains, I assume this would introduce a dependency on the full sidecar schema and not just the presence of a GROUPID key?

There is no schema check involved in this call. This is the rust-based json parser used by pydantic but not pydantic validation itself. It was a suggestion not a requirement. It's much faster but if the python json parsing is very fast already then being even faster isn't going to help much.

kfindeisen

Looks good, though I don't understand the rewrite of the read loop in _get_group_id_from_sidecar.

python/shared/raw.py

kfindeisen · 2025-11-18T19:38:52Z

python/shared/raw.py

+    sidecar : `ResourcePath`
+        A `ResourcePath` object of a sidecar JSON file.


Please use fully qualified names: ~lsst.resources.ResourcePath. Numpydoc(?) object lookups don't use the module-level imports.

kfindeisen · 2025-11-18T19:42:35Z

python/shared/raw.py

+    The sidecar file normally show up before the image.
+    If not present, wait a bit but not too long for the file.
+    Sometimes, the object store gives other transient
+    ClientErrors, which are retried in a longer timescale.


The wrapping seems a bit tight. Why not use 80 characters?

kfindeisen · 2025-11-18T19:44:40Z

python/shared/raw.py

+    count = 0
+    while count <= 20:
+        try:
+            md = json.loads(sidecar.read())


Premature optimization is premature.

Aside from Pydantic's usual pains, I assume this would introduce a dependency on the full sidecar schema and not just the presence of a GROUPID key?

python/shared/raw.py

kfindeisen · 2025-11-18T19:55:00Z

python/shared/raw.py

    return group_id


+@retry(2, botocore.exceptions.ClientError, wait=5)


If you're using attempted reads to check for file existence, that increases the chance of reading during a write. Should the number of retries be increased?

ResourcePath has its own internal retries for S3 connections (using backoff).

IIRC, ResourcePath's internal retries currently doesn't retry the botocore.exceptions.ClientError kind of error that we are encountering from embargo s3 recently.

I think the feeling was that ClientError was never retryable (eg because of auth problems). If you have examples of ClientErrors that are amenable to retries then we can modify resources.

See https://rubinobs.atlassian.net/browse/DM-53057; we're getting ClientError: An error occurred (ConcurrentModification) when calling the GetObject operation: None.

~~Unfortunately, this is bad API design on Boto's part, and the only advice I see is to parse the exception message. 😱 So assuming ClientError is user error might be the lesser evil.~~

Thanks. That definitely doesn't seem like a client error.

Ok, maybe not so much bad design as bad documentation (the existence of ClientError.response is buried here, and it's still example-based docs instead of any explicit guarantee that these keys are defined):

except botocore.exceptions.ClientError as error: if error.response['Error']['Code'] == 'LimitExceededException':

except botocore.exceptions.ClientError as err: if err.response['Error']['Code'] == 'InternalError': # Generic error

As for ClientError itself,

This is a general exception when an error response is provided by an AWS service to your Boto3 client’s request.

I'm keeping this retry here. If `ResourcePath's internal retries change in the future we should revisit.

Also I'm increasing the number of retries to 4.

kfindeisen · 2025-11-18T19:57:37Z

python/activator/middleware_interface.py

        """
        return self._prep_pipeline(pipeline_file).to_graph()

+    @connect.retry(2, DATASTORE_EXCEPTIONS, wait=repo_retry)


DATASTORE_EXCEPTIONS and repo_retry both assume that it's a Butler operation being retried. In particular, repo_retry is as large as it is to avoid spamming the Butler registry and making congestion problems even worse.

For a pure S3 operation I'd recommend using a smaller delay (EDIT: or the built-in retry @timj mentioned above).

Thanks. I changed this to retry specifically botocore.exceptions.ClientError and lowered the wait to 5 second

kfindeisen · 2025-11-18T20:18:52Z

python/shared/raw.py


 import botocore.exceptions
 import requests
+from pydantic_core import from_json


If you keep this, I suggest keeping this as a plain import so that it's more obvious where the from_json comes from.

Previously, we only retry when the sidecar does not exist, for occasional slower file transfer from summit. But it's possible that the sidecar exists but is being re-written and hence cannot be read temporarily. We have seen intermittent ConcurrentModification errors from the embargo Ceph S3. It could have been due to summit re-sending successful transfer or other transient storage issues. Hence, this retries in a longer timescale when S3 ClientError is seen. This also combines existence checking and file reading to one operation so to avoid race.

timj reviewed Nov 18, 2025

View reviewed changes

hsinfang requested a review from kfindeisen November 18, 2025 19:31

kfindeisen approved these changes Nov 18, 2025

View reviewed changes

kfindeisen reviewed Nov 18, 2025

View reviewed changes

hsinfang force-pushed the tickets/DM-53057 branch 3 times, most recently from 87318ae to b10b944 Compare November 18, 2025 23:17

Factor out the sidecar file reading

25c5ffe

hsinfang force-pushed the tickets/DM-53057 branch from b10b944 to b6206c7 Compare November 18, 2025 23:25

hsinfang added 3 commits November 18, 2025 15:28

Retry image download

b42ac8d

Use a faster library to deserialize json

fd08e39

hsinfang force-pushed the tickets/DM-53057 branch from b6206c7 to fd08e39 Compare November 18, 2025 23:30

hsinfang merged commit 2b2055e into main Nov 19, 2025
11 checks passed

hsinfang deleted the tickets/DM-53057 branch November 19, 2025 00:16

		sidecar : `ResourcePath`
		A `ResourcePath` object of a sidecar JSON file.

		return group_id


		@retry(2, botocore.exceptions.ClientError, wait=5)

DM-53057: Intermittent ConcurrentModification errors on S3 reads #358

DM-53057: Intermittent ConcurrentModification errors on S3 reads #358

Uh oh!

Conversation

hsinfang commented Nov 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfindeisen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kfindeisen Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfindeisen Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfindeisen Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfindeisen Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kfindeisen Nov 18, 2025 •

edited

Loading

kfindeisen Nov 18, 2025 •

edited

Loading

kfindeisen Nov 18, 2025 •

edited

Loading

kfindeisen Nov 18, 2025 •

edited

Loading