-
Notifications
You must be signed in to change notification settings - Fork 0
DM-53057: Intermittent ConcurrentModification errors on S3 reads #358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/shared/raw.py
Outdated
| count = 0 | ||
| while count <= 20: | ||
| try: | ||
| md = json.loads(sidecar.read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json package is slow in general so if you want this to be as fast as possible consider switching to pydantic_core.from_json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Premature optimization is premature.
Aside from Pydantic's usual pains, I assume this would introduce a dependency on the full sidecar schema and not just the presence of a GROUPID key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no schema check involved in this call. This is the rust-based json parser used by pydantic but not pydantic validation itself. It was a suggestion not a requirement. It's much faster but if the python json parsing is very fast already then being even faster isn't going to help much.
kfindeisen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, though I don't understand the rewrite of the read loop in _get_group_id_from_sidecar.
python/shared/raw.py
Outdated
| sidecar : `ResourcePath` | ||
| A `ResourcePath` object of a sidecar JSON file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use fully qualified names: ~lsst.resources.ResourcePath. Numpydoc(?) object lookups don't use the module-level imports.
python/shared/raw.py
Outdated
| The sidecar file normally show up before the image. | ||
| If not present, wait a bit but not too long for the file. | ||
| Sometimes, the object store gives other transient | ||
| ClientErrors, which are retried in a longer timescale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wrapping seems a bit tight. Why not use 80 characters?
python/shared/raw.py
Outdated
| count = 0 | ||
| while count <= 20: | ||
| try: | ||
| md = json.loads(sidecar.read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Premature optimization is premature.
Aside from Pydantic's usual pains, I assume this would introduce a dependency on the full sidecar schema and not just the presence of a GROUPID key?
python/shared/raw.py
Outdated
| return group_id | ||
|
|
||
|
|
||
| @retry(2, botocore.exceptions.ClientError, wait=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using attempted reads to check for file existence, that increases the chance of reading during a write. Should the number of retries be increased?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResourcePath has its own internal retries for S3 connections (using backoff).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, ResourcePath's internal retries currently doesn't retry the botocore.exceptions.ClientError kind of error that we are encountering from embargo s3 recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the feeling was that ClientError was never retryable (eg because of auth problems). If you have examples of ClientErrors that are amenable to retries then we can modify resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://rubinobs.atlassian.net/browse/DM-53057; we're getting ClientError: An error occurred (ConcurrentModification) when calling the GetObject operation: None.
Unfortunately, this is bad API design on Boto's part, and the only advice I see is to parse the exception message. 😱 So assuming ClientError is user error might be the lesser evil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That definitely doesn't seem like a client error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe not so much bad design as bad documentation (the existence of ClientError.response is buried here, and it's still example-based docs instead of any explicit guarantee that these keys are defined):
except botocore.exceptions.ClientError as error:
if error.response['Error']['Code'] == 'LimitExceededException':
except botocore.exceptions.ClientError as err:
if err.response['Error']['Code'] == 'InternalError': # Generic error
As for ClientError itself,
This is a general exception when an error response is provided by an AWS service to your Boto3 client’s request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm keeping this retry here. If `ResourcePath's internal retries change in the future we should revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm increasing the number of retries to 4.
| """ | ||
| return self._prep_pipeline(pipeline_file).to_graph() | ||
|
|
||
| @connect.retry(2, DATASTORE_EXCEPTIONS, wait=repo_retry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DATASTORE_EXCEPTIONS and repo_retry both assume that it's a Butler operation being retried. In particular, repo_retry is as large as it is to avoid spamming the Butler registry and making congestion problems even worse.
For a pure S3 operation I'd recommend using a smaller delay (EDIT: or the built-in retry @timj mentioned above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I changed this to retry specifically botocore.exceptions.ClientError and lowered the wait to 5 second
python/shared/raw.py
Outdated
|
|
||
| import botocore.exceptions | ||
| import requests | ||
| from pydantic_core import from_json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you keep this, I suggest keeping this as a plain import so that it's more obvious where the from_json comes from.
87318ae to
b10b944
Compare
b10b944 to
b6206c7
Compare
Previously, we only retry when the sidecar does not exist, for occasional slower file transfer from summit. But it's possible that the sidecar exists but is being re-written and hence cannot be read temporarily. We have seen intermittent ConcurrentModification errors from the embargo Ceph S3. It could have been due to summit re-sending successful transfer or other transient storage issues. Hence, this retries in a longer timescale when S3 ClientError is seen. This also combines existence checking and file reading to one operation so to avoid race.
b6206c7 to
fd08e39
Compare
This adds retry to such errors.