Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-38589: Fix repeated reads with stream handle #49

Merged
merged 15 commits into from Apr 11, 2023
Merged

Conversation

timj
Copy link
Member

@timj timj commented Apr 5, 2023

This works around a bug found in wsgidav where reading byte ranges past the end of file return the entire file contents and not 416 status code (see mar10/wsgidav#281).

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

The current position should be reported as the number of
bytes that were successfully read.
This passes on file but fails on S3 and HTTP.
natelust and others added 2 commits April 10, 2023 10:59
Both s3 and http file handles use byte ranges to request subsequent
byte reads. Fix an issue where the requested range is past the end
of the file by catching response codes and requesting the rest of
the file where appropriate. The handles will return empty byte
string when there is nothing remaining in the file.
We have decided to change tack and use the content-range header
to determine EOF rather than forcing an additional read from
the server to trigger a 416 status code.
@codecov
Copy link

codecov bot commented Apr 10, 2023

Codecov Report

Patch coverage: 85.88% and project coverage change: +0.22 🎉

Comparison is base (c73dc92) 85.47% compared to head (45af22f) 85.69%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #49      +/-   ##
==========================================
+ Coverage   85.47%   85.69%   +0.22%     
==========================================
  Files          27       27              
  Lines        3732     3804      +72     
  Branches      767      781      +14     
==========================================
+ Hits         3190     3260      +70     
+ Misses        428      426       -2     
- Partials      114      118       +4     
Impacted Files Coverage Δ
.../resources/_resourceHandles/_httpResourceHandle.py 72.17% <68.75%> (+6.61%) ⬆️
...st/resources/_resourceHandles/_s3ResourceHandle.py 80.50% <90.47%> (+0.78%) ⬆️
python/lsst/resources/tests.py 97.72% <100.00%> (+0.13%) ⬆️
tests/test_http.py 94.69% <100.00%> (+0.02%) ⬆️

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@timj timj marked this pull request as ready for review April 10, 2023 19:13
@timj timj requested a review from rra April 10, 2023 19:38
# server.
if "Content-Range" in resp.headers:
content_range = resp.headers["Content-Range"]
_, range_string = content_range.split(" ")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have a quick look to see if there was some pre-existing code I could use that would parse the Content-Range header for me but I didn't find anything standalone.

@rra
Copy link
Member

rra commented Apr 11, 2023

Not sure why GitHub added some of my comments more than once....

TextIOWrapper can call flush as part of seek, and so we have
to support the call in read-only mode.
If the server supports gzip encoding (which is the default for
urllib) the byte ranges refer to bytes in the compressed version
of the content. This can break the client because there is no
guarantee that those bytes can be uncompressed. Instead disable
the Accept-Encoding header so that the server is forced to
use the original byte range.
Rather than always contacting the server one more time and
using 416 status code to indicate EOF, instead look at the
Content-Range header to determine EOF or else use the knowledge
that the number of bytes received is less than the number
of bytes requested to indicate EOF.
Easier to understand than integers.
@timj
Copy link
Member Author

timj commented Apr 11, 2023

Not sure why GitHub added some of my comments more than once....

@rra I don't see doubles but there are two comments you made that I got in email but which are no longer visible to me on the PR so maybe you thought there were doubles and deleted some but they weren't really doubles?

This found a bug in S3 handle.

For now we can not pass byte range to wsgidav test server that
is wholly off the end of the file. Will add a test when upstream
is fixed.
Copy link
Member

@rra rra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I am surprised that Python doesn't allow .split(""); that's kind of annoying from a clarity standpoint, but this certainly works.

@timj timj merged commit 267b0d9 into main Apr 11, 2023
15 checks passed
@timj timj deleted the tickets/DM-38589 branch April 11, 2023 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants