New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-38589: Fix repeated reads with stream handle #49
Conversation
The current position should be reported as the number of bytes that were successfully read.
This passes on file but fails on S3 and HTTP.
Both s3 and http file handles use byte ranges to request subsequent byte reads. Fix an issue where the requested range is past the end of the file by catching response codes and requesting the rest of the file where appropriate. The handles will return empty byte string when there is nothing remaining in the file.
We have decided to change tack and use the content-range header to determine EOF rather than forcing an additional read from the server to trigger a 416 status code.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #49 +/- ##
==========================================
+ Coverage 85.47% 85.69% +0.22%
==========================================
Files 27 27
Lines 3732 3804 +72
Branches 767 781 +14
==========================================
+ Hits 3190 3260 +70
+ Misses 428 426 -2
- Partials 114 118 +4
... and 1 file with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
# server. | ||
if "Content-Range" in resp.headers: | ||
content_range = resp.headers["Content-Range"] | ||
_, range_string = content_range.split(" ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did have a quick look to see if there was some pre-existing code I could use that would parse the Content-Range header for me but I didn't find anything standalone.
Not sure why GitHub added some of my comments more than once.... |
TextIOWrapper can call flush as part of seek, and so we have to support the call in read-only mode.
If the server supports gzip encoding (which is the default for urllib) the byte ranges refer to bytes in the compressed version of the content. This can break the client because there is no guarantee that those bytes can be uncompressed. Instead disable the Accept-Encoding header so that the server is forced to use the original byte range.
Rather than always contacting the server one more time and using 416 status code to indicate EOF, instead look at the Content-Range header to determine EOF or else use the knowledge that the number of bytes received is less than the number of bytes requested to indicate EOF.
Easier to understand than integers.
@rra I don't see doubles but there are two comments you made that I got in email but which are no longer visible to me on the PR so maybe you thought there were doubles and deleted some but they weren't really doubles? |
Just in case the reported end position is off the end of the file.
This found a bug in S3 handle. For now we can not pass byte range to wsgidav test server that is wholly off the end of the file. Will add a test when upstream is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I am surprised that Python doesn't allow .split("")
; that's kind of annoying from a clarity standpoint, but this certainly works.
This works around a bug found in wsgidav where reading byte ranges past the end of file return the entire file contents and not 416 status code (see mar10/wsgidav#281).
Checklist
doc/changes