DM-38589: Fix repeated reads with stream handle #49

timj · 2023-04-05T19:05:27Z

This works around a bug found in wsgidav where reading byte ranges past the end of file return the entire file contents and not 416 status code (see mar10/wsgidav#281).

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

The current position should be reported as the number of bytes that were successfully read.

This passes on file but fails on S3 and HTTP.

Both s3 and http file handles use byte ranges to request subsequent byte reads. Fix an issue where the requested range is past the end of the file by catching response codes and requesting the rest of the file where appropriate. The handles will return empty byte string when there is nothing remaining in the file.

We have decided to change tack and use the content-range header to determine EOF rather than forcing an additional read from the server to trigger a 416 status code.

codecov · 2023-04-10T18:05:46Z

Codecov Report

Patch coverage: 85.88% and project coverage change: +0.22 🎉

Comparison is base (c73dc92) 85.47% compared to head (45af22f) 85.69%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #49      +/-   ##
==========================================
+ Coverage   85.47%   85.69%   +0.22%     
==========================================
  Files          27       27              
  Lines        3732     3804      +72     
  Branches      767      781      +14     
==========================================
+ Hits         3190     3260      +70     
+ Misses        428      426       -2     
- Partials      114      118       +4

Impacted Files	Coverage Δ
.../resources/_resourceHandles/_httpResourceHandle.py	`72.17% <68.75%> (+6.61%)`	⬆️
...st/resources/_resourceHandles/_s3ResourceHandle.py	`80.50% <90.47%> (+0.78%)`	⬆️
python/lsst/resources/tests.py	`97.72% <100.00%> (+0.13%)`	⬆️
tests/test_http.py	`94.69% <100.00%> (+0.02%)`	⬆️

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

timj · 2023-04-10T19:18:55Z

python/lsst/resources/_resourceHandles/_httpResourceHandle.py

+        # server.
+        if "Content-Range" in resp.headers:
+            content_range = resp.headers["Content-Range"]
+            _, range_string = content_range.split(" ")


I did have a quick look to see if there was some pre-existing code I could use that would parse the Content-Range header for me but I didn't find anything standalone.

python/lsst/resources/_resourceHandles/_httpResourceHandle.py

rra · 2023-04-11T15:23:31Z

Not sure why GitHub added some of my comments more than once....

TextIOWrapper can call flush as part of seek, and so we have to support the call in read-only mode.

If the server supports gzip encoding (which is the default for urllib) the byte ranges refer to bytes in the compressed version of the content. This can break the client because there is no guarantee that those bytes can be uncompressed. Instead disable the Accept-Encoding header so that the server is forced to use the original byte range.

Rather than always contacting the server one more time and using 416 status code to indicate EOF, instead look at the Content-Range header to determine EOF or else use the knowledge that the number of bytes received is less than the number of bytes requested to indicate EOF.

Easier to understand than integers.

timj · 2023-04-11T20:09:12Z

Not sure why GitHub added some of my comments more than once....

@rra I don't see doubles but there are two comments you made that I got in email but which are no longer visible to me on the PR so maybe you thought there were doubles and deleted some but they weren't really doubles?

Just in case the reported end position is off the end of the file.

This found a bug in S3 handle. For now we can not pass byte range to wsgidav test server that is wholly off the end of the file. Will add a test when upstream is fixed.

rra

Looks good! I am surprised that Python doesn't allow .split(""); that's kind of annoying from a clarity standpoint, but this certainly works.

timj mentioned this pull request Apr 10, 2023

DM-38589: (Alternative) Fix EOF for HTTP file handle #52

Closed

2 tasks

timj added 3 commits April 10, 2023 10:52

Report the byte range when the http read fails

40679a3

Update current position with bytes read not bytes requested

f2766d1

The current position should be reported as the number of bytes that were successfully read.

Add test for repeated reads

bf4b4a8

This passes on file but fails on S3 and HTTP.

timj force-pushed the tickets/DM-38589 branch from 7aca79b to 127f89e Compare April 10, 2023 17:54

natelust and others added 2 commits April 10, 2023 10:59

Revert the http resource handle change

22b96b0

We have decided to change tack and use the content-range header to determine EOF rather than forcing an additional read from the server to trigger a 416 status code.

timj force-pushed the tickets/DM-38589 branch from 43ac467 to 5c80f0a Compare April 10, 2023 18:00

timj force-pushed the tickets/DM-38589 branch from b7b58a1 to 4588306 Compare April 10, 2023 19:11

timj marked this pull request as ready for review April 10, 2023 19:13

timj requested a review from rra April 10, 2023 19:38

timj commented Apr 10, 2023

View reviewed changes

rra reviewed Apr 11, 2023

View reviewed changes

python/lsst/resources/_resourceHandles/_httpResourceHandle.py Outdated Show resolved Hide resolved

python/lsst/resources/_resourceHandles/_httpResourceHandle.py Outdated Show resolved Hide resolved

python/lsst/resources/_resourceHandles/_httpResourceHandle.py Outdated Show resolved Hide resolved

timj force-pushed the tickets/DM-38589 branch from a47d4c5 to a8bf035 Compare April 11, 2023 17:10

timj added 6 commits April 11, 2023 10:12

Support flush method on http handle in read-only mode

8c7ff88

TextIOWrapper can call flush as part of seek, and so we have to support the call in read-only mode.

Add additional check that using seek and rereading will work

34b8662

Use named HTTP status codes

fe863d2

Easier to understand than integers.

Add news fragment

38c3d19

timj force-pushed the tickets/DM-38589 branch from a8bf035 to 38c3d19 Compare April 11, 2023 17:12

timj added 2 commits April 11, 2023 13:30

Check that we have got units of bytes before parsing Content-Range

e13cbb6

Be a bit more defensive in determining EOF

b58e1ad

Just in case the reported end position is off the end of the file.

timj force-pushed the tickets/DM-38589 branch from b3d9395 to b58e1ad Compare April 11, 2023 20:31

timj added 2 commits April 11, 2023 14:29

Remember to increment the position when reading from S3 handle

ed3c4be

Significantly expand handle read tests

45af22f

This found a bug in S3 handle. For now we can not pass byte range to wsgidav test server that is wholly off the end of the file. Will add a test when upstream is fixed.

rra approved these changes Apr 11, 2023

View reviewed changes

timj merged commit 267b0d9 into main Apr 11, 2023
15 checks passed

timj deleted the tickets/DM-38589 branch April 11, 2023 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-38589: Fix repeated reads with stream handle #49

DM-38589: Fix repeated reads with stream handle #49

timj commented Apr 5, 2023 •

edited

codecov bot commented Apr 10, 2023 •

edited

timj Apr 10, 2023

rra commented Apr 11, 2023

timj commented Apr 11, 2023

rra left a comment

DM-38589: Fix repeated reads with stream handle #49

DM-38589: Fix repeated reads with stream handle #49

Conversation

timj commented Apr 5, 2023 • edited

Checklist

codecov bot commented Apr 10, 2023 • edited

Codecov Report

timj Apr 10, 2023

Choose a reason for hiding this comment

rra commented Apr 11, 2023

timj commented Apr 11, 2023

rra left a comment

Choose a reason for hiding this comment

timj commented Apr 5, 2023 •

edited

codecov bot commented Apr 10, 2023 •

edited