Fixes issue where forced retry would cause timeout. #241

houglum · 2018-08-09T20:20:49Z

Side note: This also fixes TravisCI builds that have been failing on
Python 3.3.

This fixes https://issuetracker.google.com/issues/111486730 -- if
using an expired access token from a cache, a forced retry will occur
and attempt to re-read bytes from the compression stream. This fails
because the bytes have already been consumed, and the server hangs
waiting for us to send the intended bytes from the first attempt. In
gsutil, this manifested itself as an SSLError with the message 'The read
operation timed out'.

While I'd normally avoid reading in an entire stream all at once, this
edge case should be fine because:

We start off the entire (uncompressed) byte sequence anyway to begin
with and only create the stream because that's the format needed for
compression.
This is only used for uploads using the SIMPLE strategy. Note that
this isn't an issue for resumable uploads, since the first request in
that flow is to create the resumable upload session (which isn't a
media transfer request, thus it doesn't try to consume any bytes from
the stream), so the credential refresh (if needed) would happen on that
request.

Side note: This also fixes TravisCI builds that have been failing on Python 3.3. This fixes https://issuetracker.google.com/issues/111486730 -- if using an expired access token from a cache, a forced retry will occur and attempt to re-read bytes from the compression stream. This fails because the bytes have already been consumed, and the server hangs waiting for us to send the intended bytes from the first attempt. In gsutil, this manifested itself as an SSLError with the message 'The read operation timed out'. While I'd normally avoid reading in an entire stream all at once, this edge case should be fine because: - We start off the entire (uncompressed) byte sequence anyway to begin with and only create the stream because that's the format needed for compression. - This is only used for uploads using the SIMPLE strategy. Note that this isn't an issue for resumable uploads, since the first request in that flow is to create the resumable upload session (which isn't a media transfer request, thus it doesn't try to consume any bytes from the stream), so the credential refresh (if needed) would happen on that request.

houglum · 2018-08-09T20:21:57Z

@thobrla Adding you since you oversaw the original implementation of the gzip stuff. PTAL and sanity check :)

coveralls · 2018-08-09T20:27:11Z

Coverage increased (+0.009%) to 91.695% when pulling 261f66e on houglum:fix-b-111486730 into 72130a5 on google:master.

thobrla · 2018-08-09T20:39:20Z

apitools/base/py/transfer.py

+                # bytes container.
+                http_request.body = (
+                    compression.CompressStream(
+                        six.BytesIO(http_request.body))[0].read())


This reads the entire contents of the stream into memory, whereas the CompressStream/StreamingBuffer classes were designed specifically to avoid that. Should we instead fix this at the forced retry level? (i.e., that should not be calling read() without rewinding the stream or buffering the bytes that it reads)?

The comment on L 736 mentions that "Both multipart and media request uploads will read the entire stream into memory" -- so this doesn't really change the behavior for the simple upload case, it just does the whole read now rather than later... right? Or am I reading this incorrectly?

If we were calling CompressStream with some arguments that limited the length (like we do elsewhere, but not for simple media and multipart requests here), I'd agree with your point... but calling it with no argument for the length parameter will just read in the whole stream -- it writes the entire contents of the input stream to the output "stream" (the underlying StreamingBuffer), so the whole thing is just sitting there in memory waiting for someone to read from the underlying StreamingBuffer. Those reads may be performed incrementally, but they'll just knock bytes off of the underlying buffer with each read. The only benefit we lose by converting this directly to a string is not being able to do that incremental reading that slowly reduces the memory used as the bytes are read off the underlying StreamingBuffer... and I'd argue that for large files where this actually matters, users should be utilizing the RESUMABLE strategy, rather than SIMPLE.

Thoughts?

Thanks - I agree; for SIMPLE this is fine.

thobrla · 2018-08-09T22:12:49Z

LGTM

houglum added 2 commits August 9, 2018 12:07

Fix failing tests that expected streams as bodies.

261f66e

googlebot added the cla: yes label Aug 9, 2018

houglum requested a review from thobrla August 9, 2018 20:20

thobrla reviewed Aug 9, 2018

View reviewed changes

houglum merged commit d0bfc7a into google:master Aug 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes issue where forced retry would cause timeout. #241

Fixes issue where forced retry would cause timeout. #241

houglum commented Aug 9, 2018

houglum commented Aug 9, 2018

coveralls commented Aug 9, 2018

thobrla Aug 9, 2018

houglum Aug 9, 2018

houglum Aug 9, 2018 •

edited

Loading

thobrla Aug 9, 2018

thobrla commented Aug 9, 2018

Fixes issue where forced retry would cause timeout. #241

Fixes issue where forced retry would cause timeout. #241

Conversation

houglum commented Aug 9, 2018

houglum commented Aug 9, 2018

coveralls commented Aug 9, 2018

thobrla Aug 9, 2018

Choose a reason for hiding this comment

houglum Aug 9, 2018

Choose a reason for hiding this comment

houglum Aug 9, 2018 • edited Loading

Choose a reason for hiding this comment

thobrla Aug 9, 2018

Choose a reason for hiding this comment

thobrla commented Aug 9, 2018

houglum Aug 9, 2018 •

edited

Loading