Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding occasionally fails when streaming #76

Closed
cpcloud opened this issue Oct 15, 2019 · 10 comments · Fixed by #83
Closed

Decoding occasionally fails when streaming #76

cpcloud opened this issue Oct 15, 2019 · 10 comments · Fixed by #83

Comments

@cpcloud
Copy link
Contributor

cpcloud commented Oct 15, 2019

Bug Report

Version

Running from 1987a40.

Platform

Linux nexus 5.3.1 #1-NixOS SMP Sat Sep 21 05:19:47 UTC 2019 x86_64 GNU/Linux

Crates

Affects tonic

Description

When running these lines I occasionally get an error message from prost informing me that a buffer underflow has occurred:

Error: Status { code: Internal, message: "failed to decode Protobuf message: buffer underflow" }

Sometimes this happens, other times it doesn't. It seems to be related to the number of bytes that I try to push through from the server to the client.

After digging around for quite some time, the failing requests all seem to be missing 3 additional bytes after the 5 byte header. When testing the blob-client with 10000 bytes a request ends up having a total length of 10005. After the buffer is advanced 5 bytes, the length is now 10000. The prost::encoding::bytes::merge function then reads off 3 bytes to get the length of the byte array from the proto, which is 10000 and then checks for underflow.

I'm stumped as to how these bytes are missing in some cases and are present in many others.

This prost issue (https://github.com/danburkert/prost/issues/98) suggests using {encode|decode}_length_delimited, but that didn't seem to have any effect when I swapped those calls in in prost.rs.

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 16, 2019

After making the following changes I'm now unable to reproduce the issue:

diff --git a/tonic/src/codec/prost.rs b/tonic/src/codec/prost.rs
index 495dc11..5c29a06 100644
--- a/tonic/src/codec/prost.rs
+++ b/tonic/src/codec/prost.rs
@@ -49,12 +49,14 @@ impl<T: Message> Encoder for ProstEncoder<T> {
 
     fn encode(&mut self, item: Self::Item, buf: &mut BytesMut) -> Result<(), Self::Error> {
         let len = item.encoded_len();
+        let additional = prost::encoding::encoded_len_varint(len as u64);
+        let required = len + additional;
 
-        if buf.remaining_mut() < len {
-            buf.reserve(len);
+        if buf.remaining_mut() < required {
+            buf.reserve(required);
         }
 
-        item.encode(buf)
+        item.encode_length_delimited(buf)
             .map_err(|_| unreachable!("Message only errors if not enough space"))
     }
 }
@@ -68,7 +70,7 @@ impl<U: Message + Default> Decoder for ProstDecoder<U> {
     type Error = Status;
 
     fn decode(&mut self, buf: &mut BytesMut) -> Result<Option<Self::Item>, Self::Error> {
-        Message::decode(buf.take())
+        Message::decode_length_delimited(buf.take())
             .map(Option::Some)
             .map_err(from_decode_error)
     }

However, I'm not exactly sure how to reliably test that this changes solves the problem.

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 16, 2019

but that didn't seem to have any effect when I swapped those calls in in prost.rs

I think I was running a client without the above changes against a server with the changes, which yields the same behavior.

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 16, 2019

I'm still able to reproduce this in my specific application, so the patch above isn't a fix.

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 16, 2019

I can reproduce with blobservice if I run the client in a loop.

@LucioFranco
Copy link
Member

@seanmonstar would hyper be cutting the body off by chance?

@seanmonstar
Copy link
Member

I wouldn't expect it to. Could try wireshark or something to sniff what DATA frames are sent/received.

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 17, 2019

Cool I'll look at it with Wireshark.

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 19, 2019

Looking at the logs coming out of h2 it looks like when 10008 bytes get sent from the server (10000 bytes from the bytes field in the message + 5 bytes for the tonic header + 2 bytes for the encoded length (10000) of the byte array + 1 for the proto tag) the first 10005 make it to the client, and the client tries to decode those 10005 bytes before waiting for the final 3 bytes it needs to fully decode the message.

Is there any reason the decoder would not wait for at minimum the number of bytes required to fully decode a message?

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 19, 2019

@seanmonstar After looking at frames using wireshark, I don't see anything anomalous. 10005 bytes are sent as DATA on stream 1, there are WINDOW_UPDATES on streams 0 and 1, then 3 bytes are sent as DATA on stream 1.

I noticed that between the 10005 bytes being sent the tonic decoding thread gets woken up and tries to decode the message before having all the bytes needed to decode it.

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 20, 2019

I'm not convinced this is related to threading or h2. I think that the issue most likely lies in tonic.

When receiving data by polling the body, it's possible to end up in a situation where the polling results in a buffer of size N where N >= length, but also that N < 5 + length.

Roughly the flow is like this:

  1. Tonic polls for data, gets N bytes, where length <= N < 5 + length
  2. Tonic sees that N bytes have been read, and if N >= length, advances the buffer by 5 bytes
  3. Advancing the buffer reduces the buffer length by 5, now we have that length - 5 <= N < length
  4. Since we need length bytes for the proto, prost fails with a buffer underflow

@cpcloud cpcloud changed the title Decoding occasionally fails when streaming large-ish buffers Decoding occasionally fails when streaming Oct 21, 2019
brentalanmiller pushed a commit to brentalanmiller/tonic that referenced this issue Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants