Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python/Core: Parsing error on client since v1.44.0, UNAUTHENTICATED error converted to UNKNOWN, Stream closed #31657

Closed
eddyashton opened this issue Nov 15, 2022 · 8 comments

Comments

@eddyashton
Copy link

Summary

There may be a bunch of red herrings in here, but I want to give as much details as possible in case any of it turns out to be relevant:

  • We have our own network stack on the server, and we're using Python grpcio on the client
  • We have a StartTx RPC, that we expect to return 401 Unauthenticated errors for some calls, and we test that in our CI
  • We recently added an AMD build configuration, where previously we were running entirely on Intel hardware
  • Additionally, the AMD runs occur in-container, while the Intel runs occur on 'bare' VMs
  • Most of our tests pass, so we're reasonably confident that our network stack and production of gRPC messages is correct
  • One test consistently fails on the AMD platform, because the 401 Unauthenticated error becomes an UNKNOWN error, with details "Stream removed"
  • I confirmed that our server behaviour is identical on both platforms, returning byte-identical HTTP/2 frames
  • With GRPC_TRACE=all, I established that the first discrepancy on the client side appears to be whether a READ operation sees just the initial HEADER frame (in passing runs), or the concatenated HEADER and DATA frames (in failing runs)
  • I tried downgrading the client version, and discovered that this issue does not reproduce on grpcio==1.44.0, but does on all more recent releases

I believe that there is some issue with the network reading/parsing logic, either due to a platform switch or a race condition, potentially introduced/exacerbated since v1.44.0, that results in valid gRPC responses being lost and converted to UNKNOWN: "Stream removed" errors.

What version of gRPC and what language are you using?

Using Python grpcio, initially on v1.50.0. Was able to reproduce this issue on v1.46.0, v1.47.0, and v1.49.1. Was not able to reproduce the issue on v1.44.0.

What operating system (Linux, Windows,...) and version?

$ uname -sr
Linux 5.11.0-rc5+

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.5 LTS
Release:        20.04
Codename:       focal

What runtime / compiler are you using (e.g. python version or version of gcc)

$ python --version
Python 3.8.10

$ pip show grpcio
Name: grpcio
Version: 1.44.0
Summary: HTTP/2-based RPC framework
Home-page: https://grpc.io
Author: The gRPC Authors
Author-email: grpc-io@googlegroups.com
License: Apache License 2.0
Location: /home/edashton-0/CCF/build/env/lib/python3.8/site-packages
Requires: six
Required-by: grpcio-tools

What did you do?

I don't have a minimal repro, but I'm happy to work with you to attempt to construct one. We have discussion of the issue here, including verbose grpc trace logging.

The failing client side code boils down to this:

import kv_pb2_grpc as Service

    anonymous_credentials = grpc.ssl_channel_credentials(
        open("service_cert.pem", "rb").read()
    )
    
    with grpc.secure_channel(
        target=target_public_address,
        credentials=anonymous_credentials,
    ) as channel:
        try:
            rd = Service.KVStub(channel).StartTx(Empty())
        except grpc.RpcError as e:
            assert e.code() == grpc.StatusCode.UNAUTHENTICATED, e

The server produces the following response frames:

00 00 16 01 04 00 00 00 01 48 82 68 01 0f 0d 82 0b e1 5f 8b 1d 75 d0 62 0d 26 3d 4c 74 41 ea

00 00 bf 00 01 00 00 00 01 7b 22 65 72 72 6f 72 22 3a 7b 22 63 6f 64 65 22 3a 22 49 6e 76 61 6c 69 64 41 75 74 68 65 6e 74 69 63 61 74 69 6f 6e 49 6e 66 6f 22 2c 22 64 65 74 61 69 6c 73 22 3a 5b 7b 22 61 75 74 68 5f 70 6f 6c 69 63 79 22 3a 22 45 78 65 63 75 74 6f 72 41 75 74 68 50 6f 6c 69 63 79 22 2c 22 63 6f 64 65 22 3a 22 49 6e 76 61 6c 69 64 41 75 74 68 65 6e 74 69 63 61 74 69 6f 6e 49 6e 66 6f 22 2c 22 6d 65 73 73 61 67 65 22 3a 22 4e 6f 20 45 78 65 63 75 74 6f 72 20 63 65 72 74 69 66 69 63 61 74 65 22 7d 5d 2c 22 6d 65 73 73 61 67 65 22 3a 22 49 6e 76 61 6c 69 64 20 69 6e 66 6f 22 7d 7d

The first is a HEADERS frame containing a 401 status, the second is a DATA frame containing a HTTP/2 response body.

Note that the response is actually content-type: application/json. We get the following error from that:

Error parsing metadata: error=invalid value key=content-type value=application/json

We believe that's harmless, since it's also printed on passing runs, but maybe that's what produces a racey failure state?

What did you expect to see?

(Note, the logs below are trimmed to hopefully highlight the interesting details. Full trace logs are available here as works.client.txt and fails.client.txt)

I expect the StartTx call to throw an RpcError with an UNAUTHENTICATED status code, so the assertion should pass. Here are trace logs around the parsing of this response on a passing run:

I1111 17:35:04.801326399   75245 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI:user-agent: grpc-python/1.50.0 grpc-c/28.0.0 (linux; chttp2)
I1111 17:35:04.801330169   75245 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI::authority: 127.226.187.16:38089
I1111 17:35:04.801334145   75245 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI::path: /externalexecutor.protobuf.KV/StartTx
I1111 17:35:04.801337952   75245 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI:WaitForReady: false
I1111 17:35:04.801341716   75245 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI:content-type: application/grpc
I1111 17:35:04.801345419   75245 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI::scheme: https
I1111 17:35:04.801348628   75245 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI::method: POST
I1111 17:35:04.801351948   75245 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI:grpc-accept-encoding: identity, deflate, gzip
I1111 17:35:04.801355687   75245 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI:te: trailers
...
D1111 17:35:04.801538273   75245 tcp_posix.cc:1839]          DATA: 17 03 03 01 32 32 3d a0 f1 06 b8 6e 08 64 04 9d 6a 18 35 4c 90 71 3d b0 54 e6 75 53 87 92 3f 75 d4 39 38 8b 58 46 eb 0d 43 10 21 ae 23 94 b8 e3 5f e3 72 d5 c1 ed 5e 6f a2 c1 93 4b f6 45 af 57 ab 21 47 48 0d 21 ef 30 ea 8c ff eb 01 db 9d cf 4f f9 aa 13 9a 24 9c 3d 8f aa 84 cf 81 2e e4 e4 ce 3e 14 07 83 80 04 36 9f c4 d6 16 e3 1a 03 b1 6b 89 fb 24 8d bb 2d 70 97 50 bb 2b ed a2 49 25 f9 06 ff e4 ba 1b d8 92 af 4d 32 41 09 f7 ac fa 50 59 32 1b 7f b2 ad bb 35 35 05 5e 02 d1 7c 8a e5 55 c0 be e1 29 57 53 2d 20 4c 21 2c 62 77 6f 8d ba 5e 93 9b 05 e8 20 5c 13 51 0c 93 ae ca 95 2b f0 dc 7e c6 bd af 30 bb 31 c1 7f 8f 2e 76 b8 07 95 1a ee d3 27 fc d9 68 66 c2 91 da 23 8a c3 21 c3 0f ac 0c d8 9d 08 18 d5 6d ad ff 0c a5 27 9e ae 0d e0 29 c3 94 c7 1d 45 df 6e 05 a3 d3 5e 28 39 e0 95 d4 0e d5 ec ad 96 db 1a fd e1 6a 39 ce ca 0c 42 56 fa df 52 de cb bb ca c7 14 85 c6 86 76 26 cd 1c e0 b4 19 89 71 15 2c 5f 31 82 8c 72 53 b3 ad e4 44 93 '....22=....n.d..j.5L.q=.T.uS..?u.98.XF..C.!.#..._.r...^o...K.E.W.!GH.!.0........O....$.=.........>.....6........k..$..-p.P.+..I%.........M2A....PY2.....55.^..|..U...)WS- L!,bwo..^.... \.Q.....+..~...0.1....v......'..hf...#..!.........m....'....)....E.n...^(9............j9...BV..R.........v&......q.,_1..rS...D.'
...
D1111 17:35:04.802781326   75245 tcp_posix.cc:820]           DATA: 17 03 03 00 30 6f c7 f1 4f fa 24 ea bd 0a c3 81 0e d6 d0 d0 b9 d8 c0 9b c1 54 91 22 e9 84 1c ca 48 7f e2 7e a9 f9 96 35 ed f7 21 8e 45 72 f1 8d f7 ba f0 56 1d '....0o..O.$..............T."....H..~...5..!.Er.....V.'
I1111 17:35:04.802789176   75245 secure_endpoint.cc:244]     READ 0x210b300: 00 00 16 01 04 00 00 00 01 48 82 68 01 0f 0d 82 0b e1 5f 8b 1d 75 d0 62 0d 26 3d 4c 74 41 ea '.........H.h......_..u.b.&=LtA.'
I1111 17:35:04.802793808   75245 parsing.cc:601]             parsing initial_metadata
D1111 17:35:04.802802056   75245 hpack_parser.cc:1060]       HTTP:1:HDR:CLI: :status: 401
D1111 17:35:04.802808170   75245 hpack_parser.cc:1060]       HTTP:1:HDR:CLI: content-length: 191
E1111 17:35:04.802814558   75245 hpack_parser.cc:1228]       Error parsing metadata: error=invalid value key=content-type value=application/json
D1111 17:35:04.802819182   75245 hpack_parser.cc:1060]       HTTP:1:HDR:CLI: content-type: <discarded-invalid-value>
I1111 17:35:04.802824082   75245 timer_generic.cc:446]       TIMER 0x20f4000: CANCEL pending=true
I1111 17:35:04.802831494   75245 tcp_posix.cc:667]           TCP:0x20bed90 notify_on_read
I1111 17:35:04.802860988   75245 channel_stack.cc:253]       OP[message_decompress:0x2123230]:  CANCEL:UNKNOWN:early return from promise based filter {created_time:"2022-11-11T17:35:04.802837298+00:00", grpc_status:16, grpc_message:"Received http2 header with status: 401"}
...
I1111 17:35:04.803082553   75245 client_channel.cc:2245]     chand=0x210db80 calld=0x211ef40: got recv_trailing_metadata_ready: error=OK service_config_call_data=0x211f160
D1111 17:35:04.803087948   75245 call.cc:793]                set_final_status CLI
D1111 17:35:04.803096758   75245 call.cc:794]                UNKNOWN:Error received from peer ipv4:127.226.187.16:38089 {grpc_message:"Received http2 header with status: 401", grpc_status:16, created_time:"2022-11-11T17:35:04.803085472+00:00"}
...
17:35:04.803 | WARNING  | __main__:test_executor_registration:140 - StartTx call produced error: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAUTHENTICATED
	details = "Received http2 header with status: 401"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:127.226.187.16:38089 {grpc_message:"Received http2 header with status: 401", grpc_status:16, created_time:"2022-11-11T17:35:04.803085472+00:00"}"
>

What did you see instead?

The status code in the thrown error is instead UNKNOWN. Tracing shows that the expected response was received, and passed around, but at some point is lost and replaced by this UNKNOWN: Stream removed error. Trace logs around a failing run:

I1111 17:35:03.337613925   25165 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI:user-agent: grpc-python/1.50.0 grpc-c/28.0.0 (linux; chttp2)
I1111 17:35:03.337619135   25165 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI::authority: 127.162.34.194:45385
I1111 17:35:03.337624174   25165 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI::path: /externalexecutor.protobuf.KV/StartTx
I1111 17:35:03.337629504   25165 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI:WaitForReady: false
I1111 17:35:03.337634624   25165 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI:content-type: application/grpc
I1111 17:35:03.337639633   25165 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI::scheme: https
I1111 17:35:03.337644622   25165 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI::method: POST
I1111 17:35:03.337649712   25165 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI:grpc-accept-encoding: identity, deflate, gzip
I1111 17:35:03.337654962   25165 chttp2_transport.cc:1195]   HTTP:0:HDR:CLI:te: trailers
...
D1111 17:35:03.337906251   25165 tcp_posix.cc:1839]          DATA: 17 03 03 01 32 4c 49 68 f0 74 09 34 d9 b5 50 0f 7d 10 4d 14 d9 8f eb a1 1b f8 cf cc e5 60 bc 96 5f 3c a8 a9 ef c0 35 a5 f2 42 37 b1 f3 81 e1 49 85 21 3f a5 98 29 7f 46 ee c1 29 9e 7b 1b 1b c0 02 50 2d cb 40 27 ab 0a 78 c9 62 91 7d c9 5a 48 ef cf dd bd c3 dc fc da dd ea 46 31 ef 81 5d 60 f0 8d 54 16 6f ac ad e0 08 29 8b f4 44 36 22 20 8f 57 c1 44 36 cb 8d a4 05 ad ef 32 44 58 0a bd 28 a2 d1 26 e0 1c b6 ed f9 03 f7 fe 2f 48 fc 5f 3e 4c a6 c2 c6 09 6f 5b 0d 8e 80 84 f6 5c f8 87 c3 61 79 b8 02 25 3f d4 f6 b8 78 31 ab 28 b5 eb fa 74 0c 58 7c 92 ac e9 e1 59 36 c4 20 38 8c 22 8d e7 78 93 33 2b a8 f9 f7 0a fa ae 8f cf 20 3c 3c db 0a 86 42 17 13 19 bf 66 31 50 ff fd 9a 1d 84 ee cb 6d 7b 58 f1 4e 9f 18 6d 1d 82 c8 d5 87 1b ff 90 2f 0a 9b 97 69 bf d3 6f 0f 41 c7 f9 a2 33 6e 4c 5c 68 83 2b b9 a7 ef 4f c2 30 0a b8 ae 8a 7d 35 e8 0a 8b 41 11 9c 41 93 c4 28 9d 68 d7 72 c1 b9 09 38 e2 6b 5c 4d 18 26 a5 a0 8b ee ab 29 8a 83 64 4e fc b0 '....2LIh.t.4..P.}.M..........`.._<....5..B7....I.!?..).F..).{....P-.@'..x.b.}.ZH..........F1..]`..T.o....)..D6" .W.D6......2DX..(..&......../H._>L....o[.....\...ay..%?...x1.(...t.X|....Y6. 8."..x.3+........ <<...B....f1P.......m{X.N..m......../...i..o.A...3nL\h.+...O.0....}5...A..A..(.h.r...8.k\M.&.....)..dN..'
...
D1111 17:35:03.341192623   25165 tcp_posix.cc:820]           DATA: 17 03 03 00 30 66 c5 59 81 61 86 a2 cb a4 7c 2a cb d3 a8 19 68 0c b2 65 9b b1 cf 2f c3 82 a0 3b eb 53 46 5c 29 30 e6 6f 39 8c 7b 82 c8 ea 08 3c 7c 79 f1 24 ab 17 03 03 00 d9 b6 16 34 18 8e 08 a9 96 ae fa 92 e6 5b 49 0d e4 b8 e7 6a d8 df be c7 e2 30 0e 8a 4d 05 01 5d 8d d0 5a 08 dd 38 54 f2 17 04 68 fe 24 d2 a3 4b df c9 0d 56 7e 69 54 2e dc b5 2e 76 17 fd db ab 3e 0e ef 05 12 b1 7b db de 7a b1 ee 81 e3 da 69 de d2 b5 26 3e f9 aa f5 89 cd eb 37 43 57 f5 c6 78 57 6a d1 6f 7c 01 2f 86 39 10 22 25 da 95 15 9f 07 ce 1b f5 df e5 74 31 f4 c3 3f 9d 2e 39 00 d7 bc e3 01 98 9c c4 0a ca cf 72 bd 1f 5b c0 16 cc ce 84 ed 88 e5 58 80 86 53 3f 65 96 eb 6d c0 c6 04 96 1f 16 2c 59 2e 00 aa 63 63 6b 03 84 01 18 f3 f4 c7 99 ff c0 68 7d 73 cc 03 e9 6b af c2 d1 f1 30 6c 1e ff ec 46 ec 72 9a 36 b9 35 07 6b 4c 2d 21 52 3f aa 74 4f ce 6f '....0f.Y.a....|*....h..e.../...;.SF\)0.o9.{....<|y.$........4.........[I....j.....0..M..]..Z..8T...h.$..K...V~iT....v....>.....{..z.....i...&>......7CW..xWj.o|./.9."%..........t1..?..9...........r..[........X..S?e..m......,Y...cck..........h}s...k....0l...F.r.6.5.kL-!R?.tO.o'
I1111 17:35:03.341220795   25165 secure_endpoint.cc:244]     READ 0x30f5230: 00 00 16 01 04 00 00 00 01 48 82 68 01 0f 0d 82 0b e1 5f 8b 1d 75 d0 62 0d 26 3d 4c 74 41 ea 00 00 bf 00 01 00 00 00 01 7b 22 65 72 72 6f 72 22 3a 7b 22 63 6f 64 65 22 3a 22 49 6e 76 61 6c 69 64 41 75 74 68 65 6e 74 69 63 61 74 69 6f 6e 49 6e 66 6f 22 2c 22 64 65 74 61 69 6c 73 22 3a 5b 7b 22 61 75 74 68 5f 70 6f 6c 69 63 79 22 3a 22 45 78 65 63 75 74 6f 72 41 75 74 68 50 6f 6c 69 63 79 22 2c 22 63 6f 64 65 22 3a 22 49 6e 76 61 6c 69 64 41 75 74 68 65 6e 74 69 63 61 74 69 6f 6e 49 6e 66 6f 22 2c 22 6d 65 73 73 61 67 65 22 3a 22 4e 6f 20 45 78 65 63 75 74 6f 72 20 63 65 72 74 69 66 69 63 61 74 65 22 7d 5d 2c 22 6d 65 73 73 61 67 65 22 3a 22 49 6e 76 61 6c 69 64 20 69 6e 66 6f 22 7d 7d '.........H.h......_..u.b.&=LtA..........{"error":{"code":"InvalidAuthenticationInfo","details":[{"auth_policy":"ExecutorAuthPolicy","code":"InvalidAuthenticationInfo","message":"No Executor certificate"}],"message":"Invalid info"}}'
I1111 17:35:03.341245502   25165 parsing.cc:601]             parsing initial_metadata
D1111 17:35:03.341266751   25165 hpack_parser.cc:1060]       HTTP:1:HDR:CLI: :status: 401
D1111 17:35:03.341285827   25165 hpack_parser.cc:1060]       HTTP:1:HDR:CLI: content-length: 191
E1111 17:35:03.341304542   25165 hpack_parser.cc:1228]       Error parsing metadata: error=invalid value key=content-type value=application/json
D1111 17:35:03.341321834   25165 hpack_parser.cc:1060]       HTTP:1:HDR:CLI: content-type: <discarded-invalid-value>
I1111 17:35:03.341452037   25165 timer_generic.cc:446]       TIMER 0x30fc020: CANCEL pending=true
I1111 17:35:03.341470162   25165 tcp_posix.cc:667]           TCP:0x2ff9d40 notify_on_read
I1111 17:35:03.341508964   25165 channel_stack.cc:253]       OP[message_decompress:0x30d79c0]:  CANCEL:UNKNOWN:early return from promise based filter {created_time:"2022-11-11T17:35:03.341488775+00:00", grpc_status:16, grpc_message:"Received http2 header with status: 401"}
...
I1111 17:35:03.341831807   25165 client_channel.cc:2245]     chand=0x30cd690 calld=0x30dae00: got recv_trailing_metadata_ready: error=OK service_config_call_data=0x30db020
D1111 17:35:03.341853718   25165 call.cc:793]                set_final_status CLI
D1111 17:35:03.341876130   25165 call.cc:794]                UNKNOWN:Error received from peer ipv4:127.162.34.194:45385 {grpc_message:"Stream removed", grpc_status:2, created_time:"2022-11-11T17:35:03.341848548+00:00"}
I1111 17:35:03.341903782   25165 retry_filter.cc:1431]       chand=0x30b5d90 calld=0x30ec430 attempt=0x30ecae0 batch_data=0x30cd8c0: got recv_initial_metadata_ready, error=UNKNOWN:early return from promise based filter {created_time:"2022-11-11T17:35:03.341488775+00:00", grpc_status:16, grpc_message:"Received http2 header with status: 401"}
...
17:35:03.343 | WARNING  | __main__:test_executor_registration:140 - StartTx call produced error: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Stream removed"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:127.162.34.194:45385 {grpc_message:"Stream removed", grpc_status:2, created_time:"2022-11-11T17:35:03.341848548+00:00"}"
>

The first difference is that the READ line sees both frames, rather than just the HEADER frame. Then the set_final_status appears to drop the original 401, and replaces it with a "Stream removed" error. Despite continued logging around the 401 response, the final error that reaches Python is UNKNOWN.

Anything else we should know about your project / environment?

Hopefully covered everything important above. Happy to answer any further questions.

@yashykt
Copy link
Member

yashykt commented Nov 15, 2022

Thanks for raising this issue!

@yashykt
Copy link
Member

yashykt commented Nov 15, 2022

Tagging @ctiller for promises related changes

@yashykt
Copy link
Member

yashykt commented Nov 15, 2022

I've got a suspicion as to what's happening here, but I need to add a repro test case for this first, which is some non-trivial work that I need to time for. I'll try to do it in this week or the coming one

@gnossen
Copy link
Contributor

gnossen commented Nov 29, 2022

@yashykt Friendly ping.

@yashykt
Copy link
Member

yashykt commented Jan 7, 2023

So, I tried debugging this with https://github.com/yashykt/grpc/tree/31657Debugging
What that does is pretty similar to what OP suggests where the server is responding with a 401 Unauthenticated.

The bad news is that I wasn't able to reproduce the issue where the final status changes to Unknown, but I did see similar logs

CANCEL:UNKNOWN:early return from promise based filter {file:"src/core/lib/channel/promise_based_filter.cc", file_line:358, created_time:"2023-01-07T01:02:58.685412258+00:00", grpc_status:16, grpc_message:"Received http2 header with status: 401"}

These logs seem to start in 1.46.x and disappear in 1.51.x with #31474.

This makes me suspect that the issue is fixed with that PR, but I cannot say for sure till I have a repro. The repro would also help creating a backport.

@yashykt
Copy link
Member

yashykt commented Apr 21, 2023

I know I forgot about this bug, but given that this was fixed in later branches, I am slightly inclined to not try to isolate the fix, and instead recommend using the later branches.

@gnossen
Copy link
Contributor

gnossen commented May 3, 2023

Closing since this issue was fixed in 1.51.3.

@gnossen gnossen closed this as completed May 3, 2023
@jeevb
Copy link

jeevb commented Jul 29, 2023

@gnossen @yashykt: This issue seems to have resurfaced with ==v1.53.1 and >=v1.54.2. Would you prefer that we open another issue or discuss further here? I'm happy to produce a minimal reproducible example that can be used for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants