istio 1.2.4: envoy segfaults occasionally on istio-proxy pods #16357

gotwarlost · 2019-08-16T22:02:58Z

Bug description

After upgrading to 1.2.4 (from: 1.1.8) we see occasional segfaults on many istio-proxy pods all of which have a form similar to this:

Caught Segmentation fault, suspect faulting address 0x228
Backtrace (use tools/stack_decode.py to get line numbers):
0: __restore_rt [0x7f3fe53cb390]
1: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf4a4d]
2: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf41e1]
3: Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate() [0xe12841]
4: Envoy::Config::GrpcMuxImpl::onDiscoveryResponse() [0xe0ff19]
5: Envoy::Grpc::TypedAsyncStreamCallbacks<>::onReceiveMessageUntyped() [0xe10d6e]
6: Envoy::Grpc::AsyncStreamImpl::onData() [0xe3ae9d]
7: Envoy::Http::AsyncStreamImpl::encodeData() [0xe3e814]
8: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0xeae3cd]
9: nghttp2_session_on_data_received [0xebe97f]
10: nghttp2_session_mem_recv [0xec0785]
11: Envoy::Http::Http2::ConnectionImpl::dispatch() [0xeada89]
12: Envoy::Http::CodecClient::onData() [0xe32106]
13: Envoy::Http::CodecClient::CodecReadFilter::onData() [0xe32afd]
14: Envoy::Network::FilterManagerImpl::onRead() [0xcb7c29]
15: Envoy::Network::ConnectionImpl::onReadReady() [0xcb4c2c]
16: Envoy::Network::ConnectionImpl::onFileEvent() [0xcb46f1]
17: Envoy::Event::FileEventImpl::assignEvents()::$_0::__invoke() [0xcaf110]
18: event_process_active_single_queue [0x10536ed]
19: event_base_loop [0x1051ca0]
20: Envoy::Event::DispatcherImpl::run() [0xcae09f]
21: Envoy::Server::InstanceImpl::run() [0xc6a773]
22: Envoy::MainCommonBase::run() [0x8f405d]
23: main [0x8f3308]
24: [0x7f3fe45fc830]

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[X ] Networking
[ X] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
Envoy does not crash.

Steps to reproduce the bug
Very intermittent so hard to reproduce. Creating issue to alert people of potential problem. We will continue to investigate.

Version (include the output of istioctl version --remote and kubectl version)

client version: version.BuildInfo{Version:"1.1.2", GitRevision:"2b1331886076df103179e3da5dc9077fed59c989", User:"root", Host:"35adf5bb-5570-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Clean", GitTag:"1.1.1"}
citadel version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}
galley version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}
pilot version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}
policy version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}
sidecar-injector version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}
telemetry version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}

How was Istio installed?
Helm charts

Environment where bug was observed (cloud vendor, OS, etc)
AWS/ Kops

The text was updated successfully, but these errors were encountered:

duderino · 2019-08-17T03:23:48Z

@PiotrSikora @htuch

PiotrSikora · 2019-08-17T21:09:45Z

Istio Proxy binary in 1.2.4 was built from istio/proxy@568f2b6, istio/envoy@204283e (branched from envoyproxy/envoy@829b905 on 2019-05-16).

That backtrace decoded using binary with debug symbols corresponds to:

Caught Segmentation fault, suspect faulting address 0x228
Backtrace (use tools/stack_decode.py to get line numbers):
#0: __restore_rt [0x7f3fe53cb390] ??:0
#1: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf4a4d] /usr/lib/llvm-7/bin/../include/c++/v1/functional:1864
 (inlined by) external/envoy/source/common/upstream/cds_api_impl.cc:82
#2: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf41e1] external/com_google_protobuf/src/google/protobuf/repeated_field.h:1465
 (inlined by) external/com_google_protobuf/src/google/protobuf/repeated_field.h:1923
 (inlined by) external/envoy/source/common/upstream/cds_api_impl.cc:63
#3: Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate() [0xe12841] /usr/lib/llvm-7/bin/../include/c++/v1/string:1406
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/string:1500
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/string:1228
 (inlined by) external/com_google_absl/absl/strings/string_view.h:171
 (inlined by) external/envoy/source/common/config/grpc_mux_subscription_impl.cc:60
#4: Envoy::Config::GrpcMuxImpl::onDiscoveryResponse() [0xe0ff19] /usr/lib/llvm-7/bin/../include/c++/v1/memory:2603
 (inlined by) external/envoy/source/common/config/grpc_mux_impl.cc:184
#5: Envoy::Grpc::TypedAsyncStreamCallbacks<>::onReceiveMessageUntyped() [0xe10d6e] bazel-out/k8-opt/bin/external/envoy/include/envoy/grpc/_virtual_includes/async_client_interface/envoy/grpc/async_client.h:?
#6: Envoy::Grpc::AsyncStreamImpl::onData() [0xe3ae9d] /usr/lib/llvm-7/bin/../include/c++/v1/iterator:1509
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/iterator:1561
 (inlined by) external/envoy/source/common/grpc/async_client_impl.cc:131
#7: Envoy::Http::AsyncStreamImpl::encodeData() [0xe3e814] external/envoy/source/common/http/async_client_impl.cc:106
#8: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0xeae3cd] external/com_google_absl/absl/types/optional.h:816
 (inlined by) external/envoy/source/common/http/http2/codec_impl.cc:488
#9: nghttp2_session_on_data_received [0xebe97f] /home/bootstrap/.cache/bazel/_bazel_bootstrap/1487d9a98de60e2921e308f5e5b22a2c/sandbox/linux-sandbox/345/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:1419
 (inlined by) /home/bootstrap/.cache/bazel/_bazel_bootstrap/1487d9a98de60e2921e308f5e5b22a2c/sandbox/linux-sandbox/345/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:4974
#10: nghttp2_session_mem_recv [0xec0785] /home/bootstrap/.cache/bazel/_bazel_bootstrap/1487d9a98de60e2921e308f5e5b22a2c/sandbox/linux-sandbox/345/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:?
#11: Envoy::Http::Http2::ConnectionImpl::dispatch() [0xeada89] external/envoy/source/common/http/http2/codec_impl.cc:364
#12: Envoy::Http::CodecClient::onData() [0xe32106] external/envoy/source/common/http/codec_client.cc:130
#13: Envoy::Http::CodecClient::CodecReadFilter::onData() [0xe32afd] ??:?
#14: Envoy::Network::FilterManagerImpl::onRead() [0xcb7c29] ??:?
#15: Envoy::Network::ConnectionImpl::onReadReady() [0xcb4c2c] external/envoy/source/common/network/connection_impl.cc:637
 (inlined by) external/envoy/source/common/network/connection_impl.cc:519
#16: Envoy::Network::ConnectionImpl::onFileEvent() [0xcb46f1] external/envoy/source/common/network/connection_impl.cc:458
#17: Envoy::Event::FileEventImpl::assignEvents()::$_0::__invoke() [0xcaf110] /usr/lib/llvm-7/bin/../include/c++/v1/functional:1406
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/functional:1912
 (inlined by) external/envoy/source/common/event/file_event_impl.cc:65
 (inlined by) external/envoy/source/common/event/file_event_impl.cc:49
#18: event_process_active_single_queue [0x10536ed] event.c:?
#19: event_base_loop [0x1051ca0] ??:?
#20: Envoy::Event::DispatcherImpl::run() [0xcae09f] external/envoy/source/common/event/dispatcher_impl.cc:181
#21: Envoy::Server::InstanceImpl::run() [0xc6a773] bazel-out/k8-opt/bin/external/envoy/source/common/common/_virtual_includes/minimal_logger_lib/common/common/logger.h:275
 (inlined by) external/envoy/source/server/server.cc:484
#22: Envoy::MainCommonBase::run() [0x8f405d] /usr/lib/llvm-7/bin/../include/c++/v1/memory:4173
 (inlined by) external/envoy/source/exe/main_common.cc:105
#23: main [0x8f3308] /usr/lib/llvm-7/bin/../include/c++/v1/memory:2321
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/memory:2634
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/memory:2588
 (inlined by) bazel-out/k8-opt/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:34
 (inlined by) bazel-out/k8-opt/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:87
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/memory:2321
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/memory:2634
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/memory:2588
 (inlined by) external/envoy/source/exe/main.cc:40
#24: [0x7f3fe45fc830]

include/c++/v1/functional:1864 points to destroy_deallocate() inside libc++:

template<class _Rp, class ..._ArgTypes>
function<_Rp(_ArgTypes...)>::~function()
{
    if ((void *)__f_ == &__buf_)
        __f_->destroy();
    else if (__f_)
        __f_->destroy_deallocate();
}

when exiting this lambda:

if (cm_.addOrUpdateCluster(
        cluster, resource.version(),
        [this](const std::string&, ClusterManager::ClusterWarmingState state) {
          // Following if/else block implements a control flow mechanism that can be used
          // by an ADS implementation to properly sequence CDS and RDS update. It is not
          // enforcing on ADS. ADS can use it to detect when a previously sent cluster becomes
          // warm before sending routes that depend on it. This can improve incidence of HTTP
          // 503 responses from Envoy when a route is used before it's supporting cluster is
          // ready.
          //
          // We achieve that by leaving CDS in the paused state as long as there is at least
          // one cluster in the warming state. This prevents CDS ACK from being sent to ADS.
          // Once cluster is warmed up, CDS is resumed, and ACK is sent to ADS, providing a
          // signal to ADS to proceed with RDS updates.
          //
          // Major concern with this approach is CDS being left in the paused state forever.
          // As long as ClusterManager::removeCluster() is not called on a warming cluster
          // this is not an issue. CdsApiImpl takes care of doing this properly, and there
          // is no other component removing clusters from the ClusterManagerImpl. If this
          // ever changes, we would need to correct the following logic.
          if (state == ClusterManager::ClusterWarmingState::Starting &&
              cm_.warmingClusterCount() == 1) {
            cm_.adsMux().pause(Config::TypeUrl::get().Cluster);
          } else if (state == ClusterManager::ClusterWarmingState::Finished &&
                     cm_.warmingClusterCount() == 0) {
            cm_.adsMux().resume(Config::TypeUrl::get().Cluster);
          }
        })) {
  ENVOY_LOG(debug, "cds: add/update cluster '{}'", cluster.name());
}

Note that this lambda was removed as part of xDS simplification, so it doesn't exist in envoyproxy/envoy@master anymore.

@htuch @lizan, any ideas?

PiotrSikora · 2019-08-18T01:05:50Z

@gotwarlost to help narrow this down:

Could you confirm that all crashes are happening in Envoy::Upstream::CdsApiImpl::onConfigUpdate()? If not, could you include others?
Could you put a number behind "occasionally"? Is it once a day to a single proxy instance out of thousands or every hour to a majority of proxy instances?
Do the crashes happen to all proxy instances at the same time?
Do the crashes happen "immediately" after changes to your cluster or on a stable cluster?

gotwarlost · 2019-08-18T03:40:49Z

We ran the new version between 10:00 AM and 4:00 PM on Friday (we rolled back to provide some stability over the weekend). In that time, we saw 154 crashes with this stack trace. There are other stack traces but those do not seem to be 1.2.4 specific (i.e they were present even after the rollback to 1.1.8).

The other stacktrace has the form:

Backtrace (use tools/stack_decode.py to get line numbers):
0: Envoy::TerminateHandler::logOnTerminate()::$_0::operator()() [0xda2d48]
0: Envoy::TerminateHandler::logOnTerminate()::$_0::operator()() [0xda2d48]
1: [0xda2c59]
1: [0xda2c59]
2: __cxxabiv1::__terminate() [0xf0cee6]
Caught Aborted, suspect faulting address 0x53900000013
Backtrace (use tools/stack_decode.py to get line numbers):
2: __cxxabiv1::__terminate() [0xf0cee6]

2 Occasionally => 154 times in 6 hours. Most of them were for the ingress gateway with some on individual services.

There is definite clustering of crashes on all pods for a service. I'll attach a CSV file for the ingress gateway that account for 119 of the 154 crashes.
This is on our playground cluster which is most "unstable" in that people are constantly deploying new services, adapter code and so on.

2019-08-16T10:49:25.456-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T10:49:25.644-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T10:49:35.495-0700,istio-ingressgateway-587cb95cfb-lpm8l

2019-08-16T10:59:40.735-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T11:03:05.053-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T11:06:38.484-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T11:07:03.279-0700,istio-ingressgateway-587cb95cfb-lpm8l

2019-08-16T11:19:04.171-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T11:36:10.117-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T11:41:37.363-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T11:41:37.494-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T11:41:40.300-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T11:56:13.847-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T11:56:15.307-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T11:56:24.941-0700,istio-ingressgateway-587cb95cfb-lpm8l

2019-08-16T12:03:33.767-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:03:33.873-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:03:34.453-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T12:06:29.632-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:06:45.603-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:06:46.460-0700,istio-ingressgateway-587cb95cfb-d98t5

2019-08-16T12:11:08.556-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:11:10.811-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:11:11.023-0700,istio-ingressgateway-587cb95cfb-lpm8l

2019-08-16T12:14:54.363-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:14:56.037-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:14:56.479-0700,istio-ingressgateway-587cb95cfb-lpm8l

2019-08-16T12:19:20.302-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:19:28.261-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T12:24:18.117-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:24:47.302-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:24:47.461-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T12:27:27.240-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:27:29.624-0700,istio-ingressgateway-587cb95cfb-d98t5

2019-08-16T12:29:10.501-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:29:10.678-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:29:10.932-0700,istio-ingressgateway-587cb95cfb-d98t5

2019-08-16T12:31:50.186-0700,istio-ingressgateway-587cb95cfb-d98t5

2019-08-16T12:32:00.269-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:32:01.793-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:32:13.342-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:32:36.619-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T12:33:14.021-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:33:50.987-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:34:01.381-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:34:13.726-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:36:08.606-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:36:09.773-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:37:16.193-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:37:16.733-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:39:45.478-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:39:49.502-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:40:02.943-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:44:19.600-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:44:20.510-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:48:47.687-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:48:49.463-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:49:48.715-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:49:49.103-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:50:48.634-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:50:49.629-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:50:51.103-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:59:02.377-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:59:03.793-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:04:28.616-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:04:30.819-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:04:40.787-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T13:06:27.271-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T13:09:11.965-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:09:12.055-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:10:05.759-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:10:08.280-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T13:42:19.187-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:42:19.557-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:42:19.616-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T13:47:22.496-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:51:46.595-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:52:51.453-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:52:51.630-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T13:52:53.975-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:56:01.633-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:56:07.786-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:05:22.107-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:05:36.697-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T14:05:37.838-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:09:34.802-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:09:36.222-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:13:06.820-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:13:09.902-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:14:45.794-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:14:49.301-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:14:50.213-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T14:18:32.496-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:18:38.025-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T14:18:41.437-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:21:41.874-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:21:41.917-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T14:21:45.820-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:23:00.422-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:23:02.673-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T14:24:03.803-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:26:32.144-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:43:29.666-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:46:14.903-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:46:14.997-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:07:53.393-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:07:56.732-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T15:10:02.491-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T15:10:05.214-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:10:31.800-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T15:17:32.543-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:22:54.060-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T15:23:06.369-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:23:08.042-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T15:28:34.163-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T15:44:33.912-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:44:35.841-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T15:45:01.708-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:45:01.786-0700,istio-ingressgateway-587cb95cfb-c98ds

Do you think that there could be one bad/ unexpected config that could be causing this?

PiotrSikora · 2019-08-18T05:14:44Z

Since this is a playground, could you try using 1.1.13 to narrow down the issue between "upgrade from 1.1.x to 1.2.x" and "switch to libc++-7" (to which we moved in 1.1.13 and 1.2.4)? Would you be willing to test a custom proxy image?

Since we didn't see those crashes during our testing done by multiple people, I imagine that a specific cluster configuration is triggering this. If you change the log level to "trace", then you'll see contents of the CDS update immediately before the crash (note that it might contain sensitive information, so please sanitize it before pasting).

htuch · 2019-08-19T15:22:42Z

@PiotrSikora does this still happen at Envoy master HEAD? You mention that there were a bunch of changes since this release cut.

duderino · 2019-08-19T17:51:32Z

@gotwarlost please file bugs for the other backtraces you see (the ones present in earlier releases but still occurring with 1.2.4) and link them to this one

PiotrSikora · 2019-08-19T19:52:44Z

@htuch we don't have a way to replicate this, so I have no idea if that's still happening at HEAD.

duderino · 2019-08-22T22:57:53Z

@gotwarlost I'm worried that we don't know how to make progress here. The 1.1.8 to 1.2.4 delta is large.

+1 to @PiotrSikora's suggestion to try to upgrade from 1.1.8 to 1.1.13 first

gotwarlost · 2019-08-23T14:05:24Z

Just came back from offsite + time off. Will sync up with the team and get you additional data. Sorry for the delay.

gotwarlost · 2019-08-23T18:40:33Z

Deployed 1.1.13 to the cluster 45 minutes ago. 35 crashes already. New stack:

Caught Segmentation fault, suspect faulting address 0x228
Backtrace (use tools/stack_decode.py to get line numbers):
#0: __restore_rt [0x7faecbaa1390]
#1: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xc8fb0d]
#2: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xc8f2a1]
#3: Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate() [0xc2a1b1]
#4: Envoy::Config::GrpcMuxImpl::onDiscoveryResponse() [0xd8df19]
#5: Envoy::Grpc::TypedAsyncStreamCallbacks<>::onReceiveMessageUntyped() [0xd8edce]
#6: Envoy::Grpc::AsyncStreamImpl::onData() [0xdb462d]
#7: Envoy::Http::AsyncStreamImpl::encodeData() [0xdb7db4]
#8: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0xe15fa3]
#9: nghttp2_session_on_data_received [0xe2691f]
#10: nghttp2_session_mem_recv [0xe28747]
#11: Envoy::Http::Http2::ConnectionImpl::dispatch() [0xe15679]
#12: Envoy::Http::CodecClient::onData() [0xdab3d6]
#13: Envoy::Http::CodecClient::CodecReadFilter::onData() [0xdabdcd]
#14: Envoy::Network::FilterManagerImpl::onRead() [0xc550aa]
#15: Envoy::Network::ConnectionImpl::onReadReady() [0xc52299]
#16: Envoy::Network::ConnectionImpl::onFileEvent() [0xc51d61]
#17: Envoy::Event::FileEventImpl::assignEvents()::$_0::__invoke() [0xc4c800]
#18: event_process_active_single_queue [0xfade7d]
#19: event_base_loop [0xfac430]
#20: Envoy::Event::DispatcherImpl::run() [0xc4bd4f]
#21: Envoy::Server::InstanceImpl::run() [0xbfe6c0]
#22: Envoy::MainCommonBase::run() [0x8a8cc9]
#23: main [0x8a7e72]

Now trying to get a detailed TRACE level log for one of the crashes.

duderino · 2019-08-23T18:42:11Z

@gotwarlost thank you that very much reduces the search space. Could you try 1.1.12? If 1.1.12 does not crash, then I think we could squarely point the finger at the http2 vulnerabiity fixes or the switch to libc++

gotwarlost · 2019-08-23T23:14:32Z

Running 1.1.12 for the last hour now and no crashes so far. OTOH we deployed too late on a Friday and deployment activity is pretty low key. (There were no crashes even in the hour prior to when we deployed). Will leave it running over the weekend and see what happens on Monday.

gotwarlost · 2019-08-26T19:35:08Z

Still no crashes. 1.1.12 seems stable w.r.t to the envoy crash

gotwarlost · 2019-08-26T21:22:41Z

I have a trace file for one of the ingress gateway pods but there is a lot of disclosure type information in the file that I'm not comfortable sharing in a public forum. Taking all this out might make the trace useless.

PiotrSikora · 2019-08-26T22:51:53Z

@gotwarlost could you test with 1.1.14 as well?

duderino · 2019-08-26T23:07:47Z

The context there is that the proxyv2 image in 1.1.13 was built by hand because we had to build it under private embargo and couldn't leverage our CI/CD pipeline since that builds everything in the clear. Piotr wants to rule out that hand build (done by me) as a source of error. 1.1.14 has the same fixes but was built in the clear using our standard pipeline.

duderino · 2019-08-26T23:21:18Z

@PiotrSikora corrected me. There's a memory alignment fix in 1.1.14 as well as an upgraded version of libc++, so it's well worth a try.

gotwarlost · 2019-08-27T00:33:54Z

we'll do this tomorrow morning and report results

gotwarlost · 2019-08-27T21:26:53Z

still happening with 1.1.14 - 7 crashes on the ingress gateway in the last hour.

gotwarlost · 2019-08-28T15:27:09Z

The number of crashes seem to have gone down but it is still an infrequent occurrence. 20 crashes in the last 20 hours just for the ingress gateway

gotwarlost · 2019-08-29T16:40:37Z

17 crashes for the ingress gateway in the last 24 hours. It is much reduced from before but still happening. Any ideas how to move forward/ debug this? As it stands we are caught between a rock and a hard place, really needing the upgrade and not being able to proceed safely.

PiotrSikora · 2019-08-29T22:16:01Z

The changes between 1.1.12 and 1.1.13 in Envoy (istio/envoy@8e54972...6910569) consist only of HTTP/2 fixes and switch to libc++ (along with some Lua fixes for it).

Since you're the only one that reported any issues with the new releases, it leads me to believe that's something unique to your setup (e.g. a non-default option that you enabled, etc.).

Would you be willing to share the trace log with me privately?
And/or try 1.3.0-rc.1 release that's coming out tomorrow?

Also, just to make sure, are the resources (mostly memory) on the ingress gateway sufficiently over-provisioned? Envoy's memory usage tends to spike during xDS message parsing, so it's possible that you'd consistently run out of memory at the exact same place in the code.

duderino · 2019-09-03T17:24:58Z

We are currently trying an asan intrumented build in @gotwarlost's env. Our best guess is that this was introduced by the switch from libstdc++ to libc++

jacexh · 2019-09-06T06:12:10Z

I'm facing the same issue. 1.2.4

jplevyak · 2019-09-06T19:36:02Z

A core file would be most appreciated. This issue seems to be rather subtle.

jplevyak · 2019-09-12T17:04:54Z

New backtrace

#0  0x00007fee8efaf269 in raise (sig=11) at ../sysdeps/unix/sysv/linux/pt-raise.c:35
#1  0x000000000113712b in Envoy::SignalAction::sigHandler (sig=11, info=0x7fee8f7cebf0, context=0x7fee8f7ceac0) at external/envoy/source/exe/signal_action.cc:23
#2  <signal handler called>
where
#3  std::__1::shared_ptr<Envoy::Upstream::Cluster>::operator* (this=0x228) at /usr/lib/llvm-7/bin/../include/c++/v1/memory:3914
#4  Envoy::Upstream::ClusterManagerImpl::addOrUpdateCluster(envoy::api::v2::Cluster const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<void (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Envoy::Upstream::ClusterManager::ClusterWarmingState)>) (this=0x392b200,     
    cluster=..., version_info=..., cluster_warming_cb=...) at external/envoy/source/common/upstream/cluster_manager_impl.cc:456
#5  0x0000000000c8fced in Envoy::Upstream::CdsApiImpl::onConfigUpdate (this=0x3aca080, added_resources=..., removed_resources=..., system_version_info=...) at external/envoy/source/common/upstream/cds_api_impl.cc:82                                                                                                                                                                       
#6  0x0000000000c8f481 in Envoy::Upstream::CdsApiImpl::onConfigUpdate (this=0x3aca080, resources=..., version_info=...) at external/envoy/source/common/upstream/cds_api_impl.cc:62
#7  0x0000000000c2a391 in Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate (this=0x3aca100, resources=..., version_info=...) at bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:61                                                                                                      
#8  0x0000000000d8e0f9 in Envoy::Config::GrpcMuxImpl::onDiscoveryResponse (this=0x3a1a000, message=...) at external/envoy/source/common/config/grpc_mux_impl.cc:166
#9  0x0000000000d8efae in Envoy::Grpc::TypedAsyncStreamCallbacks<envoy::api::v2::DiscoveryResponse>::onReceiveMessageUntyped (this=0x3a1a010, message=...) at bazel-out/k8-opt/bin/external/envoy/include/envoy/grpc/_virtual_includes/async_client_interface/envoy/grpc/async_client.h:172                                                                                                   
#10 0x0000000000db4b2d in Envoy::Grpc::AsyncStreamImpl::onData (this=0x3ca7d60, data=..., end_stream=false) at external/envoy/source/common/grpc/async_client_impl.cc:142
#11 0x0000000000db82b4 in Envoy::Http::AsyncStreamImpl::encodeData (this=0x3ace700, data=..., end_stream=false) at external/envoy/source/common/http/async_client_impl.cc:104
#12 0x0000000000e16ab3 in Envoy::Http::Http2::ConnectionImpl::onFrameReceived (this=0x3ca0790, frame=0x3cceb98) at external/envoy/source/common/http/http2/codec_impl.cc:540
#13 0x0000000000e2742f in session_call_on_frame_received (session=<optimized out>, frame=<optimized out>) at /home/jblatt/.cache/bazel/_bazel_root/b6258e8f3eabbcfebfab41d31f03811a/sandbox/processwrapper-sandbox/352/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:3295                                                                                       
#14 nghttp2_session_on_data_received (session=0x3ccea00, frame=0x3cceb98) at /home/jblatt/.cache/bazel/_bazel_root/b6258e8f3eabbcfebfab41d31f03811a/sandbox/processwrapper-sandbox/352/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:4964                                                                                                                       
#15 0x0000000000e29257 in session_process_data_frame (session=<optimized out>) at /home/jblatt/.cache/bazel/_bazel_root/b6258e8f3eabbcfebfab41d31f03811a/sandbox/processwrapper-sandbox/352/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:4984                                                                                                                  
#16 nghttp2_session_mem_recv (session=0x3ccea00, in=0x3d2e213 "cal\020\004\"\002\b\n0\004R\a\n\005*\003\b\200\b\312\001o\nm\n\005istio\022d\nb\n\006config\022X\032V/apis/networking/v1alpha3/namespaces/istio-system/destination-rule/default-enable-mtls\022\230\002\n(type.googleapis.com/envoy.api.v2.Cluster\022\353\001\nfoutbound_.16091_"..., inlen=<optimized out>)                  
    at /home/jblatt/.cache/bazel/_bazel_root/b6258e8f3eabbcfebfab41d31f03811a/sandbox/processwrapper-sandbox/352/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:6605                                                                                                                                                                                             
#17 0x0000000000e16189 in Envoy::Http::Http2::ConnectionImpl::dispatch (this=0x3ca0790, data=...) at external/envoy/source/common/http/http2/codec_impl.cc:369
#18 0x0000000000dab8d6 in Envoy::Http::CodecClient::onData (this=0x3a92990, data=...) at external/envoy/source/common/http/codec_client.cc:116
#19 0x0000000000dac2cd in Envoy::Http::CodecClient::CodecReadFilter::onData (this=<optimized out>, data=...) at bazel-out/k8-opt/bin/external/envoy/source/common/http/_virtual_includes/codec_client_lib/common/http/codec_client.h:172                                                                                                                                                      
#20 0x0000000000c5528a in Envoy::Network::FilterManagerImpl::onContinueReading (this=0x3cf2c20, filter=<optimized out>) at external/envoy/source/common/network/filter_manager_impl.cc:56
#21 Envoy::Network::FilterManagerImpl::onRead (this=0x3cf2c20) at external/envoy/source/common/network/filter_manager_impl.cc:66
#22 0x0000000000c52479 in Envoy::Network::ConnectionImpl::onRead (this=<optimized out>, read_buffer_size=<optimized out>) at external/envoy/source/common/network/connection_impl.cc:271
#23 Envoy::Network::ConnectionImpl::onReadReady (this=0x3cf2c00) at external/envoy/source/common/network/connection_impl.cc:505
#24 0x0000000000c51f41 in Envoy::Network::ConnectionImpl::onFileEvent (this=0x3cf2c00, events=<optimized out>) at external/envoy/source/common/network/connection_impl.cc:481
#25 0x0000000000c4c9e0 in std::__1::function<void (unsigned int)>::operator()(unsigned int) const (this=0x392b2d0, __arg=3) at /usr/lib/llvm-7/bin/../include/c++/v1/functional:1913
#26 Envoy::Event::FileEventImpl::assignEvents(unsigned int)::$_0::operator()(int, short, void*) const (what=<optimized out>, arg=0x392b250, this=<optimized out>) at external/envoy/source/common/event/file_event_impl.cc:64                                                                                                                                                                 
#27 Envoy::Event::FileEventImpl::assignEvents(unsigned int)::$_0::__invoke(int, short, void*) (what=<optimized out>, arg=0x392b250) at external/envoy/source/common/event/file_event_impl.cc:48                                                                                                                                                                                               
#28 0x0000000000fae97d in event_process_active_single_queue ()
#29 0x0000000000facf30 in event_base_loop ()
#30 0x0000000000c4bf2f in Envoy::Event::DispatcherImpl::run (this=<optimized out>, type=Envoy::Event::Dispatcher::RunType::Block) at external/envoy/source/common/event/dispatcher_impl.cc:167
#31 0x0000000000bfe8a0 in Envoy::Server::InstanceImpl::run (this=0x396c500) at external/envoy/source/server/server.cc:482
#32 0x00000000008a8ea9 in Envoy::MainCommonBase::run (this=0x39524d0) at external/envoy/source/exe/main_common.cc:117
#33 0x00000000008a8052 in Envoy::MainCommon::run (this=0x3952000) at bazel-out/k8-opt/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:88
#34 main (argc=18, argv=<optimized out>) at external/envoy/source/exe/main.cc:39

jplevyak · 2019-09-12T17:06:34Z

Potential fix: istio/envoy#103

jplevyak · 2019-09-12T17:13:38Z

The crash in this core seems to be caused by calling ClusterManagerInitHelper::removeCluster with a smartpointer to a Cluster which is nullptr which in turn is caused by cluster_manager_impl.cc:1258
where we use an "or" on a pair of iterators then derefs one of them.

The dereferenced nullptr (Cluster&) is not used if state_ == State::AllClustersInitialized but, the compile has an option to load the nullptr when it converts the smart pointer to a reference via * and that could be std library dependent. This issue seems to arise with libc++ but not libstdc++. It is not surprising that the behavior could be different because it depends on the optimizations the compiler does which in tern depend on the specifics of the smart pointer implementation and the environment that it is used in.

gotwarlost · 2019-09-12T18:08:50Z

@jplevyak @duderino - great findings. Thanks for working so closely with us to figure out the root cause.

jplevyak · 2019-09-12T19:05:49Z

Fix is upstream: envoyproxy/envoy@1339ed2#diff-982df999d2cabd23afd6c25088035bf5

lizan · 2019-09-12T19:09:50Z

@jplevyak as I commented in the PR, let's do a full cherry-pick of envoyproxy/envoy#8106?

jplevyak · 2019-09-12T21:35:24Z

Updated istio/envoy#105 to be a cherry-pick of the upstream fix.

jplevyak · 2019-09-13T17:28:30Z

Fixed in 1.1, 1.2, 1.3 and upstream.

duderino · 2019-09-13T20:16:56Z

Fixes will be in 1.1.15, 1.2.6, 1.3.1

@fpesce and @rlenglet

andraxylia · 2019-09-13T22:34:44Z

Funny, these # backtraces from the logs are interpreted as references to old PRs in istio, just had one of my first ones referenced :).

rlenglet · 2019-09-13T22:54:46Z

This is not done, at least for release-1.3. A couple more AIs:

Please update istio/proxy:release-1.3 and istio/istio:release-1.3 to pick up the commit.
If possible, please add an entry into the 1.3.1 release note draft.

duderino · 2019-09-14T15:20:02Z

@jplevyak please make sure fix is in master too else 1.4 will regress

duderino · 2019-09-18T02:14:16Z

1.1.15 and 1.2.6 are out. 1.3.1 should be out this week or early next at the latest (@rlenglet)

jacexh · 2019-09-19T02:30:56Z

this issue still occurred after upgrade to version 1.2.6:

[2019-09-19 01:31:46.826][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:81] Caught Segmentation fault, suspect faulting address 0x228
[2019-09-19 01:31:46.826][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:69] Backtrace (use tools/stack_decode.py to get line numbers):
[2019-09-19 01:31:46.826][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #0: __restore_rt [0x7f3650349390]
[2019-09-19 01:31:46.829][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #1: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf4a4d]
[2019-09-19 01:31:46.832][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #2: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf41e1]
[2019-09-19 01:31:46.835][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #3: Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate() [0xe12841]
[2019-09-19 01:31:46.837][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #4: Envoy::Config::GrpcMuxImpl::onDiscoveryResponse() [0xe0ff19]
[2019-09-19 01:31:46.840][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #5: Envoy::Grpc::TypedAsyncStreamCallbacks<>::onReceiveMessageUntyped() [0xe10d6e]
[2019-09-19 01:31:46.843][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #6: Envoy::Grpc::AsyncStreamImpl::onData() [0xe3ae9d]
[2019-09-19 01:31:46.845][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #7: Envoy::Http::AsyncStreamImpl::encodeData() [0xe3e814]
[2019-09-19 01:31:46.848][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #8: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0xeae3cd]
[2019-09-19 01:31:46.851][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #9: nghttp2_session_on_data_received [0xebe97f]
[2019-09-19 01:31:46.853][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #10: nghttp2_session_mem_recv [0xec0785]
[2019-09-19 01:31:46.856][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #11: Envoy::Http::Http2::ConnectionImpl::dispatch() [0xeada89]
[2019-09-19 01:31:46.860][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #12: Envoy::Http::CodecClient::onData() [0xe32106]
[2019-09-19 01:31:46.862][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #13: Envoy::Http::CodecClient::CodecReadFilter::onData() [0xe32afd]
[2019-09-19 01:31:46.865][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #14: Envoy::Network::FilterManagerImpl::onRead() [0xcb7c29]
[2019-09-19 01:31:46.868][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #15: Envoy::Network::ConnectionImpl::onReadReady() [0xcb4c2c]
[2019-09-19 01:31:46.871][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #16: Envoy::Network::ConnectionImpl::onFileEvent() [0xcb46f1]
[2019-09-19 01:31:46.873][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #17: Envoy::Event::FileEventImpl::assignEvents()::$_0::__invoke() [0xcaf110]
[2019-09-19 01:31:46.876][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #18: event_process_active_single_queue [0x10536ed]
[2019-09-19 01:31:46.878][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #19: event_base_loop [0x1051ca0]
[2019-09-19 01:31:46.881][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #20: Envoy::Event::DispatcherImpl::run() [0xcae09f]
[2019-09-19 01:31:46.884][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #21: Envoy::Server::InstanceImpl::run() [0xc6a773]
[2019-09-19 01:31:46.886][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #22: Envoy::MainCommonBase::run() [0x8f405d]
[2019-09-19 01:31:46.889][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #23: main [0x8f3308]
[2019-09-19 01:31:46.889][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #24: [0x7f364f57a830]
2019-09-19T01:31:46.905202Z	warn	Epoch 2 terminated with an error: signal: segmentation fault
2019-09-19T01:31:46.905256Z	warn	Aborted all epochs
2019-09-19T01:31:46.905319Z	info	Epoch 2: set retry delay to 200ms, budget to 9
2019-09-19T01:31:47.105457Z	info	Reconciling retry (budget 9)
2019-09-19T01:31:47.105536Z	info	Epoch 0 starting
2019-09-19T01:31:47.106806Z	info	Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster toolbox.qa --service-node sidecar~10.244.22.132~toolbox-base-6cbfff58fb-nndtt.qa~qa.svc.cluster.local --max-obj-name-len 189 --local-address-ip-version v4 --allow-unknown-fields -l warning --component-log-level misc:error --concurrency 2]
[2019-09-19 01:31:47.131][140][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 14, no healthy upstream
[2019-09-19 01:31:47.131][140][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:49] Unable to establish new stream
2019-09-19T01:31:48.402401Z	info	Envoy proxy is NOT ready: 2 errors occurred:
	* failed checking application ports. listeners="0.0.0.0:15090"
	* envoy missing listener for inbound application port: 8080


2019-09-19T01:31:50.402725Z	info	Envoy proxy is NOT ready: 2 errors occurred:
	* failed checking application ports. listeners="0.0.0.0:15090"
	* envoy missing listener for inbound application port: 8080


2019-09-19T01:31:52.401686Z	info	Envoy proxy is NOT ready: 2 errors occurred:
	* failed checking application ports. listeners="0.0.0.0:15090"
	* envoy missing listener for inbound application port: 8080


2019-09-19T01:31:54.401764Z	info	Envoy proxy is NOT ready: 2 errors occurred:
	* failed checking application ports. listeners="0.0.0.0:15090"
	* envoy missing listener for inbound application port: 8080


2019-09-19T01:31:56.401867Z	info	Envoy proxy is NOT ready: 2 errors occurred:
	* failed checking application ports. listeners="0.0.0.0:15090"
	* envoy missing listener for inbound application port: 8080

client version: 1.2.6
egressgateway version: 94746ccd404a8e056483dd02e4e478097b950da6-dirty
galley version: 1.2.6
ingressgateway version: 1.2.6
pilot version: 1.2.6
policy version: 1.2.6
sidecar-injector version: 1.2.6
telemetry version: 1.2.6

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

duderino · 2019-09-19T02:40:27Z

@jacexh I walked through all the shas in the 1.2.6 release and verified that istio/envoy@52a3903 is included in it.

When you upgraded to 1.2.6, did you complete this step? https://istio.io/docs/setup/upgrade/steps/#sidecar-upgrade

duderino · 2019-09-29T05:09:15Z

1.3.1 is out. If there are any more segfaults with similar backtrace, please reopen

JRBANCEL · 2019-10-08T17:51:06Z

Here is what we see in istio-1.3.1-12c4d01552363362b49fce86f83b847b308b9b0d-Clean:

[2019-10-08 16:27:26.275][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:81] Caught Segmentation fault, suspect faulting address 0x0
[2019-10-08 16:27:26.275][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:69] Backtrace (use tools/stack_decode.py to get line numbers):
[2019-10-08 16:27:26.275][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #0: __restore_rt [0x7fde03bcc390]
[2019-10-08 16:27:26.282][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #1: std::__1::__function::__func<>::operator()() [0x1ac08a6]
[2019-10-08 16:27:26.289][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #2: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x1aab376]
[2019-10-08 16:27:26.295][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #3: event_process_active_single_queue [0x1d73871]
[2019-10-08 16:27:26.302][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #4: event_base_loop [0x1d72360]
[2019-10-08 16:27:26.308][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #5: Envoy::Server::WorkerImpl::threadRoutine() [0x1aa434c]
[2019-10-08 16:27:26.315][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #6: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::$_0::__invoke() [0x1fdf053]
[2019-10-08 16:27:26.315][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #7: start_thread [0x7fde03bc26ba]

JRBANCEL · 2019-10-08T17:51:55Z

/reopen

duderino · 2019-10-08T17:53:49Z

@jplevyak PTAL backtrace is different

jplevyak · 2019-10-08T19:25:00Z

We should open this as a different issue since the backtrace is different it and it likely has a different root cause the historical information is going to be deceptive.

JRBANCEL · 2019-10-08T20:42:42Z

I opened a new issue: #17699.

istio-policy-bot added area/networking area/perf and scalability labels Aug 16, 2019

duderino assigned PiotrSikora Aug 17, 2019

duderino added this to the 1.3 milestone Aug 17, 2019

duderino assigned jplevyak and unassigned PiotrSikora Aug 30, 2019

jplevyak closed this as completed Sep 13, 2019

rlenglet reopened this Sep 13, 2019

rlenglet modified the milestones: 1.3, 1.4 Sep 13, 2019

duderino closed this as completed Sep 29, 2019

tcnghia mentioned this issue Oct 8, 2019

Update DEVELOPMENT.md to use istio 1.3 knative/serving#5666

Merged

3ks mentioned this issue Dec 27, 2019

zh-translation:/zh/news/releases/1.1.x/announcing-1.1.15/index.md istio/istio.io#6223

Merged

istio 1.2.4: envoy segfaults occasionally on istio-proxy pods #16357

istio 1.2.4: envoy segfaults occasionally on istio-proxy pods #16357

Comments

gotwarlost commented Aug 16, 2019 • edited Loading

duderino commented Aug 17, 2019

PiotrSikora commented Aug 17, 2019 • edited Loading

PiotrSikora commented Aug 18, 2019

gotwarlost commented Aug 18, 2019 • edited Loading

PiotrSikora commented Aug 18, 2019

htuch commented Aug 19, 2019

duderino commented Aug 19, 2019

PiotrSikora commented Aug 19, 2019

duderino commented Aug 22, 2019

gotwarlost commented Aug 23, 2019

gotwarlost commented Aug 23, 2019

duderino commented Aug 23, 2019 • edited Loading

gotwarlost commented Aug 23, 2019

gotwarlost commented Aug 26, 2019

gotwarlost commented Aug 26, 2019

PiotrSikora commented Aug 26, 2019

duderino commented Aug 26, 2019

duderino commented Aug 26, 2019

gotwarlost commented Aug 27, 2019

gotwarlost commented Aug 27, 2019

gotwarlost commented Aug 28, 2019

gotwarlost commented Aug 29, 2019

PiotrSikora commented Aug 29, 2019

duderino commented Sep 3, 2019

jacexh commented Sep 6, 2019

jplevyak commented Sep 6, 2019

jplevyak commented Sep 12, 2019 • edited by rlenglet Loading

jplevyak commented Sep 12, 2019

jplevyak commented Sep 12, 2019

gotwarlost commented Sep 12, 2019

jplevyak commented Sep 12, 2019

lizan commented Sep 12, 2019

jplevyak commented Sep 12, 2019

jplevyak commented Sep 13, 2019

duderino commented Sep 13, 2019

andraxylia commented Sep 13, 2019 • edited Loading

rlenglet commented Sep 13, 2019 • edited Loading

duderino commented Sep 14, 2019

duderino commented Sep 18, 2019

jacexh commented Sep 19, 2019

duderino commented Sep 19, 2019

duderino commented Sep 29, 2019

JRBANCEL commented Oct 8, 2019 • edited Loading

JRBANCEL commented Oct 8, 2019

duderino commented Oct 8, 2019

jplevyak commented Oct 8, 2019

JRBANCEL commented Oct 8, 2019

gotwarlost commented Aug 16, 2019 •

edited

Loading

PiotrSikora commented Aug 17, 2019 •

edited

Loading

gotwarlost commented Aug 18, 2019 •

edited

Loading

duderino commented Aug 23, 2019 •

edited

Loading

jplevyak commented Sep 12, 2019 •

edited by rlenglet

Loading

andraxylia commented Sep 13, 2019 •

edited

Loading

rlenglet commented Sep 13, 2019 •

edited

Loading

JRBANCEL commented Oct 8, 2019 •

edited

Loading