Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

istio 1.2.4: envoy segfaults occasionally on istio-proxy pods #16357

Closed
gotwarlost opened this issue Aug 16, 2019 · 47 comments
Closed

istio 1.2.4: envoy segfaults occasionally on istio-proxy pods #16357

gotwarlost opened this issue Aug 16, 2019 · 47 comments

Comments

@gotwarlost
Copy link

gotwarlost commented Aug 16, 2019

Bug description

After upgrading to 1.2.4 (from: 1.1.8) we see occasional segfaults on many istio-proxy pods all of which have a form similar to this:

Caught Segmentation fault, suspect faulting address 0x228
Backtrace (use tools/stack_decode.py to get line numbers):
0: __restore_rt [0x7f3fe53cb390]
1: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf4a4d]
2: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf41e1]
3: Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate() [0xe12841]
4: Envoy::Config::GrpcMuxImpl::onDiscoveryResponse() [0xe0ff19]
5: Envoy::Grpc::TypedAsyncStreamCallbacks<>::onReceiveMessageUntyped() [0xe10d6e]
6: Envoy::Grpc::AsyncStreamImpl::onData() [0xe3ae9d]
7: Envoy::Http::AsyncStreamImpl::encodeData() [0xe3e814]
8: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0xeae3cd]
9: nghttp2_session_on_data_received [0xebe97f]
10: nghttp2_session_mem_recv [0xec0785]
11: Envoy::Http::Http2::ConnectionImpl::dispatch() [0xeada89]
12: Envoy::Http::CodecClient::onData() [0xe32106]
13: Envoy::Http::CodecClient::CodecReadFilter::onData() [0xe32afd]
14: Envoy::Network::FilterManagerImpl::onRead() [0xcb7c29]
15: Envoy::Network::ConnectionImpl::onReadReady() [0xcb4c2c]
16: Envoy::Network::ConnectionImpl::onFileEvent() [0xcb46f1]
17: Envoy::Event::FileEventImpl::assignEvents()::$_0::__invoke() [0xcaf110]
18: event_process_active_single_queue [0x10536ed]
19: event_base_loop [0x1051ca0]
20: Envoy::Event::DispatcherImpl::run() [0xcae09f]
21: Envoy::Server::InstanceImpl::run() [0xc6a773]
22: Envoy::MainCommonBase::run() [0x8f405d]
23: main [0x8f3308]
24: [0x7f3fe45fc830]

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[X ] Networking
[ X] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
Envoy does not crash.

Steps to reproduce the bug
Very intermittent so hard to reproduce. Creating issue to alert people of potential problem. We will continue to investigate.

Version (include the output of istioctl version --remote and kubectl version)

client version: version.BuildInfo{Version:"1.1.2", GitRevision:"2b1331886076df103179e3da5dc9077fed59c989", User:"root", Host:"35adf5bb-5570-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Clean", GitTag:"1.1.1"}
citadel version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}
galley version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}
pilot version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}
policy version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}
sidecar-injector version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}
telemetry version: version.BuildInfo{Version:"1.2.4", GitRevision:"c562694ea6e554c2b60d12c9876d2641cfdd917d-dirty", User:"root", Host:"418fcd4e-b951-11e9-a7c2-029eefb01290", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.2.3"}

How was Istio installed?
Helm charts

Environment where bug was observed (cloud vendor, OS, etc)
AWS/ Kops

@duderino
Copy link
Contributor

@PiotrSikora @htuch

@duderino duderino added this to the 1.3 milestone Aug 17, 2019
@PiotrSikora
Copy link
Contributor

PiotrSikora commented Aug 17, 2019

Istio Proxy binary in 1.2.4 was built from istio/proxy@568f2b6, istio/envoy@204283e (branched from envoyproxy/envoy@829b905 on 2019-05-16).

That backtrace decoded using binary with debug symbols corresponds to:

Caught Segmentation fault, suspect faulting address 0x228
Backtrace (use tools/stack_decode.py to get line numbers):
#0: __restore_rt [0x7f3fe53cb390] ??:0
#1: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf4a4d] /usr/lib/llvm-7/bin/../include/c++/v1/functional:1864
 (inlined by) external/envoy/source/common/upstream/cds_api_impl.cc:82
#2: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf41e1] external/com_google_protobuf/src/google/protobuf/repeated_field.h:1465
 (inlined by) external/com_google_protobuf/src/google/protobuf/repeated_field.h:1923
 (inlined by) external/envoy/source/common/upstream/cds_api_impl.cc:63
#3: Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate() [0xe12841] /usr/lib/llvm-7/bin/../include/c++/v1/string:1406
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/string:1500
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/string:1228
 (inlined by) external/com_google_absl/absl/strings/string_view.h:171
 (inlined by) external/envoy/source/common/config/grpc_mux_subscription_impl.cc:60
#4: Envoy::Config::GrpcMuxImpl::onDiscoveryResponse() [0xe0ff19] /usr/lib/llvm-7/bin/../include/c++/v1/memory:2603
 (inlined by) external/envoy/source/common/config/grpc_mux_impl.cc:184
#5: Envoy::Grpc::TypedAsyncStreamCallbacks<>::onReceiveMessageUntyped() [0xe10d6e] bazel-out/k8-opt/bin/external/envoy/include/envoy/grpc/_virtual_includes/async_client_interface/envoy/grpc/async_client.h:?
#6: Envoy::Grpc::AsyncStreamImpl::onData() [0xe3ae9d] /usr/lib/llvm-7/bin/../include/c++/v1/iterator:1509
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/iterator:1561
 (inlined by) external/envoy/source/common/grpc/async_client_impl.cc:131
#7: Envoy::Http::AsyncStreamImpl::encodeData() [0xe3e814] external/envoy/source/common/http/async_client_impl.cc:106
#8: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0xeae3cd] external/com_google_absl/absl/types/optional.h:816
 (inlined by) external/envoy/source/common/http/http2/codec_impl.cc:488
#9: nghttp2_session_on_data_received [0xebe97f] /home/bootstrap/.cache/bazel/_bazel_bootstrap/1487d9a98de60e2921e308f5e5b22a2c/sandbox/linux-sandbox/345/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:1419
 (inlined by) /home/bootstrap/.cache/bazel/_bazel_bootstrap/1487d9a98de60e2921e308f5e5b22a2c/sandbox/linux-sandbox/345/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:4974
#10: nghttp2_session_mem_recv [0xec0785] /home/bootstrap/.cache/bazel/_bazel_bootstrap/1487d9a98de60e2921e308f5e5b22a2c/sandbox/linux-sandbox/345/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:?
#11: Envoy::Http::Http2::ConnectionImpl::dispatch() [0xeada89] external/envoy/source/common/http/http2/codec_impl.cc:364
#12: Envoy::Http::CodecClient::onData() [0xe32106] external/envoy/source/common/http/codec_client.cc:130
#13: Envoy::Http::CodecClient::CodecReadFilter::onData() [0xe32afd] ??:?
#14: Envoy::Network::FilterManagerImpl::onRead() [0xcb7c29] ??:?
#15: Envoy::Network::ConnectionImpl::onReadReady() [0xcb4c2c] external/envoy/source/common/network/connection_impl.cc:637
 (inlined by) external/envoy/source/common/network/connection_impl.cc:519
#16: Envoy::Network::ConnectionImpl::onFileEvent() [0xcb46f1] external/envoy/source/common/network/connection_impl.cc:458
#17: Envoy::Event::FileEventImpl::assignEvents()::$_0::__invoke() [0xcaf110] /usr/lib/llvm-7/bin/../include/c++/v1/functional:1406
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/functional:1912
 (inlined by) external/envoy/source/common/event/file_event_impl.cc:65
 (inlined by) external/envoy/source/common/event/file_event_impl.cc:49
#18: event_process_active_single_queue [0x10536ed] event.c:?
#19: event_base_loop [0x1051ca0] ??:?
#20: Envoy::Event::DispatcherImpl::run() [0xcae09f] external/envoy/source/common/event/dispatcher_impl.cc:181
#21: Envoy::Server::InstanceImpl::run() [0xc6a773] bazel-out/k8-opt/bin/external/envoy/source/common/common/_virtual_includes/minimal_logger_lib/common/common/logger.h:275
 (inlined by) external/envoy/source/server/server.cc:484
#22: Envoy::MainCommonBase::run() [0x8f405d] /usr/lib/llvm-7/bin/../include/c++/v1/memory:4173
 (inlined by) external/envoy/source/exe/main_common.cc:105
#23: main [0x8f3308] /usr/lib/llvm-7/bin/../include/c++/v1/memory:2321
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/memory:2634
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/memory:2588
 (inlined by) bazel-out/k8-opt/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:34
 (inlined by) bazel-out/k8-opt/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:87
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/memory:2321
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/memory:2634
 (inlined by) /usr/lib/llvm-7/bin/../include/c++/v1/memory:2588
 (inlined by) external/envoy/source/exe/main.cc:40
#24: [0x7f3fe45fc830]

include/c++/v1/functional:1864 points to destroy_deallocate() inside libc++:

template<class _Rp, class ..._ArgTypes>
function<_Rp(_ArgTypes...)>::~function()
{
    if ((void *)__f_ == &__buf_)
        __f_->destroy();
    else if (__f_)
        __f_->destroy_deallocate();
}

when exiting this lambda:

if (cm_.addOrUpdateCluster(
        cluster, resource.version(),
        [this](const std::string&, ClusterManager::ClusterWarmingState state) {
          // Following if/else block implements a control flow mechanism that can be used
          // by an ADS implementation to properly sequence CDS and RDS update. It is not
          // enforcing on ADS. ADS can use it to detect when a previously sent cluster becomes
          // warm before sending routes that depend on it. This can improve incidence of HTTP
          // 503 responses from Envoy when a route is used before it's supporting cluster is
          // ready.
          //
          // We achieve that by leaving CDS in the paused state as long as there is at least
          // one cluster in the warming state. This prevents CDS ACK from being sent to ADS.
          // Once cluster is warmed up, CDS is resumed, and ACK is sent to ADS, providing a
          // signal to ADS to proceed with RDS updates.
          //
          // Major concern with this approach is CDS being left in the paused state forever.
          // As long as ClusterManager::removeCluster() is not called on a warming cluster
          // this is not an issue. CdsApiImpl takes care of doing this properly, and there
          // is no other component removing clusters from the ClusterManagerImpl. If this
          // ever changes, we would need to correct the following logic.
          if (state == ClusterManager::ClusterWarmingState::Starting &&
              cm_.warmingClusterCount() == 1) {
            cm_.adsMux().pause(Config::TypeUrl::get().Cluster);
          } else if (state == ClusterManager::ClusterWarmingState::Finished &&
                     cm_.warmingClusterCount() == 0) {
            cm_.adsMux().resume(Config::TypeUrl::get().Cluster);
          }
        })) {
  ENVOY_LOG(debug, "cds: add/update cluster '{}'", cluster.name());
}

Note that this lambda was removed as part of xDS simplification, so it doesn't exist in envoyproxy/envoy@master anymore.

@htuch @lizan, any ideas?

@PiotrSikora
Copy link
Contributor

@gotwarlost to help narrow this down:

  1. Could you confirm that all crashes are happening in Envoy::Upstream::CdsApiImpl::onConfigUpdate()? If not, could you include others?
  2. Could you put a number behind "occasionally"? Is it once a day to a single proxy instance out of thousands or every hour to a majority of proxy instances?
  3. Do the crashes happen to all proxy instances at the same time?
  4. Do the crashes happen "immediately" after changes to your cluster or on a stable cluster?

@gotwarlost
Copy link
Author

gotwarlost commented Aug 18, 2019

  1. We ran the new version between 10:00 AM and 4:00 PM on Friday (we rolled back to provide some stability over the weekend). In that time, we saw 154 crashes with this stack trace. There are other stack traces but those do not seem to be 1.2.4 specific (i.e they were present even after the rollback to 1.1.8).

The other stacktrace has the form:

Backtrace (use tools/stack_decode.py to get line numbers):
0: Envoy::TerminateHandler::logOnTerminate()::$_0::operator()() [0xda2d48]
0: Envoy::TerminateHandler::logOnTerminate()::$_0::operator()() [0xda2d48]
1: [0xda2c59]
1: [0xda2c59]
2: __cxxabiv1::__terminate() [0xf0cee6]
Caught Aborted, suspect faulting address 0x53900000013
Backtrace (use tools/stack_decode.py to get line numbers):
2: __cxxabiv1::__terminate() [0xf0cee6]

2 Occasionally => 154 times in 6 hours. Most of them were for the ingress gateway with some on individual services.

  1. There is definite clustering of crashes on all pods for a service. I'll attach a CSV file for the ingress gateway that account for 119 of the 154 crashes.

  2. This is on our playground cluster which is most "unstable" in that people are constantly deploying new services, adapter code and so on.

2019-08-16T10:49:25.456-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T10:49:25.644-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T10:49:35.495-0700,istio-ingressgateway-587cb95cfb-lpm8l

2019-08-16T10:59:40.735-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T11:03:05.053-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T11:06:38.484-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T11:07:03.279-0700,istio-ingressgateway-587cb95cfb-lpm8l

2019-08-16T11:19:04.171-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T11:36:10.117-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T11:41:37.363-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T11:41:37.494-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T11:41:40.300-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T11:56:13.847-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T11:56:15.307-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T11:56:24.941-0700,istio-ingressgateway-587cb95cfb-lpm8l

2019-08-16T12:03:33.767-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:03:33.873-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:03:34.453-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T12:06:29.632-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:06:45.603-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:06:46.460-0700,istio-ingressgateway-587cb95cfb-d98t5

2019-08-16T12:11:08.556-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:11:10.811-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:11:11.023-0700,istio-ingressgateway-587cb95cfb-lpm8l

2019-08-16T12:14:54.363-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:14:56.037-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:14:56.479-0700,istio-ingressgateway-587cb95cfb-lpm8l

2019-08-16T12:19:20.302-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:19:28.261-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T12:24:18.117-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:24:47.302-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:24:47.461-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T12:27:27.240-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:27:29.624-0700,istio-ingressgateway-587cb95cfb-d98t5

2019-08-16T12:29:10.501-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:29:10.678-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:29:10.932-0700,istio-ingressgateway-587cb95cfb-d98t5

2019-08-16T12:31:50.186-0700,istio-ingressgateway-587cb95cfb-d98t5

2019-08-16T12:32:00.269-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:32:01.793-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:32:13.342-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:32:36.619-0700,istio-ingressgateway-587cb95cfb-c98ds

2019-08-16T12:33:14.021-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:33:50.987-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:34:01.381-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:34:13.726-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:36:08.606-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:36:09.773-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:37:16.193-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:37:16.733-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:39:45.478-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:39:49.502-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:40:02.943-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:44:19.600-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:44:20.510-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:48:47.687-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:48:49.463-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:49:48.715-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:49:49.103-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:50:48.634-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T12:50:49.629-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:50:51.103-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T12:59:02.377-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T12:59:03.793-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:04:28.616-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:04:30.819-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:04:40.787-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T13:06:27.271-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T13:09:11.965-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:09:12.055-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:10:05.759-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:10:08.280-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T13:42:19.187-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:42:19.557-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:42:19.616-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T13:47:22.496-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:51:46.595-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:52:51.453-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T13:52:51.630-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T13:52:53.975-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:56:01.633-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T13:56:07.786-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:05:22.107-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:05:36.697-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T14:05:37.838-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:09:34.802-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:09:36.222-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:13:06.820-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:13:09.902-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:14:45.794-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:14:49.301-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:14:50.213-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T14:18:32.496-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:18:38.025-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T14:18:41.437-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:21:41.874-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:21:41.917-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T14:21:45.820-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:23:00.422-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:23:02.673-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T14:24:03.803-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:26:32.144-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:43:29.666-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T14:46:14.903-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T14:46:14.997-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:07:53.393-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:07:56.732-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T15:10:02.491-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T15:10:05.214-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:10:31.800-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T15:17:32.543-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:22:54.060-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T15:23:06.369-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:23:08.042-0700,istio-ingressgateway-587cb95cfb-lpm8l
2019-08-16T15:28:34.163-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T15:44:33.912-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:44:35.841-0700,istio-ingressgateway-587cb95cfb-c98ds
2019-08-16T15:45:01.708-0700,istio-ingressgateway-587cb95cfb-d98t5
2019-08-16T15:45:01.786-0700,istio-ingressgateway-587cb95cfb-c98ds

Do you think that there could be one bad/ unexpected config that could be causing this?

@PiotrSikora
Copy link
Contributor

Since this is a playground, could you try using 1.1.13 to narrow down the issue between "upgrade from 1.1.x to 1.2.x" and "switch to libc++-7" (to which we moved in 1.1.13 and 1.2.4)? Would you be willing to test a custom proxy image?

Since we didn't see those crashes during our testing done by multiple people, I imagine that a specific cluster configuration is triggering this. If you change the log level to "trace", then you'll see contents of the CDS update immediately before the crash (note that it might contain sensitive information, so please sanitize it before pasting).

@htuch
Copy link

htuch commented Aug 19, 2019

@PiotrSikora does this still happen at Envoy master HEAD? You mention that there were a bunch of changes since this release cut.

@duderino
Copy link
Contributor

@gotwarlost please file bugs for the other backtraces you see (the ones present in earlier releases but still occurring with 1.2.4) and link them to this one

@PiotrSikora
Copy link
Contributor

@htuch we don't have a way to replicate this, so I have no idea if that's still happening at HEAD.

@duderino
Copy link
Contributor

@gotwarlost I'm worried that we don't know how to make progress here. The 1.1.8 to 1.2.4 delta is large.

+1 to @PiotrSikora's suggestion to try to upgrade from 1.1.8 to 1.1.13 first

@gotwarlost
Copy link
Author

Just came back from offsite + time off. Will sync up with the team and get you additional data. Sorry for the delay.

@gotwarlost
Copy link
Author

Deployed 1.1.13 to the cluster 45 minutes ago. 35 crashes already. New stack:

Caught Segmentation fault, suspect faulting address 0x228
Backtrace (use tools/stack_decode.py to get line numbers):
#0: __restore_rt [0x7faecbaa1390]
#1: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xc8fb0d]
#2: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xc8f2a1]
#3: Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate() [0xc2a1b1]
#4: Envoy::Config::GrpcMuxImpl::onDiscoveryResponse() [0xd8df19]
#5: Envoy::Grpc::TypedAsyncStreamCallbacks<>::onReceiveMessageUntyped() [0xd8edce]
#6: Envoy::Grpc::AsyncStreamImpl::onData() [0xdb462d]
#7: Envoy::Http::AsyncStreamImpl::encodeData() [0xdb7db4]
#8: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0xe15fa3]
#9: nghttp2_session_on_data_received [0xe2691f]
#10: nghttp2_session_mem_recv [0xe28747]
#11: Envoy::Http::Http2::ConnectionImpl::dispatch() [0xe15679]
#12: Envoy::Http::CodecClient::onData() [0xdab3d6]
#13: Envoy::Http::CodecClient::CodecReadFilter::onData() [0xdabdcd]
#14: Envoy::Network::FilterManagerImpl::onRead() [0xc550aa]
#15: Envoy::Network::ConnectionImpl::onReadReady() [0xc52299]
#16: Envoy::Network::ConnectionImpl::onFileEvent() [0xc51d61]
#17: Envoy::Event::FileEventImpl::assignEvents()::$_0::__invoke() [0xc4c800]
#18: event_process_active_single_queue [0xfade7d]
#19: event_base_loop [0xfac430]
#20: Envoy::Event::DispatcherImpl::run() [0xc4bd4f]
#21: Envoy::Server::InstanceImpl::run() [0xbfe6c0]
#22: Envoy::MainCommonBase::run() [0x8a8cc9]
#23: main [0x8a7e72]

Now trying to get a detailed TRACE level log for one of the crashes.

@duderino
Copy link
Contributor

duderino commented Aug 23, 2019

@gotwarlost thank you that very much reduces the search space. Could you try 1.1.12? If 1.1.12 does not crash, then I think we could squarely point the finger at the http2 vulnerabiity fixes or the switch to libc++

@gotwarlost
Copy link
Author

Running 1.1.12 for the last hour now and no crashes so far. OTOH we deployed too late on a Friday and deployment activity is pretty low key. (There were no crashes even in the hour prior to when we deployed). Will leave it running over the weekend and see what happens on Monday.

@gotwarlost
Copy link
Author

Still no crashes. 1.1.12 seems stable w.r.t to the envoy crash

@gotwarlost
Copy link
Author

I have a trace file for one of the ingress gateway pods but there is a lot of disclosure type information in the file that I'm not comfortable sharing in a public forum. Taking all this out might make the trace useless.

@PiotrSikora
Copy link
Contributor

@gotwarlost could you test with 1.1.14 as well?

@duderino
Copy link
Contributor

The context there is that the proxyv2 image in 1.1.13 was built by hand because we had to build it under private embargo and couldn't leverage our CI/CD pipeline since that builds everything in the clear. Piotr wants to rule out that hand build (done by me) as a source of error. 1.1.14 has the same fixes but was built in the clear using our standard pipeline.

@duderino
Copy link
Contributor

@PiotrSikora corrected me. There's a memory alignment fix in 1.1.14 as well as an upgraded version of libc++, so it's well worth a try.

@gotwarlost
Copy link
Author

we'll do this tomorrow morning and report results

@gotwarlost
Copy link
Author

still happening with 1.1.14 - 7 crashes on the ingress gateway in the last hour.

@gotwarlost
Copy link
Author

The number of crashes seem to have gone down but it is still an infrequent occurrence. 20 crashes in the last 20 hours just for the ingress gateway

@gotwarlost
Copy link
Author

17 crashes for the ingress gateway in the last 24 hours. It is much reduced from before but still happening. Any ideas how to move forward/ debug this? As it stands we are caught between a rock and a hard place, really needing the upgrade and not being able to proceed safely.

@PiotrSikora
Copy link
Contributor

The changes between 1.1.12 and 1.1.13 in Envoy (istio/envoy@8e54972...6910569) consist only of HTTP/2 fixes and switch to libc++ (along with some Lua fixes for it).

Since you're the only one that reported any issues with the new releases, it leads me to believe that's something unique to your setup (e.g. a non-default option that you enabled, etc.).

Would you be willing to share the trace log with me privately?
And/or try 1.3.0-rc.1 release that's coming out tomorrow?

Also, just to make sure, are the resources (mostly memory) on the ingress gateway sufficiently over-provisioned? Envoy's memory usage tends to spike during xDS message parsing, so it's possible that you'd consistently run out of memory at the exact same place in the code.

@duderino duderino assigned jplevyak and unassigned PiotrSikora Aug 30, 2019
@duderino
Copy link
Contributor

duderino commented Sep 3, 2019

We are currently trying an asan intrumented build in @gotwarlost's env. Our best guess is that this was introduced by the switch from libstdc++ to libc++

@jacexh
Copy link

jacexh commented Sep 6, 2019

I'm facing the same issue. 1.2.4

@jplevyak
Copy link
Contributor

jplevyak commented Sep 6, 2019

A core file would be most appreciated. This issue seems to be rather subtle.

@jplevyak
Copy link
Contributor

jplevyak commented Sep 12, 2019

New backtrace

#0  0x00007fee8efaf269 in raise (sig=11) at ../sysdeps/unix/sysv/linux/pt-raise.c:35
#1  0x000000000113712b in Envoy::SignalAction::sigHandler (sig=11, info=0x7fee8f7cebf0, context=0x7fee8f7ceac0) at external/envoy/source/exe/signal_action.cc:23
#2  <signal handler called>
where
#3  std::__1::shared_ptr<Envoy::Upstream::Cluster>::operator* (this=0x228) at /usr/lib/llvm-7/bin/../include/c++/v1/memory:3914
#4  Envoy::Upstream::ClusterManagerImpl::addOrUpdateCluster(envoy::api::v2::Cluster const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<void (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Envoy::Upstream::ClusterManager::ClusterWarmingState)>) (this=0x392b200,     
    cluster=..., version_info=..., cluster_warming_cb=...) at external/envoy/source/common/upstream/cluster_manager_impl.cc:456
#5  0x0000000000c8fced in Envoy::Upstream::CdsApiImpl::onConfigUpdate (this=0x3aca080, added_resources=..., removed_resources=..., system_version_info=...) at external/envoy/source/common/upstream/cds_api_impl.cc:82                                                                                                                                                                       
#6  0x0000000000c8f481 in Envoy::Upstream::CdsApiImpl::onConfigUpdate (this=0x3aca080, resources=..., version_info=...) at external/envoy/source/common/upstream/cds_api_impl.cc:62
#7  0x0000000000c2a391 in Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate (this=0x3aca100, resources=..., version_info=...) at bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:61                                                                                                      
#8  0x0000000000d8e0f9 in Envoy::Config::GrpcMuxImpl::onDiscoveryResponse (this=0x3a1a000, message=...) at external/envoy/source/common/config/grpc_mux_impl.cc:166
#9  0x0000000000d8efae in Envoy::Grpc::TypedAsyncStreamCallbacks<envoy::api::v2::DiscoveryResponse>::onReceiveMessageUntyped (this=0x3a1a010, message=...) at bazel-out/k8-opt/bin/external/envoy/include/envoy/grpc/_virtual_includes/async_client_interface/envoy/grpc/async_client.h:172                                                                                                   
#10 0x0000000000db4b2d in Envoy::Grpc::AsyncStreamImpl::onData (this=0x3ca7d60, data=..., end_stream=false) at external/envoy/source/common/grpc/async_client_impl.cc:142
#11 0x0000000000db82b4 in Envoy::Http::AsyncStreamImpl::encodeData (this=0x3ace700, data=..., end_stream=false) at external/envoy/source/common/http/async_client_impl.cc:104
#12 0x0000000000e16ab3 in Envoy::Http::Http2::ConnectionImpl::onFrameReceived (this=0x3ca0790, frame=0x3cceb98) at external/envoy/source/common/http/http2/codec_impl.cc:540
#13 0x0000000000e2742f in session_call_on_frame_received (session=<optimized out>, frame=<optimized out>) at /home/jblatt/.cache/bazel/_bazel_root/b6258e8f3eabbcfebfab41d31f03811a/sandbox/processwrapper-sandbox/352/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:3295                                                                                       
#14 nghttp2_session_on_data_received (session=0x3ccea00, frame=0x3cceb98) at /home/jblatt/.cache/bazel/_bazel_root/b6258e8f3eabbcfebfab41d31f03811a/sandbox/processwrapper-sandbox/352/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:4964                                                                                                                       
#15 0x0000000000e29257 in session_process_data_frame (session=<optimized out>) at /home/jblatt/.cache/bazel/_bazel_root/b6258e8f3eabbcfebfab41d31f03811a/sandbox/processwrapper-sandbox/352/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:4984                                                                                                                  
#16 nghttp2_session_mem_recv (session=0x3ccea00, in=0x3d2e213 "cal\020\004\"\002\b\n0\004R\a\n\005*\003\b\200\b\312\001o\nm\n\005istio\022d\nb\n\006config\022X\032V/apis/networking/v1alpha3/namespaces/istio-system/destination-rule/default-enable-mtls\022\230\002\n(type.googleapis.com/envoy.api.v2.Cluster\022\353\001\nfoutbound_.16091_"..., inlen=<optimized out>)                  
    at /home/jblatt/.cache/bazel/_bazel_root/b6258e8f3eabbcfebfab41d31f03811a/sandbox/processwrapper-sandbox/352/execroot/__main__/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:6605                                                                                                                                                                                             
#17 0x0000000000e16189 in Envoy::Http::Http2::ConnectionImpl::dispatch (this=0x3ca0790, data=...) at external/envoy/source/common/http/http2/codec_impl.cc:369
#18 0x0000000000dab8d6 in Envoy::Http::CodecClient::onData (this=0x3a92990, data=...) at external/envoy/source/common/http/codec_client.cc:116
#19 0x0000000000dac2cd in Envoy::Http::CodecClient::CodecReadFilter::onData (this=<optimized out>, data=...) at bazel-out/k8-opt/bin/external/envoy/source/common/http/_virtual_includes/codec_client_lib/common/http/codec_client.h:172                                                                                                                                                      
#20 0x0000000000c5528a in Envoy::Network::FilterManagerImpl::onContinueReading (this=0x3cf2c20, filter=<optimized out>) at external/envoy/source/common/network/filter_manager_impl.cc:56
#21 Envoy::Network::FilterManagerImpl::onRead (this=0x3cf2c20) at external/envoy/source/common/network/filter_manager_impl.cc:66
#22 0x0000000000c52479 in Envoy::Network::ConnectionImpl::onRead (this=<optimized out>, read_buffer_size=<optimized out>) at external/envoy/source/common/network/connection_impl.cc:271
#23 Envoy::Network::ConnectionImpl::onReadReady (this=0x3cf2c00) at external/envoy/source/common/network/connection_impl.cc:505
#24 0x0000000000c51f41 in Envoy::Network::ConnectionImpl::onFileEvent (this=0x3cf2c00, events=<optimized out>) at external/envoy/source/common/network/connection_impl.cc:481
#25 0x0000000000c4c9e0 in std::__1::function<void (unsigned int)>::operator()(unsigned int) const (this=0x392b2d0, __arg=3) at /usr/lib/llvm-7/bin/../include/c++/v1/functional:1913
#26 Envoy::Event::FileEventImpl::assignEvents(unsigned int)::$_0::operator()(int, short, void*) const (what=<optimized out>, arg=0x392b250, this=<optimized out>) at external/envoy/source/common/event/file_event_impl.cc:64                                                                                                                                                                 
#27 Envoy::Event::FileEventImpl::assignEvents(unsigned int)::$_0::__invoke(int, short, void*) (what=<optimized out>, arg=0x392b250) at external/envoy/source/common/event/file_event_impl.cc:48                                                                                                                                                                                               
#28 0x0000000000fae97d in event_process_active_single_queue ()
#29 0x0000000000facf30 in event_base_loop ()
#30 0x0000000000c4bf2f in Envoy::Event::DispatcherImpl::run (this=<optimized out>, type=Envoy::Event::Dispatcher::RunType::Block) at external/envoy/source/common/event/dispatcher_impl.cc:167
#31 0x0000000000bfe8a0 in Envoy::Server::InstanceImpl::run (this=0x396c500) at external/envoy/source/server/server.cc:482
#32 0x00000000008a8ea9 in Envoy::MainCommonBase::run (this=0x39524d0) at external/envoy/source/exe/main_common.cc:117
#33 0x00000000008a8052 in Envoy::MainCommon::run (this=0x3952000) at bazel-out/k8-opt/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:88
#34 main (argc=18, argv=<optimized out>) at external/envoy/source/exe/main.cc:39

@jplevyak
Copy link
Contributor

Potential fix: istio/envoy#103

@jplevyak
Copy link
Contributor

The crash in this core seems to be caused by calling ClusterManagerInitHelper::removeCluster with a smartpointer to a Cluster which is nullptr which in turn is caused by cluster_manager_impl.cc:1258
where we use an "or" on a pair of iterators then derefs one of them.

The dereferenced nullptr (Cluster&) is not used if state_ == State::AllClustersInitialized but, the compile has an option to load the nullptr when it converts the smart pointer to a reference via * and that could be std library dependent. This issue seems to arise with libc++ but not libstdc++. It is not surprising that the behavior could be different because it depends on the optimizations the compiler does which in tern depend on the specifics of the smart pointer implementation and the environment that it is used in.

@gotwarlost
Copy link
Author

@jplevyak @duderino - great findings. Thanks for working so closely with us to figure out the root cause.

@jplevyak
Copy link
Contributor

@lizan
Copy link
Contributor

lizan commented Sep 12, 2019

@jplevyak as I commented in the PR, let's do a full cherry-pick of envoyproxy/envoy#8106?

@jplevyak
Copy link
Contributor

Updated istio/envoy#105 to be a cherry-pick of the upstream fix.

@jplevyak
Copy link
Contributor

Fixed in 1.1, 1.2, 1.3 and upstream.

@duderino
Copy link
Contributor

Fixes will be in 1.1.15, 1.2.6, 1.3.1

@fpesce and @rlenglet

@andraxylia
Copy link
Contributor

andraxylia commented Sep 13, 2019

Funny, these # backtraces from the logs are interpreted as references to old PRs in istio, just had one of my first ones referenced :).

@rlenglet
Copy link
Contributor

rlenglet commented Sep 13, 2019

This is not done, at least for release-1.3. A couple more AIs:

  • Please update istio/proxy:release-1.3 and istio/istio:release-1.3 to pick up the commit.
  • If possible, please add an entry into the 1.3.1 release note draft.

@rlenglet rlenglet reopened this Sep 13, 2019
@rlenglet rlenglet modified the milestones: 1.3, 1.4 Sep 13, 2019
@duderino
Copy link
Contributor

@jplevyak please make sure fix is in master too else 1.4 will regress

@duderino
Copy link
Contributor

1.1.15 and 1.2.6 are out. 1.3.1 should be out this week or early next at the latest (@rlenglet)

@jacexh
Copy link

jacexh commented Sep 19, 2019

this issue still occurred after upgrade to version 1.2.6:

[2019-09-19 01:31:46.826][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:81] Caught Segmentation fault, suspect faulting address 0x228
[2019-09-19 01:31:46.826][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:69] Backtrace (use tools/stack_decode.py to get line numbers):
[2019-09-19 01:31:46.826][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #0: __restore_rt [0x7f3650349390]
[2019-09-19 01:31:46.829][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #1: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf4a4d]
[2019-09-19 01:31:46.832][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #2: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0xcf41e1]
[2019-09-19 01:31:46.835][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #3: Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate() [0xe12841]
[2019-09-19 01:31:46.837][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #4: Envoy::Config::GrpcMuxImpl::onDiscoveryResponse() [0xe0ff19]
[2019-09-19 01:31:46.840][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #5: Envoy::Grpc::TypedAsyncStreamCallbacks<>::onReceiveMessageUntyped() [0xe10d6e]
[2019-09-19 01:31:46.843][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #6: Envoy::Grpc::AsyncStreamImpl::onData() [0xe3ae9d]
[2019-09-19 01:31:46.845][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #7: Envoy::Http::AsyncStreamImpl::encodeData() [0xe3e814]
[2019-09-19 01:31:46.848][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #8: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0xeae3cd]
[2019-09-19 01:31:46.851][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #9: nghttp2_session_on_data_received [0xebe97f]
[2019-09-19 01:31:46.853][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #10: nghttp2_session_mem_recv [0xec0785]
[2019-09-19 01:31:46.856][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #11: Envoy::Http::Http2::ConnectionImpl::dispatch() [0xeada89]
[2019-09-19 01:31:46.860][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #12: Envoy::Http::CodecClient::onData() [0xe32106]
[2019-09-19 01:31:46.862][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #13: Envoy::Http::CodecClient::CodecReadFilter::onData() [0xe32afd]
[2019-09-19 01:31:46.865][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #14: Envoy::Network::FilterManagerImpl::onRead() [0xcb7c29]
[2019-09-19 01:31:46.868][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #15: Envoy::Network::ConnectionImpl::onReadReady() [0xcb4c2c]
[2019-09-19 01:31:46.871][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #16: Envoy::Network::ConnectionImpl::onFileEvent() [0xcb46f1]
[2019-09-19 01:31:46.873][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #17: Envoy::Event::FileEventImpl::assignEvents()::$_0::__invoke() [0xcaf110]
[2019-09-19 01:31:46.876][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #18: event_process_active_single_queue [0x10536ed]
[2019-09-19 01:31:46.878][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #19: event_base_loop [0x1051ca0]
[2019-09-19 01:31:46.881][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #20: Envoy::Event::DispatcherImpl::run() [0xcae09f]
[2019-09-19 01:31:46.884][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #21: Envoy::Server::InstanceImpl::run() [0xc6a773]
[2019-09-19 01:31:46.886][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #22: Envoy::MainCommonBase::run() [0x8f405d]
[2019-09-19 01:31:46.889][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #23: main [0x8f3308]
[2019-09-19 01:31:46.889][127][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #24: [0x7f364f57a830]
2019-09-19T01:31:46.905202Z	warn	Epoch 2 terminated with an error: signal: segmentation fault
2019-09-19T01:31:46.905256Z	warn	Aborted all epochs
2019-09-19T01:31:46.905319Z	info	Epoch 2: set retry delay to 200ms, budget to 9
2019-09-19T01:31:47.105457Z	info	Reconciling retry (budget 9)
2019-09-19T01:31:47.105536Z	info	Epoch 0 starting
2019-09-19T01:31:47.106806Z	info	Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster toolbox.qa --service-node sidecar~10.244.22.132~toolbox-base-6cbfff58fb-nndtt.qa~qa.svc.cluster.local --max-obj-name-len 189 --local-address-ip-version v4 --allow-unknown-fields -l warning --component-log-level misc:error --concurrency 2]
[2019-09-19 01:31:47.131][140][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 14, no healthy upstream
[2019-09-19 01:31:47.131][140][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:49] Unable to establish new stream
2019-09-19T01:31:48.402401Z	info	Envoy proxy is NOT ready: 2 errors occurred:
	* failed checking application ports. listeners="0.0.0.0:15090"
	* envoy missing listener for inbound application port: 8080


2019-09-19T01:31:50.402725Z	info	Envoy proxy is NOT ready: 2 errors occurred:
	* failed checking application ports. listeners="0.0.0.0:15090"
	* envoy missing listener for inbound application port: 8080


2019-09-19T01:31:52.401686Z	info	Envoy proxy is NOT ready: 2 errors occurred:
	* failed checking application ports. listeners="0.0.0.0:15090"
	* envoy missing listener for inbound application port: 8080


2019-09-19T01:31:54.401764Z	info	Envoy proxy is NOT ready: 2 errors occurred:
	* failed checking application ports. listeners="0.0.0.0:15090"
	* envoy missing listener for inbound application port: 8080


2019-09-19T01:31:56.401867Z	info	Envoy proxy is NOT ready: 2 errors occurred:
	* failed checking application ports. listeners="0.0.0.0:15090"
	* envoy missing listener for inbound application port: 8080

client version: 1.2.6
egressgateway version: 94746ccd404a8e056483dd02e4e478097b950da6-dirty
galley version: 1.2.6
ingressgateway version: 1.2.6
pilot version: 1.2.6
policy version: 1.2.6
sidecar-injector version: 1.2.6
telemetry version: 1.2.6

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

@duderino
Copy link
Contributor

@jacexh I walked through all the shas in the 1.2.6 release and verified that istio/envoy@52a3903 is included in it.

When you upgraded to 1.2.6, did you complete this step? https://istio.io/docs/setup/upgrade/steps/#sidecar-upgrade

@duderino
Copy link
Contributor

1.3.1 is out. If there are any more segfaults with similar backtrace, please reopen

@JRBANCEL
Copy link

JRBANCEL commented Oct 8, 2019

Here is what we see in istio-1.3.1-12c4d01552363362b49fce86f83b847b308b9b0d-Clean:

[2019-10-08 16:27:26.275][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:81] Caught Segmentation fault, suspect faulting address 0x0
[2019-10-08 16:27:26.275][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:69] Backtrace (use tools/stack_decode.py to get line numbers):
[2019-10-08 16:27:26.275][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #0: __restore_rt [0x7fde03bcc390]
[2019-10-08 16:27:26.282][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #1: std::__1::__function::__func<>::operator()() [0x1ac08a6]
[2019-10-08 16:27:26.289][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #2: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x1aab376]
[2019-10-08 16:27:26.295][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #3: event_process_active_single_queue [0x1d73871]
[2019-10-08 16:27:26.302][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #4: event_base_loop [0x1d72360]
[2019-10-08 16:27:26.308][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #5: Envoy::Server::WorkerImpl::threadRoutine() [0x1aa434c]
[2019-10-08 16:27:26.315][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #6: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::$_0::__invoke() [0x1fdf053]
[2019-10-08 16:27:26.315][36][critical][backtrace] [envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #7: start_thread [0x7fde03bc26ba]

@JRBANCEL
Copy link

JRBANCEL commented Oct 8, 2019

/reopen

@duderino
Copy link
Contributor

duderino commented Oct 8, 2019

@jplevyak PTAL backtrace is different

@jplevyak
Copy link
Contributor

jplevyak commented Oct 8, 2019

We should open this as a different issue since the backtrace is different it and it likely has a different root cause the historical information is going to be deceptive.

@JRBANCEL
Copy link

JRBANCEL commented Oct 8, 2019

I opened a new issue: #17699.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests