Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to communicate with RabbitMQ outside the mesh #15896

Closed
crhuber opened this issue Jul 29, 2019 · 5 comments
Closed

Unable to communicate with RabbitMQ outside the mesh #15896

crhuber opened this issue Jul 29, 2019 · 5 comments
Labels
area/networking lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while

Comments

@crhuber
Copy link

crhuber commented Jul 29, 2019

Bug description

We have a number of rabbitmq consumer pods with Istio sidecar running. These pods should consume messages from rabbitmq running as a hosted service via cloudamqp. As soon as the pod starts the consumer container never receives any messages even though the queue has messages waiting to be consumed. When we turn off sidecar injection, messages get consumed as expected. Connectivity to cloudampq seems to be broken when Istio is running.

Oddly, When we run a second process of the rabbitmq consumer within the original container that had connectivity problems while the sidecar is running, we are then able to connect and consume messages.

To troubleshoot we created a ServiceEntry resource but it did not seem to have an impact.

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: cloudamqp-external-mesh
spec:
  hosts:
  - cloudamqp.fqdn.tld
  ports:
  - name: rabbitmq
    number: 5672
    protocol: TCP
  location: MESH_EXTERNAL
  resolution: NONE

I confirmed traffic policy is ALLOW_ANY

kubectl get configmap istio -n istio-system -o yaml | grep -o "mode: ALLOW_ANY"
mode: ALLOW_ANY

Here are stats from the sidecar:

cluster.xds-grpc.assignment_stale: 0
cluster.xds-grpc.assignment_timeout_received: 0
cluster.xds-grpc.bind_errors: 0
cluster.xds-grpc.circuit_breakers.default.cx_open: 0
cluster.xds-grpc.circuit_breakers.default.cx_pool_open: 0
cluster.xds-grpc.circuit_breakers.default.rq_open: 0
cluster.xds-grpc.circuit_breakers.default.rq_pending_open: 0
cluster.xds-grpc.circuit_breakers.default.rq_retry_open: 0
cluster.xds-grpc.circuit_breakers.high.cx_open: 0
cluster.xds-grpc.circuit_breakers.high.cx_pool_open: 0
cluster.xds-grpc.circuit_breakers.high.rq_open: 0
cluster.xds-grpc.circuit_breakers.high.rq_pending_open: 0
cluster.xds-grpc.circuit_breakers.high.rq_retry_open: 0
cluster.xds-grpc.http2.header_overflow: 0
cluster.xds-grpc.http2.headers_cb_no_stream: 0
cluster.xds-grpc.http2.rx_messaging_error: 0
cluster.xds-grpc.http2.rx_reset: 0
cluster.xds-grpc.http2.too_many_header_frames: 0
cluster.xds-grpc.http2.trailers: 0
cluster.xds-grpc.http2.tx_reset: 0
cluster.xds-grpc.internal.upstream_rq_200: 5
cluster.xds-grpc.internal.upstream_rq_2xx: 5
cluster.xds-grpc.internal.upstream_rq_503: 1
cluster.xds-grpc.internal.upstream_rq_5xx: 1
cluster.xds-grpc.internal.upstream_rq_completed: 6
cluster.xds-grpc.lb_healthy_panic: 1
cluster.xds-grpc.lb_local_cluster_not_ok: 0
cluster.xds-grpc.lb_recalculate_zone_structures: 0
cluster.xds-grpc.lb_subsets_active: 0
cluster.xds-grpc.lb_subsets_created: 0
cluster.xds-grpc.lb_subsets_fallback: 0
cluster.xds-grpc.lb_subsets_fallback_panic: 0
cluster.xds-grpc.lb_subsets_removed: 0
cluster.xds-grpc.lb_subsets_selected: 0
cluster.xds-grpc.lb_zone_cluster_too_small: 0
cluster.xds-grpc.lb_zone_no_capacity_left: 0
cluster.xds-grpc.lb_zone_number_differs: 0
cluster.xds-grpc.lb_zone_routing_all_directly: 0
cluster.xds-grpc.lb_zone_routing_cross_zone: 0
cluster.xds-grpc.lb_zone_routing_sampled: 0
cluster.xds-grpc.max_host_weight: 1
cluster.xds-grpc.membership_change: 1
cluster.xds-grpc.membership_degraded: 0
cluster.xds-grpc.membership_excluded: 0
cluster.xds-grpc.membership_healthy: 1
cluster.xds-grpc.membership_total: 1
cluster.xds-grpc.original_dst_host_invalid: 0
cluster.xds-grpc.retry_or_shadow_abandoned: 0
cluster.xds-grpc.update_attempt: 15
cluster.xds-grpc.update_empty: 0
cluster.xds-grpc.update_failure: 0
cluster.xds-grpc.update_no_rebuild: 14
cluster.xds-grpc.update_success: 15
cluster.xds-grpc.upstream_cx_active: 1
cluster.xds-grpc.upstream_cx_close_notify: 2
cluster.xds-grpc.upstream_cx_connect_attempts_exceeded: 0
cluster.xds-grpc.upstream_cx_connect_fail: 0
cluster.xds-grpc.upstream_cx_connect_timeout: 0
cluster.xds-grpc.upstream_cx_destroy: 0
cluster.xds-grpc.upstream_cx_destroy_local: 0
cluster.xds-grpc.upstream_cx_destroy_local_with_active_rq: 0
cluster.xds-grpc.upstream_cx_destroy_remote: 0
cluster.xds-grpc.upstream_cx_destroy_remote_with_active_rq: 4
cluster.xds-grpc.upstream_cx_destroy_with_active_rq: 4
cluster.xds-grpc.upstream_cx_http1_total: 0
cluster.xds-grpc.upstream_cx_http2_total: 5
cluster.xds-grpc.upstream_cx_idle_timeout: 0
cluster.xds-grpc.upstream_cx_max_requests: 0
cluster.xds-grpc.upstream_cx_none_healthy: 1
cluster.xds-grpc.upstream_cx_overflow: 0
cluster.xds-grpc.upstream_cx_pool_overflow: 0
cluster.xds-grpc.upstream_cx_protocol_error: 0
cluster.xds-grpc.upstream_cx_rx_bytes_buffered: 69
cluster.xds-grpc.upstream_cx_rx_bytes_total: 52078227
cluster.xds-grpc.upstream_cx_total: 5
cluster.xds-grpc.upstream_cx_tx_bytes_buffered: 0
cluster.xds-grpc.upstream_cx_tx_bytes_total: 12457743
cluster.xds-grpc.upstream_flow_control_backed_up_total: 0
cluster.xds-grpc.upstream_flow_control_drained_total: 0
cluster.xds-grpc.upstream_flow_control_paused_reading_total: 0
cluster.xds-grpc.upstream_flow_control_resumed_reading_total: 0
cluster.xds-grpc.upstream_internal_redirect_failed_total: 0
cluster.xds-grpc.upstream_internal_redirect_succeeded_total: 0
cluster.xds-grpc.upstream_rq_200: 5
cluster.xds-grpc.upstream_rq_2xx: 5
cluster.xds-grpc.upstream_rq_503: 1
cluster.xds-grpc.upstream_rq_5xx: 1
cluster.xds-grpc.upstream_rq_active: 1
cluster.xds-grpc.upstream_rq_cancelled: 0
cluster.xds-grpc.upstream_rq_completed: 6
cluster.xds-grpc.upstream_rq_maintenance_mode: 0
cluster.xds-grpc.upstream_rq_pending_active: 0
cluster.xds-grpc.upstream_rq_pending_failure_eject: 4
cluster.xds-grpc.upstream_rq_pending_overflow: 0
cluster.xds-grpc.upstream_rq_pending_total: 5
cluster.xds-grpc.upstream_rq_per_try_timeout: 0
cluster.xds-grpc.upstream_rq_retry: 0
cluster.xds-grpc.upstream_rq_retry_overflow: 0
cluster.xds-grpc.upstream_rq_retry_success: 0
cluster.xds-grpc.upstream_rq_rx_reset: 0
cluster.xds-grpc.upstream_rq_timeout: 0
cluster.xds-grpc.upstream_rq_total: 5
cluster.xds-grpc.upstream_rq_tx_reset: 0
cluster.xds-grpc.version: 0
cluster_manager.active_clusters: 530
cluster_manager.cds.update_attempt: 47
cluster_manager.cds.update_failure: 4
cluster_manager.cds.update_rejected: 0
cluster_manager.cds.update_success: 42
cluster_manager.cds.version: 8476332205929295747
cluster_manager.cluster_added: 530
cluster_manager.cluster_modified: 0
cluster_manager.cluster_removed: 0
cluster_manager.cluster_updated: 1472
cluster_manager.cluster_updated_via_merge: 0
cluster_manager.update_merge_cancelled: 0
cluster_manager.update_out_of_merge_window: 0
cluster_manager.warming_clusters: 0
http_mixer_filter.total_check_cache_hit_accepts: 0
http_mixer_filter.total_check_cache_hit_denies: 0
http_mixer_filter.total_check_cache_hits: 0
http_mixer_filter.total_check_cache_misses: 0
http_mixer_filter.total_check_calls: 0
http_mixer_filter.total_quota_cache_hit_accepts: 0
http_mixer_filter.total_quota_cache_hit_denies: 0
http_mixer_filter.total_quota_cache_hits: 0
http_mixer_filter.total_quota_cache_misses: 0
http_mixer_filter.total_quota_calls: 0
http_mixer_filter.total_remote_call_cancellations: 0
http_mixer_filter.total_remote_call_other_errors: 0
http_mixer_filter.total_remote_call_retries: 0
http_mixer_filter.total_remote_call_send_errors: 0
http_mixer_filter.total_remote_call_successes: 0
http_mixer_filter.total_remote_call_timeouts: 0
http_mixer_filter.total_remote_calls: 0
http_mixer_filter.total_remote_check_accepts: 0
http_mixer_filter.total_remote_check_calls: 0
http_mixer_filter.total_remote_check_denies: 0
http_mixer_filter.total_remote_quota_accepts: 0
http_mixer_filter.total_remote_quota_calls: 0
http_mixer_filter.total_remote_quota_denies: 0
http_mixer_filter.total_remote_quota_prefetch_calls: 0
http_mixer_filter.total_remote_report_calls: 595
http_mixer_filter.total_remote_report_other_errors: 0
http_mixer_filter.total_remote_report_send_errors: 0
http_mixer_filter.total_remote_report_successes: 595
http_mixer_filter.total_remote_report_timeouts: 0
http_mixer_filter.total_report_calls: 1042
listener_manager.lds.update_attempt: 47
listener_manager.lds.update_failure: 4
listener_manager.lds.update_rejected: 0
listener_manager.lds.update_success: 42
listener_manager.lds.version: 8476332205929295747
listener_manager.listener_added: 96
listener_manager.listener_create_failure: 0
listener_manager.listener_create_success: 192
listener_manager.listener_modified: 0
listener_manager.listener_removed: 0
listener_manager.total_listeners_active: 96
listener_manager.total_listeners_draining: 0
listener_manager.total_listeners_warming: 0
server.concurrency: 2
server.days_until_first_cert_expiring: 86
server.debug_assertion_failures: 0
server.hot_restart_epoch: 0
server.live: 1
server.memory_allocated: 49567936
server.memory_heap_size: 79437824
server.parent_connections: 0
server.total_connections: 5
server.uptime: 4478
server.version: 7825363
server.watchdog_mega_miss: 0
server.watchdog_miss: 0
tcp_mixer_filter.total_check_cache_hit_accepts: 0
tcp_mixer_filter.total_check_cache_hit_denies: 0
tcp_mixer_filter.total_check_cache_hits: 0
tcp_mixer_filter.total_check_cache_misses: 0
tcp_mixer_filter.total_check_calls: 0
tcp_mixer_filter.total_quota_cache_hit_accepts: 0
tcp_mixer_filter.total_quota_cache_hit_denies: 0
tcp_mixer_filter.total_quota_cache_hits: 0
tcp_mixer_filter.total_quota_cache_misses: 0
tcp_mixer_filter.total_quota_calls: 0
tcp_mixer_filter.total_remote_call_cancellations: 0
tcp_mixer_filter.total_remote_call_other_errors: 0
tcp_mixer_filter.total_remote_call_retries: 0
tcp_mixer_filter.total_remote_call_send_errors: 0
tcp_mixer_filter.total_remote_call_successes: 0
tcp_mixer_filter.total_remote_call_timeouts: 0
tcp_mixer_filter.total_remote_calls: 0
tcp_mixer_filter.total_remote_check_accepts: 0
tcp_mixer_filter.total_remote_check_calls: 0
tcp_mixer_filter.total_remote_check_denies: 0
tcp_mixer_filter.total_remote_quota_accepts: 0
tcp_mixer_filter.total_remote_quota_calls: 0
tcp_mixer_filter.total_remote_quota_denies: 0
tcp_mixer_filter.total_remote_quota_prefetch_calls: 0
tcp_mixer_filter.total_remote_report_calls: 29
tcp_mixer_filter.total_remote_report_other_errors: 0
tcp_mixer_filter.total_remote_report_send_errors: 0
tcp_mixer_filter.total_remote_report_successes: 29
tcp_mixer_filter.total_remote_report_timeouts: 0
tcp_mixer_filter.total_report_calls: 29
cluster.xds-grpc.upstream_cx_connect_ms: P0(nan,0) P25(nan,0) P50(nan,0) P75(nan,0) P90(nan,0) P95(nan,0) P99(nan,0) P99.5(nan,0) P99.9(nan,0) P100(nan,0)
cluster.xds-grpc.upstream_cx_length_ms: P0(nan,290000) P25(nan,300000) P50(nan,540000) P75(nan,1.7e+06) P90(nan,1.76e+06) P95(nan,1.78e+06) P99(nan,1.796e+06) P99.5(nan,1.798e+06) P99.9(nan,1.7996e+06) P100(nan,1.8e+06)

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[x ] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior

Pods that connect to rabbitmq with Istio running have no connectivity issues

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version)

client version: 1.2.0
citadel version: 1.2.0
galley version: 1.2.0
ingressgateway version: 1.2.0
policy version: 1.2.0
sidecar-injector version: 1.2.0
telemetry version: 1.2.0

Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.7-eks-c57ff8", GitCommit:"c57ff8e35590932c652433fab07988da79265d5b", GitTreeState:"clean", BuildDate:"2019-06-07T20:43:03Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

How was Istio installed?

Helm chart

Environment where bug was observed (cloud vendor, OS, etc)

Amazon AMI, Amazon EKS

Additionally, please consider attaching a cluster state archive by attaching
the dump file to this issue.

@crhuber
Copy link
Author

crhuber commented Jul 29, 2019

I was able to work around this issue by setting this annotation on the deployment:

traffic.sidecar.istio.io/excludeOutboundPorts: "5672"

However, this is not a permanent solution

@rshriram
Copy link
Member

Oddly, When we run a second process of the rabbitmq consumer within the original container that had connectivity problems while the sidecar is running, we are then able to connect and consume messages.

does the consumer talk to itself by any chance? the above observation indicates that connectivity is working as intended.. it could also be that the consumer is racing with pilot [consumer calls out before pilot sends config or something strange of that sort]

@crhuber
Copy link
Author

crhuber commented Jul 31, 2019

@rshriram Yes, we found it to be a race condition where the consumer is calling out before pilot has sent the config. The consumer didnt have any retry logic and didnt handle the exception so it was difficult to troubleshoot. But ultimately we found that by making the application exit out on exception and allowing the container to be recreated fixed the problem.

Is there a better approach to handling these race conditions?

@howardjohn
Copy link
Member

We are working on improving the startup ordering problem -- see #11130

@istio-policy-bot istio-policy-bot added lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while lifecycle/needs-triage labels Oct 30, 2019
@istio-policy-bot
Copy link

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2019-07-31. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Feb 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while
Projects
None yet
Development

No branches or pull requests

4 participants