Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent routing failures with HTTPRoute #12610

Open
mgs255 opened this issue May 16, 2024 · 9 comments
Open

Intermittent routing failures with HTTPRoute #12610

mgs255 opened this issue May 16, 2024 · 9 comments

Comments

@mgs255
Copy link

mgs255 commented May 16, 2024

What is the issue?

We are currently running the linkerd edge-24.5.2 in our dev clusters and are using httproute objects to perform traffic splitting. We have been seeing some intermittent failures with request routing when there are httproute objects active. We are seeing all requests to the parent service failing with 500 errors. We first noticed this behaviour when we upgraded from edge-2024.3.2 to 2024.3.4.

In this case we have a httproute set up as follows:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: cms-api
spec:
  parentRefs:
  - group: core
    kind: Service
    name: cms-api
    port: 80
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: cms-api
      port: 80
      weight: 100
    - group: ""
      kind: Service
      name: cms-api-canary
      port: 8080
      weight: 100
    matches:
    - path:
        type: PathPrefix
        value: /

When this occurs all requests to the parent service fail with 500 errors. Deleting or editing the HTTPRoute object temporarily restores service to a working state.

How can it be reproduced?

As mentioned it this has occurred only intermittently, but we started noticing when we upgraded from edge-24.3.2 to edge-24.3.4.

Logs, error output, etc

When this starts to fail we see the following error messages logged in the linkerd-proxy in the calling service - in this case a pod cms-api-gateway

cms-api-gateway  logical service 172.20.77.119:80: route HTTPRoute.<namespace>.cms-api: backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found cms-api-canary HTTP/1.1 request failed
cms-api-gateway  logical service 172.20.77.119:80: route HTTPRoute.<namespace>.cms-api: backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found cms-api HTTP/1.1 request failed

output of linkerd check -o short

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.5.2 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.5.2 but cli running edge-24.5.3
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-5696fcb958-45nzv (edge-24.5.2)
	* linkerd-destination-5696fcb958-4wvnp (edge-24.5.2)
	* linkerd-destination-5696fcb958-b4d8h (edge-24.5.2)
	* linkerd-destination-5696fcb958-t7855 (edge-24.5.2)
	* linkerd-identity-5d976755c8-ppvcs (edge-24.5.2)
	* linkerd-identity-5d976755c8-v58mb (edge-24.5.2)
	* linkerd-identity-5d976755c8-w27wr (edge-24.5.2)
	* linkerd-identity-5d976755c8-x8tzh (edge-24.5.2)
	* linkerd-proxy-injector-6d665bf87f-48fc7 (edge-24.5.2)
	* linkerd-proxy-injector-6d665bf87f-f49kv (edge-24.5.2)
	* linkerd-proxy-injector-6d665bf87f-h4585 (edge-24.5.2)
	* linkerd-proxy-injector-6d665bf87f-h5mph (edge-24.5.2)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-5696fcb958-45nzv running edge-24.5.2 but cli running edge-24.5.3
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-7984d467f9-qmh2f (edge-24.5.2)
	* prometheus-678b478c56-kkpqn (edge-24.5.1)
	* tap-54c784f7ff-rpqms (edge-24.5.2)
	* tap-injector-677cc4795d-wlxq7 (edge-24.5.2)
	* web-7b88487cf-hck7v (edge-24.5.2)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-7984d467f9-qmh2f running edge-24.5.2 but cli running edge-24.5.3
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints
× viz extension self-check
    Error calling Prometheus from the control plane: Query failed: "max(process_start_time_seconds{}) by (pod, namespace)": server_error: server error: 504
    see https://linkerd.io/2/checks/#l5d-viz-metrics-api for hints

Status check results are ×

We do have prometheus metrics but it is run out of cluster

Environment

  • Kubernetes: 1.29.3
  • Environment: EKS v1.29.3-eks-adc7111
  • Host: MacOS 14.4.1 (23E224)
  • Linkerd version: edge-24.5.2

Possible solution

No response

Additional context

I've currently enabled debug logging on the service which was previously failing and will attach additional context when/if I have it.

Would you like to work on fixing this bug?

None

@adleong
Copy link
Member

adleong commented May 16, 2024

Hi @mgs255! I've been trying to reproduce this but I haven't seen the behavior you described. Of course, I don't have your full cms-api environment, but I've set up some cms-api and cms-api-canary services which point to emojivoto and bb in order to test your HTTPRoute.

More specifically, I installed emojivoto:

linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -

and then apply the following manifests in the emojivoto namespace:

---
apiVersion: v1
kind: Service
metadata:
  name: cms-api
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: web-svc
---
apiVersion: v1
kind: Service
metadata:
  name: cms-api-canary
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: bb
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bb
spec:
  replicas: 1
  selector:
    matchLabels:
      app: bb
  template:
    metadata:
      labels:
        app: bb
      annotations:
        linkerd.io/inject: "enabled"
    spec:
      containers:
      - name: app
        image: buoyantio/bb:v0.0.6
        args:
        - terminus
        - "--h1-server-port=8080"
        - "--response-text=hello"
        ports:
        - containerPort: 8080
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: cms-api
spec:
  parentRefs:
  - group: core
    kind: Service
    name: cms-api
    port: 80
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: cms-api
      port: 80
      weight: 100
    - group: ""
      kind: Service
      name: cms-api-canary
      port: 8080
      weight: 100
    matches:
    - path:
        type: PathPrefix
        value: /

Then by execing into a shell in an injected pod in the emojivoto namespace, I can curl

curl cms-api:80

As expected, this will return a response from emojivoto half the time and bb half the time, but the response is always a 200.

I think we have 3 potential avenues to explore next:

  1. If you can follow my steps above and see the same behavior that I did, you may be able to identify where my test differs from your environment and help narrow down what the problem might be
  2. If you can get full proxy logs from the calling service's (cms-api-gateway) sidecar Linkerd proxy (ideally at debug log level) from a calling service which is experiencing the 500s, there may be a clue in the logs as to what is happening
  3. If you can give more information about your full environment, it may help me to reproduce the issue. What do the cms-api and cms-api-canary services look like? are all of these resources (services, httproute, and the calling service) all in the same namespace? or do they cross namespace boundaries?

@mgs255
Copy link
Author

mgs255 commented May 17, 2024

Hi @adleong

I've managed to reproduce the issue with debug logging enabled In our environment. As I mentioned previously it is very intermittent. It was running for around 30 hours before the failure occurred. I've shared the gzipped file with you via the Linkerd slack as a DM attachment.

All 3 services and the HTTPRoute are located in the same namespace.

Hopefully this will help you get to the root cause. If there is anything else I can do to help, please let me know.

Thanks!

@adleong
Copy link
Member

adleong commented May 21, 2024

Hi @mgs255! Thanks so much for those logs, they were super helpful. After some investigation, I think this may be due to a race condition related to how backend services are processed by the policy controller. See: #12635

Ideally, this fix will be released in this week's edge release for you to test.

@mgs255
Copy link
Author

mgs255 commented May 22, 2024

@adleong that is great news, speedy work! We will keep an eye out for the next edge release.

I notice as part of that change you added some new debug logs in this area of the proxy, if we were to keep debug logging enabled for a while we evaluate the release, could you give me a steer as to the log levels we should be targeting for this policy code? I'm not very familiar with Rust's log level specification. Globally setting debug produces very verbose output.

@adleong
Copy link
Member

adleong commented May 29, 2024

I typically run with a log level of linkerd=debug,info when I'm debugging with will cause linkerd modules to log at debug level but all other modules to log at info level. It's possible to refine and narrow this further, but that should be a good starting point.

@wmorgan
Copy link
Member

wmorgan commented Jun 4, 2024

@mgs255 did you get a chance to try the latest edge release with this fix? edge-24.5.4 or later should have this.

@andrewdinunzio
Copy link

We have been seeing this issue as well, will try the fix

@mgs255
Copy link
Author

mgs255 commented Jun 5, 2024

@wmorgan @adleong Yes, indeed! We have been running this version in all of our production environments for at least a week now. We have alerts set up to monitor for the backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found error and we have not seen any reoccurrences of that error so far. 🤞

Thank you both for the speedy turnaround on this. Happy to keep this open for a bit longer to let it soak or close this issue now. I'll let you decide.

@wmorgan
Copy link
Member

wmorgan commented Jun 5, 2024

Awesome, thank you for the report @mgs255!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants