New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to avoid 502s #34

Open
bowei opened this Issue Oct 11, 2017 · 23 comments

Comments

Projects
None yet
@bowei
Copy link
Member

bowei commented Oct 11, 2017

From @esseti on September 20, 2017 9:22

Hello,
i've a problem with the ingress and the fact that the 502 page pops up when there are "several" request. I've a JMeter spinning 10 threads for 20 times, and I get more than 50 times the 502 over 2000 calls in total (less than 0,5%).

reading the readme it says
it says that this error is probably due to

The loadbalancer is probably bootstrapping itself.

but the loadbalancer is already there, so does it means that all the pods serving that url are busy? is there a way to avoid the 502 waiting for a pod to be free?

if not, is there a way to personalize the 502 page? because I expose APIs in JSON format, and I would like to show a JSON error rather than a html page.

Copied from original issue: kubernetes/ingress-nginx#1396

@bowei

This comment has been minimized.

Copy link
Member Author

bowei commented Oct 11, 2017

From @nicksardo on September 26, 2017 16:12

https://serverfault.com/questions/849230/is-there-a-way-to-use-customized-502-page-for-load-balancer
This is a question for GCP, not the ingress controller. Though, I suggest you investigate why you're getting 502s.

@bowei

This comment has been minimized.

Copy link
Member Author

bowei commented Oct 11, 2017

From @esseti on September 29, 2017 9:53

Regarding the several 502, i found out that it's due to how long the LB keeps the connection alive vs what the container provides as keepalive timout. it's explained here (point 3) https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340

In my case I also added a timout of 5s to the probe, not sure but that solved the 502 (i'm using uwsgi)

@montanaflynn

This comment has been minimized.

Copy link

montanaflynn commented Oct 27, 2017

I also get google's 502 html error page and would like to understand why and how to avoid it or customize the response. The backend pods have been running without restarting but still maybe 1/1000 requests return a 502. Using GKE with ingress that sends to API pod running nginx.

@esseti

This comment has been minimized.

Copy link

esseti commented Oct 27, 2017

@montanaflynn have you tried this (point 3) https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340 ? (i also had to increase the time of probeness, there are my comments on that page).
I've solved the problem with the 502 page, still not able to modified it but that it's not part of kubernetes.

@mikesparr

This comment has been minimized.

Copy link

mikesparr commented Oct 28, 2017

Using GKE 1.7.8 (google cloud)

I'm getting these too, and backend services have 2/2 for cluster health and green. Using Ingress with kube-lego and gce for TLS provisioning. One app served by ingress (name-based virtual hosts from single ingress) works fine but other app every other request returning 502 and preventing QA. There were zero issues with either app during initial QA when behind LoadBalancer service. I changed to NodePort services and added Ingress with TLS in front of them and plagued with 502 errors.

Updated liveness and readiness probes, and confirmed 200 responses both at probe URI and / but still getting these errors.

Redacted Ingress Config

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: staging-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "kubernetes-ingress-stg"
    kubernetes.io/tls-acme: "true"
    kubernetes.io/ingress.class: "gce"
spec:
  tls:
  - hosts:
    - eval.redacted-site1.com
    - eval.redacted-site2.com
    secretName: legacy-tls
  rules:
  - host: eval.redacted-site1.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: site1-app
          servicePort: 80
  - host: eval.redacted-site2.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: site2-app
          servicePort: 80
@montanaflynn

This comment has been minimized.

Copy link

montanaflynn commented Oct 28, 2017

@esseti I tried increasing the timeout as suggested but still get 502s

Also like @mike-saparov we're using TLS with the ingress (not acme) and NodePort services.

@mikesparr

This comment has been minimized.

Copy link

mikesparr commented Oct 28, 2017

I increased keepalive too per recommendation and didn't fix.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 26, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@lunemec

This comment has been minimized.

Copy link

lunemec commented Feb 23, 2018

We're also having these issues. Using NodePort on our services, TLS ingress with kube-lego.

We noticed 502 right after this message showed up in kubectl get events:

2m          17d          2637     ingressID                                  Ingress                                                     Normal    Service                 loadbalancer-controller                            default backend set to serviceID

Any ideas how to figure out what is causing these 502's?

@lunemec

This comment has been minimized.

Copy link

lunemec commented Feb 23, 2018

/remove-lifecycle stale

@gylu

This comment has been minimized.

Copy link

gylu commented Mar 30, 2018

I'm seeing these too. Using 1.8.6 with Kubefed trying to set up a federated Ingress (which I think I got set up), but I now keep getting 502s, and there's nothing to log/debug except stackdriver, which shows 502s

My backends occasionally say "unhealthy" though for some reason...

Even though I've fulfilled this comment:
Services exposed through an Ingress must serve a response with HTTP 200 status to the GET requests on / path. This is used for health checking. If your application does not serve HTTP 200 on /, the backend will be marked unhealthy and will not get traffic.
https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer

@mikesparr

This comment has been minimized.

Copy link

mikesparr commented Mar 31, 2018

@nicksardo nicksardo changed the title [GCE] 502 page - how to avoid it and how to personalize it [GCE] Document how to avoid 502s May 4, 2018

@nicksardo nicksardo changed the title [GCE] Document how to avoid 502s Document how to avoid 502s May 4, 2018

@petercgrant

This comment has been minimized.

Copy link

petercgrant commented Jun 20, 2018

The simple way to avoid 502s: setup a cluster that hosts only your app and does not use preemptible nodes or cluster node-pool autoscaling. Schedule app downtime for node upgrades.

If you want to avoid 502s and want cluster autoscaling, preemptibe nodes or zero downtime, you probably need to switch to the nginx ingress controller. The L7 load balancer lives in the cluster and is able to respond faster and more proactively to events occurring in the cluster. The built-in retry logic also helps.

The GCE ingress controller creates an L7 load balancer that communicates to a kubernetes NodePort service. If you use the default settings for your service, externalTrafficPolicy will be set to Cluster, meaning every node will forward requests to nodes that host the pods backing the service (which I'll just call your app). If you leave externalTrafficPolicy=Cluster, any node can cause 502s and timeouts, even if it is not running your app. Examples: you take an unrelated node down for an upgrade, even if you cordon/drain it properly; a random node in your cluster crashes (as @mikesparr noted, this could be a node out of memory); you use preemptible nodes; you have some nodes that are over-provisioned (traffic forwarding can be CPU starved?). One final note about leaving externalTrafficPolicy=Cluster: the backend for your app will show as available with N instances, where N is the number of nodes in your cluster, and it will show as available in every zone where you have nodes. This is misleading because requests have to be served by nodes hosting the pods, which is an arbitrarily small subset of the cluster nodes. Maybe this situation will improve with network endpoint group support in the GCE ingress controller?

Setting externalTrafficPolicy=Local on the NodePort service will prevent nodes from forwarding traffic to other nodes. Health checks will fail for any node not hosting your app, and your load balancer backend will show the proper number of nodes serving your app and the zone they're in. This removes the 502 caused by unrelated nodes, but you will still get them if a node hosting your app is unhealthy. We've just introduced a new failure mode: if a pod dies on any node, your service is down on that node. You can fix this by ensuring at least two pods are running on each node where your service lives. You may also want to set strategy.rollingUpdate.maxUnavailable=0 on your deployment so it will create new pods before deleting the old ones. The GCE ingress controller on GKE adds a minute to whatever health check interval I set, which is too slow to find a dead node.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Sep 18, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@metral

This comment has been minimized.

Copy link

metral commented Sep 18, 2018

/remove-lifecycle stale

@wminshew

This comment has been minimized.

Copy link

wminshew commented Oct 11, 2018

This thread was helpful for [crossing my fingers for now] eliminating my 502s. Just in case they come back though, is it possible to customize the response body? Have tried digging quite a bit without luck so guessing no, but asking here to be extra sure [noticed the title of this issue originally reference personalizing 502s as well]

@bowei

This comment has been minimized.

Copy link
Member Author

bowei commented Nov 6, 2018

/lifecycle frozen

@acasademont

This comment has been minimized.

Copy link

acasademont commented Dec 3, 2018

I believe some of this issues will be solved by the new Network Endpoints Group load balancing (https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing) due to the removed network hop in the kube-proxy

@Arconapalus

This comment has been minimized.

Copy link

Arconapalus commented Dec 12, 2018

would you be able to deploy container-native-load-balancing alongside nginx-ingress-controller?

@stefanotto

This comment has been minimized.

Copy link

stefanotto commented Feb 12, 2019

Hi, where exactly can I set the two NGINX settings described?

keepalive_timeout 650;
keepalive_requests 10000;

I have an Ingress based on nginx-ingress-controller. How exactly can I pass these to the NGINX used in the image?

@mikesparr

This comment has been minimized.

Copy link

mikesparr commented Feb 12, 2019

@mikesparr

This comment has been minimized.

Copy link

mikesparr commented Feb 12, 2019

@stefanotto

This comment has been minimized.

Copy link

stefanotto commented Feb 12, 2019

Perfect. Thank you so much @mikesparr

@rramkumar1 rramkumar1 removed the backend/gce label Feb 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment