Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress-GCE has a nil pointer exception #471

Closed
rramkumar1 opened this issue Sep 11, 2018 · 36 comments
Closed

Ingress-GCE has a nil pointer exception #471

rramkumar1 opened this issue Sep 11, 2018 · 36 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@rramkumar1
Copy link
Contributor

We are aware of a nil pointer issue in v1.3.2. This bug was actually fixed in #434 but did not make it into the 1.3 release branch. Since this nil pointer crashes the controller, the issue is not surfaced to users other than Ingresses not being synced.

The current workaround is to delete the Ingress which is not being synced and recreate it. A fix will be coming in the next 1.3 patch release (v1.3.3)

@rramkumar1
Copy link
Contributor Author

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 11, 2018
@abstrctn
Copy link

Is there a way to verify whether this is occurring in a GKE cluster? Since I don't think we have access to the logs, we can't check for the exception listed in #434, but we'd like to make sure this is the issue before recreating Ingresses.

@rramkumar1
Copy link
Contributor Author

Unfortunately, no. If you are able to update your Ingress and see the changes reflected in GCP, then you should be fine. Otherwise, you are most likely hitting this issue. Also note that this is only happening is GKE clusters above version 1.10.6

@poor-bob
Copy link

I'm pulling my hair out trying to figure out why our ingresses suddenly stopped being fulfilled by the ingress controllers. Normally I've found a very reasonable explanation (Quotas, etc.), but this time I'm relatively sure we're running into this bug.

kubernetes master version: 1.10.6-gke.2

We've tried deleting every ingress and recreating them, to no avail. Is there a time period I should wait before recreating the ingresses? I waited roughly 5 minutes this first time.

@rramkumar1
Copy link
Contributor Author

@poor-bob Email me your project name, cluster name and location of the cluster and I'll take a look. If you deleted and recreated every ingress I would think that you would not be running into this specific issue.

@addisonbair
Copy link

@rramkumar1
I'm experiencing the same issue with 1.10.6-gke.2

I have disabled the default GKE loadbalancer-controller and installed this version JUST to see logs. Indeed I am experiencing this issue:

E0913 19:46:08.868155       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)

If I delete and recreate each and every ingress from my project, can I expect to get past this nil pointer dereference issue?

@rramkumar1
Copy link
Contributor Author

@addisonbair Theoretically yes. Since you installed another instance to get logs, you should be able to find out if that indeed works for you.

@addisonbair
Copy link

Not related to the nil pointer issue, but my default backend disappears (both the service and deployment) without a trace:

ingress-gce/deploy/glbc on  master [!] at ☸️ gke_remesh-stage_us-east1-b_stage
➜ kubectl describe svc default-http-backend -n kube-system
Name:                     default-http-backend
Namespace:                kube-system
Labels:                   addonmanager.kubernetes.io/mode=Reconcile
                          k8s-app=glbc
                          kubernetes.io/cluster-service=true
                          kubernetes.io/name=GLBCDefaultBackend
Annotations:              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"Reconcile","k8s-app":"glbc","kubernetes.i...
Selector:                 k8s-app=glbc
Type:                     NodePort
IP:                       10.47.247.24
Port:                     http  80/TCP
TargetPort:               8080/TCP
NodePort:                 http  30668/TCP
Endpoints:                10.44.21.65:8080
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

ingress-gce/deploy/glbc on  master [!] at ☸️ gke_remesh-stage_us-east1-b_stage
➜ kubectl describe svc default-http-backend -n kube-system
Error from server (NotFound): services "default-http-backend" not found

Is there any way I can debug this?

@rramkumar1
Copy link
Contributor Author

@addisonbair Can you file a separate issue for that and explain how exactly you are using the script in deploy/glbc?

@addisonbair
Copy link

Will do. 👍

I believe I have a fix and in the process uncovered a possible bug with the yaml manifests.

Since I don't have access to the masters (GKE) I can't be completely sure, but it appears there is a conflict between the Addon-manager running on the master and the annotations on the objects within deploy/glbc/yaml/default-http-backend.yaml. By changing the annotations to addonmanager.kubernetes.io/mode: EnsureExists from addonmanager.kubernetes.io/mode: Reconcile, the Addon-manager does not delete these objects.

@addisonbair
Copy link

@rramkumar1

After deleting all my Ingresses, I am unfortunately still seeing the NPE.

Is there a known working pre-release image that I can use?

Thank you!!

@rramkumar1
Copy link
Contributor Author

rramkumar1 commented Sep 13, 2018

@addisonbair We are in the process of building a patch with the fix and pushing it out. This will enable you to start testing the fix. Keep in mind that this does not mean it is released in GKE. You will still have to wait for an official GKE rollout and upgrade your cluster to get the fix.

Will let you know when the patch is ready to pull down.

@addisonbair
Copy link

Awesome. Thank you!

@addisonbair
Copy link

Just a quick update:

I managed to build an image from master and have successfully deployed to GKE (1.10.6-gke.2) without seeing the dreaded NPE. All ingresses are back up and operational.

I'm happy to test a more formal image when it is ready. Thanks so much for the help!

@rramkumar1
Copy link
Contributor Author

rramkumar1 commented Sep 13, 2018

@addisonbair Thanks, that's great to hear! We just pushed v1.3.3 so please let us know if that works as well.

This would be the version that would officially be rolled out as part of a new GKE version.

@addisonbair
Copy link

@rramkumar1

Built, pushed and deployed v1.3.3 on my 1.10.6-gke.2 cluster and it works perfectly! No more NPE.

Thanks so much!

@rramkumar1
Copy link
Contributor Author

Quick update for all tracking this issue. Hopefully the GKE rollout for the fix ends this week. I will ping this thread with the GKE version everyone should upgrade to once rollout is complete.

@cerealcable
Copy link

Figure I would comment to clarify if anyone else is new to k8s since I was and it wasn't clear, you need to delete the LB as well as the associated services along with it. Once I did that and recreated the services & ingress I was able to work around this bug. Definitely not a great long-term solution but it worked until the patch is ready.

Thanks @rramkumar1 for your assistance and confirmation of my issue!

@laupow
Copy link

laupow commented Sep 24, 2018

The GKE team released a new version, 1.10.7-gke.2, which fixes the issue of stuck Ingress resources.

@rramkumar1
Copy link
Contributor Author

@laupow Thanks for the update. The fix should be rolled out as part of 1.10.7-gke.2 and 1.11.2-gke.4.

/close

@k8s-ci-robot
Copy link
Contributor

@rramkumar1: Closing this issue.

In response to this:

@laupow Thanks for the update. The fix should be rolled out as part of 1.10.7-gke.2 and 1.11.2-gke.4.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ericuldall
Copy link

ericuldall commented Sep 24, 2018

@rramkumar1 Are you sure this fix is live with 1.10.7-gke.2? I got this response from GCP Support last week:

This is to provide you an update that I checked with the product team on whether GKE version (1.10.7-gke.2) have the ingress fix and I came to know that it won't.

@rramkumar1
Copy link
Contributor Author

@ericuldall I'm not sure why GCP support told you that. They may have gotten confused about something else. Do you not see 1.10.7-gke.2 as a viable version?

@ericuldall
Copy link

I see it available, just unclear if the fix is actually deployed to that version or not.

@rramkumar1
Copy link
Contributor Author

Yes, the fix is available in that version.

@ericuldall
Copy link

Yes, I deployed it and my ingress was updated :D thanks for confirming!

@bschwartz757
Copy link

@rramkumar1 I have a cluster running on 1.10.6-gke.2 and I replaced one of the ingresses, then it got stuck in 'creating ingress'. I just found this thread this morning, and accordingly, deleted and then re-created the ingress but it's still showing up as 'creating ingress' in the GCP dashboard. Any ideas?

@rramkumar1
Copy link
Contributor Author

@bschwartz757 You can upgrade to 1.10.7-gke.2. See above discussion

@bschwartz757
Copy link

@rramkumar1 ok..... anything that doesn't involve upgrading?

@rramkumar1
Copy link
Contributor Author

rramkumar1 commented Sep 26, 2018

@bschwartz757 Upgrading is the only supported way to get these kinds of fixes. If you don't want to upgrade, you can also run the script we have in deploy/. Note that this script is somewhat dangerous to run in production (and as a result, we don't officially support it) but it does allow you to modify the version of your ingress-gce controller without having to depend on GKE for upgrades

@Arconapalus
Copy link

@rramkumar1 just letting you know that I also am running into this issue on 1.11.5-gke.5. The creating ingress is stuck and I have deleted and recreated the ingress. Should I delete node and cluster and recreate at 1.10.7.gke.2?

@rramkumar1
Copy link
Contributor Author

@Arconapalus This issue is already fixed for that version so you might be running into a separate issue.

Can you please file a separate issue for this?

@Arconapalus
Copy link

Arconapalus commented Jan 11, 2019

@rramkumar1 Yes I can.
#605

@kdeng
Copy link

kdeng commented Jan 16, 2019

@rramkumar1 I am also experiencing this issue on 1.11.6-gke.2. When I look at events of ingress details, there is no message at all.

My ingress file is pretty simple as below.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: basic-ingress
  namespace: build
spec:
  rules:
  - http:
      paths:
      - path: /jenkins
        backend:
          serviceName: jenkins-ui
          servicePort: 8080
      - path: /nexus
        backend:
          serviceName: nexus-ui
          servicePort: 8081

@rramkumar1
Copy link
Contributor Author

@kdeng Did you take a look at #605?

@surykatka
Copy link

surykatka commented Jan 21, 2019

@rramkumar1 I'm having a similar issue to the one you have described in #605. I have sent you an e-mail with my setup but I'm also happy to continue the conversation online.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests