Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

package-server pod keeps crashing #598

Closed
jsafrane opened this issue Dec 4, 2018 · 9 comments
Closed

package-server pod keeps crashing #598

jsafrane opened this issue Dec 4, 2018 · 9 comments

Comments

@jsafrane
Copy link

jsafrane commented Dec 4, 2018

package-server panics regularly:

k -n olm logs -f package-server-68948c8996-rjhqt
time="2018-12-04T13:44:41Z" level=info msg="Using in-cluster kube client config"
time="2018-12-04T13:44:41Z" level=info msg="package-server configured to watch namespaces []"
time="2018-12-04T13:44:41Z" level=info msg="Using in-cluster kube client config"
time="2018-12-04T13:44:41Z" level=info msg="connection established. cluster-version: v1.14.0-alpha.0.654+6f0bd529b0c1b6-dirty"
time="2018-12-04T13:44:41Z" level=info msg="operator ready"
time="2018-12-04T13:44:41Z" level=info msg="starting informers..."
time="2018-12-04T13:44:41Z" level=info msg="waiting for caches to sync..."
I1204 13:44:41.306000       1 reflector.go:202] Starting reflector *v1alpha1.CatalogSource (5m0s) from github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:97
I1204 13:44:41.306032       1 reflector.go:240] Listing and watching *v1alpha1.CatalogSource from github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:97
time="2018-12-04T13:44:41Z" level=info msg="olm/rh-operators added"
I1204 13:44:41.406209       1 shared_informer.go:123] caches populated
time="2018-12-04T13:44:41Z" level=info msg="starting workers..."
time="2018-12-04T13:44:41Z" level=info msg="getting from queue" key=olm/rh-operators queue=catsrc
[restful] 2018/12/04 13:44:41 log.go:33: [restful/swagger] listing is available at https://:5443/swaggerapi
[restful] 2018/12/04 13:44:41 log.go:33: [restful/swagger] https://:5443/swaggerui/ is mapped to folder /swagger-ui/
I1204 13:44:41.468606       1 serve.go:96] Serving securely on [::]:5443
I1204 13:44:44.297924       1 wrap.go:42] GET /healthz: (2.97766ms) 200 [[kube-probe/1.14+] 172.17.0.1:34082]
I1204 13:44:48.481561       1 wrap.go:42] GET /healthz: (89.312µs) 200 [[kube-probe/1.14+] 172.17.0.1:34092]
I1204 13:44:49.515358       1 authorization.go:73] Forbidden: "/", Reason: ""
I1204 13:44:49.515550       1 wrap.go:42] GET /: (3.272649ms) 403 [[Go-http-client/2.0] 192.168.121.204:51824]
I1204 13:44:49.521427       1 authorization.go:73] Forbidden: "/", Reason: ""
I1204 13:44:49.521565       1 wrap.go:42] GET /: (262.421µs) 403 [[Go-http-client/2.0] 192.168.121.204:51824]
I1204 13:44:50.033601       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1: (1.813112ms) 200 [[Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0] 192.168.121.204:51856]
I1204 13:44:50.861271       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1?timeout=32s: (1.392206ms) 200 [[hyperkube/v1.14.0 (linux/amd64) kubernetes/6f0bd52/controller-discovery] 192.168.121.204:51856]
I1204 13:44:52.783994       1 wrap.go:42] GET /openapi/v2: (17.134802ms) 304 [[] 192.168.121.204:51856]
I1204 13:44:54.308413       1 wrap.go:42] GET /healthz: (3.333876ms) 200 [[kube-probe/1.14+] 172.17.0.1:34132]
I1204 13:44:57.919663       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1?timeout=32s: (4.722212ms) 200 [[hyperkube/v1.14.0 (linux/amd64) kubernetes/6f0bd52/system:serviceaccount:kube-system:resourcequota-controller] 192.168.121.204:51856]
I1204 13:44:58.180481       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1/packagemanifests?labelSelector=%21olm-visibility&limit=250: (13.400315ms) 200 [[Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0] 192.168.121.204:51856]
I1204 13:44:58.315275       1 get.go:245] Starting watch for /apis/packages.apps.redhat.com/v1alpha1/packagemanifests, rv= labels=!olm-visibility fields= timeout=58m12.916358481s
I1204 13:44:58.491744       1 wrap.go:42] GET /healthz: (231.533µs) 200 [[kube-probe/1.14+] 172.17.0.1:34162]
I1204 13:44:59.105104       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1/packagemanifests?labelSelector=%21olm-visibility&watch=true: (791.871669ms) hijacked [[Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0] 192.168.121.204:51888]
I1204 13:45:02.528263       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1/packagemanifests?labelSelector=%21olm-visibility&limit=250: (1.487231ms) 200 [[Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0] 192.168.121.204:51856]
I1204 13:45:02.590947       1 get.go:245] Starting watch for /apis/packages.apps.redhat.com/v1alpha1/packagemanifests, rv= labels=!olm-visibility fields= timeout=49m56.208095793s
I1204 13:45:04.307826       1 wrap.go:42] GET /healthz: (184.978µs) 200 [[kube-probe/1.14+] 172.17.0.1:34196]
I1204 13:45:05.797241       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1?timeout=32s: (4.075613ms) 200 [[hyperkube/v1.14.0 (linux/amd64) kubernetes/6f0bd52/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.121.204:51856]
panic: close of closed channel
goroutine 242 [running]:
github.com/operator-framework/operator-lifecycle-manager/pkg/package-server/provider.(*InMemoryProvider).Subscribe.func1(0xc4203de2a0, 0xc4202d7400, 0x1, 0x1, 0x1)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/package-server/provider/inmem.go:302 +0xcb
created by github.com/operator-framework/operator-lifecycle-manager/pkg/package-server/provider.(*InMemoryProvider).Subscribe
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/package-server/provider/inmem.go:296 +0x27c

I installed OLM this way:

  1. run kubernetes using hack/local-up-cluster.sh
  2. install OLM using kubectl create -f deploy/upstream/manifests/latest/ (with today's OLM master)
  3. run console using ./scripts/run_console_local.sh
  4. tune up permissions, see OLM GUI does not have permission to list its OLM CRDs #597
  5. randomly click around in the console, creating subscription hits it pretty regularly.
@brian-avery
Copy link

I'm seeing the same issue using the OKD manifests on oc cluster up.

@ecordell
Copy link
Member

ecordell commented Dec 5, 2018

Thanks for the bug report! We have been working on our CI and some things slipped through. This is fixed in master, but we haven't yet cut a release that includes the fix.

@brian-avery
Copy link

brian-avery commented Dec 5, 2018

Hi @ecordell I'm still seeing this on the 0.8.0 manifests from the last master (7afcd1e today at 3:22EST). I believe this is the same issue, but recording the stack trace here in case it's useful: https://pastebin.com/raw/bqyTF7Q6. Let me know if you'd like me to move it to a separate issue.

@njhale
Copy link
Member

njhale commented Dec 7, 2018

@brian-avery The images in the 0.8.0 manifests haven't been updated yet to include the fix for the sporadic panic. I believe this will be resolved when we cut a new release.

@smarterclayton
Copy link
Collaborator

Did this problem carry over to configmap-registry-server?

Seeing this in master (last night):

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/2718

fail [github.com/openshift/origin/test/extended/operators/cluster.go:109]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-operator-lifecycle-manager/rh-operators-2fg4m is not healthy: Back-off 5m0s restarting failed container=configmap-registry-server pod=rh-operators-2fg4m_openshift-operator-lifecycle-manager(e33f6b50-063e-11e9-ad9d-12b083f00f36)",
    ]
to be empty

@ecordell
Copy link
Member

@smarterclayton does it stay crashed? I haven’t seen it fail permanently, but it’s designed to crash quickly on start if it can’t fins what it needs. It should resolve itself - if it doesn’t that’s a new bug.

One way it can fail is if the catalog data is incorrect, so that’s the most likely culprit.

@smarterclayton
Copy link
Collaborator

smarterclayton commented Dec 27, 2018 via email

@ecordell
Copy link
Member

ecordell commented Dec 28, 2018

@smarterclayton We just merged an e2e test to verify that rh-operators pod starts up and doesn't crashloop: #643

It passed OLM's e2e (with the new test) and the e2e-aws which checks for crashlooping pods. But I wasn't able to replicate the failure at all - these new tests always passed, locally and in CI.

The error itself looks like OLM is generating a bad role/rolebinding for the rh-operators pod that it creates. I've only ever seen that specific error once, on a branch of OLM, and the error was resolved before it was merged into master (it was a repeatable bug in our CI). I mention this because this isn't the first time it's looked like there was non-master-branch OLM code in master of OpenShift, so I'm wondering if there may be some bugs in the way images get tagged into releases.

@ecordell
Copy link
Member

The latest release contains all of the fixes formpackage server that were causing an issue here (0.8.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants