package-server pod keeps crashing #598

jsafrane · 2018-12-04T13:55:53Z

package-server panics regularly:

k -n olm logs -f package-server-68948c8996-rjhqt
time="2018-12-04T13:44:41Z" level=info msg="Using in-cluster kube client config"
time="2018-12-04T13:44:41Z" level=info msg="package-server configured to watch namespaces []"
time="2018-12-04T13:44:41Z" level=info msg="Using in-cluster kube client config"
time="2018-12-04T13:44:41Z" level=info msg="connection established. cluster-version: v1.14.0-alpha.0.654+6f0bd529b0c1b6-dirty"
time="2018-12-04T13:44:41Z" level=info msg="operator ready"
time="2018-12-04T13:44:41Z" level=info msg="starting informers..."
time="2018-12-04T13:44:41Z" level=info msg="waiting for caches to sync..."
I1204 13:44:41.306000       1 reflector.go:202] Starting reflector *v1alpha1.CatalogSource (5m0s) from github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:97
I1204 13:44:41.306032       1 reflector.go:240] Listing and watching *v1alpha1.CatalogSource from github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:97
time="2018-12-04T13:44:41Z" level=info msg="olm/rh-operators added"
I1204 13:44:41.406209       1 shared_informer.go:123] caches populated
time="2018-12-04T13:44:41Z" level=info msg="starting workers..."
time="2018-12-04T13:44:41Z" level=info msg="getting from queue" key=olm/rh-operators queue=catsrc
[restful] 2018/12/04 13:44:41 log.go:33: [restful/swagger] listing is available at https://:5443/swaggerapi
[restful] 2018/12/04 13:44:41 log.go:33: [restful/swagger] https://:5443/swaggerui/ is mapped to folder /swagger-ui/
I1204 13:44:41.468606       1 serve.go:96] Serving securely on [::]:5443
I1204 13:44:44.297924       1 wrap.go:42] GET /healthz: (2.97766ms) 200 [[kube-probe/1.14+] 172.17.0.1:34082]
I1204 13:44:48.481561       1 wrap.go:42] GET /healthz: (89.312µs) 200 [[kube-probe/1.14+] 172.17.0.1:34092]
I1204 13:44:49.515358       1 authorization.go:73] Forbidden: "/", Reason: ""
I1204 13:44:49.515550       1 wrap.go:42] GET /: (3.272649ms) 403 [[Go-http-client/2.0] 192.168.121.204:51824]
I1204 13:44:49.521427       1 authorization.go:73] Forbidden: "/", Reason: ""
I1204 13:44:49.521565       1 wrap.go:42] GET /: (262.421µs) 403 [[Go-http-client/2.0] 192.168.121.204:51824]
I1204 13:44:50.033601       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1: (1.813112ms) 200 [[Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0] 192.168.121.204:51856]
I1204 13:44:50.861271       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1?timeout=32s: (1.392206ms) 200 [[hyperkube/v1.14.0 (linux/amd64) kubernetes/6f0bd52/controller-discovery] 192.168.121.204:51856]
I1204 13:44:52.783994       1 wrap.go:42] GET /openapi/v2: (17.134802ms) 304 [[] 192.168.121.204:51856]
I1204 13:44:54.308413       1 wrap.go:42] GET /healthz: (3.333876ms) 200 [[kube-probe/1.14+] 172.17.0.1:34132]
I1204 13:44:57.919663       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1?timeout=32s: (4.722212ms) 200 [[hyperkube/v1.14.0 (linux/amd64) kubernetes/6f0bd52/system:serviceaccount:kube-system:resourcequota-controller] 192.168.121.204:51856]
I1204 13:44:58.180481       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1/packagemanifests?labelSelector=%21olm-visibility&limit=250: (13.400315ms) 200 [[Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0] 192.168.121.204:51856]
I1204 13:44:58.315275       1 get.go:245] Starting watch for /apis/packages.apps.redhat.com/v1alpha1/packagemanifests, rv= labels=!olm-visibility fields= timeout=58m12.916358481s
I1204 13:44:58.491744       1 wrap.go:42] GET /healthz: (231.533µs) 200 [[kube-probe/1.14+] 172.17.0.1:34162]
I1204 13:44:59.105104       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1/packagemanifests?labelSelector=%21olm-visibility&watch=true: (791.871669ms) hijacked [[Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0] 192.168.121.204:51888]
I1204 13:45:02.528263       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1/packagemanifests?labelSelector=%21olm-visibility&limit=250: (1.487231ms) 200 [[Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0] 192.168.121.204:51856]
I1204 13:45:02.590947       1 get.go:245] Starting watch for /apis/packages.apps.redhat.com/v1alpha1/packagemanifests, rv= labels=!olm-visibility fields= timeout=49m56.208095793s
I1204 13:45:04.307826       1 wrap.go:42] GET /healthz: (184.978µs) 200 [[kube-probe/1.14+] 172.17.0.1:34196]
I1204 13:45:05.797241       1 wrap.go:42] GET /apis/packages.apps.redhat.com/v1alpha1?timeout=32s: (4.075613ms) 200 [[hyperkube/v1.14.0 (linux/amd64) kubernetes/6f0bd52/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.121.204:51856]
panic: close of closed channel
goroutine 242 [running]:
github.com/operator-framework/operator-lifecycle-manager/pkg/package-server/provider.(*InMemoryProvider).Subscribe.func1(0xc4203de2a0, 0xc4202d7400, 0x1, 0x1, 0x1)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/package-server/provider/inmem.go:302 +0xcb
created by github.com/operator-framework/operator-lifecycle-manager/pkg/package-server/provider.(*InMemoryProvider).Subscribe
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/package-server/provider/inmem.go:296 +0x27c

I installed OLM this way:

run kubernetes using hack/local-up-cluster.sh
install OLM using kubectl create -f deploy/upstream/manifests/latest/ (with today's OLM master)
run console using ./scripts/run_console_local.sh
tune up permissions, see OLM GUI does not have permission to list its OLM CRDs #597
randomly click around in the console, creating subscription hits it pretty regularly.

The text was updated successfully, but these errors were encountered:

brian-avery · 2018-12-05T22:59:27Z

I'm seeing the same issue using the OKD manifests on oc cluster up.

ecordell · 2018-12-05T23:24:42Z

Thanks for the bug report! We have been working on our CI and some things slipped through. This is fixed in master, but we haven't yet cut a release that includes the fix.

brian-avery · 2018-12-05T23:46:34Z

Hi @ecordell I'm still seeing this on the 0.8.0 manifests from the last master (7afcd1e today at 3:22EST). I believe this is the same issue, but recording the stack trace here in case it's useful: https://pastebin.com/raw/bqyTF7Q6. Let me know if you'd like me to move it to a separate issue.

njhale · 2018-12-07T23:47:53Z

@brian-avery The images in the 0.8.0 manifests haven't been updated yet to include the fix for the sporadic panic. I believe this will be resolved when we cut a new release.

smarterclayton · 2018-12-24T00:10:31Z

Did this problem carry over to configmap-registry-server?

Seeing this in master (last night):

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/2718

fail [github.com/openshift/origin/test/extended/operators/cluster.go:109]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-operator-lifecycle-manager/rh-operators-2fg4m is not healthy: Back-off 5m0s restarting failed container=configmap-registry-server pod=rh-operators-2fg4m_openshift-operator-lifecycle-manager(e33f6b50-063e-11e9-ad9d-12b083f00f36)",
    ]
to be empty

ecordell · 2018-12-27T02:07:48Z

@smarterclayton does it stay crashed? I haven’t seen it fail permanently, but it’s designed to crash quickly on start if it can’t fins what it needs. It should resolve itself - if it doesn’t that’s a new bug.

One way it can fail is if the catalog data is incorrect, so that’s the most likely culprit.

smarterclayton · 2018-12-27T02:22:58Z

All the info I have is in the linked job. For it to show up here it has to have been crashing for a while. On Dec 26, 2018, at 9:07 PM, Evan Cordell <notifications@github.com> wrote: @smarterclayton <https://github.com/smarterclayton> does it stay crashed? I haven’t seen it fail permanently, but it’s designed to crash quickly on start if it can’t fins what it needs. It should resolve itself - if it doesn’t that’s a new bug. One way it can fail is if the catalog data is incorrect, so that’s the most likely culprit. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#598 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p-4nOfYI-Fa5W4y4wnhvC6N3_GsZks5u9Ct1gaJpZM4ZAtFX> .

ecordell · 2018-12-28T21:25:04Z

@smarterclayton We just merged an e2e test to verify that rh-operators pod starts up and doesn't crashloop: #643

It passed OLM's e2e (with the new test) and the e2e-aws which checks for crashlooping pods. But I wasn't able to replicate the failure at all - these new tests always passed, locally and in CI.

The error itself looks like OLM is generating a bad role/rolebinding for the rh-operators pod that it creates. I've only ever seen that specific error once, on a branch of OLM, and the error was resolved before it was merged into master (it was a repeatable bug in our CI). I mention this because this isn't the first time it's looked like there was non-master-branch OLM code in master of OpenShift, so I'm wondering if there may be some bugs in the way images get tagged into releases.

ecordell · 2019-01-22T13:49:09Z

The latest release contains all of the fixes formpackage server that were causing an issue here (0.8.1)

ecordell mentioned this issue Dec 28, 2018

test(catalog): add an e2e test verifying that the rh-operators catalog starts #643

Merged

ecordell closed this as completed Jan 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

package-server pod keeps crashing #598

package-server pod keeps crashing #598

jsafrane commented Dec 4, 2018

brian-avery commented Dec 5, 2018

ecordell commented Dec 5, 2018

brian-avery commented Dec 5, 2018 •

edited

njhale commented Dec 7, 2018

smarterclayton commented Dec 24, 2018

ecordell commented Dec 27, 2018

smarterclayton commented Dec 27, 2018 via email

ecordell commented Dec 28, 2018 •

edited

ecordell commented Jan 22, 2019

package-server pod keeps crashing #598

package-server pod keeps crashing #598

Comments

jsafrane commented Dec 4, 2018

brian-avery commented Dec 5, 2018

ecordell commented Dec 5, 2018

brian-avery commented Dec 5, 2018 • edited

njhale commented Dec 7, 2018

smarterclayton commented Dec 24, 2018

ecordell commented Dec 27, 2018

smarterclayton commented Dec 27, 2018 via email

ecordell commented Dec 28, 2018 • edited

ecordell commented Jan 22, 2019

brian-avery commented Dec 5, 2018 •

edited

ecordell commented Dec 28, 2018 •

edited