Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy before runlevel 30 so machine API can get creds. #31

Merged
merged 2 commits into from
Feb 20, 2019

Conversation

dgoodwin
Copy link
Contributor

Per: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/operators.md#how-do-i-get-added-as-a-special-run-level

Also removes the credentials definitions that were moved to component repos, or are unused.

@smarterclayton @enxebre @derekwaynecarr does this look ok? The manifests dir should show what we'll be trying to include in release now.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 13, 2019
@dgoodwin
Copy link
Contributor Author

/test e2e-aws

@dgoodwin
Copy link
Contributor Author

/test e2e-aws

@dgoodwin
Copy link
Contributor Author

Ok I have an e2e-aws failure here I do not understand, and it appears to only be affecting this PR so it may be something in my change.

This PR changes the ordering in the CVO to make sure the credentials operator is up before the machine-api needs it. Previously we were at the default 0000_70, now we're at 0000_30_00 (to get infront of machine api). You can see the new manifest here: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/31/pull-ci-openshift-cloud-credential-operator-master-e2e-aws/160/artifacts/release-latest/release-payload/

The use of adding another 00 to the prefix appears ok as it's done in level 0000_50 as well.

The failure surfaces in the install logs:

time="2019-02-13T19:01:47Z" level=debug
time="2019-02-13T19:01:47Z" level=debug msg="Destroy complete! Resources: 11 destroyed."
time="2019-02-13T19:01:47Z" level=info msg="Waiting up to 30m0s for the cluster to initialize..."
time="2019-02-13T19:01:47Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-13T19:01:58Z" level=debug msg="Still waiting for the cluster to initialize: Could not update service \"openshift-cloud-credential-operator/controller-manager-service\" (84 of 273): the server has forbidden updates to this resource"
time="2019-02-13T19:31:47Z" level=fatal msg="failed to initialize the cluster: Could not update service \"openshift-cloud-credential-operator/controller-manager-service\" (84 of 273): the server has forbidden updates to this resource"

Note that this is my component.

I0213 19:30:54.611858       1 sync_worker.go:457] Running sync for clusteroperator "kube-controller-manager" (79 of 273)
I0213 19:30:55.615205       1 sync_worker.go:470] Done syncing for clusteroperator "kube-controller-manager" (79 of 273)
I0213 19:30:55.615269       1 task_graph.go:497] Running 7 on 1
I0213 19:30:55.615287       1 sync_worker.go:457] Running sync for namespace "openshift-machine-config-operator" (104 of 273)
I0213 19:30:55.615410       1 task_graph.go:497] Running 5 on 0
I0213 19:30:55.615435       1 sync_worker.go:457] Running sync for customresourcedefinition "credentialsrequests.cloudcredential.openshift.io" (80 of 273)
I0213 19:30:55.615605       1 task_graph.go:497] Running 6 on 2
I0213 19:30:55.615626       1 sync_worker.go:457] Running sync for namespace "openshift-machine-api" (87 of 273)
I0213 19:30:55.618005       1 sync_worker.go:470] Done syncing for namespace "openshift-machine-config-operator" (104 of 273)
I0213 19:30:55.618034       1 sync_worker.go:457] Running sync for customresourcedefinition "mcoconfigs.machineconfiguration.openshift.io" (105 of 273)
I0213 19:30:55.618838       1 sync_worker.go:470] Done syncing for namespace "openshift-machine-api" (87 of 273)
I0213 19:30:55.618865       1 sync_worker.go:457] Running sync for namespace "openshift-cluster-api" (88 of 273)
I0213 19:30:55.619653       1 sync_worker.go:470] Done syncing for customresourcedefinition "credentialsrequests.cloudcredential.openshift.io" (80 of 273)
I0213 19:30:55.619684       1 sync_worker.go:457] Running sync for namespace "openshift-cloud-credential-operator" (81 of 273)
I0213 19:30:55.621208       1 sync_worker.go:470] Done syncing for customresourcedefinition "mcoconfigs.machineconfiguration.openshift.io" (105 of 273)
I0213 19:30:55.621242       1 sync_worker.go:457] Running sync for configmap "openshift-machine-config-operator/machine-config-operator-images" (106 of 273)
I0213 19:30:55.622354       1 sync_worker.go:470] Done syncing for namespace "openshift-cluster-api" (88 of 273)
I0213 19:30:55.622384       1 sync_worker.go:457] Running sync for configmap "openshift-machine-api/machine-api-operator-images" (89 of 273)
I0213 19:30:55.622699       1 sync_worker.go:470] Done syncing for namespace "openshift-cloud-credential-operator" (81 of 273)
I0213 19:30:55.622723       1 sync_worker.go:457] Running sync for clusterrole "cloud-credential-operator-role" (82 of 273)
I0213 19:30:55.625154       1 sync_worker.go:470] Done syncing for configmap "openshift-machine-config-operator/machine-config-operator-images" (106 of 273)
I0213 19:30:55.625231       1 sync_worker.go:457] Running sync for clusterrolebinding "default-account-openshift-machine-config-operator" (107 of 273)
I0213 19:30:55.626722       1 sync_worker.go:470] Done syncing for configmap "openshift-machine-api/machine-api-operator-images" (89 of 273)
I0213 19:30:55.626755       1 sync_worker.go:457] Running sync for customresourcedefinition "machines.cluster.k8s.io" (90 of 273)
I0213 19:30:55.627346       1 sync_worker.go:470] Done syncing for clusterrole "cloud-credential-operator-role" (82 of 273)
I0213 19:30:55.627406       1 sync_worker.go:457] Running sync for clusterrolebinding "cloud-credential-operator-rolebinding" (83 of 273)
I0213 19:30:55.628106       1 sync_worker.go:470] Done syncing for clusterrolebinding "default-account-openshift-machine-config-operator" (107 of 273)
I0213 19:30:55.628180       1 sync_worker.go:457] Running sync for deployment "openshift-machine-config-operator/machine-config-operator" (108 of 273)
I0213 19:30:55.630426       1 sync_worker.go:470] Done syncing for clusterrolebinding "cloud-credential-operator-rolebinding" (83 of 273)
I0213 19:30:55.630455       1 sync_worker.go:457] Running sync for service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273)
I0213 19:30:55.631196       1 sync_worker.go:470] Done syncing for customresourcedefinition "machines.cluster.k8s.io" (90 of 273)
I0213 19:30:55.631226       1 sync_worker.go:457] Running sync for customresourcedefinition "machinesets.cluster.k8s.io" (91 of 273)
I0213 19:30:55.636029       1 sync_worker.go:470] Done syncing for customresourcedefinition "machinesets.cluster.k8s.io" (91 of 273)
I0213 19:30:55.636059       1 sync_worker.go:457] Running sync for customresourcedefinition "machinedeployments.cluster.k8s.io" (92 of 273)
I0213 19:30:55.636332       1 sync_worker.go:470] Done syncing for deployment "openshift-machine-config-operator/machine-config-operator" (108 of 273)
I0213 19:30:55.636360       1 sync_worker.go:457] Running sync for configmap "openshift-machine-config-operator/machine-config-osimageurl" (109 of 273)
I0213 19:30:55.639809       1 sync_worker.go:470] Done syncing for customresourcedefinition "machinedeployments.cluster.k8s.io" (92 of 273)
I0213 19:30:55.639843       1 sync_worker.go:457] Running sync for customresourcedefinition "clusters.cluster.k8s.io" (93 of 273)
I0213 19:30:55.640162       1 sync_worker.go:470] Done syncing for configmap "openshift-machine-config-operator/machine-config-osimageurl" (109 of 273)
I0213 19:30:55.640195       1 sync_worker.go:457] Running sync for clusteroperator "machine-config" (110 of 273)
I0213 19:30:55.643586       1 sync_worker.go:470] Done syncing for customresourcedefinition "clusters.cluster.k8s.io" (93 of 273)
I0213 19:30:55.643613       1 sync_worker.go:457] Running sync for customresourcedefinition "machineclasses.cluster.k8s.io" (94 of 273)
I0213 19:30:55.645977       1 sync_worker.go:470] Done syncing for customresourcedefinition "machineclasses.cluster.k8s.io" (94 of 273)
I0213 19:30:55.646003       1 sync_worker.go:457] Running sync for customresourcedefinition "machinehealthchecks.healthchecking.openshift.io" (95 of 273)
I0213 19:30:55.648341       1 sync_worker.go:470] Done syncing for customresourcedefinition "machinehealthchecks.healthchecking.openshift.io" (95 of 273)
I0213 19:30:55.648369       1 sync_worker.go:457] Running sync for customresourcedefinition "machines.machine.openshift.io" (96 of 273)
I0213 19:30:55.651627       1 sync_worker.go:470] Done syncing for customresourcedefinition "machines.machine.openshift.io" (96 of 273)
I0213 19:30:55.651658       1 sync_worker.go:457] Running sync for customresourcedefinition "machinesets.machine.openshift.io" (97 of 273)
I0213 19:30:55.654611       1 sync_worker.go:470] Done syncing for customresourcedefinition "machinesets.machine.openshift.io" (97 of 273)
I0213 19:30:55.654637       1 sync_worker.go:457] Running sync for customresourcedefinition "machinedeployments.machine.openshift.io" (98 of 273)
I0213 19:30:55.657589       1 sync_worker.go:470] Done syncing for customresourcedefinition "machinedeployments.machine.openshift.io" (98 of 273)
I0213 19:30:55.657615       1 sync_worker.go:457] Running sync for customresourcedefinition "clusters.machine.openshift.io" (99 of 273)
I0213 19:30:55.660522       1 sync_worker.go:470] Done syncing for customresourcedefinition "clusters.machine.openshift.io" (99 of 273)
I0213 19:30:55.660548       1 sync_worker.go:457] Running sync for customresourcedefinition "machineclasses.machine.openshift.io" (100 of 273)
I0213 19:30:55.663049       1 sync_worker.go:470] Done syncing for customresourcedefinition "machineclasses.machine.openshift.io" (100 of 273)
I0213 19:30:55.663074       1 sync_worker.go:457] Running sync for clusterrolebinding "default-account-openshift-machine-api" (101 of 273)
I0213 19:30:55.665400       1 sync_worker.go:470] Done syncing for clusterrolebinding "default-account-openshift-machine-api" (101 of 273)
I0213 19:30:55.665425       1 sync_worker.go:457] Running sync for deployment "openshift-machine-api/machine-api-operator" (102 of 273)
I0213 19:30:55.675423       1 sync_worker.go:470] Done syncing for deployment "openshift-machine-api/machine-api-operator" (102 of 273)
I0213 19:30:55.675450       1 sync_worker.go:457] Running sync for clusteroperator "machine-api" (103 of 273)
I0213 19:30:58.678828       1 sync_worker.go:470] Done syncing for clusteroperator "machine-api" (103 of 273)
I0213 19:30:59.643622       1 sync_worker.go:470] Done syncing for clusteroperator "machine-config" (110 of 273)
E0213 19:31:05.652641       1 task.go:57] error running apply for service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): services "controller-manager-service" is forbidden: caches not synchronized
I0213 19:31:13.869279       1 leaderelection.go:209] successfully renewed lease openshift-cluster-version/version
I0213 19:31:15.933467       1 reflector.go:286] github.com/openshift/cluster-version-operator/vendor/github.com/openshift/client-go/config/informers/externalversions/factory.go:101: forcing resync
E0213 19:31:25.673291       1 task.go:57] error running apply for service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): services "controller-manager-service" is forbidden: caches not synchronized
I0213 19:31:43.878161       1 leaderelection.go:209] successfully renewed lease openshift-cluster-version/version
E0213 19:31:48.700363       1 task.go:57] error running apply for service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): services "controller-manager-service" is forbidden: caches not synchronized
I0213 19:31:48.700527       1 task_graph.go:438] No more reachable nodes in graph, continue
I0213 19:31:48.700545       1 task_graph.go:474] No more work
I0213 19:31:48.700563       1 task_graph.go:494] No more work for 3
I0213 19:31:48.700573       1 task_graph.go:494] No more work for 6
I0213 19:31:48.700578       1 task_graph.go:494] No more work for 7
I0213 19:31:48.700581       1 task_graph.go:494] No more work for 5
I0213 19:31:48.700589       1 task_graph.go:494] No more work for 4
I0213 19:31:48.700593       1 task_graph.go:494] No more work for 0
I0213 19:31:48.700598       1 task_graph.go:494] No more work for 1
I0213 19:31:48.700604       1 task_graph.go:494] No more work for 2
I0213 19:31:48.700614       1 task_graph.go:510] Workers finished
I0213 19:31:48.700625       1 task_graph.go:518] Result of work: [Could not update service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): the server has forbidden updates to this resource]
E0213 19:31:48.700663       1 sync_worker.go:263] unable to synchronize image (waiting 3m19.747206386s): Could not update service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): the server has forbidden updates to this resource
I0213 19:31:48.700716       1 cvo.go:298] Started syncing cluster version "openshift-cluster-version/version" (2019-02-13 19:31:48.700709155 +0000 UTC m=+2506.887267577)
I0213 19:31:48.700764       1 cvo.go:326] Desired version from operator is v1.Update{Version:"0.0.1-2019-02-13-183754", Image:"registry.svc.ci.openshift.org/ci-op-girsxxlp/release@sha256:f6f118048108d1d34c31dbd85f917c3165050b85f2e86c36bc3e34b05a65a0e1"}
I0213 19:31:48.700867       1 cvo.go:300] Finished syncing cluster version "openshift-cluster-version/version" (154.552µs)
I0213 19:32:13.892888       1 leaderelection.go:209] successfully renewed lease openshift-cluster-version/version
I0213 19:32:18.640558       1 reflector.go:286] github.com/openshift/cluster-version-operator/vendor/github.com/openshift/client-go/config/informers/externalversions/factory.go:101: forcing resync
I0213 19:32:18.640686       1 cvo.go:349] Started syncing available updates "openshift-cluster-version/version" (2019-02-13 19:32:18.640678331 +0000 UTC m=+2536.827236755)
I0213 19:32:18.640707       1 cvo.go:298] Started syncing cluster version "openshift-cluster-version/version" (2019-02-13 19:32:18.640698939 +0000 UTC m=+2536.827257395)
I0213 19:32:18.640755       1 cvo.go:326] Desired version from operator is v1.Update{Version:"0.0.1-2019-02-13-183754", Image:"registry.svc.ci.openshift.org/ci-op-girsxxlp/release@sha256:f6f118048108d1d34c31dbd85f917c3165050b85f2e86c36bc3e34b05a65a0e1"}
I0213 19:32:18.640902       1 cvo.go:300] Finished syncing cluster version "openshift-cluster-version/version" (197.489µs)
I0213 19:32:18.694484       1 availableupdates.go:132] Upstream server https://api.openshift.com/api/upgrades_info/v1/graph could not return available updates: unknown version 0.0.1-2019-02-13-183754
I0213 19:32:18.694520       1 cvo.go:351] Finished syncing available updates "openshift-cluster-version/version" (53.838074ms)
I0213 19:32:18.694537       1 cvo.go:298] Started syncing cluster version "openshift-cluster-version/version" (2019-02-13 19:32:18.69453419 +0000 UTC m=+2536.881092555)
I0213 19:32:18.694571       1 cvo.go:326] Desired version from operator is v1.Update{Version:"0.0.1-2019-02-13-183754", Image:"registry.svc.ci.openshift.org/ci-op-girsxxlp/release@sha256:f6f118048108d1d34c31dbd85f917c3165050b85f2e86c36bc3e34b05a65a0e1"}
I0213 19:32:18.694644       1 cvo.go:300] Finished syncing cluster version "openshift-cluster-version/version" (106.248µs)

In the logs it looks like the machine-api is deploying either in parallel, or before the cloud credential operator, despite my attempts to have them loaded first with the ordering. Does everything within a runlevel like 0000_30 apply in parallel? If so should I move to 0000_29?

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 15, 2019
@spangenberg
Copy link
Contributor

I saw you copied the CredentialsRequest from the Machine API Operator here, is the plan to have everything centralised here?

@dgoodwin
Copy link
Contributor Author

That was the original plan but no more, we now want component repos to have their control of credentails and we'll audit by looking at what's in release manifest.

I'd delete them in this PR but we've got a pending problem we need to solve, there's a component that reads these from source and uses it to validate permissions preflight, need to talk to @joelddiaz before we can delete them here.

@dgoodwin
Copy link
Contributor Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2019
@smarterclayton
Copy link
Contributor

Does the machine api operator not wait until the cred shows up? You don’t have to be before something for them to get a chance to load - things are launched in parallel at the same runlevel.

@dgoodwin
Copy link
Contributor Author

They're not using a credential at all yet, so conceivably they could.

There appears to be a problem at runlevel 30 though per #31 (comment), I haven't seen a result yet since changing it to 29 but at 30 we were getting:

E0213 19:31:48.700663 1 sync_worker.go:263] unable to synchronize image (waiting 3m19.747206386s): Could not update service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): the server has forbidden updates to this resource

@dgoodwin
Copy link
Contributor Author

dgoodwin commented Feb 15, 2019

Bearing in mind their cred CR won't apply cleanly until we create the CRD. 29 felt safer but can go either way.

@smarterclayton
Copy link
Contributor

Cvo already has to handle that. You definitely can’t be after theirs, but we are trying to reduce the number of levels. You need to be after kube apiserver for sure.

The error doesn’t look that bad.

@dgoodwin
Copy link
Contributor Author

Ok will take it back to 30 and see if the error returns. It seems to have failed differently at 29 for some reason. The 30 failures were consistent and I've only seen them on this PR, not really sure where to go next. Issue filed in CVO repo a few days ago: openshift/cluster-version-operator#119

machine-api also runs at 30 but CVO will process all items in same
runlevel in parallel, and machine-api will need to know to wait until
it's creds exist.

Per: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/operators.md#how-do-i-get-added-as-a-special-run-level
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 15, 2019
@dgoodwin
Copy link
Contributor Author

@smarterclayton @crawford @derekwaynecarr I am majorly stuck here, I do not understand why the CVO is shutting us down saying "the server has forbidden updates to this resource". It is consistent, it only affects this PR, regardless of runlevel 29 or 30. I've filed last week here: openshift/cluster-version-operator#119

This PR is blocked as a result which means machine-api using minted creds is blocked.

@abhinavdahiya
Copy link
Contributor

/retest

trying to get a new release image for local testing...

@abhinavdahiya
Copy link
Contributor

The actual reason why you are seeing this error is:

error running apply for service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): services "controller-manager-service" is forbidden: caches not synchronized

kube-apiserver rejects objects into all non run-level 0,1 namespaces until openshift-apiserver is ready (I think for admission purposes). And since the operator is moving from 70 -> 30 (before openshift-apiserver) resources cannot be created in your namespace.

Add this label to the operator namespace
https://github.com/openshift/cluster-version-operator/blob/93d4eff7a3f4aad8505702d875bb7169e294d221/install/0000_00_cluster-version-operator_00_namespace.yaml#L7

kube-apiserver rejects objects into all non run-level 0,1 namespaces
until openshift-apiserver is ready, we need to be up before
openshift-apiserver and thus require this annotation.

Fixes the CI error: "the server has forbidden updates to this
resource".
@twiest
Copy link

twiest commented Feb 19, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, twiest

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dgoodwin
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 20, 2019
@openshift-merge-robot openshift-merge-robot merged commit 4726195 into openshift:master Feb 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants