pkg/server: save the bootstrap MC content #1376

runcom · 2020-01-16T14:10:36Z

This patch injects the current machine config served by the MCS
into the ignition config to have it on disk for later comparisons.
This isn't making any difference between bootstrap and cluster
but I'm naming this commit with bootstrap as this will be
extremely helpful as a stopgap to investigate installer & MCO drifts.
By having the MC content on disk, if a drift happens, we can collect
the MC with must-gather and just compare it with what the MCO in the
cluster is generating.

Signed-off-by: Antonio Murdaca runcom@linux.com

runcom · 2020-01-16T14:10:57Z

@cgwalters I'd love if you can have a super fast look at this

cgwalters

Seems OK to me short term. Conceptually though...there are two things:

The Ignition
The rest of the MC, which we already have in -encapsulated.json

So if we came up with a mechanism to save the Ignition, we could reassemble the two. See this issue for some recent discussion of saving Ignition.

pkg/server/server.go

kikisdeliveryservice · 2020-01-16T17:31:12Z

/test e2e-gcp-op

kikisdeliveryservice · 2020-01-16T17:34:48Z

a daemon got cut off during eviction durign reboot and mcc see:

E0116 14:48:34.125557       1 operator.go:312] error syncing progressing status: Get https://172.30.0.1:443/apis/config.openshift.io/v1/clusteroperators/machine-config: stream error: stream ID 2091; INTERNAL_ERROR
E0116 14:48:36.610895       1 operator.go:312] error syncing progressing status: Get https://172.30.0.1:443/apis/config.openshift.io/v1/clusteroperators/machine-config: stream error: stream ID 2123; INTERNAL_ERROR

retesting

kikisdeliveryservice · 2020-01-16T21:08:56Z

I dont think that it's specific to the PR (ill check) but seeing in runs that this shutdown doesn't seem to occur and then the mcp test times out....

I0116 20:07:00.978975    2879 update.go:1050] initiating reboot: Node will reboot into config rendered-infra-a78095e38f9059fe14f5118f1e4031ff
I0116 20:07:01.065409    2879 daemon.go:553] Shutting down MachineConfigDaemon

Looking a little more

kikisdeliveryservice · 2020-01-16T21:43:28Z

Seeing dns issues:


an 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: [INFO] SIGTERM: Shutting down servers then terminating
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: ,StartedAt:2020-01-16 18:15:36 +0000 UTC,FinishedAt:2020-01-16 20:07:26 +0000 UTC,ContainerID:cri-o://52292c3de55c417117e0591707ec9ce6f037ee7307dcc51d42f6f24c413bf90f,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:1 Image:registry.svc.ci.openshift.org/ci-op-7ib4gw3m/stable@sha256:9394e57c6886f79bde6ef6ea3eb2ea19bffb86cc761b4e2746820a8cc3112694 ImageID:registry.svc.ci.openshift.org/ci-op-7ib4gw3m/stable@sha256:9394e57c6886f79bde6ef6ea3eb2ea19bffb86cc761b4e2746820a8cc3112694 ContainerID:cri-o://52292c3de55c417117e0591707ec9ce6f037ee7307dcc51d42f6f24c413bf90f Started:0xc0007db8eb} {Name:dns-node-resolver State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Error,Message:kill: sending signal to 904 failed: No such process
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: ,StartedAt:2020-01-16 18:15:36 +0000 UTC,FinishedAt:2020-01-16 20:07:26 +0000 UTC,ContainerID:cri-o://2740a569b9ac57b171eb305768301853a734ad98096c841ae401e78f9a707197,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:1 Image:registry.svc.ci.openshift.org/ci-op-7ib4gw3m/stable@sha256:1ce2e6380cc5f41d6e7a481a748b6d6083a431242d4bf444193abe00813d72fe ImageID:registry.svc.ci.openshift.org/ci-op-7ib4gw3m/stable@sha256:1ce2e6380cc5f41d6e7a481a748b6d6083a431242d4bf444193abe00813d72fe ContainerID:cri-o://2740a569b9ac57b171eb305768301853a734ad98096c841ae401e78f9a707197 Started:0xc0007db8ec}] QOSClass:Burstable EphemeralContainerStatuses:[]}
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.169972    2320 volume_manager.go:372] Waiting for volumes to attach and mount for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)"
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.170117    2320 volume_manager.go:403] All volumes are attached and mounted for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)"
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.170214    2320 kuberuntime_manager.go:442] No ready sandbox for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)" can be found. Need to start a new one
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.170305    2320 kuberuntime_manager.go:652] computePodActions got {KillPod:true CreateSandbox:true SandboxID:fba7253b05d240c2e2c0eb5948e0c04b9398b99162f88d8d5efe1211e7659886 Attempt:1 NextInitContainerToStart:nil ContainersToStart:[0 1] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)"
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.170677    2320 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-dns", Name:"dns-default-jq4qq", UID:"c5133135-faeb-48a5-870f-213a5cdb821f", APIVersion:"v1", ResourceVersion:"23015", FieldPath:""}): type: 'Normal' reason: 'SandboxChanged' Pod sandbox changed, it will be killed and re-created.
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: E0116 20:13:53.171862    2320 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error reserving pod name k8s_dns-default-jq4qq_openshift-dns_c5133135-faeb-48a5-870f-213a5cdb821f_1 for id e8dcaf6739532d79c752254ab0a622500d35aa8977a01e8fab587c966c688663: name is reserved
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: E0116 20:13:53.171914    2320 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)" failed: rpc error: code = Unknown desc = error reserving pod name k8s_dns-default-jq4qq_openshift-dns_c5133135-faeb-48a5-870f-213a5cdb821f_1 for id e8dcaf6739532d79c752254ab0a622500d35aa8977a01e8fab587c966c688663: name is reserved
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: E0116 20:13:53.171927    2320 kuberuntime_manager.go:729] createPodSandbox for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)" failed: rpc error: code = Unknown desc = error reserving pod name k8s_dns-default-jq4qq_openshift-dns_c5133135-faeb-48a5-870f-213a5cdb821f_1 for id e8dcaf6739532d79c752254ab0a622500d35aa8977a01e8fab587c966c688663: name is reserved
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: E0116 20:13:53.171968    2320 pod_workers.go:191] Error syncing pod c5133135-faeb-48a5-870f-213a5cdb821f ("dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)"), skipping: failed to "CreatePodSandbox" for "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)" with CreatePodSandboxError: "CreatePodSandbox for pod \"dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)\" failed: rpc error: code = Unknown desc = error reserving pod name k8s_dns-default-jq4qq_openshift-dns_c5133135-faeb-48a5-870f-213a5cdb821f_1 for id e8dcaf6739532d79c752254ab0a622500d35aa8977a01e8fab587c966c688663: name is reserved"
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.172103    2320 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-dns", Name:"dns-default-jq4qq", UID:"c5133135-faeb-48a5-870f-213a5cdb821f", APIVersion:"v1", ResourceVersion:"23015", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_dns-default-jq4qq_openshift-dns_c5133135-faeb-48a5-870f-213a5cdb821f_1 for id e8dcaf6739532d79c752254ab0a622500d35aa8977a01e8fab587c966c688663: name is reserved

But also seeing a degraded node-exporter.


E0116 20:14:13.652090       1 task.go:77] error running apply for clusteroperator "monitoring" (304 of 517): Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2)
I0116 20:14:13.652460       1 task_graph.go:568] Canceled worker 0
I0116 20:14:13.652480       1 task_graph.go:588] Workers finished
I0116 20:14:13.652502       1 task_graph.go:509] Graph is complete
I0116 20:14:13.652514       1 task_graph.go:596] Result of work: [Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2)]
I0116 20:14:13.652536       1 sync_worker.go:783] Summarizing 1 errors
I0116 20:14:13.652597       1 sync_worker.go:787] Update error 304 of 517: ClusterOperatorDegraded Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2) (*errors.errorString: cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2))
E0116 20:14:13.652638       1 sync_worker.go:329] unable to synchronize image (waiting 2m52.525702462s): Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2)

Will poke around BZ to see if something is open...

kikisdeliveryservice · 2020-01-16T21:52:26Z

MCO isn't degraded, it's just timing out.

/test e2e-gcp-op

kikisdeliveryservice · 2020-01-16T22:05:12Z

Opened BZ 1792033 for that node-exporter error.

This patch injects the current machine config served by the MCS into the ignition config to have it on disk for later comparisons. This isn't making any difference between bootstrap and cluster but I'm naming this commit with bootstrap as this will be extremely helpful as a stopgap to investigate installer & MCO drifts. By having the MC content on disk, if a drift happens, we can collect the MC with must-gather and just compare it with what the MCO in the cluster is generating. Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom · 2020-01-20T12:45:25Z

can we move on with #1376 to at least have some data to ease debug till we come up with a stronger solution to auto-reconcile as David suggested?

kikisdeliveryservice · 2020-01-21T17:59:49Z

LGTM

deferring to colin to ensure he has no further comments

/assign @cgwalters

runcom · 2020-01-22T08:40:40Z

/retest

runcom · 2020-01-22T15:20:05Z

/retest

cgwalters

/lgtm

openshift-ci-robot · 2020-01-22T15:35:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-01-22T16:11:22Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-01-22T16:25:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-01-22T19:38:27Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-01-22T20:43:45Z

/retest

Please review the full test history for this PR and help us cut down flakes.

runcom · 2020-01-22T21:37:39Z

/retest

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 16, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2020

openshift-ci-robot requested review from ericavonb and kikisdeliveryservice January 16, 2020 14:12

cgwalters reviewed Jan 16, 2020

View reviewed changes

pkg/server/server.go Outdated Show resolved Hide resolved

pkg/server/server.go Outdated Show resolved Hide resolved

runcom force-pushed the save-bootstrap-mc branch from c22d3ba to 32ee7ec Compare January 16, 2020 14:27

deads2k mentioned this pull request Jan 16, 2020

bootkube: Inject bootstrap MachineConfigs into cluster openshift/installer#2936

Closed

cgwalters mentioned this pull request Jan 16, 2020

Bug 1791993: proxy: use explicit list of platforms for metadata addresses openshift/installer#2939

Merged

runcom force-pushed the save-bootstrap-mc branch from 32ee7ec to ccf9a24 Compare January 20, 2020 09:01

openshift-ci-robot assigned cgwalters Jan 21, 2020

cgwalters approved these changes Jan 22, 2020

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2020

openshift-merge-robot merged commit 1b9dc9e into openshift:master Jan 22, 2020

runcom deleted the save-bootstrap-mc branch January 23, 2020 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/server: save the bootstrap MC content #1376

pkg/server: save the bootstrap MC content #1376

runcom commented Jan 16, 2020

runcom commented Jan 16, 2020

cgwalters left a comment

kikisdeliveryservice commented Jan 16, 2020

kikisdeliveryservice commented Jan 16, 2020

kikisdeliveryservice commented Jan 16, 2020

kikisdeliveryservice commented Jan 16, 2020

kikisdeliveryservice commented Jan 16, 2020

kikisdeliveryservice commented Jan 16, 2020

runcom commented Jan 20, 2020

kikisdeliveryservice commented Jan 21, 2020

runcom commented Jan 22, 2020

runcom commented Jan 22, 2020

cgwalters left a comment

openshift-ci-robot commented Jan 22, 2020

openshift-bot commented Jan 22, 2020

openshift-bot commented Jan 22, 2020

openshift-bot commented Jan 22, 2020

openshift-bot commented Jan 22, 2020

runcom commented Jan 22, 2020

pkg/server: save the bootstrap MC content #1376

pkg/server: save the bootstrap MC content #1376

Conversation

runcom commented Jan 16, 2020

runcom commented Jan 16, 2020

cgwalters left a comment

Choose a reason for hiding this comment

kikisdeliveryservice commented Jan 16, 2020

kikisdeliveryservice commented Jan 16, 2020

kikisdeliveryservice commented Jan 16, 2020

kikisdeliveryservice commented Jan 16, 2020

kikisdeliveryservice commented Jan 16, 2020

kikisdeliveryservice commented Jan 16, 2020

runcom commented Jan 20, 2020

kikisdeliveryservice commented Jan 21, 2020

runcom commented Jan 22, 2020

runcom commented Jan 22, 2020

cgwalters left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Jan 22, 2020

openshift-bot commented Jan 22, 2020

openshift-bot commented Jan 22, 2020

openshift-bot commented Jan 22, 2020

openshift-bot commented Jan 22, 2020

runcom commented Jan 22, 2020