Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/server: save the bootstrap MC content #1376

Merged
merged 1 commit into from Jan 22, 2020

Conversation

runcom
Copy link
Member

@runcom runcom commented Jan 16, 2020

This patch injects the current machine config served by the MCS
into the ignition config to have it on disk for later comparisons.
This isn't making any difference between bootstrap and cluster
but I'm naming this commit with bootstrap as this will be
extremely helpful as a stopgap to investigate installer & MCO drifts.
By having the MC content on disk, if a drift happens, we can collect
the MC with must-gather and just compare it with what the MCO in the
cluster is generating.

Signed-off-by: Antonio Murdaca runcom@linux.com

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 16, 2020
@runcom
Copy link
Member Author

runcom commented Jan 16, 2020

@cgwalters I'd love if you can have a super fast look at this

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2020
Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK to me short term. Conceptually though...there are two things:

  • The Ignition
  • The rest of the MC, which we already have in -encapsulated.json

So if we came up with a mechanism to save the Ignition, we could reassemble the two. See this issue for some recent discussion of saving Ignition.

pkg/server/server.go Outdated Show resolved Hide resolved
pkg/server/server.go Outdated Show resolved Hide resolved
@kikisdeliveryservice
Copy link
Contributor

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor

a daemon got cut off during eviction durign reboot and mcc see:

E0116 14:48:34.125557       1 operator.go:312] error syncing progressing status: Get https://172.30.0.1:443/apis/config.openshift.io/v1/clusteroperators/machine-config: stream error: stream ID 2091; INTERNAL_ERROR
E0116 14:48:36.610895       1 operator.go:312] error syncing progressing status: Get https://172.30.0.1:443/apis/config.openshift.io/v1/clusteroperators/machine-config: stream error: stream ID 2123; INTERNAL_ERROR

retesting

@kikisdeliveryservice
Copy link
Contributor

I dont think that it's specific to the PR (ill check) but seeing in runs that this shutdown doesn't seem to occur and then the mcp test times out....

I0116 20:07:00.978975    2879 update.go:1050] initiating reboot: Node will reboot into config rendered-infra-a78095e38f9059fe14f5118f1e4031ff
I0116 20:07:01.065409    2879 daemon.go:553] Shutting down MachineConfigDaemon

Looking a little more

@kikisdeliveryservice
Copy link
Contributor

Seeing dns issues:


an 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: [INFO] SIGTERM: Shutting down servers then terminating
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: ,StartedAt:2020-01-16 18:15:36 +0000 UTC,FinishedAt:2020-01-16 20:07:26 +0000 UTC,ContainerID:cri-o://52292c3de55c417117e0591707ec9ce6f037ee7307dcc51d42f6f24c413bf90f,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:1 Image:registry.svc.ci.openshift.org/ci-op-7ib4gw3m/stable@sha256:9394e57c6886f79bde6ef6ea3eb2ea19bffb86cc761b4e2746820a8cc3112694 ImageID:registry.svc.ci.openshift.org/ci-op-7ib4gw3m/stable@sha256:9394e57c6886f79bde6ef6ea3eb2ea19bffb86cc761b4e2746820a8cc3112694 ContainerID:cri-o://52292c3de55c417117e0591707ec9ce6f037ee7307dcc51d42f6f24c413bf90f Started:0xc0007db8eb} {Name:dns-node-resolver State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Error,Message:kill: sending signal to 904 failed: No such process
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: ,StartedAt:2020-01-16 18:15:36 +0000 UTC,FinishedAt:2020-01-16 20:07:26 +0000 UTC,ContainerID:cri-o://2740a569b9ac57b171eb305768301853a734ad98096c841ae401e78f9a707197,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:1 Image:registry.svc.ci.openshift.org/ci-op-7ib4gw3m/stable@sha256:1ce2e6380cc5f41d6e7a481a748b6d6083a431242d4bf444193abe00813d72fe ImageID:registry.svc.ci.openshift.org/ci-op-7ib4gw3m/stable@sha256:1ce2e6380cc5f41d6e7a481a748b6d6083a431242d4bf444193abe00813d72fe ContainerID:cri-o://2740a569b9ac57b171eb305768301853a734ad98096c841ae401e78f9a707197 Started:0xc0007db8ec}] QOSClass:Burstable EphemeralContainerStatuses:[]}
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.169972    2320 volume_manager.go:372] Waiting for volumes to attach and mount for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)"
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.170117    2320 volume_manager.go:403] All volumes are attached and mounted for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)"
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.170214    2320 kuberuntime_manager.go:442] No ready sandbox for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)" can be found. Need to start a new one
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.170305    2320 kuberuntime_manager.go:652] computePodActions got {KillPod:true CreateSandbox:true SandboxID:fba7253b05d240c2e2c0eb5948e0c04b9398b99162f88d8d5efe1211e7659886 Attempt:1 NextInitContainerToStart:nil ContainersToStart:[0 1] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)"
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.170677    2320 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-dns", Name:"dns-default-jq4qq", UID:"c5133135-faeb-48a5-870f-213a5cdb821f", APIVersion:"v1", ResourceVersion:"23015", FieldPath:""}): type: 'Normal' reason: 'SandboxChanged' Pod sandbox changed, it will be killed and re-created.
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: E0116 20:13:53.171862    2320 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error reserving pod name k8s_dns-default-jq4qq_openshift-dns_c5133135-faeb-48a5-870f-213a5cdb821f_1 for id e8dcaf6739532d79c752254ab0a622500d35aa8977a01e8fab587c966c688663: name is reserved
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: E0116 20:13:53.171914    2320 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)" failed: rpc error: code = Unknown desc = error reserving pod name k8s_dns-default-jq4qq_openshift-dns_c5133135-faeb-48a5-870f-213a5cdb821f_1 for id e8dcaf6739532d79c752254ab0a622500d35aa8977a01e8fab587c966c688663: name is reserved
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: E0116 20:13:53.171927    2320 kuberuntime_manager.go:729] createPodSandbox for pod "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)" failed: rpc error: code = Unknown desc = error reserving pod name k8s_dns-default-jq4qq_openshift-dns_c5133135-faeb-48a5-870f-213a5cdb821f_1 for id e8dcaf6739532d79c752254ab0a622500d35aa8977a01e8fab587c966c688663: name is reserved
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: E0116 20:13:53.171968    2320 pod_workers.go:191] Error syncing pod c5133135-faeb-48a5-870f-213a5cdb821f ("dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)"), skipping: failed to "CreatePodSandbox" for "dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)" with CreatePodSandboxError: "CreatePodSandbox for pod \"dns-default-jq4qq_openshift-dns(c5133135-faeb-48a5-870f-213a5cdb821f)\" failed: rpc error: code = Unknown desc = error reserving pod name k8s_dns-default-jq4qq_openshift-dns_c5133135-faeb-48a5-870f-213a5cdb821f_1 for id e8dcaf6739532d79c752254ab0a622500d35aa8977a01e8fab587c966c688663: name is reserved"
Jan 16 20:13:53 ci-op-f8vkq-w-b-fdl7c.c.openshift-gce-devel-ci.internal hyperkube[2320]: I0116 20:13:53.172103    2320 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-dns", Name:"dns-default-jq4qq", UID:"c5133135-faeb-48a5-870f-213a5cdb821f", APIVersion:"v1", ResourceVersion:"23015", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_dns-default-jq4qq_openshift-dns_c5133135-faeb-48a5-870f-213a5cdb821f_1 for id e8dcaf6739532d79c752254ab0a622500d35aa8977a01e8fab587c966c688663: name is reserved

But also seeing a degraded node-exporter.


E0116 20:14:13.652090       1 task.go:77] error running apply for clusteroperator "monitoring" (304 of 517): Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2)
I0116 20:14:13.652460       1 task_graph.go:568] Canceled worker 0
I0116 20:14:13.652480       1 task_graph.go:588] Workers finished
I0116 20:14:13.652502       1 task_graph.go:509] Graph is complete
I0116 20:14:13.652514       1 task_graph.go:596] Result of work: [Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2)]
I0116 20:14:13.652536       1 sync_worker.go:783] Summarizing 1 errors
I0116 20:14:13.652597       1 sync_worker.go:787] Update error 304 of 517: ClusterOperatorDegraded Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2) (*errors.errorString: cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2))
E0116 20:14:13.652638       1 sync_worker.go:329] unable to synchronize image (waiting 2m52.525702462s): Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2)

Will poke around BZ to see if something is open...

@kikisdeliveryservice
Copy link
Contributor

MCO isn't degraded, it's just timing out.

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor

Opened BZ 1792033 for that node-exporter error.

This patch injects the current machine config served by the MCS
into the ignition config to have it on disk for later comparisons.
This isn't making any difference between bootstrap and cluster
but I'm naming this commit with bootstrap as this will be
extremely helpful as a stopgap to investigate installer & MCO drifts.
By having the MC content on disk, if a drift happens, we can collect
the MC with must-gather and just compare it with what the MCO in the
cluster is generating.

Signed-off-by: Antonio Murdaca <runcom@linux.com>
@runcom
Copy link
Member Author

runcom commented Jan 20, 2020

can we move on with #1376 to at least have some data to ease debug till we come up with a stronger solution to auto-reconcile as David suggested?

@kikisdeliveryservice
Copy link
Contributor

LGTM

deferring to colin to ensure he has no further comments

/assign @cgwalters

@runcom
Copy link
Member Author

runcom commented Jan 22, 2020

/retest

1 similar comment
@runcom
Copy link
Member Author

runcom commented Jan 22, 2020

/retest

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Member Author

runcom commented Jan 22, 2020

/retest

@openshift-merge-robot openshift-merge-robot merged commit 1b9dc9e into openshift:master Jan 22, 2020
@runcom runcom deleted the save-bootstrap-mc branch January 23, 2020 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants