New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inject OSImageURL from CVO into templated MachineConfigs #363

Open
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
9 participants
@cgwalters
Copy link
Contributor

cgwalters commented Feb 1, 2019

This injects the OSImageURL into the "base"
config (e.g. 00-worker, 00-master). This differs from
previous pull requests which made it a separate MC, but that
adds visual noise and will exacerbate renderer race conditions.

return MachineConfigFromIgnConfig(role, name, ignCfg), nil
mcfg := MachineConfigFromIgnConfig(role, name, ignCfg)
if osUpdatesEnabledForRole(config, role) {
mcfg.Spec.OSImageURL = config.OSImageURL

This comment has been minimized.

@jlebon

jlebon Feb 1, 2019

Member

Let's maybe also log something in the else branch here so it's clear why it's not picking it up for some pools?

This comment has been minimized.

@cgwalters

cgwalters Feb 1, 2019

Author Contributor

Problem is we're going to see that message every time something in the cluster changes...I've learned my lesson there about adding debug prints.

This comment has been minimized.

@cgwalters

cgwalters Feb 1, 2019

Author Contributor

See #348

@ashcrow
Copy link
Member

ashcrow left a comment

Looks sane

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 1, 2019

Moving discussion of bootstrap over here.

Now that I think about it more...before or after this lands, in fact nothing will be reacting to the osimageurl. So we should be able to land the installer PR now.

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 1, 2019

OK I started on using labels but am currently hitting a weird error with image corruption trying to deploy my updated controller image that is almost certainly unrelated. And I'm still learning the label API/semantics.

From a6faeec6dda4ea2c89eed9a3da945c355ef27769 Mon Sep 17 00:00:00 2001
From: Colin Walters <walters@verbum.org>
Date: Fri, 1 Feb 2019 16:17:07 +0000
Subject: [PATCH] Add code to inject OSImageURL, but disable it by default

First, this supports injecting the `OSImageURL` into the "base"
config (e.g. `00-worker`, `00-master`).  This differs from
previous pull requests which made it a separate MC, but that
adds visual noise and will exacerbate renderer race conditions.

However, in order to gain experience with this code, add a
`ControllerConfig` option which can disable injecting it for certain
roles.  This is set to `*` by default, so effectively we won't
do OS updates.

My idea here is that anyone who wants to test things out can
`oc edit controllerconfig` and empty that out.  Another useful
thing would be to change it to e.g. `{"master"}` and test OS updates
on workers without affecting the master.
---
 .../v1/types.go                               |  3 +++
 pkg/controller/template/render.go             | 25 ++++++++++++++++++-
 pkg/operator/operator.go                      |  6 +++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/pkg/apis/machineconfiguration.openshift.io/v1/types.go b/pkg/apis/machineconfiguration.openshift.io/v1/types.go
index cc38d0d..4f053f6 100644
--- a/pkg/apis/machineconfiguration.openshift.io/v1/types.go
+++ b/pkg/apis/machineconfiguration.openshift.io/v1/types.go
@@ -141,6 +141,9 @@ type ControllerConfigSpec struct {
 	// Images is map of images that are used by the controller.
 	Images map[string]string `json:"images"`
 
+	// Configure which pools receive OS updates from the CVO
+	OSUpdatesEnabledForPools *metav1.LabelSelector `json:"osUpdatesEnabledForPools,omitempty"`
+
 	// Sourced from configmap/machine-config-osimageurl
 	OSImageURL string `json:"osImageURL"`
 }
diff --git a/pkg/controller/template/render.go b/pkg/controller/template/render.go
index c63ddb0..46a7e31 100644
--- a/pkg/controller/template/render.go
+++ b/pkg/controller/template/render.go
@@ -15,6 +15,7 @@ import (
 	ctconfig "github.com/coreos/container-linux-config-transpiler/config"
 	cttypes "github.com/coreos/container-linux-config-transpiler/config/types"
 	ignv2_2types "github.com/coreos/ignition/config/v2_2/types"
+	"k8s.io/apimachinery/pkg/labels"
 	"github.com/ghodss/yaml"
 	"github.com/golang/glog"
 	mcfgv1 "github.com/openshift/machine-config-operator/pkg/apis/machineconfiguration.openshift.io/v1"
@@ -131,6 +132,20 @@ func platformFromControllerConfigSpec(ic *mcfgv1.ControllerConfigSpec) (string,
 	}
 }
 
+// osUpdatesEnabledForRole parses the OSUpdatesEnabledForPools flag, which is a
+// way to control injection of the OSImageURL into rendered machine configs.
+// Primarily intended for development/testing.
+func osUpdatesEnabledForRole(config *RenderConfig, role string) (bool, error) {
+	selector, err := metav1.LabelSelectorAsSelector(config.OSUpdatesEnabledForPools)
+	if err != nil {
+		return false, fmt.Errorf("invalid label selector: %v", err)
+	}
+
+	roleLabels := make(map[string]string)
+	roleLabels[role] = ""
+	return selector.Empty() || selector.Matches(labels.Set(roleLabels)), nil
+}
+
 func generateMachineConfigForName(config *RenderConfig, role, name, path string) (*mcfgv1.MachineConfig, error) {
 	platform, err := platformFromControllerConfigSpec(config.ControllerConfigSpec)
 	if err != nil {
@@ -233,7 +248,15 @@ func generateMachineConfigForName(config *RenderConfig, role, name, path string)
 		return nil, fmt.Errorf("error transpiling ct config to Ignition config: %v", err)
 	}
 
-	return MachineConfigFromIgnConfig(role, name, ignCfg), nil
+	mcfg := MachineConfigFromIgnConfig(role, name, ignCfg)
+	osUpdatesEnabled, err := osUpdatesEnabledForRole(config, role)
+	if err != nil {
+		return nil, err
+	} else if osUpdatesEnabled {
+		mcfg.Spec.OSImageURL = config.OSImageURL
+	}
+
+	return mcfg, nil
 }
 
 const (
diff --git a/pkg/operator/operator.go b/pkg/operator/operator.go
index 446a6c3..2955d01 100644
--- a/pkg/operator/operator.go
+++ b/pkg/operator/operator.go
@@ -343,6 +343,11 @@ func icFromClusterConfig(cm *v1.ConfigMap) (installertypes.InstallConfig, error)
 }
 
 func getRenderConfig(mc *mcfgv1.MCOConfig, etcdCAData, rootCAData []byte, ps *v1.ObjectReference, imgs Images) renderConfig {
+	// For now we disable OS updates until we've done more testing
+	osUpdateSelector, err := metav1.ParseToLabelSelector("")
+	if err != nil {
+		panic(err)
+	}
 	controllerconfig := mcfgv1.ControllerConfigSpec{
 		ClusterDNSIP:        mc.Spec.ClusterDNSIP,
 		CloudProviderConfig: mc.Spec.CloudProviderConfig,
@@ -354,6 +359,7 @@ func getRenderConfig(mc *mcfgv1.MCOConfig, etcdCAData, rootCAData []byte, ps *v1
 		PullSecret:          ps,
 		SSHKey:              mc.Spec.SSHKey,
 		OSImageURL:          imgs.MachineOSContent,
+		OSUpdatesEnabledForPools: osUpdateSelector,
 		Images: map[string]string{
 			templatectrl.EtcdImageKey:    imgs.Etcd,
 			templatectrl.SetupEtcdEnvKey: imgs.SetupEtcdEnv,
-- 
2.20.1

@cgwalters cgwalters force-pushed the cgwalters:osimage-render-but-disabled branch from 99836ca to 1277d52 Feb 7, 2019

osUpdatesEnabled, err := osUpdatesEnabledForRole(config, role)
if err != nil {
return nil, err
} else if osUpdatesEnabled {

This comment has been minimized.

@runcom

runcom Feb 7, 2019

Member

No need for this else, you're returning on the branch aboce anyway

This comment has been minimized.

@cgwalters

cgwalters Feb 7, 2019

Author Contributor

I'm confused, the other one is usual error handling? Are you saying it'd e.g. be clearer like this?

diff --git a/pkg/controller/template/render.go b/pkg/controller/template/render.go
index 40aa9a8..97cf0e4 100644
--- a/pkg/controller/template/render.go
+++ b/pkg/controller/template/render.go
@@ -258,7 +258,8 @@ func generateMachineConfigForName(config *RenderConfig, role, name, path string)
 	osUpdatesEnabled, err := osUpdatesEnabledForRole(config, role)
 	if err != nil {
 		return nil, err
-	} else if osUpdatesEnabled {
+	}
+	if osUpdatesEnabled {
 		mcfg.Spec.OSImageURL = config.OSImageURL
 	}
 

This comment has been minimized.

@runcom

runcom Feb 7, 2019

Member

Yup, that's how it's usually done when you have an if branch which returns

@cgwalters cgwalters force-pushed the cgwalters:osimage-render-but-disabled branch from 1277d52 to bccb216 Feb 7, 2019

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 7, 2019

My idea here is that anyone who wants to test things out can
oc edit controllerconfig and empty that out. Another useful
thing would be to change it to e.g. {"master"} and test OS updates
on workers without affecting the master.

to clear out confusion, I believe this comment no longer stands, it's the other way around right?

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 7, 2019

to clear out confusion, I believe this comment no longer stands, it's the other way around right?

Yeah, I updated the PR description to match the current commit message.

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 7, 2019

/hold

For me to add some tests here that verify that the MCD has a target osimageurl.

@cgwalters cgwalters force-pushed the cgwalters:osimage-render-but-disabled branch from bccb216 to d0211fe Feb 8, 2019

@openshift-ci-robot openshift-ci-robot added size/L and removed size/M labels Feb 8, 2019

@cgwalters cgwalters force-pushed the cgwalters:osimage-render-but-disabled branch from d0211fe to 4094236 Feb 8, 2019

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 8, 2019

/test e2e-aws-op

@cgwalters cgwalters changed the title Add code to inject OSImageURL, but disable it by default Add code to inject OSImageURL, but just for workers by default Feb 8, 2019

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 8, 2019

I0208 18:18:26.188194    5356 daemon.go:516] Bootstrap pivot required
I0208 18:18:26.188204    5356 update.go:655] Updating OS to registry.svc.ci.openshift.org/ci-op-l08p2kyf/stable@sha256:21eba43a81fd1a6f9e114b2b957e398aeeb75cfd9a8f74c3b62fb714ba42e23c
I0208 18:18:26.188212    5356 run.go:13] Running: /bin/pivot registry.svc.ci.openshift.org/ci-op-l08p2kyf/stable@sha256:21eba43a81fd1a6f9e114b2b957e398aeeb75cfd9a8f74c3b62fb714ba42e23c
pivot version 0.0.2 (f55cf7b5d1b832ad3fecfff1aca09aa0f6969fc7)

...

I0208 18:20:58.639732    4944 start.go:52] Version: 3.11.0-586-g95a3071d-dirty
I0208 18:20:58.640449    4944 start.go:88] starting node writer
I0208 18:20:58.648801    4944 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0208 18:20:58.690876    4944 daemon.go:155] Booted osImageURL: registry.svc.ci.openshift.org/ci-op-l08p2kyf/stable@sha256:21eba43a81fd1a6f9e114b2b957e398aeeb75cfd9a8f74c3b62fb714ba42e23c (47.291)
I0208 18:20:58.692434    4944 daemon.go:227] Managing node: ip-10-0-174-241.ec2.internal
I0208 18:21:12.781634    4944 start.go:146] Calling chroot("/rootfs")
I0208 18:21:12.781706    4944 daemon.go:404] In bootstrap mode
I0208 18:21:13.551441    4944 daemon.go:432] Current+desired config: worker-095a512e0970e036a0f262d657a327da
I0208 18:21:13.551544    4944 daemon.go:520] No bootstrap pivot required; unlinking bootstrap node annotations
I0208 18:21:13.554818    4944 daemon.go:547] Validated on-disk state
I0208 18:21:13.554889    4944 daemon.go:579] In desired config worker-095a512e0970e036a0f262d657a327da
I0208 18:21:13.554993    4944 start.go:165] Starting MachineConfigDaemon
I0208 18:21:13.555048    4944 daemon.go:248] Enabling Kubelet Healthz Monitor
                    "worker": "3 out of 3 nodes have updated to latest configuration worker-095a512e0970e036a0f262d657a327da"

🎊

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 8, 2019

Ah but here's the next problem, the machine-os-content needs to be updated:

We went from 47.308 to 47.291:

Upgraded:
  nss-altfiles 0-2.atomic.git20131217gite2a80593.el7 -> 2.18.1-11.el7
  ostree 2019.1-1.el7_6 -> 2019.1.5-6649032a375238255052a43adb8bc56faac989ca.8cbd7fc123ad6d6e4e8216211aee6f7dd6264886.el7
  ostree-grub2 2019.1-1.el7_6 -> 2019.1.5-6649032a375238255052a43adb8bc56faac989ca.8cbd7fc123ad6d6e4e8216211aee6f7dd6264886.el7
  pivot 0.0.2-0.1.el7 -> 0.0.2.11-f1ed664ed83e73268464e81019f213366d961bb2
  rpm-ostree 2019.1-3.atomic.el7 -> 2019.1.4-fa5be441b177a40b285ed1abc539c6f7770ab231.091833c72cefe9fcbb3af2b42dd07d8a8c9f63d2.el7
  rpm-ostree-libs 2019.1-3.atomic.el7 -> 2019.1.4-fa5be441b177a40b285ed1abc539c6f7770ab231.091833c72cefe9fcbb3af2b42dd07d8a8c9f63d2.el7
Downgraded:
  atomic-openshift-clients 4.0.0-0.164.0.git.0.88cca3f.el7 -> 4.0.0-0.150.0.git.0.f39ab66.el7
  atomic-openshift-hyperkube 4.0.0-0.164.0.git.0.88cca3f.el7 -> 4.0.0-0.150.0.git.0.f39ab66.el7
  atomic-openshift-node 4.0.0-0.164.0.git.0.88cca3f.el7 -> 4.0.0-0.150.0.git.0.f39ab66.el7
  cri-o 1.12.5-5.rhaos4.0.git9076a33.el7 -> 1.12.5-2.rhaos4.0.gitd4191df.el7
  glusterfs 3.12.2-40.el7 -> 3.12.2-32.el7
  glusterfs-client-xlators 3.12.2-40.el7 -> 3.12.2-32.el7
  glusterfs-fuse 3.12.2-40.el7 -> 3.12.2-32.el7
  glusterfs-libs 3.12.2-40.el7 -> 3.12.2-32.el7
  redhat-release-coreos 4.0-20180515.0.atomic.el7.0 -> 0-4749e1e9959e9dcb53804ed103dfde64c813ecd7.el7
  runc 1.0.0-58.dev.rhaos4.0.git2abd837.el7 -> 1.0.0-57.dev.git2abd837.el7
Removed:
  ostree-fuse-2019.1-1.el7_6.x86_64
Added:
  bubblewrap-0.3.1.6-94147e233fe200d1fe43a9a18c52475188b22798.el7.centos.x86_64
  ostree-libs-2019.1.5-6649032a375238255052a43adb8bc56faac989ca.8cbd7fc123ad6d6e4e8216211aee6f7dd6264886.el7.x86_64
@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 15, 2019

/retest

1 similar comment
@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 15, 2019

/retest

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 15, 2019

/test e2e-aws

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 15, 2019

/retest

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 15, 2019

FWIW...and I could be wrong, I don't think the PR context matters here anymore. I think Tide runs its own tests and it automatically retests.

@jlebon

This comment has been minimized.

Copy link
Member

jlebon commented Feb 15, 2019

FWIW, I was finally able to test this properly myself as well. All nodes pivoted on first boot as expected!
Nice work! 🎊

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 15, 2019

One thing I'd thought was true but I wanted to dig through the logs to verify is:

I believe we're completing updates during the install, before the e2e tests start. From that last e2e-aws run see e.g. from machineconfigpools.json:

                        "lastTransitionTime": "2019-02-15T16:19:27Z",
                        "reason": "All nodes are updated with worker-6c6c3c661d5ab71c65bc64d4f03fc066",
                        "lastTransitionTime": "2019-02-15T16:08:25Z",
                        "reason": "All nodes are updated with master-2def176c128ac52721d107243aaed8a5",

And:

2019/02/15 15:47:34 Running pod e2e-aws
2019/02/15 16:26:13 Container setup in pod e2e-aws completed successfully
@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 15, 2019

I believe we're completing updates during the install

yes, I can see the lag at cluster bringup where the installer CVO waits for the MCO clusteroperator to be available and as soon as the installation finishes, oc get pods shows the daemons restarted from the update

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 15, 2019

FWIW...and I could be wrong, I don't think the PR context matters here anymore. I think Tide runs its own tests and it automatically retests.

Wait, I think I was wrong:

** tide ** Pending — Not mergeable. Job ci/prow/e2e-aws has not succeeded.

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 15, 2019

/retest

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 15, 2019

Hmm, another consistent failure of those two tests.

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 15, 2019

OK looking at the test grid, operator Run template e2e-aws - e2e-aws container test 30m22s is way up there in flakes.

However, the openshift-tests [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel/minimal] 2m40s doesn't seem to be very flaky.

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 15, 2019

OMG the CI doesn't really want to merge this patch

/retest

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 15, 2019

OMG the CI doesn't really want to merge this patch

Other PRs are going in though. I am still worried that something we're doing in the updates is affecting the cluster or later tests.

I'm not seeing a consistent pattern yet though.

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 15, 2019

HAProxy and Prometheus flakes

/retest

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 15, 2019

Hmm. Other PRs are merging...still worried there's something "residual" we're doing to the cluster. But I just logged into this current e2e-aws cluster, and it looks fine...oc get clusteroperator is all clean, same for oc get pods --all-namespaces.

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 16, 2019

Well that last run was a huge set of failures. Reading the logs, the pools and the operator are reporting ready/done before everything is updated, because they key off currentConfig which we set quickly. I think we should change the MCD to only set currentConfig when it's done pivoting. Otherwise the pools are lying.

But even then, the systems seem to be updated. E.g one of the master MCDs says:
I0215 22:24:35.343092 6783 daemon.go:647] In desired config master-14bca73029fbcab80e7b5d11d7f131b9

And the tests start much later:

2019/02/15 22:35:25 Container setup in pod e2e-aws completed successfully

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 16, 2019

/retest

@kikisdeliveryservice

This comment has been minimized.

Copy link
Member

kikisdeliveryservice commented Feb 16, 2019

@cgwalters there are some bigger CI fixes that are in progress (in other repos for e2e-aws) that should be going in in the next few days. We can always hold this until you are back on Tuesday and see if the CI resolves itself.

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 16, 2019

/retest

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 16, 2019

Looks like that last run hit this.

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 16, 2019

I0216 05:45:03.659689    6232 start.go:52] Version: 3.11.0-670-gb2bebde8-dirty
I0216 05:45:03.660529    6232 start.go:88] starting node writer
I0216 05:45:03.664584    6232 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0216 05:45:03.749243    6232 daemon.go:168] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:660061d6eae3ee6d93ca836cd52e6033f1d611c629c1ce47cf272c9e9bda2488 (47.318)                        
I0216 05:45:03.749539    6232 daemon.go:240] Managing node: ip-10-0-156-180.ec2.internal
F0216 05:45:03.770509    6232 start.go:142] binding pod mounts: exec: "mount": executable file not found in $PATH

wha 🤔

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 16, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 16, 2019

network failures in the last run (Haproxy being always there) + one about security context which is the first time I see it

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 16, 2019

/test images
/test e2e-aws

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 16, 2019

I wonder how #442 impacts this PR as well, tests seem to be quite stable there now (especially e2e-aws despite the usual flakes).

@smarterclayton

This comment has been minimized.

Copy link
Member

smarterclayton commented Feb 16, 2019

On the mount thing you somehow caught the new 4.0 base image (based on UBI and has no util-linux).

#445

will fix your issue and also allow that to get pulled.

@openshift-bot

This comment has been minimized.

Copy link

openshift-bot commented Feb 17, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Feb 17, 2019

@cgwalters: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws 60d70a6 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment