Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baremetal: Default to emptyDir for the baremetal platform. #332

Merged
merged 1 commit into from Aug 28, 2019

Conversation

russellb
Copy link
Member

The "baremetal" platform includes automated provisioning of bare metal
hosts. emptyDir makes sense as the default, as we can't assume
anything else is available. It's possible that for a given
deployment, some storage system may be available (such as rook / ceph)
and the cluster may be configured to use something else.

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 26, 2019
@bparees
Copy link
Contributor

bparees commented Jul 26, 2019

The "baremetal" platform includes automated provisioning of bare metal
hosts. emptyDir makes sense as the default, as we can't assume
anything else is available.

emptydir never makes sense as a default for any production deployment. If the customer installs this way and pushes a single image to the registry before they fix the storage to something "real" they will make a mess of their cluster.

We need to discuss alternatives.

@russellb
Copy link
Member Author

The "baremetal" platform includes automated provisioning of bare metal
hosts. emptyDir makes sense as the default, as we can't assume
anything else is available.

emptydir never makes sense as a default for any production deployment. If the customer installs this way and pushes a single image to the registry before they fix the storage to something "real" they will make a mess of their cluster.

We need to discuss alternatives.

any suggestions? :-)

what do the docs for bare metal UPI clusters say to do?

@russellb
Copy link
Member Author

The "baremetal" platform includes automated provisioning of bare metal
hosts. emptyDir makes sense as the default, as we can't assume
anything else is available.

emptydir never makes sense as a default for any production deployment. If the customer installs this way and pushes a single image to the registry before they fix the storage to something "real" they will make a mess of their cluster.
We need to discuss alternatives.

any suggestions? :-)

what do the docs for bare metal UPI clusters say to do?

Found the docs: https://docs.openshift.com/container-platform/4.1/installing/installing_bare_metal/installing-bare-metal.html#installation-registry-storage-config_installing-bare-metal

basically, set up a PVC and use that, right?

Can we put this default in, just so the install completes cleanly, but then the install script for the KNI clusters will immediately reconfigure this to use a PVC using rook? These KNI clusters with rook are the only place where the baremetal platform will be supported in 4.2.

@adambkaplan
Copy link
Contributor

The typical recommendation for baremetal is to use the PVC storage config, which implies separately provisioning a storage solution like ceph or NFS

@russellb
Copy link
Member Author

I've filed this issue to ensure that the KNI install scripts immediately reconfigure the registry to use ceph once OCS has been deployed: openshift-kni/install-scripts#5

so this PR would just be to ensure create cluster finishes cleanly, but we wouldn't actually use the registry in this config.

would a comment in the code about this help?

@bparees
Copy link
Contributor

bparees commented Jul 26, 2019

I've filed this issue to ensure that the KNI install scripts immediately reconfigure the registry to use ceph once OCS has been deployed: openshift-kni/install-scripts#5

Will that be in place in GA, or that's something that would be done in a future version?

so this PR would just be to ensure create cluster finishes cleanly, but we wouldn't actually use the registry in this config.

That's ok for RHHI installs, but i imagine other installs will also see the platform as "baremetal" and use emptydir and they won't have KNI install scripts to fix it up after the install completes, right?

would a comment in the code about this help?

definitely, but i still have the above concern about this approach impacting non-RHHI baremetal installs.

@bparees
Copy link
Contributor

bparees commented Jul 26, 2019

can we uniquely identify RHHI installs somehow?

@eparis
Copy link
Member

eparis commented Jul 29, 2019

Can we please just fix the registry to create a PVC and not block upgrade if the PVC is never satisfied? Fire an alert, but don't block upgrades.

@russellb
Copy link
Member Author

I've filed this issue to ensure that the KNI install scripts immediately reconfigure the registry to use ceph once OCS has been deployed: openshift-kni/install-scripts#5

Will that be in place in GA, or that's something that would be done in a future version?

We can make it a GA blocker. The baremetal platform will not be used in any GA environment until at least 4.3, and will have OCS available in all cases.

so this PR would just be to ensure create cluster finishes cleanly, but we wouldn't actually use the registry in this config.

That's ok for RHHI installs, but i imagine other installs will also see the platform as "baremetal" and use emptydir and they won't have KNI install scripts to fix it up after the install completes, right?

That's true. RHHI is the only environment where we support this baremetal managed platform right now, though. So for RHHI we can automate the switch away from emptydir post-install. For any potential future use of this platform where OCS might not be available, we'd document in big bold letters that this must be reconfigured post-install to whatever is appropriate for their environment.

would a comment in the code about this help?

definitely, but i still have the above concern about this approach impacting non-RHHI baremetal installs.

Fair concern - hopefully my comment above helps, namely that there's no non-RHHI baremetal platform installs. This issue would also block any future other usage of this platform.

The "baremetal" platform includes automated provisioning of bare metal
hosts.  emptyDir makes sense as the default, as we can't assume
anything else is available.  It's possible that for a given
deployment, some storage system may be available (such as rook / ceph)
and the cluster may be configured to use something else.
@bparees
Copy link
Contributor

bparees commented Jul 29, 2019

@eparis

Can we please just fix the registry to create a PVC and not block upgrade if the PVC is never satisfied? Fire an alert, but don't block upgrades.

  1. it is not "fixing" the registry. The current behavior is not a bug. It was extensively discussed and agreed upon. So please don't call it "fixing" it. If you'd like it to be changed, then ask for the behavior to be changed and we can (once again) discuss why that is problematic.

  2. With what you propose the registry will be unavailable (no pods running). We should never report "registry is available" when the registry is not available, or we will have started down a slippery slope of under what specific circumstances we lie to the administrator about the registry availability because we "think" it's ok that the registry is not available in this particular circumstance. When an admin screws up their manual PV config and we report "hey your registry is totally fine, nothing to see here", that is a bad user experience and defeats the entire point of what we are trying to do w/ operator statuses+reporting.

@russellb

We can make it a GA blocker. The baremetal platform will not be used in any GA environment until at least 4.3, and will have OCS available in all cases.

Having OCS available isn't sufficient for this behavior, the point is that we need to not be using emptydir. If you're saying that in all baremetal environments we will swap the storage to PVs post-install, then ok I guess. It's still a bit scary since we're dependent on some external bit doing what it's supposed to or else the admin ends up w/ an emptydir registry and doesn't realize it.

For any potential future use of this platform where OCS might not be available, we'd document in big bold letters that this must be reconfigured post-install to whatever is appropriate for their environment.

I guess we can cross that bridge when we come to it. I'd still prefer something better, but it's something.

Fair concern - hopefully my comment above helps, namely that there's no non-RHHI baremetal platform installs. This issue would also block any future other usage of this platform.

It does help, thanks. I'm ok w/ proceeding with this (using emptydir on "baremetal providers" for now given that).

Just to confirm, this "baremetal" provider is distinct from the install config the registry operator will see in a true baremetal (not managed) install, correct? We definitely not want this logic to cause the registry to run emptydir in an unmanaged baremetal installation today.

@russellb
Copy link
Member Author

Just to confirm, this "baremetal" provider is distinct from the install config the registry operator will see in a true baremetal (not managed) install, correct? We definitely not want this logic to cause the registry to run emptydir in an unmanaged baremetal installation today.

That's right. This will not affect UPI bare metal installs. This only applies to bare metal installs where we've enabled the extra automated provisioning and management of bare metal hosts, or the "baremetal" IPI platform for lack of a better way to refer to it. The only place that's used right now is RHHI.

I would definitely like to figure out a way to make this use a rook/ceph PV automatically, instead of relying on a post-install script. Maybe that's something we can figure out before GA. For 4.2 this will just be a limited beta thing, so the post-install script seems OK. Any suggestions for improvements post-4.2 definitely welcome!

@russellb
Copy link
Member Author

/test e2e-aws

@bparees
Copy link
Contributor

bparees commented Jul 31, 2019

@russellb i'm not sure who the right set of people are, but i mentioned to @derekwaynecarr and @adambkaplan that it would be good to rename the "baremetal" provider to something more specific to RHHI, so that if a future baremetal provider wants to do something else, the registry operator can distinguish providers and do the right thing. (alternatively if there is something else in the install config that the registry operator can look at to make the distinction, it would be good to key off that).

Do you know who we'd need to involve in such a rename discussion?

@russellb
Copy link
Member Author

@russellb i'm not sure who the right set of people are, but i mentioned to @derekwaynecarr and @adambkaplan that it would be good to rename the "baremetal" provider to something more specific to RHHI, so that if a future baremetal provider wants to do something else, the registry operator can distinguish providers and do the right thing. (alternatively if there is something else in the install config that the registry operator can look at to make the distinction, it would be good to key off that).

Do you know who we'd need to involve in such a rename discussion?

I'm probably the best person to coordinate a rename if necessary. Changing it in 4.3 is probably doable, but after that it's going to be less practical to change.

This isn't RHHI specific, though. Architecturally, it's IPI for bare metal hosts. RHHI is just one particular way it's productized, but there could be others in the future. Anything we can do to extend and improve the baremetal platform so that it could default to something better out of the gate would be great, of course.

@bparees
Copy link
Contributor

bparees commented Jul 31, 2019

This isn't RHHI specific, though. Architecturally, it's IPI for bare metal hosts.

The case that I made to @derekwaynecarr (not sure if he agreed and perhaps i'm missing something about the baremetal IPI intent) is that you could have another baremetal IPI provider that wants, for example, to use noobaa, or some other s3 gateway instead of OCS for registry storage.

If we are assuming/requiring that every baremetal IPI installer will

  1. install OCS
  2. include install script logic to switch the registry from emptydir to PVC (via PV provided by OCS)

then it's fine. But if we think there's any reason the storage behavior or install flow might be different for another IPI provider, we will need a way to distinguish.

@russellb
Copy link
Member Author

This isn't RHHI specific, though. Architecturally, it's IPI for bare metal hosts.

The case that I made to @derekwaynecarr (not sure if he agreed and perhaps i'm missing something about the baremetal IPI intent) is that you could have another baremetal IPI provider that wants, for example, to use noobaa, or some other s3 gateway instead of OCS for registry storage.

If we are assuming/requiring that every baremetal IPI installer will

1. install OCS

2. include install script logic to switch the registry from emptydir to PVC (via PV provided by OCS)

then it's fine. But if we think there's any reason the storage behavior or install flow might be different for another IPI provider, we will need a way to distinguish.

We're not actually assuming anything, that's why the patch defaults to emptyDir, and expects that this must be dealt with post-install. That leaves all of those options, depending on how the bare metal platform is used and combined with different storage options.

@russellb
Copy link
Member Author

CC @eparis -- you may be interested in this discussion

@bparees
Copy link
Contributor

bparees commented Jul 31, 2019

We're not actually assuming anything, that's why the patch defaults to emptyDir, and expects that this must be dealt with post-install.

that is an assumption though, right? you are assuming it will be dealt with post-install.

I guess i shouldn't have said "assumes OCS/PVC". But what i should have said is: this is ok if it is always safe to assume that for any baremetal IPI implementation, the post-install steps will always update the storage configuration.

@adambkaplan
Copy link
Contributor

@russellb would it make more sense for us to create a PVC, and require a baremetal "provider" have an automatic storage provisioner?

@bparees
Copy link
Contributor

bparees commented Jul 31, 2019

@russellb would it make more sense for us to create a PVC, and require a baremetal "provider" have an automatic storage provisioner?

the problem there is that the provisioner is OCS, but OCS won't be installed until after the basic install is complete, and the basic install can't complete until the registry comes up.

@adambkaplan
Copy link
Contributor

the problem there is that the provisioner is OCS, but OCS won't be installed until after the basic install is complete, and the basic install can't complete until the registry comes up.

Can the installer be enhanced so that for baremetal IPI it brings up the image registry operator (and perhaps the samples operator) in the Removed state? Then we shouldn't block install as we report Available=true in that state.

It would then be up to the post-install script to do the following:

  1. Install OCS/other auto-provisioner for ReadWriteMany PVs
  2. Set the image registry operator (and other potential operators) to Managed

@bparees
Copy link
Contributor

bparees commented Jul 31, 2019

Can the installer be enhanced so that for baremetal IPI it brings up the image registry operator (and perhaps the samples operator) in the Removed state?

i don't think that is a path that exists in the installer today and i think there would be significant pushback on adding it. The installer lays down the manifests as provided, it's not going to customize specific manifests based on provider.

@wking
Copy link
Member

wking commented Aug 1, 2019

The installer lays down the manifests as provided, it's not going to customize specific manifests based on provider.

Instead of handling PV config post-install, can we feed the installer some manifests to set up PV during the IPI-ish install? Users who went that route could configure the registry to use whatever storage they were configuring, and install would go smoothly and end up with a happy, functional, production registry.

Users who did not add additional manifests would have install hang and exit "the registry never got happy", and folks could investigate the registry, see that it was waiting on storage, and do whatever day-2 stuff they needed to fix it up then.

Or is that not feasible for some reason (maybe you need something more interactive to set up PV, and can't just drop in a bunch of manifests into the installer)?

@russellb
Copy link
Member Author

@bparees is this something that should already work today? or something that requires more work? Are there other relevant issues / PRs i should watch?

@bparees
Copy link
Contributor

bparees commented Aug 26, 2019

@bparees is this something that should already work today? or something that requires more work? Are there other relevant issues / PRs i should watch?

i'm not sure the current or intended state at this point. @dmage @adambkaplan

@russellb
Copy link
Member Author

@bparees is this something that should already work today? or something that requires more work? Are there other relevant issues / PRs i should watch?

i'm not sure the current or intended state at this point. @dmage @adambkaplan

OK - I need to unblock this somehow for 4.2, either by merging this PR (and following up in future releases with improvements) or by confirming that the discussed alternative behavior is ready (ability to install without a registry, and allowing it to be configured post-install instead).

@bparees
Copy link
Contributor

bparees commented Aug 26, 2019

I could live w/ merging this and following up w/ the change to keep the registry in a removed state initially instead. but it is @adambkaplan's call.

@dmage
Copy link
Member

dmage commented Aug 26, 2019

AFAIK we already declared baremetal support in 4.1. So even if this change is indented to fix the problem in 4.3 for RHHI, it can affect other baremetal (UPI) customers. How can we make sure that customers switch the registry to some kind of persistent storage before using it?

@russellb
Copy link
Member Author

Note that this patch has no effect on UPI bare metal deployments. It only affects bare metal IPI.

@adambkaplan
Copy link
Contributor

@russellb what sort of support/guarantee are we giving for baremetal IPI installs? I'd say no to this if this is intended for GA code. However, I'd be OK if we considered baremetal IPI developer preview.

@bparees
Copy link
Contributor

bparees commented Aug 27, 2019

@russellb what sort of support/guarantee are we giving for baremetal IPI installs? I'd say no to this if this is intended for GA code. However, I'd be OK if we considered baremetal IPI developer preview.

didn't we previously establish that for 4.2, RHHI is the only supported baremetal IPI install? That being the case, we can have confidence that the storage will be "made correct" by the RHHI install process.

@russellb
Copy link
Member Author

@russellb what sort of support/guarantee are we giving for baremetal IPI installs? I'd say no to this if this is intended for GA code. However, I'd be OK if we considered baremetal IPI developer preview.

didn't we previously establish that for 4.2, RHHI is the only supported baremetal IPI install? That being the case, we can have confidence that the storage will be "made correct" by the RHHI install process.

That is correct.

@sdodson
Copy link
Member

sdodson commented Aug 27, 2019

IPI vs UPI has no meaning here, what does matter is that when people perform a baremetal UPI install they're doing so by selecting platform=none

@dmage
Copy link
Member

dmage commented Aug 27, 2019

@sdodson yes, that confused me a bit, for RHHI.next we use BareMetal, and for baremetal installations we use None.

@russellb
Copy link
Member Author

@sdodson yes, that confused me a bit, for RHHI.next we use BareMetal, and for baremetal installations we use None.

Architecturally it’s the bare metal platform for infrastructure automation. RHHI is just a product delivery that uses it, among other things. It could also be generally supported in OpenShift some day in the future. Make sense?

@adambkaplan
Copy link
Contributor

IPI vs UPI has no meaning here, what does matter is that when people perform a baremetal UPI install they're doing so by selecting platform=none

/approve

Re-hashing this thread for clarity:

For RHHI the installation will be platform=baremetal, with this patch the registry will be online with emptyDir. RHHI installer will then change the storage to PVC - presumably after OCS has been added via OLM. I've come around and am OK with this for 4.2.

We are considering in 4.3 changing this process so that for platform=baremetal we bring the registry up as Removed. We would then default to PVC once an external installer (RHHI, RHV, RHHI.next) updates the operator management state to Managed.

@russellb can you please verify that a) my understanding of the process for 4.2 is correct, and b) if the proposed changes for 4.3 make sense.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 27, 2019
@russellb
Copy link
Member Author

@russellb can you please verify that a) my understanding of the process for 4.2 is correct,

Yes.

and b) if the proposed changes for 4.3 make sense.

It makes sense, though I haven't looked at what it will take. Is coming up as Removed supported already? If not, is something already tracking this idea? I just wasn't sure if this was something I should be able to just do right now vs. needing to wait for a new feature.

@adambkaplan
Copy link
Contributor

Is coming up as Removed supported already? If not, is something already tracking this idea?

The registry can be set to Removed, but we are not currently bootstrapping the operator in this state.

DevEx is tracking this in our 4.3 planning - enhancement proposal and JIRA story is forthcoming.

@russellb
Copy link
Member Author

I've filed #375 for the TODO item to change this once the registry operator can come up with the registry set to Removed.

@adambkaplan
Copy link
Contributor

@dmage want your take here - I'm OK merging this now and bringing the registry up as Removed in 4.3.

@dmage
Copy link
Member

dmage commented Aug 28, 2019

Hopefully we'll have capacity to improve it in 4.3.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan, dmage, russellb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 2eb759a into openshift:master Aug 28, 2019
russellb added a commit to russellb/dev-scripts that referenced this pull request Sep 3, 2019
The cluster-image-registry-operator will now use emptyDir by default
for the "baremetal" infrastructure platform, so this manifest is no
longer needed.

openshift/cluster-image-registry-operator#332

Note this is still not a supported configuration, and is only used to
bring up the cluster.  It should be immediately reconfigured to use a
PVC once storage has been set up.  For more info on that post-install
task, see:

openshift-kni/install-scripts#5
russellb added a commit to russellb/dev-scripts that referenced this pull request Sep 3, 2019
The cluster-image-registry-operator will now use emptyDir by default
for the "baremetal" infrastructure platform, so this manifest is no
longer needed.

openshift/cluster-image-registry-operator#332

Note this is still not a supported configuration, and is only used to
bring up the cluster.  It should be immediately reconfigured to use a
PVC once storage has been set up.  For more info on that post-install
task, see:

openshift-kni/install-scripts#5

Closes openshift-metal3#704
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants