Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-samples-operator and stuck upgrade from 4.6 to 4.7 with restricted networks #385

Closed
toastbrotch opened this issue Aug 18, 2021 · 9 comments · Fixed by #394
Closed

Comments

@toastbrotch
Copy link

toastbrotch commented Aug 18, 2021

Hi

we had set cluster-samples-operator to unmanaged and we have limited registries to our own (quay which ist not pull-thru-able) with allowedRegistriesForImport in the "images.config cluster" in our network restricted environment.

As we startet upgrade from 4.6.latest to 4.7.22. the whole upgrade got stuck with the creation of the is/hello-openshift. The only way we were able to finish upgrade was adding quay.io, registry.redhat.io and registry.access.redhat.com to the allowedRegistriesForImport and set cluster-samples-operator to Managed. i'm not sure if the steps described at https://github.com/openshift/cluster-samples-operator#troubleshooting helped us anything at all. nevertheless i had to remove the override in the clusterversion as otherwise openshift had always a warning hanging around (sorry forgot which).

to be honest this took us almost 1 day to figure out, which is way to much for just samples, we're not even interested. i hope the whole removal of the cluster-samples-operator would be possible in future... so its off and stays off without breaking my cluster.

regards,ivo

@gabemontero
Copy link
Contributor

Yeah across all the openshift operators, you should be using Removed instead of Unmanaged if you don't want that operator running.

Definitely something that has evolved since 4.1 when this stuff first arrived.

I can take a look at updating the README to better explain this.

@gabemontero
Copy link
Contributor

We also have a recent fix wrt delays in start up when in disconnected clusters. See #384

Also, overriding is for development purposes only. Having overrides violates your warranting for official OCP support. I'll make sure the README makes that clear.

@toastbrotch
Copy link
Author

toastbrotch commented Aug 19, 2021

Why are samples a clusteroperator? Why not just another operator that is completely removable?
For my understanding samples should exist, but are clearly not mandatory to the cluster. so it shouldn't be a clusteroperator.

@gabemontero
Copy link
Contributor

Yeah unfortunately it was the only reliable means of delivering samples back in the 4.0/4.x days a few years ago. Although not a given/mandatory as you noted, more of our customers at the time (and probably still the case) needed them, so this is how we landed.

There have been internal discussions in engineering as well as external requests from time to time to pull samples out of the payload.

I am told that among other things https://issues.redhat.com/browse/RFE-722 will be re-opened soon and decoupling samples from the OCP payload could be part of that. It is also possible it will be achieved via separate RFEs as well.

@sbose78 is currently working on some internal proposals for this.

But it is not far enough along @toastbrotch that I can give you a definitive target date.

@dperaza4dustbit FYI

Short term, as I noted originally in your issue, I'll be making some README updates here wrt some of the related details.

@toastbrotch
Copy link
Author

toastbrotch commented Aug 25, 2021

even with "managementState: Removed" the upgrade on another cluster stoped! again with the same problem as originaly reported. everthing else worked flawless...

So in my opinion this cluster operator is broken in network-restricted setups where you have only your own registry allowed with allowedRegistriesForImport (https://docs.openshift.com/container-platform/4.7/post_installation_configuration/preparing-for-users.html#images-configuration-parameters_post-install-preparing-for-users)!

if you happen to experience the same, try this:

# whitelist temporary the redhat registries 
#
# WARNING: this might not work as your firewall is not open or you're not allowed to get direct access to those registries, ...
#
oc edit images.config cluster
  - domainName: quay.io
    insecure: false
  - domainName: registry.redhat.io
    insecure: false
  - domainName: registry.access.redhat.com
    insecure: false
 
# let the management-operator be managed
oc edit configs.samples.operator.openshift.io -n openshift-cluster-samples-operator
  managementState: Managed
 
# restart operators
oc delete pods -n openshift-cluster-samples-operator --all
 
# upgrade should proceed after some minutes
 
# let it be removed again
oc edit configs.samples.operator.openshift.io -n openshift-cluster-samples-operator
  managementState: Removed
 
# remove those added registries
oc edit images.config cluster

# restart operator
oc delete pods -n openshift-cluster-samples-operator --all

additionaly here some information how to build air-gap friendly operators:
https://www.openshift.com/blog/is-your-operator-air-gap-friendly and
https://www.openshift.com/blog/building-an-air-gap-friendly-operator

@gabemontero
Copy link
Contributor

thanks for the detail @toastbrotch

as to where the problem may lie, from what you described, it very well could be in the samples operator's main dependencies, the imagestream controller/apiserver support and the internal images registry, and CVO itself, where even though we are removed, there are a few "in payload" imagestreams like cli and for a time hello openshift, which are actually created / maintained by the CVO, if though their specifications are in the sample operator manifest.

I'll have to reach out to our QE to redo this test with the precise setup you have described, but ideally, if you could provide me

  1. the yaml output from oc get configs.samples.operator.openshift.io cluster -o yaml ... save it to a file and attach it here
  2. the yaml output from oc get clusteroperator openshift-samples -o yaml ... save it to a file and attach it here
  3. the yaml output from oc get is -n openshift -o yaml ... save it to a file and attach it here

That would expedite progress.

Also, in case I do need to pull in other teams, ideally, if you are an OCP customer, open a support case or even bugzilla if you have direct access to that. A bugzilla would facilitate me bringing in other teams if need be.

thanks again

@toastbrotch
Copy link
Author

@gabemontero attached the output after upgrade with my described workaround
samples-imagestreams.txt
samples-clusteroperator.txt
samples-operator.txt

@gabemontero
Copy link
Contributor

Thanks for the data @toastbrotch

It was interesting in that everythings from a samples operator perspective is clean and what we would expect. It says it is a 4.7, everything removed, not degraded.

And yes, the hello openshift imagestream works when you specify quay.io in the allowed registries, facilitating your mirror. The image imported: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9da94ec207ee7ea43ba0749ad89df14eccc0676f98eef7d355585f68085d35cd

I have been in contact with the team that owns imagestream import (both development and test). I'll tag them here: @dmage for dev, @xiuwang for test.

@xiuwang is performing one more test to confirm, but from what @dmage tells me, having to specify your local mirror in allowedImageRegistries while disconnected is required. He says the OpenShift API server doesn't know about the local mirror registry.

I'll defer to him on how possible a change might be so that having the imagestream import logic in the OpenShift API server automatically

The further complication with all this is that even though the hello openshift imagestream was defined in the samples operator manifest at https://github.com/openshift/cluster-samples-operator/blob/release-4.7/manifests/08-openshift-imagestreams.yaml it is actually the CVO which manages the creation and status of those imagestreams. So the failure reporting around this is outside the code of this repository, unless it is determined that say annotations defined on the imagestreams in https://github.com/openshift/cluster-samples-operator/blob/release-4.7/manifests/08-openshift-imagestreams.yam somehow bypass the errors CVO is reporting.

From @dmage :

An image stream will be rejected if it points to a registry that is not
in allowedRegistriesForImport. OpenShift API server doesn't know about
the local mirror registry.

CVO can reconcile annotations and other properties if a forbidden image
reference is already in the image stream. **But it'll fail to create new
image streams with the same image reference. allowedRegistriesForImport
is applied before any ICSP rules, so the registry name from the image
stream object should be allowed.**

hello-openshift falls into that last sentence, in that it did not exist in 4.6.

So, next steps:

  1. as it turns out, for totally unrelated reasons, hello-openshift was removed from the product. So far, that only happened in 4.8 with https://bugzilla.redhat.com/show_bug.cgi?id=1975539 ..... I'm going to backport that fix to 4.7. Presumably then, upgrades to later versions of 4.7 from 4.6 won't have this delta to deal with.

  2. We'll continue to work with @dmage and team on whether the imagestream import logic can ever learn about the local mirror. In case we ever add new imagestreams to https://github.com/openshift/cluster-samples-operator/blob/master/manifests/08-openshift-imagestreams.yaml

  3. In addition the to prior readme updates I cited in cluster-samples-operator and stuck upgrade from 4.6 to 4.7 with restricted networks #385 (comment) I'll include this scenario in the list ... we'll also want to consider product docs wrt that

@dperaza4dustbit @jitendar-singh FYI

@gabemontero
Copy link
Contributor

OK everyone, my last status update wrt this:

  • the Bug 2002368: acccount for image api returning invalid on imagestream create based on allowed/blocked registry settings #394 PR and associated BZ will catch if the global image config blocks the registry used by the sample imagestreams,
  • if it happens on initial install, samples will bootstrap as removed
  • if it happens after the initial install, through a samples or global image config change, samples will citing it as a config error.... itstill be degraded, but the cluster admin will be alerted that they have a mismatch if imagestream vs. samples configuration that needs to be addressed via the existing alerts for invalid config
  • that addresses the imagestreams samples operator actually manages
  • also updated the README for some of the stale guidance
  • for the in payload imagestreams that the CVO creates (like 'hello-opeshift' noted here), I've opened https://bugzilla.redhat.com/show_bug.cgi?id=2003814 for the imagestream team to track .... most likely, in payload imagestreams need to be special cased somehow and not blocked ... most of them are needed for support
  • with Bug 2001977: delete hello-openshift in payload imagestream #393 I'm deleting 'hello-openshift' in 4.7 .... this is a backport of deletes of it in 4.8 and beyond ... basically, we've been told it should not have gone out the door

when #394 merges, I'll be closing this item out in favor of the bugzillas referenced (which are open to the public).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants