Skip to content

Commit

Permalink
Merge pull request #400 from gabemontero/no-degrade-boot-removed-allo…
Browse files Browse the repository at this point in the history
…w-block-img-reg-49

Bug 2009722: acccount for image api returning invalid on imagestream create based on allowed/blocked registry settings
  • Loading branch information
openshift-merge-robot committed Nov 12, 2021
2 parents 49a97d3 + 9952a2b commit eb7466b
Show file tree
Hide file tree
Showing 8 changed files with 219 additions and 30 deletions.
31 changes: 28 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,32 @@ Samples: `oc get is -n openshift`, `oc get templates -n openshift` … use of -

Deletion of the CRD instance will reset the samples operator to the default configuration, but leave the current revision of the samples operator pod running.

If there is a bug in the samples operator deletion logic, to reset the samples operator configuration by stopping the current pod and starting a new one:
You can also set the management state of the samples operator to `Removed` if you do not need samples running in your
cluster:

```yaml
apiVersion: samples.operator.openshift.io/v1
kind: Config
metadata:
name: cluster
spec:
architectures:
- x86_64
managementState: Removed
```

# "In-payload" imagestreams

The imagestreams defined at (https://github.com/openshift/cluster-samples-operator/blob/master/manifests/08-openshift-imagestreams.yaml)[https://github.com/openshift/cluster-samples-operator/blob/master/manifests/08-openshift-imagestreams.yaml]
are not managed by the samples operator. These are special imagestreams that point to images in the install payload, and their creation and updates are handled by the
[Cluster Version Operator](https://github.com/openshift/cluster-version-operator)

# Development

NOTE: this next step violates your cluster support contract if you are an OCP customer. But if your are using a cluster
for development, and you wish to experiment with and update the samples operator code itself, you can reset the image used
for the samples operator via the following steps:

- Run `oc edit clusterversion version` and add an override entry to the spec for the Deployment of the samples operator so it is unmanaged:

```yaml
Expand All @@ -131,8 +156,8 @@ spec:
```

- Scale down the deployment via `oc scale deploy cluster-samples-operator --replicas=0`
- Edit the samples resource, remove the finalizer
- Delete the samples resource, config map/secrets in operator namespace, samples in openshift namespace
- Edit the samples resource, `oc edit config.samples cluster`, remove the finalizer
- Delete the samples resource, `oc delete config.samples cluster`, config map/secrets in operator namespace, imagestreams and templates in openshift namespace
- Scale up the deployment via `oc scale deploy cluster-samples-operator --replicas=1`


Expand Down
5 changes: 4 additions & 1 deletion manifests/010-prometheus-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,13 @@ spec:
severity: info
annotations:
message: |
Samples operator could not access 'registry.redhat.io' during its initial installation and it bootstrapped as removed.
One of two situations has occurred. Either
samples operator could not access 'registry.redhat.io' during its initial installation and it bootstrapped as removed.
If this is expected, and stems from installing in a restricted network environment, please note that if you
plan on mirroring images associated with sample imagestreams into a registry available in your restricted
network environment, and subsequently moving samples operator back to 'Managed' state, a list of the images
associated with each image stream tag from the samples catalog is
provided in the 'imagestreamtag-to-image' config map in the 'openshift-cluster-samples-operator' namespace to
assist the mirroring process.
Or, the use of allowed registries or blocked registries with global imagestream configuration will not allow
samples operator to create imagestreams using the default image registry 'registry.redhat.io'.
15 changes: 15 additions & 0 deletions manifests/03-rbac-imageconfig-role-binding.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cluster-samples-operator-imageconfig-reader
annotations:
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
subjects:
- kind: ServiceAccount
name: cluster-samples-operator
namespace: openshift-cluster-samples-operator
roleRef:
kind: ClusterRole
name: cluster-samples-operator-imageconfig-reader
17 changes: 17 additions & 0 deletions manifests/03-rbac-imageconfig-role.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-samples-operator-imageconfig-reader
annotations:
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
rules:
- apiGroups:
- config.openshift.io
resources:
- images
verbs:
- get
- list
- watch
15 changes: 13 additions & 2 deletions pkg/stub/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,14 @@ func (h *Handler) SpecValidation(cfg *v1.Config) error {
}
}
}

// if the global imagestream config prevents the creation of imagestreams with the samples registry, we mark
// invalid config
if h.imageConfigBlocksImageStreamCreation(cfg.Spec.SamplesRegistry) {
err := fmt.Errorf("global openshift image configuration prevents the creation of imagestreams using the registry %s", cfg.Spec.SamplesRegistry)
return h.processError(cfg, v1.ConfigurationValid, corev1.ConditionFalse, err, "%#v")
}

h.GoodConditionUpdate(cfg, corev1.ConditionTrue, v1.ConfigurationValid)
return nil
}
Expand Down Expand Up @@ -282,14 +290,17 @@ func (h *Handler) processError(opcfg *v1.Config, ctype v1.ConfigConditionType, c

// most reasons for degraded status are handled in ClusterOperatorStatusDegradedCondition based on
// state in the opcfg correlated with other factors; but when the error should directly drive the
// reason text, we set the reason here; at this time, that means either API Server errors or
// errors accessing the local file system
// reason text, we set the reason here; at this time, that means either API Server errors,
// errors accessing the local file system, or global imagestream config blocking the use of our configured
// registry
reason := kerrors.ReasonForError(err)
switch {
case len(reason) > 0 && reason != kapis.StatusReasonUnknown:
status.Reason = fmt.Sprintf("APIServer%sError", reason)
case cstatus == corev1.ConditionUnknown && ctype == v1.SamplesExist:
status.Reason = "FileSystemError"
default:
status.Reason = "GlobalImageStreamConfigRegistryMismatch"
}

util.ConditionUpdate(opcfg, status)
Expand Down
140 changes: 124 additions & 16 deletions pkg/stub/handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ package stub
import (
"context"
"crypto/tls"
"errors"
"fmt"
"io/ioutil"
"net"
Expand All @@ -25,6 +26,7 @@ import (
restclient "k8s.io/client-go/rest"
"k8s.io/client-go/util/retry"

configv1 "github.com/openshift/api/config/v1"
imagev1 "github.com/openshift/api/image/v1"
operatorsv1api "github.com/openshift/api/operator/v1"
v1 "github.com/openshift/api/samples/v1"
Expand Down Expand Up @@ -306,6 +308,16 @@ func IsRetryableAPIError(err error) bool {
return false
}

func ShouldNotGoDegraded(err error) bool {
// retryable and conflict errors should be self-explanatory for not going degraded;
// in particular with conflicts, they are quite possible with imagestreams during startup given
// the multi tag, highly concurrent nature of creating / updating;
// for invalid, we special case this because that is what the imagestream apiserver endpoint
// returns if a registry is missing from the images allowed registry list, or is explicitly
// listed in the blocked registry list
return IsRetryableAPIError(err) || kerrors.IsConflict(err) || kerrors.IsInvalid(err)
}

// this method assumes it is only called on initial cfg create, or if the Architecture array len == 0
func (h *Handler) updateCfgArch(cfg *v1.Config) *v1.Config {
switch {
Expand All @@ -327,6 +339,81 @@ func (h *Handler) updateCfgArch(cfg *v1.Config) *v1.Config {
return cfg
}

func (h *Handler) imageConfigBlocksImageStreamCreation(name string) bool {
//TODO openshift/client-go/config/clientset/versioned/fake and ConfigV1Interface has compile issues with
// respect to ConfigV1Client (missing method implementations), so for now we cannot use it in our
// unit tests. In the actual runtime if we cannot create the client the operator errors out on startup.
if h.configclient == nil {
return false
}
var imgCfg *configv1.Image
err := wait.PollImmediate(5*time.Second, 20*time.Second, func() (done bool, err error) {
// if the image config allowed registry or blocked registry list will prevent the creation of imagestreams,
// we consider this inaccessible
imgCfg, err = h.configclient.Images().Get(context.TODO(), "cluster", metav1.GetOptions{})
if err != nil {
logrus.Printf("unable to retrieve image configuration as part of testing %s connectivity: %s", name, err.Error())
return false, nil
}
return true, nil
})
if err != nil {
return true
}

ok := false
// check allowed domain list
if len(imgCfg.Spec.AllowedRegistriesForImport) > 0 {
for _, rl := range imgCfg.Spec.AllowedRegistriesForImport {
logrus.Printf("considering allowed registries domain %s for %s", rl.DomainName, name)
if strings.HasSuffix(name, rl.DomainName) {
logrus.Printf("the allowed registries domain %s allows imagestream creation access for search param %s", rl.DomainName, name)
ok = true
break
}
}
if !ok {
logrus.Printf("no allowed registries items will permit the use of %s", name)
return true
}
}

ok = false
// check allowed specific registry list
if len(imgCfg.Spec.RegistrySources.AllowedRegistries) > 0 {
for _, r := range imgCfg.Spec.RegistrySources.AllowedRegistries {
logrus.Printf("considering allowed registry %s for %s", r, name)
if strings.TrimSpace(r) == strings.TrimSpace(name) {
logrus.Printf("the allowed registry %s allows imagestream creation for search param %s", r, name)
ok = true
break
}
}
if !ok {
logrus.Printf("no allowed registries items will permit the use of %s", name)
return true
}

}

ok = false
// check blocked list
if len(imgCfg.Spec.RegistrySources.BlockedRegistries) > 0 {
for _, r := range imgCfg.Spec.RegistrySources.BlockedRegistries {
logrus.Printf("considering blocked registry %s for %s", r, name)
if strings.TrimSpace(r) == strings.TrimSpace(name) {
logrus.Printf("the blocked registry %s prevents imagestream creation for search param %s", r, name)
return true
}
}
logrus.Printf("no blocked registries items will prevent the use of %s", name)
}

logrus.Printf("no global imagestream configuration will block imagestream creation using %s", name)

return false
}

func (h *Handler) tbrInaccessible() bool {
if h.configclient == nil {
// unit test environment
Expand All @@ -338,17 +425,34 @@ func (h *Handler) tbrInaccessible() bool {
logrus.Print("registry.redhat.io does not support ipv6, bootstrap to removed")
return true
}
// if a proxy is in play, the registry.redhat.io connection attempt during startup is problematic at best;
// assume tbr is accessible since a proxy implies external access, and not disconnected
proxy, err := h.configclient.Proxies().Get(context.TODO(), "cluster", metav1.GetOptions{})
if err != nil {
logrus.Printf("unable to retrieve proxy configuration as part of testing registry.redhat.io connectivity: %s", err.Error())
} else {
if len(proxy.Status.HTTPSProxy) > 0 || len(proxy.Status.HTTPProxy) > 0 {
logrus.Printf("with global proxy configured assuming registry.redhat.io is accessible, bootstrap to Managed")
return false
if h.imageConfigBlocksImageStreamCreation("registry.redhat.io") ||
h.imageConfigBlocksImageStreamCreation("registry.access.redhat.io") ||
h.imageConfigBlocksImageStreamCreation("quay.io") {
return true
}

proxy := &configv1.Proxy{}
var err error
err = wait.PollImmediate(5*time.Second, 20*time.Second, func() (bool, error) {
// if a proxy is in play, the registry.redhat.io connection attempt during startup is problematic at best;
// assume tbr is accessible since a proxy implies external access, and not disconnected
proxy, err = h.configclient.Proxies().Get(context.TODO(), "cluster", metav1.GetOptions{})
if err != nil {
logrus.Printf("unable to retrieve proxy configuration as part of testing registry.redhat.io connectivity: %s", err.Error())
return false, nil
}
return true, nil
})
if err != nil {
// just logging for reference; we will move on to the registry connection tests
logrus.Printf("was unable to get proxy instance within a reasonable retry window")
}

if len(proxy.Status.HTTPSProxy) > 0 || len(proxy.Status.HTTPProxy) > 0 {
logrus.Printf("with global proxy configured assuming registry.redhat.io is accessible, bootstrap to Managed")
return false
}

err = wait.PollImmediate(20*time.Second, 5*time.Minute, func() (bool, error) {
// we have seen cases in the field with disconnected cluster where the default connection timeout can be
// very long (15 minutes in one case); so we do an initial non-tls connection were we can specify a quicker
Expand Down Expand Up @@ -642,7 +746,7 @@ func (h *Handler) Handle(event util.Event) error {
// here, we get started with the delete, which will ultimately reset the conditions
// and samples, when it is false again, regardless of whether the imagestream imports are done
if h.upsertInProgress {
return fmt.Errorf("A delete attempt has come in while creating samples; initiating retry; creation loop should abort soon")
return errors.New("A delete attempt has come in while creating samples; initiating retry; creation loop should abort soon")
}

// nuke any registered upserts
Expand Down Expand Up @@ -911,12 +1015,16 @@ func (h *Handler) Handle(event util.Event) error {
}

if err != nil {
h.processError(cfg, v1.SamplesExist, corev1.ConditionUnknown, err, "error creating samples: %v")
dbg := "setting samples exists to unknown"
logrus.Printf("CRDUPDATE %s", dbg)
e := h.crdwrapper.UpdateStatus(cfg, dbg)
if e != nil {
return e
if !ShouldNotGoDegraded(err) {
h.processError(cfg, v1.SamplesExist, corev1.ConditionUnknown, err, "error creating samples: %v")
dbg := "setting samples exists to unknown"
logrus.Printf("CRDUPDATE %s", dbg)
e := h.crdwrapper.UpdateStatus(cfg, dbg)
if e != nil {
return e
}
} else {
logrus.Printf("CRDUPDATE error on createSamples but retryable, not marking degraded: %s", err.Error())
}
return err
}
Expand Down
13 changes: 9 additions & 4 deletions pkg/stub/imagestreams.go
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,10 @@ func (h *Handler) processImageStreamWatchEvent(is *imagev1.ImageStream, deleted
// it the main loop created, we will let it set the image change reason field
return nil
}
if ShouldNotGoDegraded(err) {
logrus.Printf("CRDUPDATE: retryable error %s with imagestream update %s", err.Error(), imagestream.Name)
return err
}
cfg = h.refetchCfgMinimizeConflicts(cfg)
h.processError(cfg, v1.SamplesExist, corev1.ConditionUnknown, err, "%v error replacing imagestream %s", imagestream.Name)
dbg := fmt.Sprintf("CRDUPDATE event img update err bad api obj update %s", imagestream.Name)
Expand Down Expand Up @@ -171,6 +175,10 @@ func (h *Handler) upsertImageStream(imagestreamInOperatorImage, imagestreamInClu
// return the error so the caller can decide what to do
return err
}
if ShouldNotGoDegraded(err) {
logrus.Printf("CRDUPDATE: retryable error %s with imagestream update %s", err.Error(), imagestreamInOperatorImage.Name)
return err
}
return h.processError(opcfg, v1.SamplesExist, corev1.ConditionUnknown, err, "imagestream create error: %v")
}
logrus.Printf("created imagestream %s", imagestreamInOperatorImage.Name)
Expand Down Expand Up @@ -201,10 +209,7 @@ func (h *Handler) upsertImageStream(imagestreamInOperatorImage, imagestreamInClu
imagestreamInOperatorImage.ResourceVersion = imagestreamInCluster.ResourceVersion
_, err = h.imageclientwrapper.Update(imagestreamInOperatorImage)
if err != nil {
// we don't generically retry on conflict error, but we don't want to go to degraded on an imagestream
// conflict given its highly concurrent nature and that we should eventually settle, so just return the
// error so the controller retries and we re-do all the logic above ^^
if IsRetryableAPIError(err) || kerrors.IsConflict(err) {
if ShouldNotGoDegraded(err) {
return err
}
return h.processError(opcfg, v1.SamplesExist, corev1.ConditionUnknown, err, "imagestream update error: %v")
Expand Down
13 changes: 9 additions & 4 deletions pkg/stub/templates.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,10 @@ func (h *Handler) processTemplateWatchEvent(t *templatev1.Template, deleted bool
if kerrors.IsAlreadyExists(err) {
return nil
}
if ShouldNotGoDegraded(err) {
logrus.Printf("CRDUPDATE: retryable error %s with template update %s", err.Error(), template.Name)
return err
}
cfg = h.refetchCfgMinimizeConflicts(cfg)
h.processError(cfg, v1.SamplesExist, corev1.ConditionUnknown, err, "%v error replacing template %s", template.Name)
dbg := "event temp update err bad api obj update"
Expand Down Expand Up @@ -100,6 +104,10 @@ func (h *Handler) upsertTemplate(templateInOperatorImage, templateInCluster *tem
// return the error so the caller can decide what to do
return err
}
if ShouldNotGoDegraded(err) {
logrus.Printf("CRDUPDATE: retryable error %s with template update %s", err.Error(), templateInOperatorImage.Name)
return err
}
return h.processError(opcfg, v1.SamplesExist, corev1.ConditionUnknown, err, "template create error: %v")
}
logrus.Printf("created template %s", templateInOperatorImage.Name)
Expand All @@ -109,10 +117,7 @@ func (h *Handler) upsertTemplate(templateInOperatorImage, templateInCluster *tem
templateInOperatorImage.ResourceVersion = templateInCluster.ResourceVersion
_, err := h.templateclientwrapper.Update(templateInOperatorImage)
if err != nil {
// we don't generically retry on conflict error, but we don't want to go to degraded on an template
// conflict so just return the
// error so the controller retries and we re-do all the logic above ^^
if kerrors.IsConflict(err) || IsRetryableAPIError(err) {
if ShouldNotGoDegraded(err) {
return err
}
return h.processError(opcfg, v1.SamplesExist, corev1.ConditionUnknown, err, "template update error: %v")
Expand Down

0 comments on commit eb7466b

Please sign in to comment.