Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator-sdk run bundle cannot pull private bundle image #4650

Closed
bitscuit opened this issue Mar 12, 2021 · 13 comments · Fixed by #4703
Closed

operator-sdk run bundle cannot pull private bundle image #4650

bitscuit opened this issue Mar 12, 2021 · 13 comments · Fixed by #4703
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. olm-integration Issue relates to the OLM integration priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@bitscuit
Copy link

Bug Report

What did you do?

operator-sdk run bundle hyc-cloud-private-scratch-docker-local.artifactory.swg-devops.com/ibmcom/common-service-operator-bundle:3.7.1

What did you expect to see?

CatalogSource pod created by run bundle should be in Running state, and operator installed

What did you see instead? Under which circumstances?

CatalogSource pod is in CrashLoopBackOff state

Output of run bundle

INFO[0011] Successfully created registry pod: tory-swg-devops-com-ibmcom-common-service-operator-bundle-3-7-1 
INFO[0011] Created CatalogSource: ibm-common-service-operator-catalog 
INFO[0011] OperatorGroup "operator-sdk-og" created      
INFO[0011] Created Subscription: ibm-common-service-operator-v3-7-1-sub 
FATA[0120] Failed to run bundle: install plan is not available for the subscription ibm-common-service-operator-v3-7-1-sub: timed out waiting for the condition 
Makefile:200: recipe for target 'run-bundle' failed
make: *** [run-bundle] Error 1

Logs from CatalogSource pod

time="2021-03-12T21:37:05Z" level=info msg="adding to the registry" bundles="[hyc-cloud-private-scratch-docker-local.artifactory.swg-devops.com/ibmcom/common-service-operator-bundle:3.7.1]"
time="2021-03-12T21:37:06Z" level=error msg="permissive mode disabled" bundles="[hyc-cloud-private-scratch-docker-local.artifactory.swg-devops.com/ibmcom/common-service-operator-bundle:3.7.1]" error="[error resolving name : failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized, image \"hyc-cloud-private-scratch-docker-local.artifactory.swg-devops.com/ibmcom/common-service-operator-bundle:3.7.1\": not found]"
Error: [error resolving name : failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized, image "hyc-cloud-private-scratch-docker-local.artifactory.swg-devops.com/ibmcom/common-service-operator-bundle:3.7.1": not found]
Usage:
  opm registry add [flags]

Flags:
  -b, --bundle-images strings   comma separated list of links to bundle image
  -c, --container-tool string   tool to interact with container images (save, build, etc.). One of: [none, docker, podman] (default "none")
  -d, --database string         relative path to database file (default "bundles.db")
      --debug                   enable debug logging
  -h, --help                    help for add
      --mode string             graph update mode that defines how channel graphs are updated. One of: [replaces, semver, semver-skippatch] (default "replaces")
      --permissive              allow registry load errors

Global Flags:
      --skip-tls   skip TLS certificate verification for container image registries while pulling bundles or index

Environment

Operator type:

/language go

Kubernetes cluster type:

Openshift 4.6.16

$ operator-sdk version

operator-sdk version: "v1.3.0", commit: "1abf57985b43bf6a59dcd18147b3c574fa57d3f6", kubernetes version: "1.19.4", go version: "go1.15.5", GOOS: "linux", GOARCH: "amd64"

Also tried with operator-sdk version 1.5.0

$ go version (if language is Go)

go version go1.15.5 linux/amd64

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0+e49167a", GitCommit:"e49167aad6a08046be6ab21ff13029110c76951d", GitTreeState:"clean", BuildDate:"2021-01-28T07:35:27Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

CatalogSource pod should additionally use global cluster pull secret

Additional context

Few things I can confirm:

  1. Image does exist as I can successfully pull it locally
  2. Cluster global pull secret is configured for that private registry
  3. Cluster is able to pull images from private registry since I successfully deployed Deployment using another image

I tried adding my private registry credentials to the imagePullSecrets the pod was using, but got the same result:

  1. added credentials to pod's imagePullSecrets secret
  2. ran operator-sdk cleanup ibm-common-service-operator
  3. re-ran run bundle command
  4. pod in CrashLoopBackOff (confirmed that pod has the same imagePullSecrets)

Found a related bugzilla report https://bugzilla.redhat.com/show_bug.cgi?id=1883198

Also, potentially another bug; the pod name was tory-swg-devops-com-ibmcom-common-service-operator-bundle-3-7-1. Seems like there is a character limit to the name that does not match Kubernetes' character limit for names.

@openshift-ci-robot openshift-ci-robot added the language/go Issue is related to a Go operator project label Mar 12, 2021
@estroz
Copy link
Member

estroz commented Mar 12, 2021

/kind feature

@openshift-ci-robot openshift-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 12, 2021
@estroz estroz modified the milestone: v1.5.0 Mar 12, 2021
@jberkhahn jberkhahn added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Mar 15, 2021
@jberkhahn jberkhahn added this to the v1.7.0 milestone Mar 15, 2021
@jmrodri jmrodri self-assigned this Mar 15, 2021
@titou10titou10
Copy link

titou10titou10 commented Mar 23, 2021

Same problem for us, ie the"opm registry add" does not trust the CA setup in the cluster and on the underlying nodes even if the steps are slightly different

We are trying to go through the "quickstart for go" and it fails executing this step (registry url replaced by xxx)

export BUNDLE_IMG="xxxxxx.xxxx.xxx.xxxxx.xxxxx/xxxxx/memcached-operator-bundle:v0.0.1"
operator-sdk run bundle $BUNDLE_IMG

The image registry uses a certificate emitted by the corporate CA, the cluster is configured with it and all the images running in the cluster are pulled from the registry without any problem

Versions:

minikube (fail also on microk8s)
operator-sdk version: "v1.4.2", commit: "4b083393be65589358b3e0416573df04f4ae8d9b", kubernetes version: "1.19.4", go version: "go1.15.5", GOOS: "linux", GOARCH: "amd64"
go version go1.15.10 linux/amd64
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1-5-g76a04fc", GitCommit:"c66c03f3012a10f16eb86fdce6330433adf6c9ee", GitTreeState:"clean", BuildDate:"2021-02-13T03:54:59Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

The errors are not in the "CatalogSource" pod as described in the initial post but from a pod namedxxx-xxxx-xxx-xxxx-xxxx-xxxx-memcached-operator-bundle-v0-0-1created by the"operator-sdk run bundle"

time="2021-03-23T00:55:50Z" level=info msg="adding to the registry" bundles="[xxxxxx.xxxx.xxx.xxxxx.xxxxx/xxxxx/memcached-operator-bundle:v0.0.1]"
time="2021-03-23T00:55:50Z" level=error msg="permissive mode disabled" bundles="[xxxxxx.xxxx.xxx.xxxxx.xxxxx/xxxxx/memcached-operator-bundle:v0.0.1]" error="[error resolving name : failed to do request: Head https://xxxxxx.xxxx.xxx.xxxxx.xxxx/v2/denis/memcached-operator-bundle/manifests/v0.0.1: x509: certificate signed by unknown authority, image \"xxxxxx.xxxx.xxx.xxxxx.xxxxx/xxxxx/memcached-operator-bundle:v0.0.1\": not found]"
Error: [error resolving name : failed to do request: Head https://xxxxxx.xxxx.xxx.xxxxx.xxxxx/v2/xxxxx/memcached-operator-bundle/manifests/v0.0.1: x509: certificate signed by unknown authority, image "xxxxxx.xxxx.xxx.xxxxx.xxxxx/xxxxx/memcached-operator-bundle:v0.0.1": not found]
Usage:
  opm registry add [flags]

Flags:
  -b, --bundle-images strings   comma separated list of links to bundle image
  -c, --container-tool string   tool to interact with container images (save, build, etc.). One of: [none, docker, podman] (default "none")
  -d, --database string         relative path to database file (default "bundles.db")
      --debug                   enable debug logging
  -h, --help                    help for add
      --mode string             graph update mode that defines how channel graphs are updated. One of: [replaces, semver, semver-skippatch] (default "replaces")
      --permissive              allow registry load errors

Global Flags:
      --skip-tls   skip TLS certificate verification for container image registries while pulling bundles or index

Q:

  • is there a workaround for this problem?
  • is there a way to configure the opm registry add...command running in the generated pod to use the CA of the cluster or use some ConfigMap/Secret?

@jmrodri
Copy link
Member

jmrodri commented Mar 23, 2021

@titou10titou10 unfortunately not at the moment. The only work around is to deploy the operator manually https://olm.operatorframework.io/docs/tasks/make-operator-part-of-catalog/

@titou10titou10
Copy link

titou10titou10 commented Mar 23, 2021

@jmrodri OK.

IMHO it should be added in the documentation that you can't use the"operator-sdk run bundle"command in the quickstart and in the tutorial if you are using a private image registry using a private/corporate CA and that only the manual way is working in this case

Also the command reference for"operator-sdk run bundle" (ref here) should indicate that this command does not actually support in house CA as the timeframe for the fix in operator-sdk v1.7 is quite far from now

Thx

@estroz
Copy link
Member

estroz commented Mar 23, 2021

I'm going to take this on. Will update docs prior to the next release so at least readers are aware of this issue, if the feature doesn't land in v1.6.

@estroz
Copy link
Member

estroz commented Mar 23, 2021

Also this is a scorecard issue too I believe. You can configure a service account with image pull secrets, then specify that service account with operator-sdk scorecard ./bundle --service-account <sa-name>.

@estroz estroz added olm-integration Issue relates to the OLM integration scorecard Issue relates to the scorecard subcomponent and removed language/go Issue is related to a Go operator project labels Mar 23, 2021
@estroz estroz assigned estroz and unassigned jmrodri and rashmigottipati Mar 23, 2021
@estroz
Copy link
Member

estroz commented Mar 24, 2021

@bitscuit @titou10titou10 #4694 implements this feature. Let me know if the options provided are sufficient to pull private bundles.

@bitscuit
Copy link
Author

thanks @estroz, they look good to me.

@titou10titou10
Copy link

titou10titou10 commented Mar 25, 2021

@estroz I don't think it solves the problem. AFAIK there is no way to set a CA inside an"imagePullSecret", you can only set the credentials (user/password/email)

The pod has no problem pulling its image to run, it uses the CA set at the cluster level. The problem is the command ("opm registry add")that is run inside the pod that performz an image pull and is not aware of the registry CA and there is no way to configure it to use the correct CA

To solve the problem, we should be able to specify a file with the CA cert to the"opm registry add" command via the"operator-sdk run bundle"command

"opm registry add"already has a" --skip-tls"option but there is no way to set it via the"operator-sdk run bundle"command and AFAIK it will try to access the registry in HTTP instead of HTTPS. It would not solve the problem here

The "operator-sdk run bundle"command should be updated to optionnaly pass a new option ("--opm-ca-cert=<ca cert file>"?) to specify the CA cert file. The opm registrycommand must also be changed to use that file

Another option would be to change the opm registrycommand to use the cluster CA directly

As for the --skip-tls option to be used by theopm registrycommand, It should also be added to the"operator-sdk run bundle" and passed to the underlying command, but It will not solve my problem here

@estroz estroz removed the scorecard Issue relates to the scorecard subcomponent label Mar 25, 2021
@estroz
Copy link
Member

estroz commented Mar 25, 2021

@titou10titou10 opm registry add will use the cert pool present in the pod when pulling an image, which can be configured with the cluster CA.

Perhaps allowing a cert to be passed to opm registry add would be convenient. operator-framework/operator-registry#611 needs to be merged first.

@titou10titou10
Copy link

titou10titou10 commented Mar 26, 2021

@estroz I've seen the code change you've made on theopm registry add command. Great, thanks!

Now, what will be the parameters of the"operator-sdk run bundle" command to:

  • tell the"opm registry add" command to use a specific remote registry CA?
  • run the "opm registry add"command with the"--skip-tls"parameter?

Thx

@estroz
Copy link
Member

estroz commented Mar 26, 2021

@titou10titou10 if it's desirable to skip TLS, then I don't see a problem exposing --skip-tls from opm registry add in operator-sdk run bundle. To use a specific remote registry CA, you'd first create a secret containing the root certificate of that registry, then set that secret's name with --ca-secret-name:

kubectl create secret generic tls-ca --from-file=cert.pem=/path/to/registry/cert.pem
operator-sdk run bundle public-ca-reg.com/my-bundle:v0.1.0 --ca-secret-name=tls-ca

If a pull secret is also required, i.e. for a private registry, you'd run

kubectl create secret generic tls-ca --from-file=cert.pem=/path/to/registry/cert.pem
kubectl create secret docker-registry reg --docker-username=foo --docker-password=bar123 --docker-server=https://private-ca-reg.com
kubectl patch serviceaccount default -p '{"imagePullSecrets":[{"name":"reg"}]}'
operator-sdk run bundle private-ca-reg.com/my-bundle:v0.1.0 --secret-name=reg --ca-secret-name=tls-ca

Ideally the cluster, namespace, and service account used will have been provisioned with the appropriate secrets beforehand, so the above just becomes

operator-sdk run bundle private-ca-reg.com/my-bundle:v0.1.0 ...

Perhaps renaming --secret-name to --pull-secret-name would clarify flag meanings. Thoughts?

@titou10titou10
Copy link

@estroz excellent
If you ask my opinion for the parameter name, as those parameters are not directly related to the"operator-sdk run bundle"command but to the underlying"opm registry add"command, I would prefix them with "opm", ie

--opm-ca-secret-name
--opm-pull-secret-name
--opm-skip-tls

just my 2cts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. olm-integration Issue relates to the OLM integration priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants