Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AtlasDatabaseUser - message - unable to list: test because of unknown namespace for the cache #1515

Closed
qtranton opened this issue Apr 16, 2024 · 20 comments · Fixed by #1621
Closed

Comments

@qtranton
Copy link

Have a version of operator 1.7.1 and decide to upgrade to the latest in cluster.
Create local env

  1. k8s - by docker desktop v 1.25.4
  2. operator v1.7.1
  3. Add AtlasDeployment and AtlasDatabaseUser
  4. Upgrade to v2.2.0 ( helm upgrade crd then upgrade operator )
  5. Fix AtlasDeployment
  6. Check logs of operator get error aka

{"level":"INFO","time":"2024-04-16T12:12:14.543Z","msg":"Status update","atlasdatabaseuser":"test/operator-upgrade-test","lastCondition":{"type":"DatabaseUserReady","status":"False","lastTransitionTime":null,"reason":"DatabaseUserStaleConnectionSecrets","message":"unable to list: test because of unknown namespace for the cache"}}

What did you expect?
After all step operator just should work as expected

What happened instead?
AtlasDatabaseUser status always in False state

Operator Information

  • 1.7.1 -> 2.2.0

Kubernetes Cluster Information

  • Docker Desktop
  • 1.25.4

Additional context
Try to figure out why AtlasDatabaseUser CRD failed.
It's created proper secrets and creates users in AtlasUI but CRD itself always in Ready - False state

status: conditions: - lastTransitionTime: "2024-04-16T12:03:17Z" status: "False" type: Ready - lastTransitionTime: "2024-04-16T11:44:08Z" status: "True" type: ResourceVersionIsValid - lastTransitionTime: "2024-04-16T11:44:08Z" status: "True" type: ValidationSucceeded - lastTransitionTime: "2024-04-16T12:03:18Z" message: 'unable to list: test because of unknown namespace for the cache' reason: DatabaseUserStaleConnectionSecrets status: "False" type: DatabaseUserReady

If possible, please include:

{"level":"DEBUG","time":"2024-04-16T12:17:12.709Z","msg":"Ensured connection Secret up-to-date","atlasdatabaseuser":"test/operator-upgrade-test","secretname":"HIDDEN"} {"level":"INFO","time":"2024-04-16T12:17:12.709Z","msg":"Status update","atlasdatabaseuser":"test/operator-upgrade-test-","lastCondition":{"type":"DatabaseUserReady","status":"False","lastTransitionTime":null,"reason":"DatabaseUserStaleConnectionSecrets","message":"unable to list: test because of unknown namespace for the cache"}}

@josvazg
Copy link
Collaborator

josvazg commented Apr 17, 2024

Thanks for reporting this issue @qtranton !

Could you give us a minimum YAML sample we could use to reproduce the issue?
Does not need to be your original complete setup, just the definitions that reproduce the same failure.

@qtranton
Copy link
Author

Sure, i have cleanup i guess my yaml here

apiVersion: v1
kind: Secret
metadata:
  labels:
    app: operator-upgrade
    atlas.mongodb.com/type: credentials
    env: dev
  name: operator-upgrade-test
  namespace: test
stringData:
  password: testpassword


---
# Source: app-resources/templates/mongodb_atlas.yaml
apiVersion: atlas.mongodb.com/v1
kind: AtlasBackupPolicy
metadata:
  name: operator-upgrade-test
  namespace: test
  annotations:
    mongodb.com/atlas-resource-policy: "keep"
spec:
  items: 
    - frequencyInterval: 12
      frequencyType: hourly
      retentionUnit: days
      retentionValue: 1
    - frequencyInterval: 1
      frequencyType: daily
      retentionUnit: days
      retentionValue: 7
    - frequencyInterval: 6
      frequencyType: weekly
      retentionUnit: weeks
      retentionValue: 1
    - frequencyInterval: 40
      frequencyType: monthly
      retentionUnit: months
      retentionValue: 1
---
# Source: app-resources/templates/mongodb_atlas.yaml
apiVersion: atlas.mongodb.com/v1
kind: AtlasBackupSchedule
metadata:
  name: operator-upgrade-test
  namespace: test
  annotations:
    mongodb.com/atlas-resource-policy: "keep"
spec:
  autoExportEnabled: false
  referenceHourOfDay: 21
  referenceMinuteOfHour: 2
  policy:
    name: operator-upgrade-test
    namespace: test
---
# Source: app-resources/templates/mongodb_atlas.yaml
apiVersion: atlas.mongodb.com/v1
kind: AtlasDatabaseUser
metadata:
  name: operator-upgrade-test
  labels:
    app: "operator-upgrade"
    env: dev
  #   mongodb.com/atlas-resource-policy: "keep"
spec:
  roles:
  - roleName: readWrite
    databaseName: Application
  scopes:
  - type: CLUSTER
    name: operator-upgrade-test
  projectRef:
    name: project-name
    namespace: mongodb-operator
  username: operator-upgrade-test
  databaseName: admin
  passwordSecretRef:
    name: "operator-upgrade-test"

---
# Source: app-resources/templates/mongodb_atlas.yaml
apiVersion: atlas.mongodb.com/v1
kind: AtlasDeployment
metadata:
  name: operator-upgrade-test
  namespace: test
  labels:
    app: "operator-upgrade"
    env: dev
  # annotations:
  #   mongodb.com/atlas-resource-policy: "keep"
spec:
  backupRef:
    name: operator-upgrade-test
    namespace: test
  projectRef:
    name: project-name
    namespace: mongodb-operator
  advancedDeploymentSpec:
    mongoDBMajorVersion: "6.0"
    clusterType: REPLICASET
    backupEnabled: true
    pitEnabled: false
    name: operator-upgrade-test
    replicationSpecs:
      - regionConfigs:
        - electableSpecs:
              instanceSize: M10
              nodeCount: 3
          providerName: GCP
          backingProviderName: GCP
          regionName: "EASTERN_US"
          # Priority description https://www.mongodb.com/docs/atlas/reference/atlas-operator/atlasdeployment-custom-resource/#mongodb-setting-spec.advancedDeploymentSpec.replicationSpecs.regionConfigs.priority
          priority: 7
          autoScaling:
            compute:
              enabled: false

@s-urbaniak
Copy link
Collaborator

cc @roothorp

@s-urbaniak
Copy link
Collaborator

s-urbaniak commented Apr 26, 2024

@qtranton can you check if you happen to have the WATCH_NAMESPACE environment variable set for your operator deployment? i.e. if you could submit the output of kubectl -n <operator_namespace> get pod <operator_name> here?

@s-urbaniak
Copy link
Collaborator

i.e. it looks like the test namespace is not being listened by the operator, overriden by the WATCH_NAMESPACE env variable.

@qtranton
Copy link
Author

In helm i see this

{{- if .Values.watchNamespaces }}
          - name: WATCH_NAMESPACE
            value: "{{ join "," .Values.watchNamespaces }}"
          {{- end }}

So i have check pod and

 Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      OPERATOR_POD_NAME:   mongodb-atlas-operator-5df9ff6978-tqznx (v1:metadata.name)
      OPERATOR_NAMESPACE:  mongodb-operator (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4kgq7 (to)

@qtranton
Copy link
Author

qtranton commented Apr 29, 2024

In roles i also see some mention of this variable, but since it empty no additional roles was created

mongodb-operator   mongodb-atlas-operator                           
mongodb-operator   mongodb-atlas-operator-leader-election-role      

Plus it works on older version so older version could read secrets i guess

@qtranton
Copy link
Author

Validate secrets as well
When remove labels

atlas.mongodb.com/type: credentials

Get error like

"msg":"Status update","atlasdatabaseuser":"tester/operator-upgrade-test","lastCondition":{"type":"DatabaseUserReady","status":"False","lastTransitionTime":null,"reason":"InternalError","message":"Secret \"operator-upgrade-test\" not found"}}

Back labels in place get error

"msg":"Status update","atlasdatabaseuser":"tester/operator-upgrade-test","lastCondition":{"type":"DatabaseUserReady","status":"False","lastTransitionTime":null,"reason":"DatabaseUserStaleConnectionSecrets","message":"unable to list: tester because of unknown namespace for the cache"}}

@qtranton
Copy link
Author

qtranton commented May 27, 2024

@josvazg @s-urbaniak hey have some time to debug issue, so on my local cluster for some reason on version 2.2.2 i do not see status.name parameters.
Just put a lot of println in local branch :D

    #############################
    operator-upgrade-test
    cleanupStaleSecrets: Failed to list connection Secrets 
    ############################# 

To

if user.Status.UserName != user.Spec.Username {
		// Note, that we pass the username from the status, not from the spec
		fmt.Println("#############################")
		fmt.Println(user.Status.UserName, user.Spec.Username)
		fmt.Println("cleanupStaleSecrets: Failed to list connection Secrets")
		fmt.Println("#############################")
		return RemoveStaleSecretsByUserName(ctx.Context, k8sClient, projectID, user.Status.UserName, user, ctx.Log)
	}

Here
https://github.com/mongodb/mongodb-atlas-kubernetes/blob/main/pkg/controller/connectionsecret/connectionsecrets.go#L126
Now i try figure out why i have error related to secret if user not set
Meanwhile CRD look like that :

status:
    conditions:
    - lastTransitionTime: "2024-05-27T11:30:38Z"
      status: "False"
      type: Ready
    - lastTransitionTime: "2024-05-27T11:30:38Z"
      status: "True"
      type: ResourceVersionIsValid
    - lastTransitionTime: "2024-05-27T11:30:38Z"
      status: "True"
      type: ValidationSucceeded
    - lastTransitionTime: "2024-05-27T11:30:39Z"
      message: 'unable to list: tester because of unknown namespace for the cache'
      reason: DatabaseUserStaleConnectionSecrets
      status: "False"
      type: DatabaseUserReady
    observedGeneration: 1
    passwordVersion: "3017702"

@qtranton
Copy link
Author

Update: Recheck on v 1.7 and name in status appear

@josvazg
Copy link
Collaborator

josvazg commented May 27, 2024

Thanks for your reports. I managed to reproduce the same. I am debugging it now.

@josvazg
Copy link
Collaborator

josvazg commented May 27, 2024

Seems we found the issue, we are working on a fix.

In the meantime, you could pass the list of namespaces you want to get checked. ie:

helm install ... --set watchNamespaces=test,...

@qtranton
Copy link
Author

@josvazg on local machine yeah, but for main cluster we have too much namespace :) i will wait, not so critical

@josvazg
Copy link
Collaborator

josvazg commented May 29, 2024

BTW this #1619 already fixes the issue but it includes unrelated refactors. I am working on a specific test to cover this bug which was not previously detected by our test suite.

josvazg added a commit that referenced this issue May 29, 2024
josvazg added a commit that referenced this issue May 29, 2024
Signed-off-by: jose.vazquez <jose.vazquez@mongodb.com>
josvazg added a commit that referenced this issue May 29, 2024
Signed-off-by: jose.vazquez <jose.vazquez@mongodb.com>
@qtranton
Copy link
Author

I will check build locally then :)

@qtranton
Copy link
Author

qtranton commented May 30, 2024

@josvazg jfyi

{"level":"ERROR","time":"2024-05-30T14:00:05.322Z","msg":"LeaderElectionID must be configuredunable to start operator"}

Get this error now

@josvazg
Copy link
Collaborator

josvazg commented May 31, 2024

@josvazg jfyi

{"level":"ERROR","time":"2024-05-30T14:00:05.322Z","msg":"LeaderElectionID must be configuredunable to start operator"}

Get this error now

I do not think this is related. BTW this PR #1621 should fix the original issue.

As for this new error, do you have a sample to reproduce it?

@qtranton
Copy link
Author

@josvazg just build and put docker container to helm chart 2.2.2 nothing change from in deployment

@qtranton
Copy link
Author

qtranton commented Jun 3, 2024

@josvazg After few additional crd ( not in upstream yet :D ) user status becomes true. We will do some additional tests according to our infra.
Maybe you know when will it be released?

@josvazg
Copy link
Collaborator

josvazg commented Jun 3, 2024

@josvazg After few additional crd ( not in upstream yet :D ) user status becomes true. We will do some additional tests according to our infra. Maybe you know when will it be released?

We are aiming for a release soon, maybe this week. I should be merging PR #1621 tomorrow

josvazg added a commit that referenced this issue Jun 3, 2024
* Add reproducing test

Signed-off-by: jose.vazquez <jose.vazquez@mongodb.com>

* Fix cache and predicate setup

* test/e2e/cache_watch_test.go: improve e2e test

* Fix gets labels and ns names

Signed-off-by: jose.vazquez <jose.vazquez@mongodb.com>

* Rename controller predicates helper

* Move trim to env reading line

---------

Signed-off-by: jose.vazquez <jose.vazquez@mongodb.com>
Co-authored-by: Sergiusz Urbaniak <sergiusz.urbaniak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants