Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selector on watches.yaml not honoured #31

Open
anupchandak opened this issue Feb 14, 2023 · 9 comments
Open

Selector on watches.yaml not honoured #31

anupchandak opened this issue Feb 14, 2023 · 9 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/support Indicates an issue that is a support question.

Comments

@anupchandak
Copy link

anupchandak commented Feb 14, 2023

To control the scope of an operator in a multi-development environment, I have defined a selector at the watches.yaml level by referring here

The selector is defined as something like the below (presented below with equivalent dummy values)

- version: v1
  group: mytest.com
  kind: MyKind
  snakeCaseParameters: False
  playbook: playbooks/create.yml
  finalizer:
    name: myTest.com/finalizer
    playbook: playbooks/purge.yml
  selector:
    matchExpressions:
      - key: mytest.com/controller-namespace
        operator: In
        values: 
          - "my-test-na"

When I start my ansible runner then as expected, I see the following log at the start

{"level":"info","ts":1676367873.856818,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}

and I expect that my Operator will still not worry (watch) about CR defined with the label mytest.com/controller-namespace=your-test-na. But it does and reconciles it.

It is an ansible based operator and environment details are as below:

% ansible --version
/usr/local/lib/python3.9/site-packages/paramiko/transport.py:236: CryptographyDeprecationWarning: Blowfish has been deprecated
  "class": algorithms.Blowfish,
ansible [core 2.13.5]
  config file = /Users/anupchandak/ansible-profiler.cfg
  configured module search path = ['/Users/anupchandak/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /Users/anupchandak/Library/Python/3.9/lib/python/site-packages/ansible
  ansible collection location = /Users/anupchandak/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.9.14 (main, Sep  6 2022, 23:29:09) [Clang 13.1.6 (clang-1316.0.21.2.5)]
  jinja version = 3.1.2
  libyaml = True
@jberkhahn jberkhahn added the triage/support Indicates an issue that is a support question. label Feb 20, 2023
@varshaprasad96
Copy link
Member

@jberkhahn The thread regarding this issue: https://mail.google.com/mail/u/0/#search/ansible/FMfcgzGrcXtllNJlqSFwVfvsJVrwzQjw

@anupchandak Could you please share your controller pod logs or the project, for us to able to run it locally and check the issue. The selectors should be working as expected by creating predicates, looking at the logs may help us dig into it more.

@anupchandak
Copy link
Author

anupchandak commented Feb 23, 2023

@varshaprasad96 - I tried creating a sample project using the Memcached example but was not able to reproduce the above issue.

I cannot share my work project for copyright restrictions.

Any pointer on how I can check what is all coming on the operator's watch list when it starts? And selector it is applying.

@anupchandak
Copy link
Author

Any way to know what dependent resource was changed that triggered the operator's reconciliation loop?

@varshaprasad96
Copy link
Member

varshaprasad96 commented Feb 27, 2023

The other option is to add additional logs in ansible operator binary and try it out locally to see what is happening. Some pointers are:

  1. I'd start by looking if watches.yaml is being parsed as expected. In the sense if the selectors are being parsed and loaded from the watches file, which happens here: https://github.com/operator-framework/operator-sdk/blob/5cbdad9209332043b7c730856b6302edc8996faf/internal/ansible/watches/watches.go#L313
  2. This is where predicates are set up based on labels: https://github.com/operator-framework/operator-sdk/blob/d828db26e4c0377e8423bfbdafa36449a971f05a/internal/ansible/controller/controller.go#L115. Checking here if predicates are being created successfully would be helpful.
  3. The above two steps should help in digging the issue. If not, I would go a step further and try to replicate this method (https://github.com/kubernetes-sigs/controller-runtime/blob/b9940edaaafe3f0292d6be43b362852aab079369/pkg/predicate/predicate.go#L375), which is where predicates are created according to labels. That would help in checking if the labels are in the right format, and if the predicate func is appearing as expected.
  4. This is where the ansible controller's logic is written (https://github.com/operator-framework/operator-sdk/blob/d828db26e4c0377e8423bfbdafa36449a971f05a/internal/cmd/ansible-operator/run/cmd.go#L89), digging into logs to check the events which are being received and the requests triggering the reconciler would be helpful.

You may have to build the binary locally to test it out. The steps are here: https://sdk.operatorframework.io/docs/contribution-guidelines/developer-guide/.

Before all this, I would suggest to increase the log verbosity and check if there is anything suspicious indicating that labels haven't been set up as expected. Hope this helps!

@anupchandak
Copy link
Author

@varshaprasad96 - Thank you so much for your detailed reply above.

Sorry for the late reply but I think I am able to reproduce the issue. I think it's because of the dependent resource CronJob created by the CR.

Please use the attached project and follow the below steps to reproduce the issue.

  1. Copy the project locally.
  2. Install CRD make install.
  3. Start the operator locally with ansible-operator run local --zap-devel=true.
  4. Create the first CR in the apple namespace. This will create a deployment object and a CronJob in the suspended state.
    kubectl create namespace apple
    kubectl config set-context --current --namespace=apple
    kubectl --namespace apple create -f config/samples/apple_sample.yaml
    
  5. Create the second CR in the banana namespace. This will create a deployment object and a CronJob in the non-suspended state.
    kubectl create namespace banana
    kubectl config set-context --current --namespace=banana
    kubectl --namespace banana create -f config/samples/banana_sample.yaml
    
  6. Now, stop the operator and modify the watches.yaml to only select resources from the apple namespace.
    selector:
       matchExpressions:
          - key: cache.example.com/controller-namespace
             operator: In 
             values: [apple]
    
  7. Restart the operator with ansible-operator run local --zap-devel=true.
  8. You should see that the operator will run whenever a CronJob in the banana namespace is triggered even though the Operator watches selector is configured to watch only resources from the apple namespace.

I have also attached logs from my local execution. Please note that to restrict the logs to only testing namespaces, I had set export WATCH_NAMESPACE=apple,banana.

memcached-operator.zip
reconcile_log.txt

Thank you!

@anupchandak
Copy link
Author

Team - Any comment/update on this issue?

@anupchandak
Copy link
Author

Hi Team - Have you got a chance to look at this issue?

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2023
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 16, 2023
@everettraven everettraven transferred this issue from operator-framework/operator-sdk Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

4 participants