Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reconcile on namespace changes #5898

Merged

Conversation

simonpasquier
Copy link
Contributor

@simonpasquier simonpasquier commented Sep 7, 2023

Description

When the operator was configured to select only a limited number of
namespaces, it would not watch for namespace changes. It means that the
operator may not reconcile when a namespace label is added/removed
(affecting which objects should be selected or not).

This change enables the operator to use a privileged namespace
lister/watcher whenever the service account has the needed permissions.

IMPORTANT: it also requires Kubernetes >= 1.22 to be effective but the
operator will degrade to the less efficient implementation if this condition isn't met.

xref #3847 (one more PR required to fix it completely).

Type of change

What type of changes does your code introduce to the Prometheus operator? Put an x in the box that apply.

  • CHANGE (fix or feature that would cause existing functionality to not work as expected)
  • FEATURE (non-breaking change which adds functionality)
  • BUGFIX (non-breaking change which fixes an issue)
  • ENHANCEMENT (non-breaking change which improves existing functionality)
  • NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Changelog entry

Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.

Reconcile monitoring resources on namespace updates.

@pull-request-size pull-request-size bot added size/XL and removed size/L labels Sep 7, 2023
@simonpasquier simonpasquier force-pushed the improve-ns-listerwatcher branch 4 times, most recently from ae11f2a to fe98cee Compare September 11, 2023 07:27
simonpasquier added a commit to simonpasquier/prometheus-operator that referenced this pull request Sep 12, 2023
This change is a pre-requisite for prometheus-operator#5898.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
simonpasquier added a commit to simonpasquier/prometheus-operator that referenced this pull request Sep 13, 2023
This change is a pre-requisite for prometheus-operator#5898.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
simonpasquier added a commit to simonpasquier/prometheus-operator that referenced this pull request Sep 13, 2023
This change is a pre-requisite for prometheus-operator#5898.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
simonpasquier added a commit to simonpasquier/prometheus-operator that referenced this pull request Sep 13, 2023
This change is a pre-requisite for prometheus-operator#5898.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
@simonpasquier simonpasquier force-pushed the improve-ns-listerwatcher branch 3 times, most recently from 467307b to d7d5c85 Compare September 14, 2023 11:56
@simonpasquier simonpasquier marked this pull request as ready for review September 14, 2023 13:24
@simonpasquier simonpasquier requested a review from a team as a code owner September 14, 2023 13:24
@@ -302,11 +331,11 @@ func testPrometheusInstanceNamespacesAllowList(t *testing.T) {
// Remove the selecting label on the "allowed" namespace and check that
// the target is removed.
// See https://github.com/prometheus-operator/prometheus-operator/issues/3847
//if err := testFramework.RemoveLabelsFromNamespace(framework.KubeClient, allowedNs, "monitored"); err != nil {
//if err := framework.RemoveLabelsFromNamespace(context.Background(), allowedNs, "monitored"); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is still commented because the service account isn't configured with list/watch permissions on namespaces. I have another change in flight that would fix it by implementing a poller based watcher but to keep this PR reviewable, I'll send as a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -303,6 +303,15 @@ func run() int {
return 1
}

kubernetesVersion, err := kclient.Discovery().ServerVersion()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating the same check in each controller, let's do it once on startup.

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking this long to review, the subject here is a bit out of my comfort zone 😬 I'm still going through the codebase and trying to understand everything.

A few doubts came up while reading the PR


return cache.NewFilteredListWatchFromClient(c, "namespaces", metav1.NamespaceAll, tweak)
level.Debug(l).Log("msg", "using privileged namespace lister/watcher")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this debug log seems out of place 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm not sure to understand but I can remove all the "using (un)privileged namespace lister/watcher" debug logs as they were more for me to check the code path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to keep it! I missed the return literally in the next line and thought you left duplicated logs 😅

level.Info(l).Log("msg", "namespace not found", "namespace", name)
continue
}
result, err := corev1Client.Namespaces().Get(ctx, name, metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this inside an if?

if getAllowed {
  result, err := corev1Client.Namespaces().Get(ctx, name, metav1.GetOptions{})
  .
  .
  .
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the existing behavior to minimize the changes. The SA might have get permissions on a subset of the configured namespaces but getAllowed will be false.

err = fmt.Errorf("%w: %w", err, r)
}

level.Warn(l).Log("msg", "the operator lacks required permissions which may result in degraded functionalities", "err", err)
}

listFunc := func(options metav1.ListOptions) (runtime.Object, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
listFunc := func(options metav1.ListOptions) (runtime.Object, error) {
listFunc := func(_ metav1.ListOptions) (runtime.Object, error) {

Is it intended to not use the options provided here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's intended and the PR didn't change anything in that regard.

Copy link
Contributor Author

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @ArthurSens. I understand that the change isn't easy to review, fortunately we have quite good e2e test coverage which helped me find bugs in the first iterations :)


return cache.NewFilteredListWatchFromClient(c, "namespaces", metav1.NamespaceAll, tweak)
level.Debug(l).Log("msg", "using privileged namespace lister/watcher")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm not sure to understand but I can remove all the "using (un)privileged namespace lister/watcher" debug logs as they were more for me to check the code path.

err = fmt.Errorf("%w: %w", err, r)
}

level.Warn(l).Log("msg", "the operator lacks required permissions which may result in degraded functionalities", "err", err)
}

listFunc := func(options metav1.ListOptions) (runtime.Object, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's intended and the PR didn't change anything in that regard.

level.Info(l).Log("msg", "namespace not found", "namespace", name)
continue
}
result, err := corev1Client.Namespaces().Get(ctx, name, metav1.GetOptions{})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the existing behavior to minimize the changes. The SA might have get permissions on a subset of the configured namespaces but getAllowed will be false.

@simonpasquier
Copy link
Contributor Author

@ArthurSens in case you missed my earlier comment: 5ada40d is the next step but I didn't want to increase the scope of this PR even more.

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, LGTM!

Thanks for working on this, reviewing it also helped me understand better what kubebuilder usually abstracts away from other projects I work with :P


return cache.NewFilteredListWatchFromClient(c, "namespaces", metav1.NamespaceAll, tweak)
level.Debug(l).Log("msg", "using privileged namespace lister/watcher")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to keep it! I missed the return literally in the next line and thought you left duplicated logs 😅

Comment on lines 59 to 69
am = framework.MakeBasicAlertmanager(instanceNs, "instance", 3)
am = framework.MakeBasicAlertmanager(instanceNs, "instance", 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any special reason to reduce the number of replicas?

I assume it is just because the test is valid for both 3 or 2 replicas and 2 replicas is just faster, but I wanted to ask anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the test doesn't care if it's 3, 2 or 1 replicas. But I'll revert it to avoid unrelated changes. We can have a look at all e2e tests globally and decide whether they really need multi replicas: in most cases, I think that the answer is no and switching to single replicas might save a few resources.

@ArthurSens
Copy link
Member

ArthurSens commented Sep 20, 2023

@ArthurSens in case you missed my earlier comment: 5ada40d is the next step but I didn't want to increase the scope of this PR even more.

I've just seen it and looks 👌 !

When the operator was configured to select only a limited number of
namespaces, it would not watch for namespace changes. It means that the
operator may not reconcile when a namespace label is added/removed
(affecting which objects should be selected or not).

This change enables the operator to use a privileged namespace
lister/watcher whenever the service account has the needed permissions.

**IMPORTANT:** it also requires Kubernetes >= 1.22 to be effective but
the operator will degrade to the suboptimal implementation without watch
if this condition isn't met.

Closes prometheus-operator#3847

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
@simonpasquier simonpasquier merged commit 561eb52 into prometheus-operator:main Sep 21, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants