Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1508 add selector, afinity and priority to components #569

Merged
merged 4 commits into from
Mar 25, 2024

Conversation

jpinsonneau
Copy link
Contributor

@jpinsonneau jpinsonneau commented Feb 16, 2024

Description

This PR expose NodeSelector, Affinity and PriorityClassName on each netobserv component in their relative advanced sections.
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector

Example on a 2 Nodes cluster:

oc get nodes
NAME                         STATUS   ROLES    AGE   VERSION
ip-10-0-1-175.ec2.internal   Ready    worker   52m   v1.27.9+e36e183
ip-10-0-1-238.ec2.internal   Ready    worker   51m   v1.27.9+e36e183

oc get pods -n netobserv-privileged
NAME                         READY   STATUS    RESTARTS   AGE
netobserv-ebpf-agent-qs4f7   1/1     Running   0          4s
netobserv-ebpf-agent-t2b92   1/1     Running   0          4s

Set a custom annotation on a node:

kind: Node
apiVersion: v1
metadata:
  name: ip-10-0-1-238.ec2.internal
  labels:
    netobserv: 'true'
...

Set nodeSelector on eBPF agent config:

apiVersion: flows.netobserv.io/v1beta2
kind: FlowCollector
spec:
  agent:
    ebpf:
      advanced:
        nodeSelector:
          netobserv: 'true'
...
oc get pods -n netobserv-privileged
NAME                         READY   STATUS    RESTARTS   AGE
netobserv-ebpf-agent-hwp4c   1/1     Running   0          7m46s

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Makefile Outdated
@@ -361,7 +361,7 @@ deploy: kustomize ## Deploy controller to the K8s cluster specified in ~/.kube/c
$(SED) -i -r 's~ebpf-agent:.+~ebpf-agent:main~' ./config/manager/manager.yaml
$(SED) -i -r 's~flowlogs-pipeline:.+~flowlogs-pipeline:main~' ./config/manager/manager.yaml
$(SED) -i -r 's~console-plugin:.+~console-plugin:main~' ./config/manager/manager.yaml
$(KUSTOMIZE) build config/openshift | sed -r "s/openshift-netobserv-operator\.svc/${NAMESPACE}.svc/" | kubectl apply -f -
$(KUSTOMIZE) build config/openshift | sed -r "s/openshift-netobserv-operator\.svc/${NAMESPACE}.svc/" | kubectl create -f -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl folks recommend using apply over create IIRC they want to find ways to remove create

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fall in the following issue keeping apply here:

metadata.annotations: Too long: must have at most 262144 bytes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is weird which md is it referring to here svc which version of kubectl u have ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on an old version. Just updated to latest and still have the same issue:

Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.9+e36e183
USER=jpinsonn VERSION=1508 make deploy
cd config/manager && /home/julien/dev/me/network-observability-operator/bin/kustomize edit set image controller=quay.io/jpinsonn/network-observability-operator:1508-3
sed -i -r 's~ebpf-agent:.+~ebpf-agent:main~' ./config/manager/manager.yaml
sed -i -r 's~flowlogs-pipeline:.+~flowlogs-pipeline:main~' ./config/manager/manager.yaml
sed -i -r 's~console-plugin:.+~console-plugin:main~' ./config/manager/manager.yaml
/home/julien/dev/me/network-observability-operator/bin/kustomize build config/openshift | sed -r "s/openshift-netobserv-operator\.svc/netobserv.svc/" | kubectl apply -f -
namespace/netobserv created
customresourcedefinition.apiextensions.k8s.io/flowmetrics.flows.netobserv.io created
serviceaccount/netobserv-controller-manager created
role.rbac.authorization.k8s.io/netobserv-leader-election-role created
role.rbac.authorization.k8s.io/netobserv-openshift-netobserv-operator-prometheus created
clusterrole.rbac.authorization.k8s.io/netobserv-manager-role created
clusterrole.rbac.authorization.k8s.io/netobserv-proxy-role created
rolebinding.rbac.authorization.k8s.io/netobserv-leader-election-rolebinding created
rolebinding.rbac.authorization.k8s.io/netobserv-openshift-netobserv-operator-prometheus created
clusterrolebinding.rbac.authorization.k8s.io/netobserv-manager-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/netobserv-proxy-rolebinding created
configmap/netobserv-manager-config created
service/netobserv-metrics-service created
service/netobserv-webhook-service created
deployment.apps/netobserv-controller-manager created
servicemonitor.monitoring.coreos.com/netobserv-metrics-monitor created
validatingwebhookconfiguration.admissionregistration.k8s.io/netobserv-validating-webhook-configuration created
The CustomResourceDefinition "flowcollectors.flows.netobserv.io" is invalid: metadata.annotations: Too long: must have at most 262144 bytes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that must be the alm-example, I guess? Perhaps we should try to shrink that a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corev1.Affinity object is actually too complex and adds 1071 rows in the manifest for each components.
We can remove this object to solve the issue here but keep in mind that this may force customers to set custom labels on their nodes just for us.
Also, Pod Affinity / Anti Affinity helps to get or avoid two pods in the same node. This is complementary to NodeSelector.

I feel we will fall into that situation anyway in the future. WDYT ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you rebase on #577 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested and same 😞

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I did some search around this problem and this is something other people has brought up, apparently a common approach to mitigate that is to remove the description fields on embedded APIs (like Affinity or Autoscaler); there's a ticket for allowing this via kubebuilder but it's currently still open kubernetes-sigs/controller-tools#441 (since 2020).

I tried this simple workaround that seems to work well:

(in our Makefile under manifests)

##@ Code / files generation
manifests: controller-gen ## Generate WebhookConfiguration, ClusterRole and CustomResourceDefinition objects.
	$(CONTROLLER_GEN) $(CRD_OPTIONS) rbac:roleName=manager-role webhook paths="./..." output:crd:artifacts:config=config/crd/bases
	$(YQ) -i 'del(.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.processor.properties.kafkaConsumerAutoscaler.properties.metrics.items | .. | select(has("description")) | .description)' config/crd/bases/flows.netobserv.io_flowcollectors.yaml
	$(YQ) -i 'del(.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.consolePlugin.properties.autoscaler.properties.metrics.items | .. | select(has("description")) | .description)' config/crd/bases/flows.netobserv.io_flowcollectors.yaml
	$(YQ) -i 'del(.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.agent.properties.ebpf.properties.advanced.properties.affinity.properties | .. | select(has("description")) | .description)' config/crd/bases/flows.netobserv.io_flowcollectors.yaml
	$(YQ) -i 'del(.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.processor.properties.advanced.properties.affinity.properties | .. | select(has("description")) | .description)' config/crd/bases/flows.netobserv.io_flowcollectors.yaml
	$(YQ) -i 'del(.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.consolePlugin.properties.advanced.properties.affinity.properties | .. | select(has("description")) | .description)' config/crd/bases/flows.netobserv.io_flowcollectors.yaml

It makes the CRD shrinks from ~592kB to ~321kB so not too bad .. although this isn't a definitive solution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and for the record, the issue comes from kubectl apply injecting the last-applied-configuration, which kubectl create does not; it's this annotation that is too big since it contains the whole resource)

@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 16, 2024
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 16, 2024
@netobserv netobserv deleted a comment from github-actions bot Feb 16, 2024
@netobserv netobserv deleted a comment from openshift-ci bot Feb 16, 2024
@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 16, 2024
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:819d919
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-819d919
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-819d919

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:819d919 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-819d919

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-819d919
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

Copy link

codecov bot commented Feb 16, 2024

Codecov Report

Attention: Patch coverage is 66.07930% with 77 lines in your changes are missing coverage. Please review.

Project coverage is 66.98%. Comparing base (1cd8e76) to head (0d433bc).
Report is 3 commits behind head on main.

Files Patch % Lines
...pis/flowcollector/v1beta2/zz_generated.deepcopy.go 0.00% 27 Missing and 6 partials ⚠️
pkg/helper/comparators.go 55.31% 16 Missing and 5 partials ⚠️
pkg/helper/flowcollector.go 73.13% 12 Missing and 6 partials ⚠️
...pis/flowcollector/v1beta1/flowcollector_webhook.go 86.48% 3 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #569      +/-   ##
==========================================
- Coverage   67.38%   66.98%   -0.41%     
==========================================
  Files          65       65              
  Lines        7987     8108     +121     
==========================================
+ Hits         5382     5431      +49     
- Misses       2276     2331      +55     
- Partials      329      346      +17     
Flag Coverage Δ
unittests 66.98% <66.07%> (-0.41%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -52,6 +52,32 @@ func (r *FlowCollector) ConvertTo(dstRaw conversion.Hub) error {
dst.Spec.Loki.Microservices = restored.Spec.Loki.Microservices
dst.Spec.Loki.Manual = restored.Spec.Loki.Manual

if restored.Spec.Agent.EBPF.Advanced != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding more complex conversion logic (sometimes error prone), I find it generally easier to just backport a feature to previous versions. As long as we're not modifying or deleting existing parts of the API, I think it's just fine to backport it to v1beta1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to add the whole advanced sections to every components here ?

Currently in v1beta1, only eBPF and flp components have debug sections containing a single Env map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that's not what I was suggesting ... iirc advanced brought some breaking changes so it had to go via a new API and could not be backported; I was more thinking about adding these fields at the component root like we used to do in beta1, but if you think it won't simplify the things, feel free to keep it like you did

@@ -57,6 +92,28 @@ func annotationsChanged(old, new *corev1.PodTemplateSpec, report *ChangeReport)
return false
}

func assignationChanged(old, new *corev1.PodTemplateSpec, report *ChangeReport) bool {
if !equality.Semantic.DeepDerivative(old.Spec.NodeSelector, new.Spec.NodeSelector) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeepDerivative normally takes the "new" param as a first argument and the "old" as the second; and it checks for stuff in "new" that wasn't in "old", but not the other way around. That's typically useful for annotations/labels, when other parts of the kube infra might add their own labels / annotations on pods and we don't want to interfere with that.

So here I believe either we want to use DeepDerivative but then we should change the arguments order, or perhaps we actually want DeepEqual here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave a try with DeepEqual here and got a infinite reconcile loop. It seemed to work fine with DeepDerivative but I'll give another try inverting the two args to see if I get differences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed your feedback and simplified the code in this commit: 4178913

Thanks !

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 20, 2024
@jpinsonneau
Copy link
Contributor Author

Rebased without changes

@jotak
Copy link
Member

jotak commented Mar 20, 2024

@jpinsonneau @msherif1234 : on the apply vs create, I'm proposing this workaround: jpinsonneau#4
This removes external/embedded resource descriptions, which allows shrinking the CRD & annotation size and makes the apply pass again.

@jpinsonneau
Copy link
Contributor Author

@jpinsonneau @msherif1234 : on the apply vs create, I'm proposing this workaround: jpinsonneau#4 This removes external/embedded resource descriptions, which allows shrinking the CRD & annotation size and makes the apply pass again.

I guess we loose autocompletion in the UI with this approach right ?
But if it's making the CRD lightweight I'm fine with that. We just need to document with examples these sections so the users are not lost

@jotak
Copy link
Member

jotak commented Mar 20, 2024

I guess we loose autocompletion in the UI with this approach right ? But if it's making the CRD lightweight I'm fine with that. We just need to document with examples these sections so the users are not lost

Yes .. or perhaps what we could do, instead of remove the descriptions, is to replace them with a link to these object's doc, such as https://docs.openshift.com/container-platform/4.15/rest_api/autoscale_apis/horizontalpodautoscaler-autoscaling-v2.html

@jotak
Copy link
Member

jotak commented Mar 20, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Mar 20, 2024
@Amoghrd
Copy link
Contributor

Amoghrd commented Mar 21, 2024

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 21, 2024
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:5550579
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-5550579
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-5550579

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:5550579 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-5550579

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-5550579
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@jpinsonneau
Copy link
Contributor Author

Rebased without changes

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 22, 2024
@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 22, 2024
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:ca8234e
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-ca8234e
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-ca8234e

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:ca8234e make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-ca8234e

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-ca8234e
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@openshift-ci openshift-ci bot added the lgtm label Mar 22, 2024
@Amoghrd
Copy link
Contributor

Amoghrd commented Mar 22, 2024

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label Mar 22, 2024
Copy link

openshift-ci bot commented Mar 25, 2024

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit b61b9d5 into netobserv:main Mar 25, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants