NETOBSERV-473 - Loki and strimzi operator installation #172

jpinsonneau · 2022-09-26T16:26:21Z

This PR adds an option for both loki & kafka to install dependent operators automatically using:

  operatorsAutoInstall: ["kafka", "loki"]
  kafka:
    autoInstallSpec:
      replicas: 3
      storage:
        type: persistent-claim
        size: 200Gi
        class: gp2
      zooKeeperReplicas: 3
      zooKeeperStorage:
        type: persistent-claim
        size: 20Gi
        class: gp2
      partitions: 24
      topicReplicas: 3
  loki:
    autoInstallSpec:
      secretName: loki-secret
      objectStorageType: s3
      size: 1x.extra-small
      storageClassName: gp2
      retentionDays: 1

It create subscription based on the environment currently forced to openshift for testing.
Then it apply related yaml from controllers/operators/embed/ folder overriding optional configuration provided in instanceSpec.

/!\ you still need to manage some manual tasks:

create loki-secret for storage access
~~apply loki role (currently returning an error since controller service account doesn't have these roles)~~
loki roles are automatically applied 552e3f5
~~copy kafka secrets in netobserv-privileged namespace (make fix-ebpf-kafka-tls)~~
kafka secret are automatically copied 55dbebf
use custom catalog for loki-operator until release 5.6.x (Nov 15th)

Create the catalog using the following yaml:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: loki
  namespace: openshift-marketplace
spec:
  displayName: Loki Operator Catalog
  image: 'quay.io/jpinsonn/loki-operator-catalog:v0.0.1'
  publisher: jpinsonn
  sourceType: grpc

Then specify it in our CRD:

  loki:
    autoInstallSpec:
      source: loki

openshift-ci · 2022-09-26T16:26:25Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

KalmanMeth · 2022-09-29T09:59:31Z

We should keep in mind that we will also probably need a similar prometheus operator installation to expose the flp metrics.

jpinsonneau · 2022-09-29T10:03:14Z

We should keep in mind that we will also probably need a similar prometheus operator installation to expose the flp metrics.

Yes @KalmanMeth these are followups:
https://issues.redhat.com/browse/NETOBSERV-564 Dependent operators: prometheus (for NOO upstream only)
https://issues.redhat.com/browse/NETOBSERV-388 Enable prometheus endpoint in FLP without adding a prometheus stage

And grafana could also be a good candidate:
https://issues.redhat.com/browse/NETOBSERV-604 Dependent operators: grafana

mariomac

I tried but the following pods are stuck in ContainerCreating status:

netobserv-ebpf-agent
flowlogs-pipeline-transformer
netobserv-plugin

Due to this errors in the description's events list:

  Normal   Scheduled    2m53s                default-scheduler  Successfully assigned netobserv-privileged/netobserv-ebpf-agent-99tzk to ip-10-0-144-117.ec2.internal by ip-10-0-169-224
  Warning  FailedMount  50s                  kubelet            Unable to attach or mount volumes: unmounted volumes=[kafka-certs-ca kafka-certs-user], unattached volumes=[kube-api-access-2zbnc kafka-certs-ca kafka-certs-user]: timed out waiting for the condition
  Warning  FailedMount  46s (x9 over 2m53s)  kubelet            MountVolume.SetUp failed for volume "kafka-certs-ca" : secret "kafka-cluster-cluster-ca-cert" not found
  Warning  FailedMount  46s (x9 over 2m53s)  kubelet            MountVolume.SetUp failed for volume "kafka-certs-user" : secret "flp-kafka" not found

  Warning  FailedMount  4m10s (x7 over 4m41s)  kubelet            MountVolume.SetUp failed for volume "kafka-certs-ca" : secret "kafka-cluster-cluster-ca-cert" not found
  Warning  FailedMount  3m38s (x8 over 4m41s)  kubelet            MountVolume.SetUp failed for volume "loki-certs-ca" : configmap "lokistack-ca-bundle" not found

To be addressed in another PR/task: There are some ugly stacktraces in the manager logs due to trying to reconcile elements while the related CRD is still not applied:

1.6649732503408134e+09	INFO	controller.flowcollector	checking for lokistack in ns netobserv ...	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "component": "ClientHelper", "function": "ApplyWithNamespaceOverride"}
1.6649732503408515e+09	ERROR	controller.flowcollector	Can't apply embed/loki_instance.yaml yaml	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "component": "OperatorsController", "function": "manageOperator", "error": "no matches for kind \"LokiStack\" in version \"loki.grafana.com/v1\""}
github.com/netobserv/network-observability-operator/controllers/operators.(*Reconciler).Reconcile
	/opt/app-root/controllers/operators/operators_reconciler.go:124
github.com/netobserv/network-observability-operator/controllers.(*FlowCollectorReconciler).Reconcile
	/opt/app-root/controllers/flowcollector_controller.go:165
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2




1.66497323313885e+09	ERROR	controller.flowcollector	Can't apply embed/loki_instance.yaml yaml	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "component": "OperatorsController", "function": "manageOperator", "error": "no matches for kind \"LokiStack\" in version \"loki.grafana.com/v1\""}
github.com/netobserv/network-observability-operator/controllers/operators.(*Reconciler).Reconcile
	/opt/app-root/controllers/operators/operators_reconciler.go:124
github.com/netobserv/network-observability-operator/controllers.(*FlowCollectorReconciler).Reconcile
	/opt/app-root/controllers/flowcollector_controller.go:165
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem

api/v1alpha1/flowcollector_types.go

mariomac · 2022-10-05T12:27:38Z

config/samples/flows_v1alpha1_flowcollector.yaml

+      objectStorageType: s3
+      size: 1x.extra-small
+      storageClassName: gp2
+      retentionDays: 1


Must it be in days? maybe if the storage is too high, some users might want to set the retention to hours.

Yes it is in days on loki-operator side. We can sync with logging team if we really need less

config/manager/kustomization.yaml

api/v1alpha1/flowcollector_types.go

config/samples/flows_v1alpha1_flowcollector.yaml

controllers/operators/embed/kafka_instance.yaml

jpinsonneau · 2022-10-12T08:19:55Z

Loki roles are now automatically set 552e3f5

openshift-ci · 2022-11-10T09:07:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jpinsonneau by writing /assign @jpinsonneau in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-11-10T09:07:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jpinsonneau by writing /assign @jpinsonneau in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jpinsonneau · 2022-11-10T09:10:19Z

PR has been rebased. No more changes on my side.

loki operator 5.6 is still not yet available as far as I see in my cluster bot however @memodi had it in a CI cluster 🤔

mariomac · 2022-11-10T10:53:48Z

@jpinsonneau I tried to install it but I had some problems with the PersistentVolume claims:

  Warning  FailedScheduling  3m34s  default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.

That's ok, I just changed the size of the storage and reapplied the Flowcollector (also decreasing the instances for Kafka and Zookeper) but the changes didn't take effect.

Then I tried to deleting the FlowCollector but the reconciliation of kafka was still blocked:

NAME                                                    READY   STATUS    RESTARTS   AGE
amq-streams-cluster-operator-v2.2.0-2-bfdb5f487-7z9xb   1/1     Running   0          6m50s
kafka-cluster-zookeeper-0                               0/1     Pending   0          6m33s
kafka-cluster-zookeeper-1                               0/1     Pending   0          6m33s
kafka-cluster-zookeeper-2                               0/1     Pending   0          6m33s
netobserv-controller-manager-746d858486-rwbc8           2/2     Running   0          14m

I had to completely undeploy the operator to be able to remove the pods.

mariomac · 2022-11-10T10:54:47Z

@jpinsonneau I redeployed with smaller persistent volume sizes and the status of the flowcollector is ReconcileDependentOperatorsFailed. The manager logs continuously this message:

1.6680775737424061e+09	ERROR	controller.flowcollector	Failed to reconcile dependent operators:
LokiStack.loki.grafana.com "lokistack" is invalid: spec.tenants.mode: Unsupported value: "openshift-network":
supported values: "static", "dynamic", "openshift-logging"

Tested in:

$ oc version
Client Version: 4.11.13
Kustomize Version: v4.5.4
Server Version: 4.12.0-0.nightly-2022-11-07-181244
Kubernetes Version: v1.25.2+93b33ea

jotak · 2022-11-10T17:00:13Z

controllers/operators/embed/operator_group.yaml

+apiVersion: operators.coreos.com/v1
+kind: OperatorGroup
+metadata:
+  name: netobserv-dependend-operators


typo: dependend => dependent
Also, what is it / how does that work?

It's a mandatory resource generating RBACs
https://olm.operatorframework.io/docs/tasks/install-operator-with-olm/#prerequisites
https://docs.openshift.com/container-platform/4.8/operators/understanding/olm/olm-understanding-operatorgroups.html

You can't create a subscription in a particular namespace without this

controllers/secrets/secrets_reconciler.go

jotak · 2022-11-10T17:37:21Z

I've finished reviewing the code. I have some remarks but I think they can be addressed in a follow-up (we should probably try to go ahead and iterate from there, in order to have it more tested before GA, at least for non-regressions. I just hope this feature will not bring too much maintenance cost for the benefit, because it is quite complex and depends on external factors).

I haven't tested, except some non-regression tests. I leave it to people who test to put the lgtm mark :)

What I'd like to see in a follow-up:

Having kubebuilder validation on AutoInstall enum (I think we can do that by declaring a new string type, like type AutoInstall string, with the appropriate kubebuilder annotation on it)
Use upper case for new enum, consistently with other enums
Nitpicking, but wouldn't it be nice to use a similar API that we did on Spec (like spec.UseKafka()) for these auto-install things?
Could we have the secrets watcher being run as a separate controller? Maybe also the CRD watcher? In the same idea than what Olivier suggested with different goroutines, I guess...
If I read correctly, Kafka secrets, when present, are always copied to privileged namespace. But there's cases where we don't have such a namespace, e.g. when agent is ipfix, and I wonder if that's also possible with ebpf (I don't remember, I guess @mariomac can answer). In that case we should not try to copy secrets.

jotak · 2022-11-16T11:54:23Z

controllers/flowcollector_controller_test.go

@@ -630,7 +630,7 @@ func GetReadyCR(key types.NamespacedName) *flowsv1alpha1.FlowCollector {
 			return err
 		}
 		cond := meta.FindStatusCondition(cr.Status.Conditions, conditions.TypeReady)
-		if cond.Status == metav1.ConditionFalse {
+		if cond != nil && cond.Status == metav1.ConditionFalse {


@jpinsonneau maybe the problem is here: if cond is nil, then it should return an error ?

thanks it seems to work as expected 👍

jpinsonneau · 2022-11-21T17:16:01Z

/hold

openshift-merge-robot · 2022-11-22T12:10:30Z

@jpinsonneau: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

NETOBSERV-234 remove slashes in debug json flows

openshift-ci bot added the do-not-merge/work-in-progress label Sep 26, 2022

jpinsonneau force-pushed the 473 branch 3 times, most recently from 6c7dabf to 7067927 Compare September 28, 2022 10:16

jpinsonneau marked this pull request as ready for review September 28, 2022 10:30

openshift-ci bot removed the do-not-merge/work-in-progress label Sep 28, 2022

jpinsonneau force-pushed the 473 branch 3 times, most recently from aed63c3 to cd996aa Compare September 28, 2022 13:43

openshift-merge-robot added the needs-rebase label Sep 29, 2022

jpinsonneau force-pushed the 473 branch from cd996aa to 1c54d69 Compare October 5, 2022 08:26

openshift-merge-robot removed the needs-rebase label Oct 5, 2022

jpinsonneau requested review from jotak, mariomac and OlivierCazade October 5, 2022 08:27

mariomac reviewed Oct 5, 2022

View reviewed changes

api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved

openshift-merge-robot added the needs-rebase label Oct 5, 2022

mariomac reviewed Oct 6, 2022

View reviewed changes

api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved

jotak reviewed Oct 6, 2022

View reviewed changes

api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved

jotak reviewed Oct 6, 2022

View reviewed changes

config/samples/flows_v1alpha1_flowcollector.yaml Outdated Show resolved Hide resolved

jotak reviewed Oct 6, 2022

View reviewed changes

config/samples/flows_v1alpha1_flowcollector.yaml Outdated Show resolved Hide resolved

jotak reviewed Oct 6, 2022

View reviewed changes

controllers/operators/embed/kafka_instance.yaml Show resolved Hide resolved

jpinsonneau force-pushed the 473 branch from 95e2fa3 to c62a625 Compare October 10, 2022 16:25

openshift-merge-robot removed the needs-rebase label Oct 10, 2022

jpinsonneau force-pushed the 473 branch from c62a625 to adc8afe Compare October 11, 2022 07:34

jpinsonneau closed this Nov 10, 2022

jpinsonneau force-pushed the 473 branch from 76cecde to cb80d31 Compare November 10, 2022 09:03

jpinsonneau reopened this Nov 10, 2022

openshift-merge-robot removed the needs-rebase label Nov 10, 2022

jpinsonneau requested review from mariomac and OlivierCazade November 10, 2022 09:10

jotak reviewed Nov 10, 2022

View reviewed changes

controllers/secrets/secrets_reconciler.go Outdated Show resolved Hide resolved

openshift-merge-robot added the needs-rebase label Nov 12, 2022

jpinsonneau added 2 commits November 16, 2022 11:06

dependent operators

b731ad6

fix unsafe GetReadyCR check

02ddfe0

jpinsonneau force-pushed the 473 branch from 205cfce to 02ddfe0 Compare November 16, 2022 10:06

openshift-merge-robot removed the needs-rebase label Nov 16, 2022

typo & log

be77163

jpinsonneau force-pushed the 473 branch from 58de256 to 3c7e038 Compare November 16, 2022 11:16

jotak reviewed Nov 16, 2022

View reviewed changes

getReadyCR return error on nil

06f4fd2

jpinsonneau force-pushed the 473 branch from 3c7e038 to 06f4fd2 Compare November 16, 2022 12:01

openshift-ci bot added the do-not-merge/hold label Nov 21, 2022

openshift-merge-robot added the needs-rebase label Nov 22, 2022

jpinsonneau closed this Dec 15, 2022

KalmanMeth pushed a commit to KalmanMeth/network-observability-operator that referenced this pull request Feb 13, 2023

Merge pull request netobserv#172 from jotak/remove-slashes

3cefd94

NETOBSERV-234 remove slashes in debug json flows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NETOBSERV-473 - Loki and strimzi operator installation #172

NETOBSERV-473 - Loki and strimzi operator installation #172

jpinsonneau commented Sep 26, 2022 •

edited

openshift-ci bot commented Sep 26, 2022

KalmanMeth commented Sep 29, 2022

jpinsonneau commented Sep 29, 2022

mariomac left a comment

mariomac Oct 5, 2022

jpinsonneau Oct 5, 2022

jpinsonneau commented Oct 12, 2022

openshift-ci bot commented Nov 10, 2022

openshift-ci bot commented Nov 10, 2022

jpinsonneau commented Nov 10, 2022

mariomac commented Nov 10, 2022

mariomac commented Nov 10, 2022 •

edited

jotak Nov 10, 2022

jpinsonneau Nov 16, 2022

jotak commented Nov 10, 2022

jotak Nov 16, 2022

jpinsonneau Nov 16, 2022

jpinsonneau commented Nov 21, 2022

openshift-merge-robot commented Nov 22, 2022

NETOBSERV-473 - Loki and strimzi operator installation #172

NETOBSERV-473 - Loki and strimzi operator installation #172

Conversation

jpinsonneau commented Sep 26, 2022 • edited

openshift-ci bot commented Sep 26, 2022

KalmanMeth commented Sep 29, 2022

jpinsonneau commented Sep 29, 2022

mariomac left a comment

Choose a reason for hiding this comment

mariomac Oct 5, 2022

Choose a reason for hiding this comment

jpinsonneau Oct 5, 2022

Choose a reason for hiding this comment

jpinsonneau commented Oct 12, 2022

openshift-ci bot commented Nov 10, 2022

openshift-ci bot commented Nov 10, 2022

jpinsonneau commented Nov 10, 2022

mariomac commented Nov 10, 2022

mariomac commented Nov 10, 2022 • edited

jotak Nov 10, 2022

Choose a reason for hiding this comment

jpinsonneau Nov 16, 2022

Choose a reason for hiding this comment

jotak commented Nov 10, 2022

jotak Nov 16, 2022

Choose a reason for hiding this comment

jpinsonneau Nov 16, 2022

Choose a reason for hiding this comment

jpinsonneau commented Nov 21, 2022

openshift-merge-robot commented Nov 22, 2022

jpinsonneau commented Sep 26, 2022 •

edited

mariomac commented Nov 10, 2022 •

edited