Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-473 - Loki and strimzi operator installation #172

Closed
wants to merge 4 commits into from

Conversation

jpinsonneau
Copy link
Contributor

@jpinsonneau jpinsonneau commented Sep 26, 2022

This PR adds an option for both loki & kafka to install dependent operators automatically using:

  operatorsAutoInstall: ["kafka", "loki"]
  kafka:
    autoInstallSpec:
      replicas: 3
      storage:
        type: persistent-claim
        size: 200Gi
        class: gp2
      zooKeeperReplicas: 3
      zooKeeperStorage:
        type: persistent-claim
        size: 20Gi
        class: gp2
      partitions: 24
      topicReplicas: 3
  loki:
    autoInstallSpec:
      secretName: loki-secret
      objectStorageType: s3
      size: 1x.extra-small
      storageClassName: gp2
      retentionDays: 1

It create subscription based on the environment currently forced to openshift for testing.
Then it apply related yaml from controllers/operators/embed/ folder overriding optional configuration provided in instanceSpec.

image

/!\ you still need to manage some manual tasks:

  • create loki-secret for storage access
  • apply loki role (currently returning an error since controller service account doesn't have these roles)
    loki roles are automatically applied 552e3f5
  • copy kafka secrets in netobserv-privileged namespace (make fix-ebpf-kafka-tls)
    kafka secret are automatically copied 55dbebf
  • use custom catalog for loki-operator until release 5.6.x (Nov 15th)

Create the catalog using the following yaml:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: loki
  namespace: openshift-marketplace
spec:
  displayName: Loki Operator Catalog
  image: 'quay.io/jpinsonn/loki-operator-catalog:v0.0.1'
  publisher: jpinsonn
  sourceType: grpc

Then specify it in our CRD:

  loki:
    autoInstallSpec:
      source: loki

@openshift-ci
Copy link

openshift-ci bot commented Sep 26, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@KalmanMeth
Copy link
Contributor

We should keep in mind that we will also probably need a similar prometheus operator installation to expose the flp metrics.

@jpinsonneau
Copy link
Contributor Author

We should keep in mind that we will also probably need a similar prometheus operator installation to expose the flp metrics.

Yes @KalmanMeth these are followups:
https://issues.redhat.com/browse/NETOBSERV-564 Dependent operators: prometheus (for NOO upstream only)
https://issues.redhat.com/browse/NETOBSERV-388 Enable prometheus endpoint in FLP without adding a prometheus stage

And grafana could also be a good candidate:
https://issues.redhat.com/browse/NETOBSERV-604 Dependent operators: grafana

Copy link
Contributor

@mariomac mariomac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but the following pods are stuck in ContainerCreating status:

  • netobserv-ebpf-agent
  • flowlogs-pipeline-transformer
  • netobserv-plugin

Due to this errors in the description's events list:

  Normal   Scheduled    2m53s                default-scheduler  Successfully assigned netobserv-privileged/netobserv-ebpf-agent-99tzk to ip-10-0-144-117.ec2.internal by ip-10-0-169-224
  Warning  FailedMount  50s                  kubelet            Unable to attach or mount volumes: unmounted volumes=[kafka-certs-ca kafka-certs-user], unattached volumes=[kube-api-access-2zbnc kafka-certs-ca kafka-certs-user]: timed out waiting for the condition
  Warning  FailedMount  46s (x9 over 2m53s)  kubelet            MountVolume.SetUp failed for volume "kafka-certs-ca" : secret "kafka-cluster-cluster-ca-cert" not found
  Warning  FailedMount  46s (x9 over 2m53s)  kubelet            MountVolume.SetUp failed for volume "kafka-certs-user" : secret "flp-kafka" not found

  Warning  FailedMount  4m10s (x7 over 4m41s)  kubelet            MountVolume.SetUp failed for volume "kafka-certs-ca" : secret "kafka-cluster-cluster-ca-cert" not found
  Warning  FailedMount  3m38s (x8 over 4m41s)  kubelet            MountVolume.SetUp failed for volume "loki-certs-ca" : configmap "lokistack-ca-bundle" not found

To be addressed in another PR/task: There are some ugly stacktraces in the manager logs due to trying to reconcile elements while the related CRD is still not applied:

1.6649732503408134e+09	INFO	controller.flowcollector	checking for lokistack in ns netobserv ...	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "component": "ClientHelper", "function": "ApplyWithNamespaceOverride"}
1.6649732503408515e+09	ERROR	controller.flowcollector	Can't apply embed/loki_instance.yaml yaml	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "component": "OperatorsController", "function": "manageOperator", "error": "no matches for kind \"LokiStack\" in version \"loki.grafana.com/v1\""}
github.com/netobserv/network-observability-operator/controllers/operators.(*Reconciler).Reconcile
	/opt/app-root/controllers/operators/operators_reconciler.go:124
github.com/netobserv/network-observability-operator/controllers.(*FlowCollectorReconciler).Reconcile
	/opt/app-root/controllers/flowcollector_controller.go:165
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2




1.66497323313885e+09	ERROR	controller.flowcollector	Can't apply embed/loki_instance.yaml yaml	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "component": "OperatorsController", "function": "manageOperator", "error": "no matches for kind \"LokiStack\" in version \"loki.grafana.com/v1\""}
github.com/netobserv/network-observability-operator/controllers/operators.(*Reconciler).Reconcile
	/opt/app-root/controllers/operators/operators_reconciler.go:124
github.com/netobserv/network-observability-operator/controllers.(*FlowCollectorReconciler).Reconcile
	/opt/app-root/controllers/flowcollector_controller.go:165
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem

api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
objectStorageType: s3
size: 1x.extra-small
storageClassName: gp2
retentionDays: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must it be in days? maybe if the storage is too high, some users might want to set the retention to hours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is in days on loki-operator side. We can sync with logging team if we really need less

config/manager/kustomization.yaml Outdated Show resolved Hide resolved
@jpinsonneau
Copy link
Contributor Author

Loki roles are now automatically set 552e3f5

@openshift-ci
Copy link

openshift-ci bot commented Nov 10, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jpinsonneau by writing /assign @jpinsonneau in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@openshift-ci
Copy link

openshift-ci bot commented Nov 10, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jpinsonneau by writing /assign @jpinsonneau in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jpinsonneau
Copy link
Contributor Author

PR has been rebased. No more changes on my side.

loki operator 5.6 is still not yet available as far as I see in my cluster bot however @memodi had it in a CI cluster 🤔

@mariomac
Copy link
Contributor

@jpinsonneau I tried to install it but I had some problems with the PersistentVolume claims:

  Warning  FailedScheduling  3m34s  default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.

That's ok, I just changed the size of the storage and reapplied the Flowcollector (also decreasing the instances for Kafka and Zookeper) but the changes didn't take effect.

Then I tried to deleting the FlowCollector but the reconciliation of kafka was still blocked:

NAME                                                    READY   STATUS    RESTARTS   AGE
amq-streams-cluster-operator-v2.2.0-2-bfdb5f487-7z9xb   1/1     Running   0          6m50s
kafka-cluster-zookeeper-0                               0/1     Pending   0          6m33s
kafka-cluster-zookeeper-1                               0/1     Pending   0          6m33s
kafka-cluster-zookeeper-2                               0/1     Pending   0          6m33s
netobserv-controller-manager-746d858486-rwbc8           2/2     Running   0          14m

I had to completely undeploy the operator to be able to remove the pods.

@mariomac
Copy link
Contributor

mariomac commented Nov 10, 2022

@jpinsonneau I redeployed with smaller persistent volume sizes and the status of the flowcollector is ReconcileDependentOperatorsFailed. The manager logs continuously this message:

1.6680775737424061e+09	ERROR	controller.flowcollector	Failed to reconcile dependent operators:
LokiStack.loki.grafana.com "lokistack" is invalid: spec.tenants.mode: Unsupported value: "openshift-network":
supported values: "static", "dynamic", "openshift-logging"

Tested in:

$ oc version
Client Version: 4.11.13
Kustomize Version: v4.5.4
Server Version: 4.12.0-0.nightly-2022-11-07-181244
Kubernetes Version: v1.25.2+93b33ea

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: netobserv-dependend-operators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: dependend => dependent
Also, what is it / how does that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jotak
Copy link
Member

jotak commented Nov 10, 2022

I've finished reviewing the code. I have some remarks but I think they can be addressed in a follow-up (we should probably try to go ahead and iterate from there, in order to have it more tested before GA, at least for non-regressions. I just hope this feature will not bring too much maintenance cost for the benefit, because it is quite complex and depends on external factors).

I haven't tested, except some non-regression tests. I leave it to people who test to put the lgtm mark :)

What I'd like to see in a follow-up:

  • Having kubebuilder validation on AutoInstall enum (I think we can do that by declaring a new string type, like type AutoInstall string, with the appropriate kubebuilder annotation on it)
  • Use upper case for new enum, consistently with other enums
  • Nitpicking, but wouldn't it be nice to use a similar API that we did on Spec (like spec.UseKafka()) for these auto-install things?
  • Could we have the secrets watcher being run as a separate controller? Maybe also the CRD watcher? In the same idea than what Olivier suggested with different goroutines, I guess...
  • If I read correctly, Kafka secrets, when present, are always copied to privileged namespace. But there's cases where we don't have such a namespace, e.g. when agent is ipfix, and I wonder if that's also possible with ebpf (I don't remember, I guess @mariomac can answer). In that case we should not try to copy secrets.

@@ -630,7 +630,7 @@ func GetReadyCR(key types.NamespacedName) *flowsv1alpha1.FlowCollector {
return err
}
cond := meta.FindStatusCondition(cr.Status.Conditions, conditions.TypeReady)
if cond.Status == metav1.ConditionFalse {
if cond != nil && cond.Status == metav1.ConditionFalse {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpinsonneau maybe the problem is here: if cond is nil, then it should return an error ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks it seems to work as expected 👍

@jpinsonneau
Copy link
Contributor Author

/hold

@openshift-merge-robot
Copy link
Collaborator

@jpinsonneau: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

KalmanMeth pushed a commit to KalmanMeth/network-observability-operator that referenced this pull request Feb 13, 2023
NETOBSERV-234 remove slashes in debug json flows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants