Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-219 PoC Operators #59

Closed
wants to merge 2 commits into from

Conversation

jpinsonneau
Copy link
Contributor

@jpinsonneau jpinsonneau commented Feb 16, 2022

JIRA: https://issues.redhat.com/browse/NETOBSERV-219

This PoC allow strimzi, loki and grafana operators install + instances creation simply specifying operator(s) name(s) in FlowCollectorSpec.Operators

It create a Subscription using operator-framework api that will deploy the operator and then create the kafka instance using strimzi-client-go, the strimzi api interface generated with crd-codegen. It also allow us to manipulate KafkaUser, KafkaTopic and so on.

Since loki operator is not yet available in a public CatalogSource, a custom one containing an updated version of the operator has been released:

  • namespace has been replaced by network-observability
  • gateway has been removed
  • auth has been disabled

The reconcilier will add this source to the namespace if it doesn't exists. We could use that to tests our operator releases before publishing

=> Loki Operator has been released in the redhat-operators catalog. Tenant-ID and TLS InsecureSkipVerify has been added to be able to connect until https://issues.redhat.com/browse/NETOBSERV-309 implementation.
Check netobserv/network-observability-console-plugin#147

/!\ you need to create the aws bucket before creating FlowCollector instance:

aws s3api create-bucket --bucket netobserv-loki --region us-east-1
oc -n network-observability create secret generic test --from-literal=endpoint="https://s3.us-east-1.amazonaws.com" --from-literal=region="us-east-1" --from-literal=bucketnames="netobserv-loki" --from-literal=access_key_id="XXXXXXXXXXXXXXXXXXXX" --from-literal=access_key_secret="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
oc apply -f config/samples/flows_v1alpha1_flowcollector_dev.yaml

You can check deploy-example-secret.sh for infos

image

@openshift-ci
Copy link

openshift-ci bot commented Feb 16, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Feb 16, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from jpinsonneau after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@OlivierCazade OlivierCazade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong here but my understanding is that each operator can only be deployed per cluster once.

Because of this, I think that the dependency is too strong :

  • what happens if the operator is already deployed in the cluster
  • if we deploy the operator and someone else create a second resource instance for it, this resource will get destroyed if we remove the flowcollector object

I think managing the instances like you did is great but the operator part raise many concerns.

And just for my understanding, did we try using the operator dependency feature?

api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
controllers/operators/operators_reconciler.go Show resolved Hide resolved
@jpinsonneau
Copy link
Contributor Author

Correct me if I am wrong here but my understanding is that each operator can only be deployed per cluster once.

You can actually install multiple times the same operator in different namespaces. The only situation I didn't tested is if the operator is globally installed. Then maybe the subscription in our namespace will fail.

Here is an example with grafana operator:
image

And just for my understanding, did we try using the operator dependency feature?

That's look great ! Thanks for sharing. I will test that ASAP.
This could avoid managing CatalogSource and Subscription in our code but we will still need to create the instances of each operator right ?

@OlivierCazade
Copy link
Contributor

Correct me if I am wrong here but my understanding is that each operator can only be deployed per cluster once.

You can actually install multiple times the same operator in different namespaces. The only situation I didn't tested is if the operator is globally installed. Then maybe the subscription in our namespace will fail.

Here is an example with grafana operator:

Ok, I did not know we could restrict an operator to some managed namespace. Thanks for this! This solve many of my concerns.

Copy link
Contributor

@mariomac mariomac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool! I just dropped few comments for the moment when this PoC goes from PoC to product.

api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
controllers/flowcollector_controller.go Outdated Show resolved Hide resolved
controllers/operators/operators_objects.go Outdated Show resolved Hide resolved
Copy link
Contributor

@eranra eranra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpinsonneau a question ... if I change the CR adding or removing an operator from the list ... with that operator also will be uninstalled ???

@jpinsonneau
Copy link
Contributor Author

@jpinsonneau a question ... if I change the CR adding or removing an operator from the list ... with that operator also will be uninstalled ???

For now no since I wasn't sure about installing the operators in the same namespace.
If we all agree that we force operator instances to be in the same namespace, then yes we can remove them according to the list.

@jpinsonneau
Copy link
Contributor Author

jpinsonneau commented Mar 1, 2022

I tried to import all operators configs but there is some limitations on CRD size so we should add only needed configs

  • Added prometheus (only replicas option for now)
  • Added grafana datasource & allow dashboard from specified configmap
  • Automatically create grafana route if ingress: true
  • Loki storageClassName configurable (standard for kubernetes / gp2 for openshift) + secret name
  • Automatically set ConsolePlugin querier URL if operator is required

The operators + instances will be created as soon as corresponding instanceSpec is set; else it will skip it.
You can also specify extra objects like grafana dashboard.

Example:

grafana:
  instanceSpec:
    config:
      security:
        admin_user: admin
        admin_password: admin
    ingress: true
  dashboardSpec:
      configMapRef:
        name: grafana-dashboard
        key: json
        optional: true

@jotak
Copy link
Member

jotak commented Mar 22, 2022

I'm back on testing this, sorry for the delay!
I'm seeing this error in logs:

1.647941642174263e+09	ERROR	controller.flowcollector	Failed to reconcile required operators	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "error": "no matches for kind \"LokiStack\" in version \"loki.openshift.io/v1beta1\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

Maybe it's just the time that the custom loki catalog gets deployed ?

@jotak
Copy link
Member

jotak commented Mar 22, 2022

Similar for grafana:

1.6479417328308504e+09	ERROR	controller.flowcollector	Reconciler error	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "error": "no matches for kind \"Grafana\" in version \"integreatly.org/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

@jotak
Copy link
Member

jotak commented Mar 22, 2022

/hold
putting on hold, just to have a discussion with the rest of the team: is it ok or not to ship this feature that relies on a temporary / alpha, self-hosted version of the Loki operator? I don't think it's a blocker, but I'd prefer to talk with all prior

@jotak
Copy link
Member

jotak commented Mar 22, 2022

Seems like Loki fails to deploy for me.
What I did is just build / deployed the operator, then applied the provided CR with _dev suffix, changing storageClassName to standard (or does standard won't work on AWS, and creating a S3 bucket is mandatory?).
Anyway, the error I get doesn't look related to that, here it is, in the loki operator logs:

{"_ts":"2022-03-22T09:34:41.38305699Z","_level":"0","_component":"loki-operator_controller-runtime_manager_controller_lokistack","_message":"Reconciler error","_error":{"msg":"Operation cannot be fulfilled on lokistacks.loki.openshift.io \"loki\": the object has been modified; please apply your changes to the latest version and try again"},"name":"loki","namespace":"network-observability"}

My LokiStack:

$ oc get lokistack -o yaml
apiVersion: v1
items:
- apiVersion: loki.openshift.io/v1beta1
  kind: LokiStack
  metadata:
    creationTimestamp: "2022-03-22T09:34:41Z"
    generation: 1
    labels:
      part-of: netobserv
    name: loki
    namespace: network-observability
    ownerReferences:
    - apiVersion: flows.netobserv.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: FlowCollector
      name: cluster
      uid: 37950a15-ab0b-4407-8e5c-73819b0061bc
    resourceVersion: "39218"
    uid: 77b0b01a-2059-4b6d-876f-df1d4c3b5615
  spec:
    managementState: Managed
    replicationFactor: 1
    size: 1x.small
    storage:
      secret:
        name: test
    storageClassName: standard
  status:
    components: {}
    conditions:
    - lastTransitionTime: "2022-03-22T09:34:41Z"
      message: Missing object storage secret
      reason: MissingObjectStorageSecret
      status: "False"
      type: Degraded
    - lastTransitionTime: "2022-03-22T09:34:41Z"
      message: All components ready
      reason: ReadyComponents
      status: "True"
      type: Ready
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Maybe that's due to the early state of the loki operator?

@jotak
Copy link
Member

jotak commented Mar 22, 2022

I changed the storageClassName back to gp2 but it didn't make it to the LokiStack, it's still showing standard

@jotak
Copy link
Member

jotak commented Mar 22, 2022

Also, it seems like the created operators are not removed after removing the CR, or editing it to disable them

@jpinsonneau
Copy link
Contributor Author

jpinsonneau commented Mar 24, 2022

@jotak here are some answers:

What I did is just build / deployed the operator, then applied the provided CR with _dev suffix, changing storageClassName to standard (or does standard won't work on AWS, and creating a S3 bucket is mandatory?).

On AWS using openshift-install command you will not have standard StorageClasses
image

Anyway, the error I get doesn't look related to that, here it is, in the loki operator logs:
...
Maybe that's due to the early state of the loki operator?

I changed the storageClassName back to gp2 but it didn't make it to the LokiStack, it's still showing standard

Yes ! Error status are not revelent at all and it just crashs if your storage / bucket is not correctly set. You then need to manually uninstall Operator or LokiStack since most of changes / checks are not managed on their side.

Also, it seems like the created operators are not removed after removing the CR, or editing it to disable them

My initial plan was to use global operators. I changed this to our operator namespace:
#59 (comment)
I can implement deletes now I guess but it will be easy to loose data by mistake

@openshift-ci openshift-ci bot removed the lgtm label Apr 1, 2022
@openshift-ci
Copy link

openshift-ci bot commented Apr 1, 2022

New changes are detected. LGTM label has been removed.


// install operators if main instance is required
// dependent objects will be ignored if InstanceSpec is nil
if target.Spec.Loki.InstanceSpec != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment than here #78 (comment) :-)
We need an explicit flag, due to restrictions when the CR is created from a console form (not yaml). Checking nil pointers won't work (they'll never be nil).

By the way, I haven't checked in deep if the Kafka config is consistent with what @OlivierCazade is doing here: #78 , do you know? Things like TopicName will be defined in just one place, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not much difference between the configs. I'll adapt after #78 merge

@jpinsonneau
Copy link
Contributor Author

Added enable flags

@openshift-ci
Copy link

openshift-ci bot commented May 12, 2022

@jpinsonneau: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

}
}

func grafanaSubscription(name string, namespace string) *ov1alpha1.Subscription {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in slack, I think it's necessary for productization to make these subscriptions optional, and configurable, in order to adapt upstream vs downstream.
I think a good option would be to externalize these as Subscription yaml resources, that will be easily adaptable between upstream and downstream.

As a consequence, it must be considered in code that subscriptions can potentially be missing (e.g.: no community loki operator), which should not be considered an error.

@jpinsonneau jpinsonneau closed this Jul 6, 2022
KalmanMeth added a commit to KalmanMeth/network-observability-operator that referenced this pull request Feb 13, 2023
catch sigterm and exit go routines cleanly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants