Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1904497: Add vsphere problem detector deployment #111

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Dec 4, 2020

Use deployment for managing vspher problem detector:

  • Create static assets for RBAC, deployment, service and prometheus
  • Sync static assets we can sync with library-go
  • Sync ServiceMonitor object
  • Sync Deployment object.
  • Link up the deployment with rest of the CSO

fixes https://bugzilla.redhat.com/show_bug.cgi?id=1904497

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 4, 2020
@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 4, 2020
@gnufied gnufied force-pushed the add-vsphere-problem-detector-deployment branch from 034f8d2 to fab994e Compare December 8, 2020 20:37
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 8, 2020
@gnufied gnufied changed the title WIP: Add vsphere problem detector deployment Add vsphere problem detector deployment Dec 8, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2020
Add code to sync static assets
Add generated binary assets
Add code to create service monitor requests
@gnufied gnufied force-pushed the add-vsphere-problem-detector-deployment branch from fab994e to 0a30222 Compare December 8, 2020 20:40
@gnufied gnufied changed the title Add vsphere problem detector deployment Bug 1904497: Add vsphere problem detector deployment Dec 8, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Dec 8, 2020
@openshift-ci-robot
Copy link
Contributor

@gnufied: This pull request references Bugzilla bug 1904497, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1904497: Add vsphere problem detector deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@gnufied: This pull request references Bugzilla bug 1904497, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1904497: Add vsphere problem detector deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gnufied
Copy link
Member Author

gnufied commented Dec 9, 2020

This is how conditions look like btw:

    conditions:
    - lastTransitionTime: "2020-11-30T04:03:30Z"
      reason: AsExpected
      status: "False"
      type: CSIDriverStarterDegraded
    - lastTransitionTime: "2020-11-30T04:03:30Z"
      status: "False"
      type: ManagementStateDegraded
    - lastTransitionTime: "2020-11-30T04:03:30Z"
      status: "True"
      type: DefaultStorageClassControllerAvailable
    - lastTransitionTime: "2020-11-30T04:03:30Z"
      status: "False"
      type: DefaultStorageClassControllerProgressing
    - lastTransitionTime: "2020-11-30T04:03:30Z"
      reason: AsExpected
      status: "False"
      type: DefaultStorageClassControllerDegraded
    - lastTransitionTime: "2020-11-30T04:03:30Z"
      status: "True"
      type: SnapshotCRDControllerUpgradeable
    - lastTransitionTime: "2020-11-30T04:03:30Z"
      reason: AsExpected
      status: "False"
      type: SnapshotCRDControllerDegraded
    - lastTransitionTime: "2020-12-04T02:41:07Z"
      status: "True"
      type: VSphereProblemDetectorControllerAvailable
    - lastTransitionTime: "2020-12-04T02:41:07Z"
      reason: AsExpected
      status: "False"
      type: VSphereProblemDetectorControllerDegraded
    - lastTransitionTime: "2020-12-07T20:24:12Z"
      reason: AsExpected
      status: "False"
      type: VSphereProblemDetectorStarterDegraded
    - lastTransitionTime: "2020-12-07T20:24:18Z"
      reason: AsExpected
      status: "False"
      type: VSphereProblemDetectorStarterStaticControllerDegraded
    - lastTransitionTime: "2020-12-09T15:39:49Z"
      status: "True"
      type: VSphereProblemDetectorDeploymentControllerAvailable
    - lastTransitionTime: "2020-12-08T19:36:23Z"
      status: "False"
      type: VSphereProblemDetectorDeploymentControllerProgressing
    - lastTransitionTime: "2020-12-07T20:34:57Z"
      reason: AsExpected
      status: "False"
      type: VSphereProblemDetectorDeploymentControllerDegraded
    - lastTransitionTime: "2020-12-08T19:24:41Z"
      status: "True"
      type: VSphereProblemDetectorMonitoringControllerAvailable
    - lastTransitionTime: "2020-12-08T19:24:41Z"
      reason: AsExpected
      status: "False"
      type: VSphereProblemDetectorMonitoringControllerDegraded

@gnufied gnufied force-pushed the add-vsphere-problem-detector-deployment branch from 289b8ce to f236f05 Compare December 9, 2020 17:00
return nil
}

go c.controller.Start(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run the controller only once, this looks like it's going to be called repeatedly in each sync()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

pkg/operator/vsphereproblemdetector/monitoring.go Outdated Show resolved Hide resolved
eventRecorder: eventRecorder,
}
return factory.New().
WithInformers(c.operatorClient.Informer()).
Copy link
Contributor

@jsafrane jsafrane Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it have informer of ServiceMonitors too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but it will require importing all of - https://github.com/prometheus-operator/prometheus-operator/tree/master/pkg/apis/monitoring but I was being thwarted by dependency conflicts, so I have avoided doing this.

We are doing what we did for syncing credentials (before we moved them to CVO) and that is the reason we are rescyncing every 1 minute because we are not watching ServiceMonitor objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did ended up importing those informer and APIs and hence marking this as resolved.

return nil
}

func getLogLevel(logLevel operatorapi.LogLevel) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move it to a common package, it's used in deploymentcontroller.go too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. I extracted most of deployment controller creation code in a util function and used in both places.

@gnufied gnufied force-pushed the add-vsphere-problem-detector-deployment branch from 423ae36 to d86d9f0 Compare December 11, 2020 03:54
@gnufied
Copy link
Member Author

gnufied commented Dec 11, 2020

/retest

opSpec, opStatus, _, err := c.operatorClient.GetOperatorState()
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check for IsNotFound error here

resourcemerge.SetDeploymentGeneration(&opStatus.Generations, deployment)

updateGenerationFn := func(newStatus *operatorapi.OperatorStatus) error {
if deployment != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If deployment were nil it would have crashed up above

- apiGroups:
- security.openshift.io
resourceNames:
- privileged
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which functionality require it to be privileged?

namespace: openshift-cluster-storage-operator
annotations:
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is deployed by CSO, not CVO. Why do we need these annotations?

If we do need them, shouldn't the deployment have them as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for the other files in assets/vsphere_problem_detector

operatorClient: clients.OperatorClient,
kubeClient: clients.KubeClient,
dynamicClient: clients.DynamicClient,
eventRecorder: eventRecorder,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a suffix to this recorder so that we know where events are coming from

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't eventRecorder.WithComponentSuffix("vsphere-monitoring-controller") applied when creating controller be enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really because it's a different control loop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we don't use it anywhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops I missed this. fixed.

pkg/utils/deployment_controller.go Outdated Show resolved Hide resolved
@@ -82,35 +82,17 @@ spec:
value: quay.io/openshift/origin-csi-node-driver-registrar:latest
- name: LIVENESS_PROBE_IMAGE
value: quay.io/openshift/origin-csi-livenessprobe:latest
- name: VSPHERE_PROBLEM_DETECTOR_OPERATOR_IMAGE
value: quay.io/openshift/origin-vsphere-problem-detector:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: PR descriptions says it's adding CredentialsRequest, but it's already there

opSpec, _, _, err := c.operatorClient.GetOperatorState()
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing check for not found error...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Member Author

@gnufied gnufied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placeholder

@openshift-ci-robot
Copy link
Contributor

@gnufied: This pull request references Bugzilla bug 1904497, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1904497: Add vsphere problem detector deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

operatorClient: clients.OperatorClient,
kubeClient: clients.KubeClient,
dynamicClient: clients.DynamicClient,
eventRecorder: eventRecorder,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really because it's a different control loop

_, err = resourceapply.ApplyServiceMonitor(c.dynamicClient, c.eventRecorder, serviceMonitorBytes)

if err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we shouldn't make CSO degraded if we fail to create a service monitor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not? currently service monitor objects are created by CVO and if it can't create those, I think cluster will be degraded.

deploymentAvailable.Status = operatorapi.ConditionTrue
} else {
deploymentAvailable.Status = operatorapi.ConditionFalse
deploymentAvailable.Reason = "WaitDeployment"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was like this before, but we should have a single Reason value to represent the same thing: openshift/library-go#901

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


func (c *VSphereProblemDetectorStarter) sync(ctx context.Context, syncCtx factory.SyncContext) error {
klog.V(4).Infof("VSphereProblemDetectorStarter.Sync started")
defer klog.V(4).Infof("VSphereProblemDetectorStarter.Sync finished")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... this controller would still be running in all platforms, not just vSphere...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this controller will still run but it won't do anything.

eventRecorder events.Recorder
}

func NewVSphereProblemDetectorStarter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like that we have 2 "starter" controllers whose main job is the same: start certain sub-controllers per platform.

I'd prefer to make the necessary adjustments to CSIDriverStarterController (to make it easily to start non-CSI controllers), rather than creating another starter controller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that I disagree to it but can we do this in a future PR(I am happy to open a BZ/github issue to remind myself), because I think that this PR is quite large already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: when we have vSphere CSI driver, we can add detector deployment at an extra controller started with CSI driver, similarly as Manila starts certificate syncer:

ExtraControllers: []factory.Controller{
newCertificateSyncerOrDie(clients, recorder),
},

It will be somewhat hidden in the CSI driver deployment, but still, it will run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that. Can we add a TODO entry in this starter controller?

v1helpers.UpdateConditionFn(deploymentProgressing),
updateGenerationFn,
); err != nil {
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the idea of this function was to avoid code duplication, but this will make CSO degraded if we can't deploy the problem detector operator. Is that what we want?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we want that? vsphere-problem-detector is not an optional operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to get degraded, it's really not an optional operator.

@gnufied gnufied force-pushed the add-vsphere-problem-detector-deployment branch from a5efca5 to 6f568c2 Compare December 14, 2020 16:30
Comment on lines 15 to 58
- apiGroups:
- rbac.authorization.k8s.io
resources:
- clusterroles
- clusterrolebindings
- roles
- rolebindings
verbs:
- watch
- list
- get
- apiGroups:
- ''
resources:
- serviceaccounts
verbs:
- get
- list
- watch
- apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
verbs:
- '*'
- apiGroups:
- authorization.k8s.io
resources:
- subjectaccessreviews
verbs:
- '*'
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- list
- watch
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- '*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the problem detector need anything from this list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this got copied from csi driver's clusterroles. We can drop it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Comment on lines 83 to 90
- apiGroups:
- ''
resources:
- namespaces
verbs:
- get
- list
- watch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator should not need read namespaces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Comment on lines 99 to 110
- apiGroups:
- '*'
resources:
- events
verbs:
- get
- patch
- create
- list
- watch
- update
- delete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator should not need read/write events in another namespaces (use Role for emitting events in the operator namespace.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Comment on lines 111 to 116
- apiGroups:
- cloudcredential.openshift.io
resources:
- credentialsrequests
verbs:
- '*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator should not need manipulate CredentialRequests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed those extra verbs. but still have CR permissions to get, list, watch.

k8s.io/apimachinery v0.19.2
k8s.io/client-go v0.19.0
k8s.io/client-go v12.0.0+incompatible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suspicious - why downgrade to 1.12?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not actually result in downgrade of client-go(check go.sum), but used client-go is still v0.19.2 but what is happening is - one of the dependencies of prometheus-operator depends on v0.12.0 - https://github.com/prometheus-operator/prometheus-operator/blob/master/go.mod#L48 and because of golang-1.13 semver check it is almost impossible to use client-go v0.12.0 without using replace directive.

I have not manually added this line - but go mod vendor and go mod tidy will replace v.0.19.2 with this version.

}
}

depOpts.OpStatus.ReadyReplicas = deployment.Status.ReadyReplicas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSI driver will replace nr. of replica of problem-detector deployment and a vice versa. IMO, one of the deployments (or even both) should ignore ReadyReplicas completely. The field is not very usable anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed setting that field.

@gnufied gnufied force-pushed the add-vsphere-problem-detector-deployment branch from 6f568c2 to 5d7df47 Compare December 14, 2020 19:42
@gnufied
Copy link
Member Author

gnufied commented Dec 15, 2020

/retest

1 similar comment
@gnufied
Copy link
Member Author

gnufied commented Dec 15, 2020

/retest

Comment on lines 15 to 52
- apiGroups:
- rbac.authorization.k8s.io
resources:
- clusterroles
- clusterrolebindings
- roles
- rolebindings
verbs:
- watch
- list
- get
- apiGroups:
- ''
resources:
- serviceaccounts
verbs:
- get
- list
- watch
- apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
verbs:
- '*'
- apiGroups:
- authorization.k8s.io
resources:
- subjectaccessreviews
verbs:
- '*'
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- list
- watch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this seems to be useless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed these.

Comment on lines 61 to 63
- create
- patch
- update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator does not need to modify nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed these.

Comment on lines 84 to 91
- apiGroups:
- cloudcredential.openshift.io
resources:
- credentialsrequests
verbs:
- get
- list
- watch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator should not need to read credentialsrequests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Fix eventrecorder and reasoning
@gnufied gnufied force-pushed the add-vsphere-problem-detector-deployment branch from 5d7df47 to 60bc9ce Compare December 15, 2020 17:45
@jsafrane
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 0fdc1a3 into openshift:master Dec 16, 2020
@openshift-ci-robot
Copy link
Contributor

@gnufied: All pull requests linked via external trackers have merged:

Bugzilla bug 1904497 has been moved to the MODIFIED state.

In response to this:

Bug 1904497: Add vsphere problem detector deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants