New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1904497: Add vsphere problem detector deployment #111
Bug 1904497: Add vsphere problem detector deployment #111
Conversation
034f8d2
to
fab994e
Compare
Add code to sync static assets Add generated binary assets Add code to create service monitor requests
fab994e
to
0a30222
Compare
@gnufied: This pull request references Bugzilla bug 1904497, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@gnufied: This pull request references Bugzilla bug 1904497, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is how conditions look like btw:
|
289b8ce
to
f236f05
Compare
return nil | ||
} | ||
|
||
go c.controller.Start(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run the controller only once, this looks like it's going to be called repeatedly in each sync()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/operator/vsphereproblemdetector/vsphere_problem_detector_starter.go
Outdated
Show resolved
Hide resolved
eventRecorder: eventRecorder, | ||
} | ||
return factory.New(). | ||
WithInformers(c.operatorClient.Informer()). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it have informer of ServiceMonitors
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could but it will require importing all of - https://github.com/prometheus-operator/prometheus-operator/tree/master/pkg/apis/monitoring but I was being thwarted by dependency conflicts, so I have avoided doing this.
We are doing what we did for syncing credentials (before we moved them to CVO) and that is the reason we are rescyncing every 1 minute because we are not watching ServiceMonitor
objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did ended up importing those informer and APIs and hence marking this as resolved.
pkg/operator/vsphereproblemdetector/vsphere_problem_detector_deployment.go
Outdated
Show resolved
Hide resolved
return nil | ||
} | ||
|
||
func getLogLevel(logLevel operatorapi.LogLevel) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move it to a common package, it's used in deploymentcontroller.go too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. I extracted most of deployment controller creation code in a util function and used in both places.
423ae36
to
d86d9f0
Compare
/retest |
opSpec, opStatus, _, err := c.operatorClient.GetOperatorState() | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check for IsNotFound
error here
resourcemerge.SetDeploymentGeneration(&opStatus.Generations, deployment) | ||
|
||
updateGenerationFn := func(newStatus *operatorapi.OperatorStatus) error { | ||
if deployment != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If deployment
were nil
it would have crashed up above
- apiGroups: | ||
- security.openshift.io | ||
resourceNames: | ||
- privileged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which functionality require it to be privileged?
namespace: openshift-cluster-storage-operator | ||
annotations: | ||
include.release.openshift.io/self-managed-high-availability: "true" | ||
include.release.openshift.io/single-node-developer: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is deployed by CSO, not CVO. Why do we need these annotations?
If we do need them, shouldn't the deployment have them as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for the other files in assets/vsphere_problem_detector
operatorClient: clients.OperatorClient, | ||
kubeClient: clients.KubeClient, | ||
dynamicClient: clients.DynamicClient, | ||
eventRecorder: eventRecorder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a suffix to this recorder so that we know where events are coming from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't eventRecorder.WithComponentSuffix("vsphere-monitoring-controller")
applied when creating controller be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really because it's a different control loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we don't use it anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops I missed this. fixed.
pkg/operator/vsphereproblemdetector/vsphere_problem_detector_deployment.go
Outdated
Show resolved
Hide resolved
@@ -82,35 +82,17 @@ spec: | |||
value: quay.io/openshift/origin-csi-node-driver-registrar:latest | |||
- name: LIVENESS_PROBE_IMAGE | |||
value: quay.io/openshift/origin-csi-livenessprobe:latest | |||
- name: VSPHERE_PROBLEM_DETECTOR_OPERATOR_IMAGE | |||
value: quay.io/openshift/origin-vsphere-problem-detector:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: PR descriptions says it's adding CredentialsRequest, but it's already there
opSpec, _, _, err := c.operatorClient.GetOperatorState() | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing check for not found error...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
placeholder
@gnufied: This pull request references Bugzilla bug 1904497, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
operatorClient: clients.OperatorClient, | ||
kubeClient: clients.KubeClient, | ||
dynamicClient: clients.DynamicClient, | ||
eventRecorder: eventRecorder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really because it's a different control loop
_, err = resourceapply.ApplyServiceMonitor(c.dynamicClient, c.eventRecorder, serviceMonitorBytes) | ||
|
||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we shouldn't make CSO degraded if we fail to create a service monitor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not? currently service monitor objects are created by CVO and if it can't create those, I think cluster will be degraded.
pkg/utils/deployment_controller.go
Outdated
deploymentAvailable.Status = operatorapi.ConditionTrue | ||
} else { | ||
deploymentAvailable.Status = operatorapi.ConditionFalse | ||
deploymentAvailable.Reason = "WaitDeployment" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it was like this before, but we should have a single Reason
value to represent the same thing: openshift/library-go#901
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
func (c *VSphereProblemDetectorStarter) sync(ctx context.Context, syncCtx factory.SyncContext) error { | ||
klog.V(4).Infof("VSphereProblemDetectorStarter.Sync started") | ||
defer klog.V(4).Infof("VSphereProblemDetectorStarter.Sync finished") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm... this controller would still be running in all platforms, not just vSphere...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this controller will still run but it won't do anything.
eventRecorder events.Recorder | ||
} | ||
|
||
func NewVSphereProblemDetectorStarter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like that we have 2 "starter" controllers whose main job is the same: start certain sub-controllers per platform.
I'd prefer to make the necessary adjustments to CSIDriverStarterController
(to make it easily to start non-CSI controllers), rather than creating another starter controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not that I disagree to it but can we do this in a future PR(I am happy to open a BZ/github issue to remind myself), because I think that this PR is quite large already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea: when we have vSphere CSI driver, we can add detector deployment at an extra controller started with CSI driver, similarly as Manila starts certificate syncer:
cluster-storage-operator/pkg/operator/csidriveroperator/csioperatorclient/manila.go
Lines 46 to 48 in e7919b1
ExtraControllers: []factory.Controller{ | |
newCertificateSyncerOrDie(clients, recorder), | |
}, |
It will be somewhat hidden in the CSI driver deployment, but still, it will run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that. Can we add a TODO entry in this starter controller?
v1helpers.UpdateConditionFn(deploymentProgressing), | ||
updateGenerationFn, | ||
); err != nil { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the idea of this function was to avoid code duplication, but this will make CSO degraded if we can't deploy the problem detector operator. Is that what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we want that? vsphere-problem-detector is not an optional operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to get degraded, it's really not an optional operator.
a5efca5
to
6f568c2
Compare
- apiGroups: | ||
- rbac.authorization.k8s.io | ||
resources: | ||
- clusterroles | ||
- clusterrolebindings | ||
- roles | ||
- rolebindings | ||
verbs: | ||
- watch | ||
- list | ||
- get | ||
- apiGroups: | ||
- '' | ||
resources: | ||
- serviceaccounts | ||
verbs: | ||
- get | ||
- list | ||
- watch | ||
- apiGroups: | ||
- authentication.k8s.io | ||
resources: | ||
- tokenreviews | ||
verbs: | ||
- '*' | ||
- apiGroups: | ||
- authorization.k8s.io | ||
resources: | ||
- subjectaccessreviews | ||
verbs: | ||
- '*' | ||
- apiGroups: | ||
- apiextensions.k8s.io | ||
resources: | ||
- customresourcedefinitions | ||
verbs: | ||
- list | ||
- watch | ||
- apiGroups: | ||
- coordination.k8s.io | ||
resources: | ||
- leases | ||
verbs: | ||
- '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the problem detector need anything from this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this got copied from csi driver's clusterroles. We can drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
- apiGroups: | ||
- '' | ||
resources: | ||
- namespaces | ||
verbs: | ||
- get | ||
- list | ||
- watch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator should not need read namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
- apiGroups: | ||
- '*' | ||
resources: | ||
- events | ||
verbs: | ||
- get | ||
- patch | ||
- create | ||
- list | ||
- watch | ||
- update | ||
- delete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator should not need read/write events in another namespaces (use Role for emitting events in the operator namespace.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
- apiGroups: | ||
- cloudcredential.openshift.io | ||
resources: | ||
- credentialsrequests | ||
verbs: | ||
- '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator should not need manipulate CredentialRequests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed those extra verbs. but still have CR permissions to get, list, watch.
k8s.io/apimachinery v0.19.2 | ||
k8s.io/client-go v0.19.0 | ||
k8s.io/client-go v12.0.0+incompatible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is suspicious - why downgrade to 1.12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not actually result in downgrade of client-go(check go.sum
), but used client-go is still v0.19.2
but what is happening is - one of the dependencies of prometheus-operator depends on v0.12.0
- https://github.com/prometheus-operator/prometheus-operator/blob/master/go.mod#L48 and because of golang-1.13 semver check it is almost impossible to use client-go v0.12.0
without using replace directive.
I have not manually added this line - but go mod vendor
and go mod tidy
will replace v.0.19.2
with this version.
pkg/utils/deployment_controller.go
Outdated
} | ||
} | ||
|
||
depOpts.OpStatus.ReadyReplicas = deployment.Status.ReadyReplicas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSI driver will replace nr. of replica of problem-detector deployment and a vice versa. IMO, one of the deployments (or even both) should ignore ReadyReplicas
completely. The field is not very usable anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed setting that field.
6f568c2
to
5d7df47
Compare
/retest |
1 similar comment
/retest |
- apiGroups: | ||
- rbac.authorization.k8s.io | ||
resources: | ||
- clusterroles | ||
- clusterrolebindings | ||
- roles | ||
- rolebindings | ||
verbs: | ||
- watch | ||
- list | ||
- get | ||
- apiGroups: | ||
- '' | ||
resources: | ||
- serviceaccounts | ||
verbs: | ||
- get | ||
- list | ||
- watch | ||
- apiGroups: | ||
- authentication.k8s.io | ||
resources: | ||
- tokenreviews | ||
verbs: | ||
- '*' | ||
- apiGroups: | ||
- authorization.k8s.io | ||
resources: | ||
- subjectaccessreviews | ||
verbs: | ||
- '*' | ||
- apiGroups: | ||
- apiextensions.k8s.io | ||
resources: | ||
- customresourcedefinitions | ||
verbs: | ||
- list | ||
- watch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this seems to be useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed these.
- create | ||
- patch | ||
- update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator does not need to modify nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed these.
- apiGroups: | ||
- cloudcredential.openshift.io | ||
resources: | ||
- credentialsrequests | ||
verbs: | ||
- get | ||
- list | ||
- watch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator should not need to read credentialsrequests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
Fix eventrecorder and reasoning
5d7df47
to
60bc9ce
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied, jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@gnufied: All pull requests linked via external trackers have merged: Bugzilla bug 1904497 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Use deployment for managing vspher problem detector:
ServiceMonitor
objectDeployment
object.fixes https://bugzilla.redhat.com/show_bug.cgi?id=1904497