Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance to record more accurate e2e provision/deletion latency metric #78061

Merged
merged 2 commits into from Jun 1, 2019

Conversation

@yuxiangqian
Copy link
Contributor

commented May 18, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
Record accurate end to end latency for provisioning/deletion into metric name "volume_operation_total_time"
Which issue(s) this PR fixes:

Fixes #77030
Fixes #77608

Special notes for your reviewer:
unit tests created

Does this PR introduce a user-facing change?:
Metrics (end to end latency/error count) for provisioning/deletion have not been correctly recorded.

  1. Time taken by external provisioner to create a volume has not been taken into consideration.
  2. for in-tree workflow, the metric record the time spent on the first try of the operation instead of waiting for it to complete.
    This PR is to enhance pv controller to record more accurate latency/error count metrics for users to consume.
Introducing new semantic for metric "volume_operation_total_seconds" to be the end to end latency of volume provisioning/deletion. Existing metric "storage_operation_duration_seconds" will remain untouched however exposed to the following potential issues:
1. for volume's provisioned/deleted via external provisioner/deleter, "storage_operation_duration_seconds" will NOT wait for the external operation to be done before reporting latency metric (effectively close to 0). This will be fixed by using "volume_operation_total_seconds" instead
2. if there's a transient error happened during "provisioning/deletion", i.e., a volume is still in-use while a deleteVolume has been called, original "storage_operation_duration_seconds" will NOT wait until a volume has been finally deleted before reporting a not accurate latency metric. The newly implemented metric "volume_operation_total_seconds", however, wait util a provisioning/deletion operation has been fully executed.

Potential impacts:
If an SLO/alert has been defined based on "volume_operation_total_seconds", it might get violated because of the more accurate metric might be significantly larger than previously reported. The metric is defined to be a histogram and the new semantic could change the distribution.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

Hi @yuxiangqian. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested review from jingxu97 and msau42 May 18, 2019

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2019

/assign @msau42 @gnufied
/sig-storage

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2019

@msau42

This comment has been minimized.

Copy link
Member

commented May 18, 2019

/ok-to-test

@yuxiangqian yuxiangqian force-pushed the yuxiangqian:metrics branch from ee9d664 to bc134bf May 20, 2019

// End of the provision + binding operation lifecycle
// [Unit test 12-1, 12-2, 12-4]
metrics.RecordMetric(claimKey, &ctrl.operationTimestamps, nil)
ctrl.operationTimestamps.Delete(claimKey)

This comment has been minimized.

Copy link
@gnufied

gnufied May 20, 2019

Member

Do you think we could move RecordMetric as a function inside operationTimestamps - so as act of recording the metric also automatically deletes the key from cache? It will be less error prone that way?

This comment has been minimized.

Copy link
@yuxiangqian

yuxiangqian May 21, 2019

Author Contributor

agreed leaving the life-cycle management of a cache item to the caller, i.e., in this case pv_controller/pv_controller_base, may causes cache entry leaking if caller forgot to call Delete. I propose to manage the deletion within RecordMetric, however not letting that function inside OperationStartTimeCache but still under metrics.go instead. To me, the cache should not be dealing with cache entry lifecycle management. WDYT?

This comment has been minimized.

Copy link
@yuxiangqian

yuxiangqian May 22, 2019

Author Contributor

updated "RecordMetric" as described above. when an operation has been successfully executed, i.e., err == nil, the corresponding cache'd entry will be removed within the function.

@msau42
Copy link
Member

left a comment

just some minor comments. otherwise lgtm

/approve

@msau42

This comment has been minimized.

Copy link
Member

commented May 22, 2019

Will let @gnufied do final review

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 22, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, yuxiangqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@msau42

This comment has been minimized.

Copy link
Member

commented May 22, 2019

/priority important-soon

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/retest

1 similar comment
@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/retest

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/test pull-kubernetes-kubemark-e2e-gce-big

@gnufied

This comment has been minimized.

Copy link
Member

commented May 30, 2019

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold and removed lgtm labels May 30, 2019

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/retest

@yuxiangqian yuxiangqian force-pushed the yuxiangqian:metrics branch from f73e6f3 to 38a884a May 30, 2019

@gnufied

This comment has been minimized.

Copy link
Member

commented May 30, 2019

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm and removed do-not-merge/hold labels May 30, 2019

@gnufied

This comment has been minimized.

Copy link
Member

commented May 30, 2019

@saad-ali @childsb can we add this PR to 1.15 milestone please?

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/retest

@childsb childsb added this to the v1.15 milestone May 30, 2019

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/retest

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/test pull-kubernetes-e2e-gce-device-plugin-gpu

@davidz627

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

I'm concerned with the flakyness of these tests. @yuxiangqian have you confirmed the flakyness is due to issues outside of the scope of this PR?

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

I'm concerned with the flakyness of these tests. @yuxiangqian have you confirmed the flakyness is due to issues outside of the scope of this PR?

I think so, I checked most test failures, i.e., for the GPU one it says some error like:
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
ERROR: (gcloud.compute.instance-groups.managed.wait-until-stable) Timeout while waiting for group to become stable.

for the integration one:
I0530 19:30:31.814] dangling: []v1.OwnerReference{v1.OwnerReference{APIVersion:"mygroup.example.com/v1beta1", Kind:"foozpxh8a", Name:"ownervtkr7", UID:"b20b3559-cfeb-4a02-9c01-58bdc5736d21", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}}
I0530 19:30:31.814] FAIL k8s.io/kubernetes/test/integration/garbagecollector 558.064s

and see issues here:
#70180

At least I could not find anything which hints this PR caused issues like those. I have also conducted my own e2e test using gcp, none of those appears.

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

briefly checked some other PRs, like this one #78541, it seems blazel-test and integration tests fails from time to time.

@davidz627

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

sgtm thanks for the detailed answer @yuxiangqian
/retest

@yuxiangqian

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/test pull-kubernetes-e2e-gce-storage-slow

@k8s-ci-robot k8s-ci-robot merged commit b3c1bbb into kubernetes:master Jun 1, 2019

21 checks passed

cla/linuxfoundation yuxiangqian authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
func (ctrl *PersistentVolumeController) getProvisionerNameFromVolume(volume *v1.PersistentVolume) string {
plugin, err := ctrl.findDeletablePlugin(volume)
if err != nil {
return "N/A"

This comment has been minimized.

Copy link
@msau42

msau42 Jun 3, 2019

Member

Local PV will end up with "N/A" as the plugin since it's not dynamically provisioned and the plugin itself doesn't implement delete.

Should we let it fall through instead of returning "N/A" here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.