Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track driver deploy time in e2e test pipeline #815

Merged
merged 1 commit into from
Mar 24, 2021

Conversation

AndyXiangLi
Copy link
Contributor

@AndyXiangLi AndyXiangLi commented Mar 23, 2021

Is this a bug fix or adding new feature?
Fixes #804
What is this PR about? / Why do we need it?
Add driver start time info during e2e test. so we have better understanding on driver's behavior.
Set threshold as 20s for now, may adjust as we have more info.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 23, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndyXiangLi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 23, 2021
@coveralls
Copy link

coveralls commented Mar 23, 2021

Pull Request Test Coverage Report for Build 1765

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 81.789%

Totals Coverage Status
Change from base Build 1757: 0.0%
Covered Lines: 1756
Relevant Lines: 2147

💛 - Coveralls

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 24, 2021
@@ -116,6 +118,15 @@ if [[ -r "${EBS_SNAPSHOT_CRD}" ]]; then
kubectl apply -f "$EBS_SNAPSHOT_CRD"
# TODO deploy snapshot controller too instead of including in helm chart
fi
endSec=$(date +'%s')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have to wait for the pods to become ready. not so easy in bash to be honest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree, I did some research but no luck to find any tool to track container start up time.
But one thing is if we use helm --wait flag, helm will wait for containersReady condition before exit. IMO It would be good enough as for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And open for any suggestion to track this info :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh yeah let's use helm built-in functionality --wait flag, that is cool if they have it.

Otherwise I was thinking of how to call a python/go script from here and use the kube python/go client, which would be really ugly...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use cloudwatch? The tests run in an AWS-internal account. And if running locally, i don't mind pushing metrics to my own account. We can fail-open in case for whatever reason the metric push fails (transient cloudwatch issue or whatever)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good find. Does it wait for everything deployed by the chart to be ready?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally there is a way to expose the metric publicly though, cloudwatch wont be visible like https://testgrid.k8s.io/provider-aws-efs-csi-driver#e2e-test&width=20 is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will wait for all the containers to be ready
https://helm.sh/docs/intro/using_helm/

@AndyXiangLi AndyXiangLi changed the title [WIP] test driver deploy time, do not merge!!! Track driver deploy time in e2e test pipeline Mar 24, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2021
@@ -116,6 +118,15 @@ if [[ -r "${EBS_SNAPSHOT_CRD}" ]]; then
kubectl apply -f "$EBS_SNAPSHOT_CRD"
# TODO deploy snapshot controller too instead of including in helm chart
fi
endSec=$(date +'%s')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good find. Does it wait for everything deployed by the chart to be ready?

hack/e2e/run.sh Outdated
@@ -25,6 +25,7 @@ source "${BASE_DIR}"/util.sh

DRIVER_NAME=${DRIVER_NAME:-aws-ebs-csi-driver}
CONTAINER_NAME=${CONTAINER_NAME:-ebs-plugin}
DRIVER_START_TIME_THRESHOLD=25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did we come up with this number? I feel like it should be higher? We can adjust later as we go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed on my cluster that usually takes ~15s so I set this number. But makes sense to increase that a little bit in the initial commit, we can adjust this later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks like it failed on the CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the delta is the image pull time.. Cold start apparently takes longer than I expected lol

hack/e2e/run.sh Outdated
secondUsed=$(( (endSec-startSec)/1 ))
# Set timeout threshold as 20 seconds for now, usually it takes less than 10s to startup
if [ $secondUsed -gt $DRIVER_START_TIME_THRESHOLD ]; then
loudecho "Driver start timeout, test fail!"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should log here how long it took and what the threshold is, so we can see the gap immediately without reading the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@wongma7
Copy link
Contributor

wongma7 commented Mar 24, 2021

I think the main bottleneck will be image pulling which we don't really control. I guess if we are trying to measure from ux perspecitve what 'cold start' would look like then it doesn't amtter the details of what is happening under the hood, just the overall #. anyway I'm ok with merging variation of this and seeing how it goes.

hack/e2e/run.sh Outdated
secondUsed=$(( (endSec-startSec)/1 ))
# Set timeout threshold as 20 seconds for now, usually it takes less than 10s to startup
if [ $secondUsed -gt $DRIVER_START_TIME_THRESHOLD ]; then
loudecho "Driver start timeout, Cost $secondUsed but the threshold is $DRIVER_START_TIME_THRESHOLD Fail the test."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Cost/Took

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it took ~30s to start up now, do you think it is ok I change the threshold to 45s? 60s looks a bit too much here

Copy link
Contributor

@ayberk ayberk Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a problem with going to 60s honestly. For now we can use this to make sure we don't introduce a change that'd increase the startup time too much. So let's start with a high number and go from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, Thank you!

hack/e2e/run.sh Outdated
@@ -25,6 +25,7 @@ source "${BASE_DIR}"/util.sh

DRIVER_NAME=${DRIVER_NAME:-aws-ebs-csi-driver}
CONTAINER_NAME=${CONTAINER_NAME:-ebs-plugin}
DRIVER_START_TIME_THRESHOLD=60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last thing, can you add a comment like # seconds here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, added seconds in the name to make it clear.

@@ -25,6 +25,7 @@ source "${BASE_DIR}"/util.sh

DRIVER_NAME=${DRIVER_NAME:-aws-ebs-csi-driver}
CONTAINER_NAME=${CONTAINER_NAME:-ebs-plugin}
DRIVER_START_TIME_THRESHOLD_SECONDS=60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :)

@ayberk
Copy link
Contributor

ayberk commented Mar 24, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 24, 2021
@k8s-ci-robot k8s-ci-robot merged commit cdbec43 into kubernetes-sigs:master Mar 24, 2021
@AndyXiangLi AndyXiangLi mentioned this pull request Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Performance Test] Track driver start time pre-submit
5 participants