Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale-test: Measure APIServer SLOs #15963

Merged
merged 4 commits into from
Oct 17, 2023
Merged

scale-test: Measure APIServer SLOs #15963

merged 4 commits into from
Oct 17, 2023

Conversation

hakuna-matatah
Copy link
Contributor

@hakuna-matatah hakuna-matatah commented Sep 25, 2023

- This PR makes CL2 to install prometheus stack locally before kicking off the load test
- Update go deps from kubernetes-sigs/kubetest2#244

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 25, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @hakuna-matatah. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Member

dims commented Sep 25, 2023

/ok-to-test
/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 25, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 25, 2023
@hakman
Copy link
Member

hakman commented Sep 25, 2023

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

2 similar comments
@hakman
Copy link
Member

hakman commented Sep 26, 2023

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakman
Copy link
Member

hakman commented Sep 26, 2023

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakman hakman requested review from hakman and removed request for olemarkus and johngmyers September 26, 2023 09:13
@hakman
Copy link
Member

hakman commented Sep 26, 2023

@hakuna-matatah

kubetest2 kops -v=2 --cloud-provider=aws --cluster-name=e2e-ff02749ef8-a423a.test-cncf-aws.k8s.io --kops-binary-path=/home/prow/go/src/k8s.io/bin/kops --admin-access=0.0.0.0/0 --env=KOPS_FEATURE_FLAGS=ClusterAddons, --validation-wait=45m --test=clusterloader2 --kubernetes-version=v1.28.2 --enable-prometheus-server=true --prometheus-pvc-storage-class=gp2 --provider=aws --repo-root=/home/prow/go/src/k8s.io/perf-tests --test-configs=/home/prow/go/src/k8s.io/perf-tests/clusterloader2/testing/load/config.yaml --test-configs=/home/prow/go/src/k8s.io/perf-tests/clusterloader2/testing/huge-service/config.yaml --test-configs=/home/prow/go/src/k8s.io/perf-tests/clusterloader2/testing/access-tokens/config.yaml --kube-config=/tmp/kubeconfig.yL6RMrPS9
Error: unknown flag: --enable-prometheus-server

@dims
Copy link
Member

dims commented Sep 26, 2023

@hakman kubernetes-sigs/kubetest2#244

@hakman
Copy link
Member

hakman commented Sep 26, 2023

Thanks @dims! 😄
@hakuna-matatah Could you also check what limit was hit when both pre-submit and periodic were running at the same time this morning?

@upodroid
Copy link
Member

one more thing, you'll run in to this problem if you try to bump the kubetest2 deps

kubernetes-sigs/boskos#173

@hakuna-matatah
Copy link
Contributor Author

Thanks @dims! 😄 @hakuna-matatah Could you also check what limit was hit when both pre-submit and periodic were running at the same time this morning?

What limits are you referring to ? EC2 vcpus ?

@hakman
Copy link
Member

hakman commented Sep 26, 2023

What limits are you referring to ? EC2 vcpus ?

I suspect that there is a limit on network interfaces (for sure not vCPUs).
Nodes are allocated, but some pods are still in pending:

VALIDATION ERRORS
KIND	NAME						MESSAGE
Machine	i-004735166086eb1a8				machine "i-004735166086eb1a8" has not yet joined cluster
Machine	i-04d472f0ade8444a2				machine "i-04d472f0ade8444a2" has not yet joined cluster
Machine	i-080a7e812c07c8b5f				machine "i-080a7e812c07c8b5f" has not yet joined cluster
Machine	i-08765c46553ad7ec5				machine "i-08765c46553ad7ec5" has not yet joined cluster
Machine	i-0cbcdc04bc9c43d8e				machine "i-0cbcdc04bc9c43d8e" has not yet joined cluster
Machine	i-0d78f9a4fd2ac19f1				machine "i-0d78f9a4fd2ac19f1" has not yet joined cluster
Machine	i-0da80da3246b8bf33				machine "i-0da80da3246b8bf33" has not yet joined cluster
Machine	i-0dcf5f85fc27ec1ae				machine "i-0dcf5f85fc27ec1ae" has not yet joined cluster
Machine	i-0ec9351e879207a21				machine "i-0ec9351e879207a21" has not yet joined cluster
Machine	i-0f06977ad9681e2a0				machine "i-0f06977ad9681e2a0" has not yet joined cluster
Machine	i-0f7a95e2c34a855f4				machine "i-0f7a95e2c34a855f4" has not yet joined cluster
Node	i-040f427732d5daf27				node "i-040f427732d5daf27" of role "node" is not ready
Pod	kube-system/aws-node-r8dxf			system-node-critical pod "aws-node-r8dxf" is pending
Pod	kube-system/aws-node-tm696			system-node-critical pod "aws-node-tm696" is pending
Pod	kube-system/aws-node-trrdn			system-node-critical pod "aws-node-trrdn" is pending
Pod	kube-system/aws-node-wqrb6			system-node-critical pod "aws-node-wqrb6" is pending
Pod	kube-system/ebs-csi-node-cknr6			system-node-critical pod "ebs-csi-node-cknr6" is pending
Pod	kube-system/ebs-csi-node-kj8r5			system-node-critical pod "ebs-csi-node-kj8r5" is pending
Pod	kube-system/ebs-csi-node-qrgf5			system-node-critical pod "ebs-csi-node-qrgf5" is pending
Pod	kube-system/ebs-csi-node-snsfc			system-node-critical pod "ebs-csi-node-snsfc" is pending
Pod	kube-system/kube-proxy-i-004735166086eb1a8	system-node-critical pod "kube-proxy-i-004735166086eb1a8" is pending

This is the run:
https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1706544867737341952/

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 26, 2023
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 17, 2023
@hakuna-matatah
Copy link
Contributor Author

Apparently tests are now failing because kube-proxy service is looking for label selector [component](https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/prometheus/manifests/default/kube-proxy-service.yaml#L2) by default. But kops creates kube-proxy with label k8s-app which makes kube-proxy service to not match with any of the kube-proxy pod endpoints , which leads the prometheus to not be able to scrape kubeproxy and thus leading to CL2 tests failures with errors NetworkProgrammingLatency gathering error: got unexpected number of samples: 0 .

Resolved this issue in the last latest commit on this PR by passing env variable for kube-proxy selector key

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

Now that kube-proxy endpoints are being detected due to above fix in the comment , ran into another blocker, i.e; prometheus is not able to get data from metrics endpoint of kube-proxy .
Screenshot 2023-10-17 at 1 10 10 PM

@hakuna-matatah
Copy link
Contributor Author

Now that kube-proxy endpoints are being detected due to above fix in the comment , ran into another blocker, i.e; prometheus is not able to get data from metrics endpoint of kube-proxy . Screenshot 2023-10-17 at 1 10 10 PM

It appears that MetricsBindAddress on kube-proxy is defaulted to localhost which is why I was able to get metrics locally but prometheus could not scrape at node level port, see below for more details. I'm planning to set the

HTTP/1.1 200 OK
Content-Type: text/plain; version=0.0.4; charset=utf-8
Process-Start-Time-Unix: 1697069093
Date: Tue, 17 Oct 2023 20:13:53 GMT
Transfer-Encoding: chunked

# HELP aggregator_discovery_aggregation_count_total [ALPHA] Counter of number of times discovery was aggregated
# TYPE aggregator_discovery_aggregation_count_total counter
aggregator_discovery_aggregation_count_total 0
# HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] 

I have set the kubeProxy metricsBindAddress to spec.kubeProxy.metricsBindAddress=0.0.0.0:10249" and now it looks like it is able to scrape metrics endpoint locally.

Screenshot 2023-10-17 at 1 20 29 PM

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 17, 2023
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah hakuna-matatah changed the title [WIP]Update the worker node architecture to amd64 for prometheus and grafana nodes to come up successfully. [WIP]Update the worker node architecture to amd64 for prometheus and grafana nodes to come up successfully Oct 17, 2023
@hakuna-matatah hakuna-matatah changed the title [WIP]Update the worker node architecture to amd64 for prometheus and grafana nodes to come up successfully [WIP] Fix all issues in description to measure APIServer SLOs on KOPS Oct 17, 2023
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Oct 17, 2023

@hakuna-matatah: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmit-kops-aws-scale-amazonvpc-using-cl2 211d111 link true /test presubmit-kops-aws-scale-amazonvpc-using-cl2

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Oct 17, 2023

@hakuna-matatah hakuna-matatah changed the title [WIP] Fix all issues in description to measure APIServer SLOs on KOPS Fix all issues in description to measure APIServer SLOs on KOPS Oct 17, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 17, 2023
@hakuna-matatah
Copy link
Contributor Author

/retest

@hakman hakman changed the title Fix all issues in description to measure APIServer SLOs on KOPS scale-test: Measure APIServer SLOs Oct 17, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 17, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, hakman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2023
@k8s-ci-robot k8s-ci-robot merged commit 1038071 into kubernetes:master Oct 17, 2023
22 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants