Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-kubernetes-e2e-gci-gce-scalability-watch-list-off has unexpectedly high LIST latency #2287

Open
mborsz opened this issue Jun 9, 2023 · 7 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mborsz
Copy link
Member

mborsz commented Jun 9, 2023

I was reviewing performance of kube-apiserver in watchlist-off tests and found that the LIST latency in tests https://k8s-testgrid.appspot.com/sig-scalability-experiments#watchlist-off is around 40s:

I0609 08:33:55.589834      11 trace.go:236] Trace[2051566814]: "SerializeObject" audit-id:8a686097-a6c2-4869-9ff1-27f5b7a9dce5,method:GET,url:/api/v1/namespaces/watch-list-1/secrets,protocol:HTTP/2.0,mediaType:application/vnd.kubernetes.protobuf,encoder:{"encodeGV":"v1","encoder":"protobuf","name":"versioning"} (09-Jun-2023 08:33:11.330) (total time: 44259ms):
Trace[2051566814]: ---"Write call succeeded" writer:*gzip.Writer,size:304090734,firstWrite:true 43754ms (08:33:55.589)
Trace[2051566814]: [44.259564981s] [44.259564981s] END
I0609 08:33:55.589912      11 trace.go:236] Trace[1886779945]: "List" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:8a686097-a6c2-4869-9ff1-27f5b7a9dce5,client:35.226.210.156,protocol:HTTP/2.0,resource:secrets,scope:namespace,url:/api/v1/namespaces/watch-list-1/secrets,user-agent:watch-list/v0.0.0 (linux/amd64) kubernetes/$Format,verb:LIST (09-Jun-2023 08:33:11.328) (total time: 44260ms):
Trace[1886779945]: ---"Writing http response done" count:400 44259ms (08:33:55.589)
Trace[1886779945]: [44.260928281s] [44.260928281s] END
I0609 08:33:55.590150      11 httplog.go:132] "HTTP" verb="LIST" URI="/api/v1/namespaces/watch-list-1/secrets?limit=500&resourceVersion=0" latency="44.263133267s" userAgent="watch-list/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="8a686097-a6c2-4869-9ff1-27f5b7a9dce5" srcIP="35.226.210.156:45202" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="44.262028085s" resp=200

This is quite high for ~290M of data. From experience we usually observe 20-30MiB/s throughput on write due to compression, which should translate to more like ~10-15s latency.

I suspect we are lacking CPU or egress either on master or node side.

I'm not sure how important is it, but I'm afraid it may affect some benchmarks we are running.

/assign @p0lyn0mial

/cc @serathius

@mborsz mborsz added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2023
@p0lyn0mial
Copy link
Contributor

p0lyn0mial commented Jul 7, 2023

Hey, thanks for the info!
The first suspect is the test itself that runs on a worker node.
The easiest way would be to increase the number of CPU for the test. Especially that nothing else is running on that machine.
The second approach would be to monitor CPU and RAM usage from within the test itself. It would be more time-consuming and would likely require correlating CPU usage with the machine itself.

I think I will start with the first approach.

@p0lyn0mial
Copy link
Contributor

p0lyn0mial commented Jul 7, 2023

What is interesting is that at some point we increased the number of the test replicas to 2 (#2281).

This change was reflected in CPU and RAM usage but not so much for the latency (screenshots).

Screenshot 2023-07-07 at 13 52 29

Screenshot 2023-07-07 at 13 52 52

Screenshot 2023-07-07 at 13 53 23

@wojtek-t
Copy link
Member

/cc

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024
@p0lyn0mial
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 24, 2024
@p0lyn0mial
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants