ci-kubernetes-e2e-gci-gce-scalability-watch-list-off has unexpectedly high LIST latency #2287

mborsz · 2023-06-09T12:41:34Z

I was reviewing performance of kube-apiserver in watchlist-off tests and found that the LIST latency in tests https://k8s-testgrid.appspot.com/sig-scalability-experiments#watchlist-off is around 40s:

I0609 08:33:55.589834      11 trace.go:236] Trace[2051566814]: "SerializeObject" audit-id:8a686097-a6c2-4869-9ff1-27f5b7a9dce5,method:GET,url:/api/v1/namespaces/watch-list-1/secrets,protocol:HTTP/2.0,mediaType:application/vnd.kubernetes.protobuf,encoder:{"encodeGV":"v1","encoder":"protobuf","name":"versioning"} (09-Jun-2023 08:33:11.330) (total time: 44259ms):
Trace[2051566814]: ---"Write call succeeded" writer:*gzip.Writer,size:304090734,firstWrite:true 43754ms (08:33:55.589)
Trace[2051566814]: [44.259564981s] [44.259564981s] END
I0609 08:33:55.589912      11 trace.go:236] Trace[1886779945]: "List" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:8a686097-a6c2-4869-9ff1-27f5b7a9dce5,client:35.226.210.156,protocol:HTTP/2.0,resource:secrets,scope:namespace,url:/api/v1/namespaces/watch-list-1/secrets,user-agent:watch-list/v0.0.0 (linux/amd64) kubernetes/$Format,verb:LIST (09-Jun-2023 08:33:11.328) (total time: 44260ms):
Trace[1886779945]: ---"Writing http response done" count:400 44259ms (08:33:55.589)
Trace[1886779945]: [44.260928281s] [44.260928281s] END
I0609 08:33:55.590150      11 httplog.go:132] "HTTP" verb="LIST" URI="/api/v1/namespaces/watch-list-1/secrets?limit=500&resourceVersion=0" latency="44.263133267s" userAgent="watch-list/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="8a686097-a6c2-4869-9ff1-27f5b7a9dce5" srcIP="35.226.210.156:45202" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="44.262028085s" resp=200

This is quite high for ~290M of data. From experience we usually observe 20-30MiB/s throughput on write due to compression, which should translate to more like ~10-15s latency.

I suspect we are lacking CPU or egress either on master or node side.

I'm not sure how important is it, but I'm afraid it may affect some benchmarks we are running.

/assign @p0lyn0mial

/cc @serathius

The text was updated successfully, but these errors were encountered:

p0lyn0mial · 2023-07-07T11:24:19Z

Hey, thanks for the info!
The first suspect is the test itself that runs on a worker node.
The easiest way would be to increase the number of CPU for the test. Especially that nothing else is running on that machine.
The second approach would be to monitor CPU and RAM usage from within the test itself. It would be more time-consuming and would likely require correlating CPU usage with the machine itself.

I think I will start with the first approach.

p0lyn0mial · 2023-07-07T12:04:24Z

What is interesting is that at some point we increased the number of the test replicas to 2 (#2281).

This change was reflected in CPU and RAM usage but not so much for the latency (screenshots).

wojtek-t · 2023-07-24T08:26:21Z

/cc

k8s-triage-robot · 2024-01-25T03:13:29Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

p0lyn0mial · 2024-01-25T13:21:57Z

/remove-lifecycle stale

k8s-triage-robot · 2024-04-24T13:56:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

p0lyn0mial · 2024-04-25T07:14:07Z

/remove-lifecycle stale

mborsz added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2023

k8s-ci-robot assigned p0lyn0mial Jun 9, 2023

p0lyn0mial mentioned this issue Jul 20, 2023

issue:2287: decrease the number of informers to see impact on latency #2303

Merged

wojtek-t mentioned this issue Jul 24, 2023

issue:2287: slightly increase the number of informers to see impact on latency #2305

Merged

This was referenced Jul 27, 2023

ControlPlane node is not ready in scalability tests when run on GCE kubernetes/test-infra#29500

Open

issue:2287: find maximum number of informers that doesn't impact latency. #2306

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 24, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci-kubernetes-e2e-gci-gce-scalability-watch-list-off has unexpectedly high LIST latency #2287

ci-kubernetes-e2e-gci-gce-scalability-watch-list-off has unexpectedly high LIST latency #2287

mborsz commented Jun 9, 2023

p0lyn0mial commented Jul 7, 2023 •

edited

Loading

p0lyn0mial commented Jul 7, 2023 •

edited

Loading

wojtek-t commented Jul 24, 2023

k8s-triage-robot commented Jan 25, 2024

p0lyn0mial commented Jan 25, 2024

k8s-triage-robot commented Apr 24, 2024

p0lyn0mial commented Apr 25, 2024

ci-kubernetes-e2e-gci-gce-scalability-watch-list-off has unexpectedly high LIST latency #2287

ci-kubernetes-e2e-gci-gce-scalability-watch-list-off has unexpectedly high LIST latency #2287

Comments

mborsz commented Jun 9, 2023

p0lyn0mial commented Jul 7, 2023 • edited Loading

p0lyn0mial commented Jul 7, 2023 • edited Loading

wojtek-t commented Jul 24, 2023

k8s-triage-robot commented Jan 25, 2024

p0lyn0mial commented Jan 25, 2024

k8s-triage-robot commented Apr 24, 2024

p0lyn0mial commented Apr 25, 2024

p0lyn0mial commented Jul 7, 2023 •

edited

Loading

p0lyn0mial commented Jul 7, 2023 •

edited

Loading