Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet JSON logging performance test #102430

Closed
serathius opened this issue May 29, 2021 · 25 comments
Closed

Kubelet JSON logging performance test #102430

serathius opened this issue May 29, 2021 · 25 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging.

Comments

@serathius
Copy link
Contributor

serathius commented May 29, 2021

With Kubelet logs fully migrated to structured logging we can measure macro performance improvements related to using JSON logging format. We should use this data to motive more cluster administrators to save on resources.

Requirements:

  • Test different logging format to compare. Formats --logging-forma=text, --logging-format=json
  • Test different logging verbosity levels: --v=0 (default), and --v=4
  • Test Kubelet standard flow (run container, terminate etc)
  • Repeat test in small clusters or run larger cluster to collect enough data

My first idea would be to utilize pull-kubernetes-e2e-gce-100-performance scenario to create repeatable way to collect Kubelet metric data. Before starting tests we should consult with SIG Scalability/Node how to best measure compare Kubelet performance.

Goal: Motivate further investment in JSON by being able to say "Switching to JSON logging in kubelet saves average of X% CPU and Y% of Memory" in release announcements.

/wg structured-logging
/sig scalability
/sig node
@kubernetes/wg-structured-logging-members

/priority important-soon
Needed to graduate JSON to Beta.

@serathius serathius added the kind/feature Categorizes issue or PR as related to a new feature. label May 29, 2021
@k8s-ci-robot k8s-ci-robot added wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/node Categorizes an issue or PR as relevant to SIG Node. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 29, 2021
@MadhavJivrajani
Copy link
Contributor

/cc @sladyn98
Since you were interested in the testing aspect of it

@MadhavJivrajani
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 30, 2021
@pacoxu
Copy link
Member

pacoxu commented May 31, 2021

#99803 is a kubelet log benchmark example.

@krzysiekg
Copy link
Contributor

I am working on this one.
/assign

@serathius
Copy link
Contributor Author

@pacuxu Thanks for the links. Just to clarify in that issue we were working on benchmarking log volume, in this issue we are looking at performance CPU/Memory usage.

@serathius
Copy link
Contributor Author

cc @wojtek-t @mborsz from SIG scalability

@serathius
Copy link
Contributor Author

/milestone v1.22
Part of work go graduate JSON log format.

@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone Jun 20, 2021
@voigt
Copy link

voigt commented Jul 20, 2021

Due to v1.22 code freeze, I'm moving this issue to the next release.

/milestone v1.23

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.22, v1.23 Jul 20, 2021
@krzysiekg
Copy link
Contributor

@serathius here's how I ran performance tests locally using kind

  1. Build node image
kind build node-image -v 2
  1. Create kind cluster (example config for single node, json format and verbosity 4)
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:latest
kubeadmConfigPatches:
- |
  ---
  kind: InitConfiguration
  nodeRegistration:
    kubeletExtraArgs:
      v: "4"
      logging-format: "json"
  ---
  kind: JoinConfiguration
  nodeRegistration:
    kubeletExtraArgs:
      v: "4"
      logging-format: "json"

For single node cluster it's also necessary to set this env, otherwise tests won't compile

export CL2_LOAD_TEST_THROUGHPUT=1
  1. Run tests (from perf-tests/clusterloader2 dir)
go run cmd/clusterloader.go --provider=kind --kubeconfig=/path/to/.kube/config --testconfig=./testing/load/config.yaml --report-dir=./reports -v 4

Kubelet CPU and Mem usage will be in ResourceUsageSummary report

I tested several verbosities, below are results for verbosity 1, 4 and 9 but other were similar

------------------------------------------    ------------------------------------------
format: json, verbosity: 1                    format: text, verbosity: 1                
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
Run |     |      p50 |      p90 |      p99    Run |     |      p50 |      p90 |      p99
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  1 | CPU |   0.1436 |   0.2044 |   0.2512      1 | CPU |   0.1491 |   0.2006 |   0.2063
    | Mem |   104.96 |   121.68 |   122.21        | Mem |   95.621 |   116.05 |   120.01
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  2 | CPU |   0.1398 |   0.2035 |   0.2386      2 | CPU |   0.1445 |   0.1841 |   0.1865
    | Mem |   102.86 |   126.75 |   129.87        | Mem |   102.96 |   118.86 |   119.06
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  3 | CPU |   0.1560 |   0.1843 |   0.2169      3 | CPU |   0.1675 |   0.2164 |   0.2317
    | Mem |   98.007 |   121.90 |   124.83        | Mem |   101.53 |   121.15 |   124.22
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  4 | CPU |   0.1712 |   0.2100 |   0.2264      4 | CPU |   0.1597 |   0.2391 |   0.2522
    | Mem |   103.75 |   124.05 |   125.59        | Mem |   99.679 |   131.42 |   137.63
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  5 | CPU |   0.1379 |   0.2075 |   0.2253      5 | CPU |   0.1507 |   0.2167 |   0.2471
    | Mem |   101.05 |   119.41 |   124.23        | Mem |   101.32 |   124.15 |   125.19
----|-----|----------|----------|---------    ----|-----|----------|----------|---------

------------------------------------------    ------------------------------------------
format: json, verbosity: 4                    format: text, verbosity: 4                
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
Run |     |      p50 |      p90 |      p99    Run |     |      p50 |      p90 |      p99
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  1 | CPU |   0.1388 |   0.2117 |   0.2575      1 | CPU |   0.1444 |   0.1945 |   0.2256
    | Mem |   108.10 |   125.71 |   129.37        | Mem |   98.355 |   119.72 |   121.40
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  2 | CPU |   0.1746 |   0.2488 |   0.2657      2 | CPU |   0.1541 |   0.2118 |   0.2359
    | Mem |   110.16 |   141.19 |   142.82        | Mem |    106.0 |   125.32 |   130.37
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  3 | CPU |   0.1758 |   0.2582 |   0.2848      3 | CPU |   0.1574 |   0.2033 |   0.2321
    | Mem |   107.92 |   128.60 |   129.18        | Mem |   97.277 |   119.97 |   120.73
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  4 | CPU |   0.1720 |   0.2222 |   0.2417      4 | CPU |   0.1629 |   0.2164 |   0.2311
    | Mem |   103.20 |   118.40 |   121.96        | Mem |   103.23 |   120.66 |   121.45
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  5 | CPU |   0.1460 |   0.2049 |   0.2271      5 | CPU |   0.1681 |   0.2452 |   0.2569
    | Mem |   97.703 |   115.61 |   120.73        | Mem |   107.48 |   129.45 |   132.33
----|-----|----------|----------|---------    ----|-----|----------|----------|---------

------------------------------------------    ------------------------------------------
format: json, verbosity: 9                    format: text, verbosity: 9                
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
Run |     |      p50 |      p90 |      p99    Run |     |      p50 |      p90 |      p99
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  1 | CPU |   0.1871 |   0.2325 |   0.2705      1 | CPU |   0.2032 |   0.2710 |   0.2875
    | Mem |   99.773 |   117.39 |   121.74        | Mem |   101.43 |   134.34 |   137.10
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  2 | CPU |   0.1729 |   0.2311 |   0.2528      2 | CPU |   0.1811 |   0.2526 |   0.2681
    | Mem |   109.95 |   128.02 |   129.32        | Mem |   106.52 |   127.25 |   131.73
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  3 | CPU |   0.1494 |   0.2310 |   0.2634      3 | CPU |   0.1914 |   0.2692 |   0.3050
    | Mem |   111.30 |   136.43 |   139.33        | Mem |   110.10 |   126.37 |   131.41
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  4 | CPU |   0.1523 |   0.2209 |   0.2841      4 | CPU |   0.1441 |   0.2236 |   0.2324
    | Mem |   106.84 |   136.47 |   143.10        | Mem |   104.33 |   122.18 |   124.48
----|-----|----------|----------|---------    ----|-----|----------|----------|---------
  5 | CPU |   0.1662 |   0.2328 |   0.2988      5 | CPU |   0.1588 |   0.1874 |   0.2300
    | Mem |   104.92 |   140.07 |   142.57        | Mem |   106.81 |   123.46 |   125.25
----|-----|----------|----------|---------    ----|-----|----------|----------|---------

There is no clear winner, json and text performance is similar and varies from run to run

I also run a few tests on bigger clusters (1 cp and up to 9 workers), no significant difference either

Here's how we could test performance on a large scale:

@serathius
Copy link
Contributor Author

serathius commented Aug 4, 2021

Hmm, looks there is pretty large variance in tests, if there is difference it's smaller than variance. In such cases we will need to use some statistical analysis to see if there is an improvement. I have three ideas to improve quality of tests:

  • Eliminate as much noise during tests as possible. Basically try to turn off any other programs and disable CPU bursting, best run in VM with preallocated resources.
  • Focus on most optimistic scenario first. Here it should be verbosity 9 which generates most logs.
  • Increase number of Runs and assume natural distribution of usage and calculate difference and statistical significance p (I expect 10-20 runs should be enough)

Recommend reading https://about.sourcegraph.com/go/gophercon-2019-optimizing-go-code-without-a-blindfold/
It should explain what is and how to calculate statistical significance p.

Please let me know if you need any help with this. Happy to discuss details on Slack.

@ritpanjw
Copy link

Hi @krzysiekg , this is bug triage shadow here 👋
There hasn't been any activity on this issue for a couple of months now, Just want to make sure this issue is still on track for the 1.23?

@ritpanjw
Copy link

Hi @krzysiekg , following back, this is bug triage shadow here 👋
There hasn't been any activity on this issue for a couple of months now, Just want to make sure this issue is still on track for the 1.23?

@ritpanjw
Copy link

Hi 👋 I'm the bug triage shadow. There hasn't been activity for couple of months. We're in code freeze for 1.23 milestone. I wanted to check if we're targeting to get this issue merged in time for 1.23 release(7/12)

@ehashman
Copy link
Member

@ritpanjw I don't believe we are. This is not a blocker.

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.23 milestone Nov 17, 2021
@pohly
Copy link
Contributor

pohly commented Nov 17, 2021

Another relevant aspect is log ingestion.

We would need to be careful about apple/orange comparisons (text import: just split into log messages vs json: full semantic), but to get the full picture we need to consider it.

About variance: I suspect that logging is not causing enough overhead during normal kubelet operation to make measuring small changes in the overhead statistically relevant when measuring total load.

What if we take a different approach:

  • collect realistic kubelet logs (= under load) in JSON format for different log levels
  • write a micro benchmark which loads such a file and then measures CPU cycles and memory consumption for writing it in JSON and text format
  • write a micro benchmark for parsing both

@pohly
Copy link
Contributor

pohly commented Nov 18, 2021

"collect ... for different log levels" - this isn't really needed. We now have "v":1 to indicate the verbosity of a log message, so we can do one run and collect at, say, -v9, and then during benchmarking only consider log messages below the log level we are interested in.

@serathius
Copy link
Contributor Author

/unassign @krzysieg
/assign @pohly
/triage accepted

@pohly
Copy link
Contributor

pohly commented Dec 9, 2021

#106594 has benchmarks for encoding without caller and timestamping.

For this issue we need variants of those benchmarks with the overhead for caller determination and timestamping.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 9, 2022
@pohly
Copy link
Contributor

pohly commented Mar 9, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 9, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 7, 2022
@vaibhav2107
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2022
@serathius
Copy link
Contributor Author

/close

@k8s-ci-robot
Copy link
Contributor

@serathius: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging.
Projects
None yet
Development

No branches or pull requests