Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API call latencies regressed due to upgrade to golang 1.12 #75833

Closed
mborsz opened this issue Mar 28, 2019 · 46 comments · Fixed by #77528
Closed

API call latencies regressed due to upgrade to golang 1.12 #75833

mborsz opened this issue Mar 28, 2019 · 46 comments · Fixed by #77528
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@mborsz
Copy link
Member

mborsz commented Mar 28, 2019

Something has changed between 320 and 321 runs and the request latency has increased significantly.

The graphs below shows request latency (non watch) over time for a single test run:

320:
FYrJr3W9WRY

321:
vHedLTUA8zO

In my opinion they looks significantly different: in 320 most of the spikes are lower than 0.8s while 321 most of the spikes exceeds 1.6s.

I checked few runs below 320 and have latency pattern similar to 320 and few runs after 321 and they have similar pattern to the 321.

We are seeing similar latency growth in kubemark tests.

The most significant change between 321 and 320 was go 1.11.5 -> 1.12 migration.

/cc @wojtek-t @krzysied

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 28, 2019
@mborsz
Copy link
Member Author

mborsz commented Mar 28, 2019

/sig scalability

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 28, 2019
@mborsz
Copy link
Member Author

mborsz commented Mar 28, 2019

With @krzysied we are trying to prepare small scale repro that we can use for further debugging.

@liggitt
Copy link
Member

liggitt commented Mar 28, 2019

#74778 was also in that range (cc @sttts @roycaihw for verification that nothing in that PR increased server load in non-alpha clusters)

@wojtek-t
Copy link
Member

@krzysied did tests at head with reverting to go1.11 and metrics seemed better
He is working on confirming that and should paste results then, but it seems to be pretty clear that it's go regression.
We're also working on providing smaller-scale repro case to work closer with golang team (they used to be really helpful when debugging regressions in previous versions (IIRC 1.9 is where there were previous one).

@roycaihw
Copy link
Member

I double checked #74778 and didn't find anything suspicious. The alpha function wasn't invoked in that window, and the PR is a no-op to non-alpha clusters

@krzysied
Copy link
Contributor

Regarding what @wojtek-t wrote:
kubemark_regression

I was mostly investigating the regression on kubemark-5000. Some of its latencies increased x3.

I run two tests: one with 7514c49 checkout, and the other one with some commit checkout but without 36e934d (Update to use golang 1.12).

The 99th latency percentile of node listing was 3400.152ms in first case and f 952.576ms in second one.

@wojtek-t
Copy link
Member

Look into other metrics (I can't remember which exactly, delete pods? post pods? or sth like that) - which increased by almost 10x.

@wojtek-t
Copy link
Member

This is the graph that I had on my mind, but yeah, it's more like 4x not 10x. But still quite huge drop:
Screenshot from 2019-03-29 13-46-32

@wojtek-t wojtek-t added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 29, 2019
@wojtek-t wojtek-t added this to the v1.15 milestone Mar 29, 2019
@krzysied
Copy link
Contributor

Reproducing the issue:

  1. Checkout kubernetes repo - https://github.com/kubernetes/kubernetes.git.
  • First broken version 36e934d (go1.12)
  • Last working version 8c95a07 (go1.11)
  1. Checkout perf-tests repo - https://github.com/kubernetes/perf-tests.git.
    This repo is home for clusterloader2 - tool used for running the scalability tests.

  2. In kubernetes directory run “make quick-release”. This will build kubernetes based on checkout version.

  3. Create GCE cluster.

export KUBE_GCE_NETWORK=e2e-$USER
export KUBE_GCE_INSTANCE_PREFIX=e2e-$USER
export INSTANCE_PREFIX=e2e-$USER

export PROJECT_ID=<project_name>
export ZONE=<zone>

go run hack/e2e.go -- \
    --gcp-project=$PROJECT_ID \
    --gcp-zone=$ZONE \
    --check-leaked-resources \
    --gcp-master-size=n1-standard-8 \
    --gcp-node-image=gci \
    --gcp-node-size=n1-standard-8 \
    --gcp-nodes=16 \
    --provider=gce \
    --up \
    --test=false \
    --timeout=30m
  1. Run kubemark test.
    This step creates kubemark cluster (on top of the cluster created in step 4), run cluster loader test and tear down the cluster.
export KUBE_GCE_NETWORK=e2e-$USER
export KUBE_GCE_INSTANCE_PREFIX=e2e-$USER
export INSTANCE_PREFIX=e2e-$USER

export KUBEMARK_MASTER_COMPONENTS_QPS_LIMITS="--kube-api-qps=100 --kube-api-burst=100"

export PROJECT_ID=<project_name>
export ZONE=<zone>

go run hack/e2e.go -- \
    --gcp-project=$PROJECT_ID \
    --gcp-zone=$ZONE \
    --provider=gce \
    --kubemark \
    --kubemark-nodes=1000 \
    --test=false \
    --test-cmd=$GOPATH/src/k8s.io/perf-tests/run-e2e.sh \
    --test-cmd-args=cluster-loader2 \
    --test-cmd-args=--nodes=1000 \
    --test-cmd-args=--provider=kubemark \
    --test-cmd-args=--report-dir=/tmp/cluster_loader2/_artifacts \
    --test-cmd-args=--testconfig=testing/load/config.yaml \
    --test-cmd-args=--testoverrides=./testing/load/kubemark/500_nodes/override.yaml \
    --test-cmd-args="--enable-prometheus-server=false" \
    --test-cmd-name=ClusterLoaderV2 \
    --timeout=60m

The test run metrics will be exported to directory provided by the --report-dir flag. Inside this directory there will be APIResponsiveness_load_.txt file containing 50, 90, and 90 percentiles of all apiserver call latencies. The latencies affected the most are:

  • Listing nodes
    {
    "Resource": "nodes",
    "Scope": "cluster",
    "Subresource": "",
    "Verb": "LIST"
    }

  • Pod deletion
    {
    "Resource": "pods",
    "Scope": "namespace",
    "Subresource": "",
    "Verb": "DELETE"
    }

  • Listing services
    {
    "Resource": "services",
    "Scope": "cluster",
    "Subresource": "",
    "Verb": "LIST"
    }

I would suggest their 99th percentiles as the issue metric.

As for comparison:
-Go1.11 run:

  • Listing nodes - 162.409ms
  • Pod deletion - 41.302ms
  • Listing services - 28.277ms

-Go1.12 run:

  • Listing nodes - 293.724ms
  • Pod deletion - 56.2ms
  • Listing services - 46.183ms
  1. Tear down GCE cluster
export KUBE_GCE_NETWORK=e2e-$USER
export KUBE_GCE_INSTANCE_PREFIX=e2e-$USER
export INSTANCE_PREFIX=e2e-$USER

export PROJECT_ID=<project_name>
export ZONE=<zone>

go run hack/e2e.go -- \
    --gcp-project=$PROJECT_ID \
    --gcp-zone=$ZONE \
    --check-leaked-resources \
    --gcp-master-size=n1-standard-8 \
    --gcp-node-image=gci \
    --gcp-node-size=n1-standard-8 \
    --gcp-nodes=16 \
    --provider=gce \
    --down \
    --test=false \
    --timeout=30m

@wojtek-t wojtek-t changed the title request latency regressed in ci-kubernetes-e2e-gce-scale-performance API call latencies regressed due to upgrade to golang 1.12 Apr 1, 2019
@wojtek-t
Copy link
Member

wojtek-t commented Apr 1, 2019

@kubernetes/sig-scalability-bugs

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 1, 2019
@wojtek-t
Copy link
Member

wojtek-t commented Apr 1, 2019

We're trying to dig deeper into it to both root cause it to a smaller part of the system, but it is quite tricky.
Unfortunately, we don't have much more than guesses right now, though.

Some observations:

  • I looked into kubemark-5000 runs into DELETE pods calls, and it seems that all of them for which we have traces (500ms+) - there are ~5000 of those in each of recent runs [previously 99th percentile was <200ms), there is a significant big of "Object deleted from database".

  • That may suggest some etcd issues, but we don't see logs related to long-running-calls on etcd side. That may suggest some issues on the etcd client side, but we don't have anything concrete yet.

  • It's also possible that golang upgrade changes something in serialization/deserialization - that's a second hypothesis (we've seen a bunch of traces of "Decoded delete options" (deserialization is also done in etcd client)), but I'm not convinced this is the case as of now

@wojtek-t
Copy link
Member

wojtek-t commented Apr 5, 2019

We've spent significant time with golang folks and already have some findings.
To summarize:

  • the microbenchmark from [WIP] Golang 1.12 regressions benchmark #76080 enabled Go team to find some regression on their side - the faulty commit from golang is: 07e738ec32025da458cdf968e4f991972471e6e9
  • to cite Go folks: "To summarize the change: I removed the "fast" path for small page-level allocations (which currently always need to acquire the heap lock) in order to simplify code for subsequent changes. In effect, span allocations now all go through a balanced tree rather than just a linked list for small spans (I think it was all spans under 128 pages?)."

We aren't fully convinced that this is for sure the same thing that is causing regression in our scalability tests, but analyzing traces, profiles, golang tracesetc. suggests that problems are indeed around memory allocations or GC. So with high probabilty this is the same thing.

We are discussing with the way to prove it's the first thing and working with the to see what we can do with it.

@wojtek-t
Copy link
Member

It appeared that the commit 07e738ec32025da458cdf968e4f991972471e6e9 isn't the most faulty one for us - running scalability tests against that commit doesn't show significant drop in metrics.

In the lack of better ideas, we decided to bisect changes in golang by running full scalability suite on 5k-node cluster against them. This will take couple more days though...

@dims
Copy link
Member

dims commented Apr 30, 2019

@wojtek-t yes i can, please let me know what needs to be done (so i can figure out if i have access before you head out)

@wojtek-t
Copy link
Member

@dims - thanks a lot. I didn't do upgrade of go for 2 years or so, so I'm not entirely sure how (if?) the process changed in the meantime. @cblecker - was doing the last one: #76576, so I'm sure he will be able to help with that.

@dims
Copy link
Member

dims commented Apr 30, 2019

Ah just update to latest golang when it gets out. yes, i can do that. i was wondering if you wanted me to kick off some CI runs or something.

Have a good vacation @wojtek-t

@wojtek-t
Copy link
Member

@dims - thanks a lot!
Yes - it wasn't clear, but I was testing that and it helps us, so no need to run another tests (we will get final confirmation from postsubmits).

@cblecker
Copy link
Member

I can also help out to get this upgrade in when it’s released. Thanks for all your work on this @wojtek-t!

@cblecker
Copy link
Member

cblecker commented May 7, 2019

Looks like 1.12.5 went out this afternoon. PR open (#77528), but expect tests to fail:

  • Waiting for the docker hub image to publish. The PR is merged, so this should happen shortly.
  • Need a new version of the bazel rules_go to be published.

@wojtek-t
Copy link
Member

wojtek-t commented May 7, 2019

Just to clarify the above (from @cblecker :

"oh, so I saw the auto update ran here: https://github.com/docker-library/golang/commits/master

but it looks like the update for the tags hasn't merged yet: docker-library/official-images#5867"

@dims
Copy link
Member

dims commented May 7, 2019

Yea, it takes some time for the images to get published.

@cblecker
Copy link
Member

cblecker commented May 7, 2019

Digest: sha256:e52f9c9ea80fa182a78bd83a9f9e107d173b8e7fd922dbb0c60c8020a88fb7f6
Status: Downloaded newer image for golang:1.12.5

Looks like new images are live.

@cblecker
Copy link
Member

cblecker commented May 9, 2019

Reopening this so we can verify that everything is back to normal in CI

@cblecker cblecker reopened this May 9, 2019
@mm4tt
Copy link
Contributor

mm4tt commented May 10, 2019

/assign

@cblecker
Copy link
Member

Opened pick: #77831

@tpepper
Copy link
Member

tpepper commented May 13, 2019

@wojtek-t has SIG Scalability done any one-off custom scale/perf testing with the new golang patch release and the k8s 1.14 branch content?

Do you see this cherry pick as a 1.14.2 release blocker? Or would a week or two of focused testing with updated golang in the 1.14 k8s branch and a 1.14.3 release by end of May be a sufficiently safe yet expedient response?

@wojtek-t
Copy link
Member

@wojtek-t has SIG Scalability done any one-off custom scale/perf testing with the new golang patch release and the k8s 1.14 branch content?

We were doing a bunch of tests to discover the real regression in golang 1.12 - the regression happened within 1.14 release timeframe and the bisection was done on that branch.
So yes - I'm 100% confident that it helps.

Do you see this cherry pick as a 1.14.2 release blocker? Or would a week or two of focused testing with updated golang in the 1.14 k8s branch and a 1.14.3 release by end of May be a sufficiently safe yet expedient response?

Tough question - I would love to see it in 1.14.2, otoh given how long it took to debug and fix it, I can see release team pushing it to 1.14.3. But if we could have that in 1.14.2, it would be extremely helpful.

@hoegaarden
Copy link
Contributor

#77831 just merged and we let 1.14 soak for one or two days before we cut v1.14.2.

Let's keep this issue open until we have some data on the latencies after that merge.

SIG Scalability, can you please provide some data / graphs about the latency in a (couple of) days for the release-1.14 branch?

@tpepper
Copy link
Member

tpepper commented May 14, 2019

@wojtek-t is there any evidence we'll be able to observe in https://testgrid.k8s.io/sig-release-1.14-blocking for validation? Or is it possible for SIG Scalability to trigger any custom runs now that the cherry-pick has merged to the 1.14 branch?

@wojtek-t
Copy link
Member

We aren't running large tests on release branches due to cost of them...
We don't have any free quota to run it now, due to huge number of regressions that we have at head that we're currently debugging.

But I'm 100% confident with this change. What I was personally doing in the past:

  • I was running tests against commit when we bumped to 1.12.0 with 1.12.5 (which was from beginning of March, so still in 1.14 timeline) and the effect was exactly the same as the effect against head we got after merging it.
  • I also understand changes in Go between 1.12.4 and 1.12.5 and their effect on performance
  • And we also understand this regression, and why the fixes help, which has nothing to do with any changes that are happening to kubernetes.

So I'm entirely sure that if you didn't observe any correctness issues, from performance perspective it is good-to-go.

@hoegaarden
Copy link
Contributor

@wojtek-t Thanks for your answer!

@wojtek-t
Copy link
Member

With this being fixed in head and 1.14.2 released with the fix, I'm closing this one as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

Successfully merging a pull request may close this issue.