Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API call latencies regressed due to upgrade to golang 1.12 #75833

Closed
mborsz opened this issue Mar 28, 2019 · 46 comments

Comments

@mborsz
Copy link
Member

commented Mar 28, 2019

Something has changed between 320 and 321 runs and the request latency has increased significantly.

The graphs below shows request latency (non watch) over time for a single test run:

320:
FYrJr3W9WRY

321:
vHedLTUA8zO

In my opinion they looks significantly different: in 320 most of the spikes are lower than 0.8s while 321 most of the spikes exceeds 1.6s.

I checked few runs below 320 and have latency pattern similar to 320 and few runs after 321 and they have similar pattern to the 321.

We are seeing similar latency growth in kubemark tests.

The most significant change between 321 and 320 was go 1.11.5 -> 1.12 migration.

/cc @wojtek-t @krzysied

@mborsz

This comment has been minimized.

Copy link
Member Author

commented Mar 28, 2019

/sig scalability

@mborsz

This comment has been minimized.

Copy link
Member Author

commented Mar 28, 2019

With @krzysied we are trying to prepare small scale repro that we can use for further debugging.

@liggitt

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

#74778 was also in that range (cc @sttts @roycaihw for verification that nothing in that PR increased server load in non-alpha clusters)

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

@krzysied did tests at head with reverting to go1.11 and metrics seemed better
He is working on confirming that and should paste results then, but it seems to be pretty clear that it's go regression.
We're also working on providing smaller-scale repro case to work closer with golang team (they used to be really helpful when debugging regressions in previous versions (IIRC 1.9 is where there were previous one).

@roycaihw

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

I double checked #74778 and didn't find anything suspicious. The alpha function wasn't invoked in that window, and the PR is a no-op to non-alpha clusters

@krzysied

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2019

Regarding what @wojtek-t wrote:
kubemark_regression

I was mostly investigating the regression on kubemark-5000. Some of its latencies increased x3.

I run two tests: one with 7514c49 checkout, and the other one with some commit checkout but without 36e934d (Update to use golang 1.12).

The 99th latency percentile of node listing was 3400.152ms in first case and f 952.576ms in second one.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

Look into other metrics (I can't remember which exactly, delete pods? post pods? or sth like that) - which increased by almost 10x.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

This is the graph that I had on my mind, but yeah, it's more like 4x not 10x. But still quite huge drop:
Screenshot from 2019-03-29 13-46-32

@wojtek-t wojtek-t added this to the v1.15 milestone Mar 29, 2019

@krzysied

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

Reproducing the issue:

  1. Checkout kubernetes repo - https://github.com/kubernetes/kubernetes.git.
  • First broken version 36e934d (go1.12)
  • Last working version 8c95a07 (go1.11)
  1. Checkout perf-tests repo - https://github.com/kubernetes/perf-tests.git.
    This repo is home for clusterloader2 - tool used for running the scalability tests.

  2. In kubernetes directory run “make quick-release”. This will build kubernetes based on checkout version.

  3. Create GCE cluster.

export KUBE_GCE_NETWORK=e2e-$USER
export KUBE_GCE_INSTANCE_PREFIX=e2e-$USER
export INSTANCE_PREFIX=e2e-$USER

export PROJECT_ID=<project_name>
export ZONE=<zone>

go run hack/e2e.go -- \
    --gcp-project=$PROJECT_ID \
    --gcp-zone=$ZONE \
    --check-leaked-resources \
    --gcp-master-size=n1-standard-8 \
    --gcp-node-image=gci \
    --gcp-node-size=n1-standard-8 \
    --gcp-nodes=16 \
    --provider=gce \
    --up \
    --test=false \
    --timeout=30m
  1. Run kubemark test.
    This step creates kubemark cluster (on top of the cluster created in step 4), run cluster loader test and tear down the cluster.
export KUBE_GCE_NETWORK=e2e-$USER
export KUBE_GCE_INSTANCE_PREFIX=e2e-$USER
export INSTANCE_PREFIX=e2e-$USER

export KUBEMARK_MASTER_COMPONENTS_QPS_LIMITS="--kube-api-qps=100 --kube-api-burst=100"

export PROJECT_ID=<project_name>
export ZONE=<zone>

go run hack/e2e.go -- \
    --gcp-project=$PROJECT_ID \
    --gcp-zone=$ZONE \
    --provider=gce \
    --kubemark \
    --kubemark-nodes=1000 \
    --test=false \
    --test-cmd=$GOPATH/src/k8s.io/perf-tests/run-e2e.sh \
    --test-cmd-args=cluster-loader2 \
    --test-cmd-args=--nodes=1000 \
    --test-cmd-args=--provider=kubemark \
    --test-cmd-args=--report-dir=/tmp/cluster_loader2/_artifacts \
    --test-cmd-args=--testconfig=testing/load/config.yaml \
    --test-cmd-args=--testoverrides=./testing/load/kubemark/500_nodes/override.yaml \
    --test-cmd-args="--enable-prometheus-server=false" \
    --test-cmd-name=ClusterLoaderV2 \
    --timeout=60m

The test run metrics will be exported to directory provided by the --report-dir flag. Inside this directory there will be APIResponsiveness_load_.txt file containing 50, 90, and 90 percentiles of all apiserver call latencies. The latencies affected the most are:

  • Listing nodes
    {
    "Resource": "nodes",
    "Scope": "cluster",
    "Subresource": "",
    "Verb": "LIST"
    }

  • Pod deletion
    {
    "Resource": "pods",
    "Scope": "namespace",
    "Subresource": "",
    "Verb": "DELETE"
    }

  • Listing services
    {
    "Resource": "services",
    "Scope": "cluster",
    "Subresource": "",
    "Verb": "LIST"
    }

I would suggest their 99th percentiles as the issue metric.

As for comparison:
-Go1.11 run:

  • Listing nodes - 162.409ms
  • Pod deletion - 41.302ms
  • Listing services - 28.277ms

-Go1.12 run:

  • Listing nodes - 293.724ms
  • Pod deletion - 56.2ms
  • Listing services - 46.183ms
  1. Tear down GCE cluster
export KUBE_GCE_NETWORK=e2e-$USER
export KUBE_GCE_INSTANCE_PREFIX=e2e-$USER
export INSTANCE_PREFIX=e2e-$USER

export PROJECT_ID=<project_name>
export ZONE=<zone>

go run hack/e2e.go -- \
    --gcp-project=$PROJECT_ID \
    --gcp-zone=$ZONE \
    --check-leaked-resources \
    --gcp-master-size=n1-standard-8 \
    --gcp-node-image=gci \
    --gcp-node-size=n1-standard-8 \
    --gcp-nodes=16 \
    --provider=gce \
    --down \
    --test=false \
    --timeout=30m

@wojtek-t wojtek-t changed the title request latency regressed in ci-kubernetes-e2e-gce-scale-performance API call latencies regressed due to upgrade to golang 1.12 Apr 1, 2019

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

We're trying to dig deeper into it to both root cause it to a smaller part of the system, but it is quite tricky.
Unfortunately, we don't have much more than guesses right now, though.

Some observations:

  • I looked into kubemark-5000 runs into DELETE pods calls, and it seems that all of them for which we have traces (500ms+) - there are ~5000 of those in each of recent runs [previously 99th percentile was <200ms), there is a significant big of "Object deleted from database".

  • That may suggest some etcd issues, but we don't see logs related to long-running-calls on etcd side. That may suggest some issues on the etcd client side, but we don't have anything concrete yet.

  • It's also possible that golang upgrade changes something in serialization/deserialization - that's a second hypothesis (we've seen a bunch of traces of "Decoded delete options" (deserialization is also done in etcd client)), but I'm not convinced this is the case as of now

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

We've spent significant time with golang folks and already have some findings.
To summarize:

  • the microbenchmark from #76080 enabled Go team to find some regression on their side - the faulty commit from golang is: 07e738ec32025da458cdf968e4f991972471e6e9
  • to cite Go folks: "To summarize the change: I removed the "fast" path for small page-level allocations (which currently always need to acquire the heap lock) in order to simplify code for subsequent changes. In effect, span allocations now all go through a balanced tree rather than just a linked list for small spans (I think it was all spans under 128 pages?)."

We aren't fully convinced that this is for sure the same thing that is causing regression in our scalability tests, but analyzing traces, profiles, golang tracesetc. suggests that problems are indeed around memory allocations or GC. So with high probabilty this is the same thing.

We are discussing with the way to prove it's the first thing and working with the to see what we can do with it.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

The commits that we were waiting for are already there:
golang/go@fd36763
golang/go@438b1a5

Now we are waiting for next 1.12 patch release of Go to update Kubernetes to use, the ETA is May 1st.

@dims - given that I will be on vacation for the rest of this week, would you be able to help with that?

@dims

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

@wojtek-t yes i can, please let me know what needs to be done (so i can figure out if i have access before you head out)

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

@dims - thanks a lot. I didn't do upgrade of go for 2 years or so, so I'm not entirely sure how (if?) the process changed in the meantime. @cblecker - was doing the last one: #76576, so I'm sure he will be able to help with that.

@dims

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

Ah just update to latest golang when it gets out. yes, i can do that. i was wondering if you wanted me to kick off some CI runs or something.

Have a good vacation @wojtek-t

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

@dims - thanks a lot!
Yes - it wasn't clear, but I was testing that and it helps us, so no need to run another tests (we will get final confirmation from postsubmits).

@cblecker

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I can also help out to get this upgrade in when it’s released. Thanks for all your work on this @wojtek-t!

@cblecker

This comment has been minimized.

Copy link
Member

commented May 7, 2019

Looks like 1.12.5 went out this afternoon. PR open (#77528), but expect tests to fail:

  • Waiting for the docker hub image to publish. The PR is merged, so this should happen shortly.
  • Need a new version of the bazel rules_go to be published.
@wojtek-t

This comment has been minimized.

Copy link
Member

commented May 7, 2019

Just to clarify the above (from @cblecker :

"oh, so I saw the auto update ran here: https://github.com/docker-library/golang/commits/master

but it looks like the update for the tags hasn't merged yet: docker-library/official-images#5867"

@dims

This comment has been minimized.

Copy link
Member

commented May 7, 2019

Yea, it takes some time for the images to get published.

@cblecker

This comment has been minimized.

Copy link
Member

commented May 7, 2019

Digest: sha256:e52f9c9ea80fa182a78bd83a9f9e107d173b8e7fd922dbb0c60c8020a88fb7f6
Status: Downloaded newer image for golang:1.12.5

Looks like new images are live.

@cblecker

This comment has been minimized.

Copy link
Member

commented May 9, 2019

Reopening this so we can verify that everything is back to normal in CI

@cblecker cblecker reopened this May 9, 2019

@mm4tt

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

/assign

@cblecker

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Opened pick: #77831

@tpepper

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

@wojtek-t has SIG Scalability done any one-off custom scale/perf testing with the new golang patch release and the k8s 1.14 branch content?

Do you see this cherry pick as a 1.14.2 release blocker? Or would a week or two of focused testing with updated golang in the 1.14 k8s branch and a 1.14.3 release by end of May be a sufficiently safe yet expedient response?

@wojtek-t

This comment has been minimized.

Copy link
Member

commented May 14, 2019

@wojtek-t has SIG Scalability done any one-off custom scale/perf testing with the new golang patch release and the k8s 1.14 branch content?

We were doing a bunch of tests to discover the real regression in golang 1.12 - the regression happened within 1.14 release timeframe and the bisection was done on that branch.
So yes - I'm 100% confident that it helps.

Do you see this cherry pick as a 1.14.2 release blocker? Or would a week or two of focused testing with updated golang in the 1.14 k8s branch and a 1.14.3 release by end of May be a sufficiently safe yet expedient response?

Tough question - I would love to see it in 1.14.2, otoh given how long it took to debug and fix it, I can see release team pushing it to 1.14.3. But if we could have that in 1.14.2, it would be extremely helpful.

@hoegaarden

This comment has been minimized.

Copy link
Member

commented May 14, 2019

#77831 just merged and we let 1.14 soak for one or two days before we cut v1.14.2.

Let's keep this issue open until we have some data on the latencies after that merge.

SIG Scalability, can you please provide some data / graphs about the latency in a (couple of) days for the release-1.14 branch?

@tpepper

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

@wojtek-t is there any evidence we'll be able to observe in https://testgrid.k8s.io/sig-release-1.14-blocking for validation? Or is it possible for SIG Scalability to trigger any custom runs now that the cherry-pick has merged to the 1.14 branch?

@wojtek-t

This comment has been minimized.

Copy link
Member

commented May 15, 2019

We aren't running large tests on release branches due to cost of them...
We don't have any free quota to run it now, due to huge number of regressions that we have at head that we're currently debugging.

But I'm 100% confident with this change. What I was personally doing in the past:

  • I was running tests against commit when we bumped to 1.12.0 with 1.12.5 (which was from beginning of March, so still in 1.14 timeline) and the effect was exactly the same as the effect against head we got after merging it.
  • I also understand changes in Go between 1.12.4 and 1.12.5 and their effect on performance
  • And we also understand this regression, and why the fixes help, which has nothing to do with any changes that are happening to kubernetes.

So I'm entirely sure that if you didn't observe any correctness issues, from performance perspective it is good-to-go.

@hoegaarden

This comment has been minimized.

Copy link
Member

commented May 15, 2019

@wojtek-t Thanks for your answer!

@wojtek-t

This comment has been minimized.

Copy link
Member

commented May 17, 2019

With this being fixed in head and 1.14.2 released with the fix, I'm closing this one as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.