API call latencies regressed due to upgrade to golang 1.12 #75833

mborsz · 2019-03-28T14:56:12Z

Something has changed between 320 and 321 runs and the request latency has increased significantly.

The graphs below shows request latency (non watch) over time for a single test run:

In my opinion they looks significantly different: in 320 most of the spikes are lower than 0.8s while 321 most of the spikes exceeds 1.6s.

I checked few runs below 320 and have latency pattern similar to 320 and few runs after 321 and they have similar pattern to the 321.

We are seeing similar latency growth in kubemark tests.

The most significant change between 321 and 320 was go 1.11.5 -> 1.12 migration.

/cc @wojtek-t @krzysied

mborsz · 2019-03-28T14:56:30Z

/sig scalability

mborsz · 2019-03-28T14:56:49Z

With @krzysied we are trying to prepare small scale repro that we can use for further debugging.

liggitt · 2019-03-28T16:18:27Z

#74778 was also in that range (cc @sttts @roycaihw for verification that nothing in that PR increased server load in non-alpha clusters)

wojtek-t · 2019-03-28T16:33:19Z

@krzysied did tests at head with reverting to go1.11 and metrics seemed better
He is working on confirming that and should paste results then, but it seems to be pretty clear that it's go regression.
We're also working on providing smaller-scale repro case to work closer with golang team (they used to be really helpful when debugging regressions in previous versions (IIRC 1.9 is where there were previous one).

roycaihw · 2019-03-28T18:58:18Z

I double checked #74778 and didn't find anything suspicious. The alpha function wasn't invoked in that window, and the PR is a no-op to non-alpha clusters

krzysied · 2019-03-28T20:29:42Z

Regarding what @wojtek-t wrote:

I was mostly investigating the regression on kubemark-5000. Some of its latencies increased x3.

I run two tests: one with 7514c49 checkout, and the other one with some commit checkout but without 36e934d (Update to use golang 1.12).

The 99th latency percentile of node listing was 3400.152ms in first case and f 952.576ms in second one.

wojtek-t · 2019-03-28T21:16:43Z

Look into other metrics (I can't remember which exactly, delete pods? post pods? or sth like that) - which increased by almost 10x.

wojtek-t · 2019-03-29T12:48:36Z

This is the graph that I had on my mind, but yeah, it's more like 4x not 10x. But still quite huge drop:

krzysied · 2019-03-29T16:42:07Z

Reproducing the issue:

Checkout kubernetes repo - https://github.com/kubernetes/kubernetes.git.

First broken version 36e934d (go1.12)
Last working version 8c95a07 (go1.11)

Checkout perf-tests repo - https://github.com/kubernetes/perf-tests.git.
This repo is home for clusterloader2 - tool used for running the scalability tests.
In kubernetes directory run “make quick-release”. This will build kubernetes based on checkout version.
Create GCE cluster.

export KUBE_GCE_NETWORK=e2e-$USER
export KUBE_GCE_INSTANCE_PREFIX=e2e-$USER
export INSTANCE_PREFIX=e2e-$USER

export PROJECT_ID=<project_name>
export ZONE=<zone>

go run hack/e2e.go -- \
    --gcp-project=$PROJECT_ID \
    --gcp-zone=$ZONE \
    --check-leaked-resources \
    --gcp-master-size=n1-standard-8 \
    --gcp-node-image=gci \
    --gcp-node-size=n1-standard-8 \
    --gcp-nodes=16 \
    --provider=gce \
    --up \
    --test=false \
    --timeout=30m

Run kubemark test.
This step creates kubemark cluster (on top of the cluster created in step 4), run cluster loader test and tear down the cluster.

export KUBE_GCE_NETWORK=e2e-$USER
export KUBE_GCE_INSTANCE_PREFIX=e2e-$USER
export INSTANCE_PREFIX=e2e-$USER

export KUBEMARK_MASTER_COMPONENTS_QPS_LIMITS="--kube-api-qps=100 --kube-api-burst=100"

export PROJECT_ID=<project_name>
export ZONE=<zone>

go run hack/e2e.go -- \
    --gcp-project=$PROJECT_ID \
    --gcp-zone=$ZONE \
    --provider=gce \
    --kubemark \
    --kubemark-nodes=1000 \
    --test=false \
    --test-cmd=$GOPATH/src/k8s.io/perf-tests/run-e2e.sh \
    --test-cmd-args=cluster-loader2 \
    --test-cmd-args=--nodes=1000 \
    --test-cmd-args=--provider=kubemark \
    --test-cmd-args=--report-dir=/tmp/cluster_loader2/_artifacts \
    --test-cmd-args=--testconfig=testing/load/config.yaml \
    --test-cmd-args=--testoverrides=./testing/load/kubemark/500_nodes/override.yaml \
    --test-cmd-args="--enable-prometheus-server=false" \
    --test-cmd-name=ClusterLoaderV2 \
    --timeout=60m

The test run metrics will be exported to directory provided by the --report-dir flag. Inside this directory there will be APIResponsiveness_load_.txt file containing 50, 90, and 90 percentiles of all apiserver call latencies. The latencies affected the most are:

Listing nodes
{
"Resource": "nodes",
"Scope": "cluster",
"Subresource": "",
"Verb": "LIST"
}
Pod deletion
{
"Resource": "pods",
"Scope": "namespace",
"Subresource": "",
"Verb": "DELETE"
}
Listing services
{
"Resource": "services",
"Scope": "cluster",
"Subresource": "",
"Verb": "LIST"
}

I would suggest their 99th percentiles as the issue metric.

As for comparison:
-Go1.11 run:

Listing nodes - 162.409ms
Pod deletion - 41.302ms
Listing services - 28.277ms

-Go1.12 run:

Listing nodes - 293.724ms
Pod deletion - 56.2ms
Listing services - 46.183ms

Tear down GCE cluster

export KUBE_GCE_NETWORK=e2e-$USER
export KUBE_GCE_INSTANCE_PREFIX=e2e-$USER
export INSTANCE_PREFIX=e2e-$USER

export PROJECT_ID=<project_name>
export ZONE=<zone>

go run hack/e2e.go -- \
    --gcp-project=$PROJECT_ID \
    --gcp-zone=$ZONE \
    --check-leaked-resources \
    --gcp-master-size=n1-standard-8 \
    --gcp-node-image=gci \
    --gcp-node-size=n1-standard-8 \
    --gcp-nodes=16 \
    --provider=gce \
    --down \
    --test=false \
    --timeout=30m

wojtek-t · 2019-04-01T10:49:12Z

@kubernetes/sig-scalability-bugs

wojtek-t · 2019-04-01T10:59:10Z

We're trying to dig deeper into it to both root cause it to a smaller part of the system, but it is quite tricky.
Unfortunately, we don't have much more than guesses right now, though.

Some observations:

I looked into kubemark-5000 runs into DELETE pods calls, and it seems that all of them for which we have traces (500ms+) - there are ~5000 of those in each of recent runs [previously 99th percentile was <200ms), there is a significant big of "Object deleted from database".
That may suggest some etcd issues, but we don't see logs related to long-running-calls on etcd side. That may suggest some issues on the etcd client side, but we don't have anything concrete yet.
It's also possible that golang upgrade changes something in serialization/deserialization - that's a second hypothesis (we've seen a bunch of traces of "Decoded delete options" (deserialization is also done in etcd client)), but I'm not convinced this is the case as of now

wojtek-t · 2019-04-05T12:16:01Z

We've spent significant time with golang folks and already have some findings.
To summarize:

the microbenchmark from [WIP] Golang 1.12 regressions benchmark #76080 enabled Go team to find some regression on their side - the faulty commit from golang is: 07e738ec32025da458cdf968e4f991972471e6e9
to cite Go folks: "To summarize the change: I removed the "fast" path for small page-level allocations (which currently always need to acquire the heap lock) in order to simplify code for subsequent changes. In effect, span allocations now all go through a balanced tree rather than just a linked list for small spans (I think it was all spans under 128 pages?)."

We aren't fully convinced that this is for sure the same thing that is causing regression in our scalability tests, but analyzing traces, profiles, golang tracesetc. suggests that problems are indeed around memory allocations or GC. So with high probabilty this is the same thing.

We are discussing with the way to prove it's the first thing and working with the to see what we can do with it.

wojtek-t · 2019-04-11T11:18:09Z

It appeared that the commit 07e738ec32025da458cdf968e4f991972471e6e9 isn't the most faulty one for us - running scalability tests against that commit doesn't show significant drop in metrics.

In the lack of better ideas, we decided to bisect changes in golang by running full scalability suite on 5k-node cluster against them. This will take couple more days though...

dims · 2019-04-30T11:14:10Z

@wojtek-t yes i can, please let me know what needs to be done (so i can figure out if i have access before you head out)

wojtek-t · 2019-04-30T11:41:03Z

@dims - thanks a lot. I didn't do upgrade of go for 2 years or so, so I'm not entirely sure how (if?) the process changed in the meantime. @cblecker - was doing the last one: #76576, so I'm sure he will be able to help with that.

dims · 2019-04-30T11:57:41Z

Ah just update to latest golang when it gets out. yes, i can do that. i was wondering if you wanted me to kick off some CI runs or something.

Have a good vacation @wojtek-t

wojtek-t · 2019-04-30T13:06:41Z

@dims - thanks a lot!
Yes - it wasn't clear, but I was testing that and it helps us, so no need to run another tests (we will get final confirmation from postsubmits).

cblecker · 2019-04-30T15:50:29Z

I can also help out to get this upgrade in when it’s released. Thanks for all your work on this @wojtek-t!

cblecker · 2019-05-07T02:41:40Z

Looks like 1.12.5 went out this afternoon. PR open (#77528), but expect tests to fail:

Waiting for the docker hub image to publish. The PR is merged, so this should happen shortly.
Need a new version of the bazel rules_go to be published.

wojtek-t · 2019-05-07T06:34:53Z

Just to clarify the above (from @cblecker :

"oh, so I saw the auto update ran here: https://github.com/docker-library/golang/commits/master

but it looks like the update for the tags hasn't merged yet: docker-library/official-images#5867"

dims · 2019-05-07T12:48:04Z

Yea, it takes some time for the images to get published.

cblecker · 2019-05-07T22:45:55Z

Digest: sha256:e52f9c9ea80fa182a78bd83a9f9e107d173b8e7fd922dbb0c60c8020a88fb7f6
Status: Downloaded newer image for golang:1.12.5

Looks like new images are live.

cblecker · 2019-05-09T16:37:22Z

Reopening this so we can verify that everything is back to normal in CI

mm4tt · 2019-05-10T13:56:32Z

/assign

wojtek-t · 2019-05-13T18:54:26Z

For me it's pretty clear that this was resolved:

http://perf-dash.k8s.io/#/?jobname=kubemark-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness&Resource=pods&Scope=namespace&Subresource=&Verb=DELETE

http://perf-dash.k8s.io/#/?jobname=kubemark-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness&Resource=nodes&Scope=cluster&Subresource=&Verb=LIST

Now there is one more thing we should do, which is cherrypick it to 1.14 branch.
@cblecker @dims ^^

cblecker · 2019-05-13T19:02:12Z

Opened pick: #77831

tpepper · 2019-05-13T20:09:49Z

@wojtek-t has SIG Scalability done any one-off custom scale/perf testing with the new golang patch release and the k8s 1.14 branch content?

Do you see this cherry pick as a 1.14.2 release blocker? Or would a week or two of focused testing with updated golang in the 1.14 k8s branch and a 1.14.3 release by end of May be a sufficiently safe yet expedient response?

wojtek-t · 2019-05-14T06:13:21Z

@wojtek-t has SIG Scalability done any one-off custom scale/perf testing with the new golang patch release and the k8s 1.14 branch content?

We were doing a bunch of tests to discover the real regression in golang 1.12 - the regression happened within 1.14 release timeframe and the bisection was done on that branch.
So yes - I'm 100% confident that it helps.

Do you see this cherry pick as a 1.14.2 release blocker? Or would a week or two of focused testing with updated golang in the 1.14 k8s branch and a 1.14.3 release by end of May be a sufficiently safe yet expedient response?

Tough question - I would love to see it in 1.14.2, otoh given how long it took to debug and fix it, I can see release team pushing it to 1.14.3. But if we could have that in 1.14.2, it would be extremely helpful.

hoegaarden · 2019-05-14T10:10:35Z

#77831 just merged and we let 1.14 soak for one or two days before we cut v1.14.2.

Let's keep this issue open until we have some data on the latencies after that merge.

SIG Scalability, can you please provide some data / graphs about the latency in a (couple of) days for the release-1.14 branch?

tpepper · 2019-05-14T23:41:28Z

@wojtek-t is there any evidence we'll be able to observe in https://testgrid.k8s.io/sig-release-1.14-blocking for validation? Or is it possible for SIG Scalability to trigger any custom runs now that the cherry-pick has merged to the 1.14 branch?

wojtek-t · 2019-05-15T13:44:02Z

We aren't running large tests on release branches due to cost of them...
We don't have any free quota to run it now, due to huge number of regressions that we have at head that we're currently debugging.

But I'm 100% confident with this change. What I was personally doing in the past:

I was running tests against commit when we bumped to 1.12.0 with 1.12.5 (which was from beginning of March, so still in 1.14 timeline) and the effect was exactly the same as the effect against head we got after merging it.
I also understand changes in Go between 1.12.4 and 1.12.5 and their effect on performance
And we also understand this regression, and why the fixes help, which has nothing to do with any changes that are happening to kubernetes.

So I'm entirely sure that if you didn't observe any correctness issues, from performance perspective it is good-to-go.

hoegaarden · 2019-05-15T16:17:59Z

@wojtek-t Thanks for your answer!

wojtek-t · 2019-05-17T06:14:05Z

With this being fixed in head and 1.14.2 released with the fix, I'm closing this one as fixed.

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 28, 2019

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 28, 2019

wojtek-t added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 29, 2019

wojtek-t added this to the v1.15 milestone Mar 29, 2019

wojtek-t mentioned this issue Apr 1, 2019

Expose etcd client latency metrics #75953

Merged

wojtek-t changed the title ~~request latency regressed in ci-kubernetes-e2e-gce-scale-performance~~ API call latencies regressed due to upgrade to golang 1.12 Apr 1, 2019

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 1, 2019

wojtek-t mentioned this issue Apr 1, 2019

Paginate watch cache->etcd List calls & reflector init/resync List calls not served by watch cache #75389

Merged

simonferquel mentioned this issue Apr 1, 2019

[Build scripts] Parameterize Golang image version #75966

Closed

wojtek-t assigned wojtek-t and mborsz Apr 1, 2019

This was referenced Apr 2, 2019

Add TLS support to kubemark #76017

Merged

[WIP] Golang 1.12 regressions benchmark #76080

Closed

spzala mentioned this issue Apr 9, 2019

*: update to Go 1.12.2 etcd-io/etcd#10619

Closed

wojtek-t mentioned this issue Apr 10, 2019

ClusterLoader - Kubemark 5000 flakes kubernetes/perf-tests#462

Closed

tnozicka mentioned this issue Apr 11, 2019

Force kube 1.14 golang.org/x/net openshift/origin#22519

Closed

simonpasquier mentioned this issue Apr 30, 2019

Prometheus 2.9.2: runtime: out of memory prometheus/prometheus#5524

Closed

wojtek-t mentioned this issue May 6, 2019

[Failing Test] testing/load/config.yaml failing on master-informing #76670

Closed

cblecker mentioned this issue May 7, 2019

Update to go 1.12.5 #77528

Merged

k8s-ci-robot closed this as completed in #77528 May 9, 2019

cblecker reopened this May 9, 2019

k8s-ci-robot assigned mm4tt May 10, 2019

mm4tt mentioned this issue May 16, 2019

K8s Prow is limited to 1k concurrent pods, or podgc will fight with sinker kubernetes/test-infra#11594

Closed

wojtek-t closed this as completed May 17, 2019

wojtek-t mentioned this issue Jul 2, 2019

Implement release blocking job criteria kubernetes/sig-release#347

Closed

aojea mentioned this issue Apr 19, 2021

Go 1.16 + CI fixes ovn-org/ovn-kubernetes#2178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API call latencies regressed due to upgrade to golang 1.12 #75833

API call latencies regressed due to upgrade to golang 1.12 #75833

mborsz commented Mar 28, 2019

mborsz commented Mar 28, 2019

mborsz commented Mar 28, 2019

liggitt commented Mar 28, 2019

wojtek-t commented Mar 28, 2019

roycaihw commented Mar 28, 2019

krzysied commented Mar 28, 2019

wojtek-t commented Mar 28, 2019

wojtek-t commented Mar 29, 2019

krzysied commented Mar 29, 2019

wojtek-t commented Apr 1, 2019

wojtek-t commented Apr 1, 2019

wojtek-t commented Apr 5, 2019

wojtek-t commented Apr 11, 2019

dims commented Apr 30, 2019

wojtek-t commented Apr 30, 2019

dims commented Apr 30, 2019

wojtek-t commented Apr 30, 2019

cblecker commented Apr 30, 2019

cblecker commented May 7, 2019

wojtek-t commented May 7, 2019 •

edited

Loading

dims commented May 7, 2019

cblecker commented May 7, 2019

cblecker commented May 9, 2019

mm4tt commented May 10, 2019

wojtek-t commented May 13, 2019

cblecker commented May 13, 2019

tpepper commented May 13, 2019

wojtek-t commented May 14, 2019

hoegaarden commented May 14, 2019

tpepper commented May 14, 2019

wojtek-t commented May 15, 2019

hoegaarden commented May 15, 2019

wojtek-t commented May 17, 2019

API call latencies regressed due to upgrade to golang 1.12 #75833

API call latencies regressed due to upgrade to golang 1.12 #75833

Comments

mborsz commented Mar 28, 2019

mborsz commented Mar 28, 2019

mborsz commented Mar 28, 2019

liggitt commented Mar 28, 2019

wojtek-t commented Mar 28, 2019

roycaihw commented Mar 28, 2019

krzysied commented Mar 28, 2019

wojtek-t commented Mar 28, 2019

wojtek-t commented Mar 29, 2019

krzysied commented Mar 29, 2019

wojtek-t commented Apr 1, 2019

wojtek-t commented Apr 1, 2019

wojtek-t commented Apr 5, 2019

wojtek-t commented Apr 11, 2019

dims commented Apr 30, 2019

wojtek-t commented Apr 30, 2019

dims commented Apr 30, 2019

wojtek-t commented Apr 30, 2019

cblecker commented Apr 30, 2019

cblecker commented May 7, 2019

wojtek-t commented May 7, 2019 • edited Loading

dims commented May 7, 2019

cblecker commented May 7, 2019

cblecker commented May 9, 2019

mm4tt commented May 10, 2019

wojtek-t commented May 13, 2019

cblecker commented May 13, 2019

tpepper commented May 13, 2019

wojtek-t commented May 14, 2019

hoegaarden commented May 14, 2019

tpepper commented May 14, 2019

wojtek-t commented May 15, 2019

hoegaarden commented May 15, 2019

wojtek-t commented May 17, 2019

wojtek-t commented May 7, 2019 •

edited

Loading