Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server-side apply protobuf serialization performance issue #76219

Closed
apelisse opened this issue Apr 5, 2019 · 20 comments

Comments

@apelisse
Copy link
Member

commented Apr 5, 2019

Context

Kubernetes 1.14, ServerSideApply feature enabled on kube-apiserver.

Introduction

Server-Side Apply (SSA) seems to be triggering a performance issue on protobuf serialization of objects, specifically while serializing the new managedFields recursive map. We detected the problem because the scalability tests mostly indicated problems in LISTing objects (pods and nodes), even though SSA shouldn't impact this code path.

At this point the problems are probably related to the size of the objects (increased significantly by the additional field) which would lead to:

  • more serialization/deserialization,
  • more data transfer between client and apiserver,
  • more data transfer between apiserver and etcd

Probably many other things, let's try to reproduce.

Steps to reproduce

I'm now comparing the following to set-up:

# Running without the feature enabled
STORAGE_MEDIA_TYPE=application/vnd.kubernetes.protobuf \ 
FEATURE_GATES=ServerSideApply=false \
PATH=$PWD/third_party/etcd:$PATH \
hack/local-up-cluster.sh -O

# Running with the feature enabled
STORAGE_MEDIA_TYPE=application/vnd.kubernetes.protobuf \
FEATURE_GATES=ServerSideApply=true \
PATH=$PWD/third_party/etcd:$PATH \
hack/local-up-cluster.sh -O

I first create a random deployment with a 1000 replicas, the pods don't have to actually start, but should be successfully created.

My benchmark is to run the following python3 script:

import urllib.request
import time
import statistics

request = urllib.request.Request(
    "http://localhost:8080/api/v1/pods",
    headers={
        "Accept": "application/vnd.kubernetes.protobuf",
    }
)

size = []
duration = []
NUM_RUNS = 50
for _ in range(NUM_RUNS):
  b = time.perf_counter()
  size += [len(urllib.request.urlopen(request).read())]
  duration += [(time.perf_counter() - b) * 1000]

print("%d runs:" % NUM_RUNS)
print("size min/max/avg/stddev: %d/%d/%d/%d" % (
    min(size), max(size), statistics.mean(size), statistics.stdev(size)
))
print("duration min/max/avg/stddev: %d/%d/%d/%d" % (
    min(duration),
    max(duration),
    statistics.mean(duration),
    statistics.stdev(duration)
))

Without the feature-flag, the results look like this (YMMV):

> python3 test.py 
50 runs:
size min/max/avg/stddev: 980436/980470/980450/17
duration min/max/avg/stddev: 22/47/31/5

with the feature-flag, the results looks like this:

50 runs:
size min/max/avg/stddev: 2465211/2465211/2465211/0
duration min/max/avg/stddev: 97/146/110/13

Investigation

Since we can see that there is a clear difference between the two, let's try to profile the difference.

Let's send queries infinitely to the apiserver so we can see what it's doing:

while true
do curl --silent -H "Accept: application/vnd.kubernetes.protobuf" \
  http://localhost:8080/api/v1/pods >/dev/null
done

In a different window, start profiling:

go tool pprof http://localhost:8080/debug/pprof/profile?seconds=30
... (wait)
> kcachegrind # If you don't know this, it's absolutely awesome.

Screenshot from 2019-04-05 15-03-13

Let's look at that method:
https://github.com/kubernetes/kubernetes/blob/v1.14.0/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/generated.pb.go#L2107-L2120

It basically does a full tree traversal to know how big the buffer needs to be allocated, that's OK.

The problem is that this method is called again and again at every depth of the tree while serializing the map:
https://github.com/kubernetes/kubernetes/blob/v1.14.0/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/generated.pb.go#L737-L774

which means that the marshaling complexity is going to be proportional to the square of the depth of the map.

My suspicion thought is that this problem is not limited to "maps", but to all objects we serialize in kubernetes, let's look at it:
Taking a random other types from the same generated files (but you can look at any other protobuf generated files, they always do the same thing):
https://github.com/kubernetes/kubernetes/blob/v1.14.0/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/generated.pb.go#L1672-L1711

We can see that it's computing the size of the entire sub-object (by recursing as deeply as it can) before serializing each individual field, which are all going to recurse AGAIN the entire object, and again and again ...

Detecting these with profiling is much harder because they don't all call the same method (each field has a different type). Running on the recursive map made it obvious because all the calls are aggregated under the same function.

Solutions

Marshaling protobuf objects requires to marshal the size of sub-objects before the actual bytes of these objects. Also this size is marshaled as a "varint", which prevents you from computing the bytes and then update the size in place. You would have to copy the data very frequently in order to do that.

An alternate, good solution is to memoize the size of each object so that you don't have to re-compute them very frequently. Interestingly, https://github.com/golang/protobuf does add a XXX_CacheSize int variable in go objects when it generates the go code from proto files. Ideally, the memoization would be "local" to the individual serialization process to avoid problems of invalidating the cache, running multiple serialization at the same time, or modifying objects themselves.

Next step would probably be to come-up with a relevant experiment to see how bad this actually is, both on the SSA map and on the rest of the objects.

@apelisse

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2019

@lavalamp

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

Thanks for the exhaustive write up!!

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2019

We should only ever be calling MarshalTo (which should never call size). Is there something in the middle flipping to Marshal()?

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2019

Jordan says MarshalTo is calling Size on children - it shouldn’t be.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2019

I would have to look later but Size should never be called ever within MarshalTo. We have AST rewrites for embedded generators - maybe that accidentally started doing this at some point rather than what it was supposed to do.

@apelisse

This comment has been minimized.

Copy link
Member Author

commented Apr 6, 2019

I don’t see how it would not be called? The size of each structure has to be marshaled before the marshaled structure itself. And it can’t wait to see how big the marshaled data is because it uses varint so it doesn’t know where the data should be written.

@liggitt

This comment has been minimized.

Copy link
Member

commented Apr 6, 2019

it's done that since the first version we used proto:

https://github.com/kubernetes/kubernetes/blob/release-1.3/pkg/api/v1/generated.pb.go#L5106-5136

func (m *Pod) MarshalTo(data []byte) (int, error) {
	var i int
	_ = i
	var l int
	_ = l
	data[i] = 0xa
	i++
	i = encodeVarintGenerated(data, i, uint64(m.ObjectMeta.Size()))
	n105, err := m.ObjectMeta.MarshalTo(data[i:])
	if err != nil {
		return 0, err
	}
	i += n105
	data[i] = 0x12
	i++
	i = encodeVarintGenerated(data, i, uint64(m.Spec.Size()))
	n106, err := m.Spec.MarshalTo(data[i:])
	if err != nil {
		return 0, err
	}
	i += n106
	data[i] = 0x1a
	i++
	i = encodeVarintGenerated(data, i, uint64(m.Status.Size()))
	n107, err := m.Status.MarshalTo(data[i:])
	if err != nil {
		return 0, err
	}
	i += n107
	return i, nil
}

and every child MarshalTo method does the same

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2019

Ah, because we didn’t inline the embedded structs but instead modeled them as messages when we didn’t have to.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2019

It’s amazing we got this far.

We already have to modify objects during serialization (gvk). If we’re careful, I think we can get away with inline memoizing. I’m slightly worried about the cost implications in memory use for smaller structs though.

We should definitely add a much larger nested object to the serialization and conversion and copy benchmarks. We relied on fuzzing before but a crafted worst case and best case would be better.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

@apelisse - thanks a lot for filing this, it's really useful.
I definitely agree that we need to add better cases to our benchmarks as a first step.

We already have to modify objects during serialization (gvk). If we’re careful, I think we can get away with inline memoizing. I’m slightly worried about the cost implications in memory use for smaller structs though.

@smarterclayton - can you clarify? I think I'm not really following - depending on gvk we may serialize to completely different objects, so I don't really understand it.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

@caesarxuchao

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

@apelisse

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

I have some more experiments that I just ran, confirming the suspicions:

$ go get github.com/gogo/protobuf/protoc-gen-gofast
$ cat <<EOF >test_test.go
package test

import (
	"testing"
)

var result []byte

func BenchmarkWideProtobuf(b *testing.B) {
	test := Test{
		Tests: map[string]*Test{
			"a": nil,
			"b": nil,
			"c": nil,
			"d": nil,
			"e": nil,
			"f": nil,
			"g": nil,
			"h": nil,
			"i": nil,
			"j": nil,
		},
	}
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf(b *testing.B) {
	test := Test{
		Tests: map[string]*Test{
			"a": &Test{
				Tests: map[string]*Test{
					"b": &Test{
						Tests: map[string]*Test{
							"c": &Test{
								Tests: map[string]*Test{
									"d": &Test{
										Tests: map[string]*Test{
											"e": &Test{
												Tests: map[string]*Test{
													"f": &Test{
														Tests: map[string]*Test{
															"g": &Test{
																Tests: map[string]*Test{
																	"h": &Test{
																		Tests: map[string]*Test{
																			"i": &Test{
																				Tests: map[string]*Test{
																					"j": nil,
																				},
																			},
																		},
																	},
																},
															},
														},
													},
												},
											},
										},
									},
								},
							},
						},
					},
				},
			},
		},
	}
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}
EOF
$ cat <<EOF >test.proto
syntax = "proto2";

message test {
  map<string, test> tests = 1;
}
EOF
$ protoc --gofast_out=. *.proto
$ go test -bench=.
goos: linux
goarch: amd64
BenchmarkWideProtobuf-12    	 2000000	       743 ns/op
BenchmarkDeepProtobuf-12    	  200000	      6188 ns/op
PASS
ok  	_/usr/local/google/home/apelisse/code/proto-test	3.597s
@apelisse

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

Adding two extra items to both deep and wide:

$ go test -bench=.
goos: linux
goarch: amd64
BenchmarkWideProtobuf-12    	 2000000	       894 ns/op
BenchmarkDeepProtobuf-12    	  200000	      8753 ns/op
PASS
ok  	_/usr/local/google/home/apelisse/code/proto-test	4.558s
@lavalamp

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

That's a 20% increase vs 40% increase for wide vs deep, seems like a good problem reproduction.

@lavalamp

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

I'm not sure it was avoidable to model things as embedded messages rather than inline, given our expectations around unset / nil behavior.

@lavalamp

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

Put some more numbers from @apelisse in a chart:

chart

The fixed overhead of this codepath must be very low. This is clearly quadratic behavior already at values of N that we will actually reach.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

I don't think we reach N=11 (maybe, but it's fairly uncommon), but something like 5 is definitely what we have. That's really interesting to see...

@apelisse

This comment has been minimized.

Copy link
Member Author

commented Apr 9, 2019

@apelisse

This comment has been minimized.

Copy link
Member Author

commented Apr 9, 2019

import json
import urllib2

def depth(d, level=1):
    if isinstance(d, dict) and d:
        return max(depth(d[k], level + 1) for k in d)
    if isinstance(d, list) and d:
        return max(depth(k, level + 1) for k in d)
    return level

print depth(json.load(urllib2.urlopen(urllib2.Request(
    "http://localhost:8001/api/v1/pods",
    headers={
        "Accept": "application/json",
    }
))))

Running this script against a cluster with 1 basic nginx pod gives me a depth of 14 (without server-side apply): items[0].spec.volumes[0].projected.sources[0].downwardAPI.items[0].fieldRef.fieldPath (probably off by one somehow but whatever).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.