Stackdriver logging sink failed to create on large clusters #51700

shyamjvs · 2017-08-31T11:59:13Z

The "Cluster level logging implemented by Stackdriver should ingest system logs from all nodes" e2e test is failing on our 2k-node gce clusters - https://k8s-testgrid.appspot.com/google-gce-scale#gce-large-correctness with the following error:

failed to create Stackdriver Logging sink: googleapi: Error 400: Filter cannot be longer than 20000 characters., badRequest

Seems to be because of creating a filter of O(#nodes) size:

Using the following filter for log entries: resource.type="gce_instance" AND (resource.labels.instance_id=9022783563020170818 OR resource.labels.instance_id=623491093136995906 OR resource.labels.instance_id=4780682787538400833 OR resource.labels.instance_id=5936394307195696705 OR resource.labels.instance_id=6219227236601230913 OR resource.labels.instance_id=5239226076018570818 OR resource.labels.instance_id=4293549143014307394 OR resource.labels.instance_id=6628560181373236803 OR resource.labels.instance_id=1898100776565012035 OR .....

cc @crassirostris @kubernetes/sig-instrumentation-bugs @kubernetes/sig-scalability-misc

The text was updated successfully, but these errors were encountered:

shyamjvs · 2017-08-31T13:13:41Z

This is part of the correctness suite and large cluster tests are release-blocking.

piosz · 2017-08-31T18:05:25Z

@x13n could you please take a look since @crassirostris is OOO?

cc @fgrzadkowski

x13n · 2017-09-01T08:50:25Z

I can find some time to check this today.

@shyamjvs Is it possible to reproduce locally?

shyamjvs · 2017-09-01T10:08:24Z

Yes, it should be reproducible on a large cluster. The issue is with filter string growing linearly with #nodes and then hitting the limit of 20000 chars of stackdriver api.
From the above, each node seems to add about 52B to the filter, so any cluster of >=385 nodes should see the problem.

shyamjvs · 2017-09-01T10:09:53Z

Alternatively if you don't want to create a large cluster, just artificially bloating it here https://github.com/kubernetes/kubernetes/blob/master/test/e2e/instrumentation/logging/stackdrvier/utils.go#L216 should also reproduce it.

x13n · 2017-09-01T10:24:40Z

I don't think we should filter by each node id individually, will try to see what will break if I just remove this.

x13n · 2017-09-01T14:18:40Z

After a couple of hours, the test is still running. I will get back to this on Monday.

k8s-github-robot · 2017-09-01T19:03:15Z

[MILESTONENOTIFIER] Milestone Labels Complete

@crassirostris @shyamjvs

Issue label settings:

sig/instrumentation sig/scalability: Issue will be escalated to these SIGs if needed.
priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
kind/bug: Fixes a bug discovered during the current release.

Additional instructions available here

crassirostris · 2017-09-03T11:54:14Z

Hah! We need to set up a filter to separate one test running in the same project from another since there's no resource for the K8s node

I'll look into it

crassirostris · 2017-09-03T11:56:12Z

/cc @igorpeshansky @summit

Would really appreciate your help here

igorpeshansky · 2017-09-03T13:15:09Z

@crassirostris Would filtering by cluster name work?

crassirostris · 2017-09-03T13:22:52Z

@igorpeshansky problem is, docker and kubelet logs are written against the gce vm resource. There are two options: remove that filter and let all the logs flow, or make a prefix filter on the vm name

igorpeshansky · 2017-09-03T13:29:12Z

@crassirostris Ah, then can GCE labels or tags be used?

crassirostris · 2017-09-03T14:34:15Z

@igorpeshansky There's compute.googleapis.com/resource_name label, but the more I think about it, the less I like the idea. I'll just remove the filter, that shouldn't be a problem in this case, since vm ids are used for checking the logs presence anyway. Thanks for chiming in!

…gs-filter Automatic merge from submit-queue Fix Stackdriver Logging tests for large clusters Fixes #51700 Due to the limit on the length of the filter, filtering out all nodes in the cluster is not possible. Removing the filter shouldn't affect the tests, since the checks are made based on the nodeIds in the cluster that are unique anyway

shyamjvs · 2017-09-07T01:13:44Z

Ref #51718

shyamjvs added this to the v1.8 milestone Aug 31, 2017

shyamjvs assigned crassirostris Aug 31, 2017

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Aug 31, 2017

shyamjvs added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Aug 31, 2017

DirectXMan12 mentioned this issue Aug 31, 2017

Move autoscaling/v2 from alpha1 to beta1 #50708

Merged

shyamjvs mentioned this issue Aug 31, 2017

Failing e2e tests on scalability jobs #51718

Closed

k8s-github-robot added the milestone-labels-incomplete label Sep 1, 2017

shyamjvs added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed milestone-labels-incomplete labels Sep 1, 2017

k8s-github-robot added the milestone-labels-complete label Sep 1, 2017

crassirostris mentioned this issue Sep 4, 2017

Fix Stackdriver Logging tests for large clusters #51913

Merged

k8s-github-robot closed this as completed in #51913 Sep 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stackdriver logging sink failed to create on large clusters #51700

Stackdriver logging sink failed to create on large clusters #51700

shyamjvs commented Aug 31, 2017

shyamjvs commented Aug 31, 2017

piosz commented Aug 31, 2017

x13n commented Sep 1, 2017

shyamjvs commented Sep 1, 2017

shyamjvs commented Sep 1, 2017

x13n commented Sep 1, 2017

x13n commented Sep 1, 2017

k8s-github-robot commented Sep 1, 2017

crassirostris commented Sep 3, 2017

crassirostris commented Sep 3, 2017

igorpeshansky commented Sep 3, 2017

crassirostris commented Sep 3, 2017 •

edited

igorpeshansky commented Sep 3, 2017

crassirostris commented Sep 3, 2017

shyamjvs commented Sep 7, 2017

Stackdriver logging sink failed to create on large clusters #51700

Stackdriver logging sink failed to create on large clusters #51700

Comments

shyamjvs commented Aug 31, 2017

shyamjvs commented Aug 31, 2017

piosz commented Aug 31, 2017

x13n commented Sep 1, 2017

shyamjvs commented Sep 1, 2017

shyamjvs commented Sep 1, 2017

x13n commented Sep 1, 2017

x13n commented Sep 1, 2017

k8s-github-robot commented Sep 1, 2017

crassirostris commented Sep 3, 2017

crassirostris commented Sep 3, 2017

igorpeshansky commented Sep 3, 2017

crassirostris commented Sep 3, 2017 • edited

igorpeshansky commented Sep 3, 2017

crassirostris commented Sep 3, 2017

shyamjvs commented Sep 7, 2017

crassirostris commented Sep 3, 2017 •

edited