Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stackdriver logging sink failed to create on large clusters #51700

Closed
shyamjvs opened this issue Aug 31, 2017 · 15 comments
Closed

Stackdriver logging sink failed to create on large clusters #51700

shyamjvs opened this issue Aug 31, 2017 · 15 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@shyamjvs
Copy link
Member

The "Cluster level logging implemented by Stackdriver should ingest system logs from all nodes" e2e test is failing on our 2k-node gce clusters - https://k8s-testgrid.appspot.com/google-gce-scale#gce-large-correctness with the following error:

failed to create Stackdriver Logging sink: googleapi: Error 400: Filter cannot be longer than 20000 characters., badRequest

Seems to be because of creating a filter of O(#nodes) size:

Using the following filter for log entries: resource.type="gce_instance" AND (resource.labels.instance_id=9022783563020170818 OR resource.labels.instance_id=623491093136995906 OR resource.labels.instance_id=4780682787538400833 OR resource.labels.instance_id=5936394307195696705 OR resource.labels.instance_id=6219227236601230913 OR resource.labels.instance_id=5239226076018570818 OR resource.labels.instance_id=4293549143014307394 OR resource.labels.instance_id=6628560181373236803 OR resource.labels.instance_id=1898100776565012035 OR .....

cc @crassirostris @kubernetes/sig-instrumentation-bugs @kubernetes/sig-scalability-misc

@shyamjvs shyamjvs added this to the v1.8 milestone Aug 31, 2017
@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Aug 31, 2017
@shyamjvs
Copy link
Member Author

This is part of the correctness suite and large cluster tests are release-blocking.

@shyamjvs shyamjvs added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Aug 31, 2017
@piosz
Copy link
Member

piosz commented Aug 31, 2017

@x13n could you please take a look since @crassirostris is OOO?

cc @fgrzadkowski

@x13n
Copy link
Member

x13n commented Sep 1, 2017

I can find some time to check this today.

@shyamjvs Is it possible to reproduce locally?

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 1, 2017

Yes, it should be reproducible on a large cluster. The issue is with filter string growing linearly with #nodes and then hitting the limit of 20000 chars of stackdriver api.
From the above, each node seems to add about 52B to the filter, so any cluster of >=385 nodes should see the problem.

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 1, 2017

Alternatively if you don't want to create a large cluster, just artificially bloating it here https://github.com/kubernetes/kubernetes/blob/master/test/e2e/instrumentation/logging/stackdrvier/utils.go#L216 should also reproduce it.

@x13n
Copy link
Member

x13n commented Sep 1, 2017

I don't think we should filter by each node id individually, will try to see what will break if I just remove this.

@x13n
Copy link
Member

x13n commented Sep 1, 2017

After a couple of hours, the test is still running. I will get back to this on Monday.

@shyamjvs shyamjvs added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed milestone-labels-incomplete labels Sep 1, 2017
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Labels Complete

@crassirostris @shyamjvs

Issue label settings:

  • sig/instrumentation sig/scalability: Issue will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Additional instructions available here

@crassirostris
Copy link

Hah! We need to set up a filter to separate one test running in the same project from another since there's no resource for the K8s node

I'll look into it

@crassirostris
Copy link

/cc @igorpeshansky @summit

Would really appreciate your help here

@igorpeshansky
Copy link

@crassirostris Would filtering by cluster name work?

@crassirostris
Copy link

crassirostris commented Sep 3, 2017

@igorpeshansky problem is, docker and kubelet logs are written against the gce vm resource. There are two options: remove that filter and let all the logs flow, or make a prefix filter on the vm name

@igorpeshansky
Copy link

@crassirostris Ah, then can GCE labels or tags be used?

@crassirostris
Copy link

@igorpeshansky There's compute.googleapis.com/resource_name label, but the more I think about it, the less I like the idea. I'll just remove the filter, that shouldn't be a problem in this case, since vm ids are used for checking the logs presence anyway. Thanks for chiming in!

k8s-github-robot pushed a commit that referenced this issue Sep 4, 2017
…gs-filter

Automatic merge from submit-queue

Fix Stackdriver Logging tests for large clusters

Fixes #51700

Due to the limit on the length of the filter, filtering out all nodes in the cluster is not possible. Removing the filter shouldn't affect the tests, since the checks are made based on the nodeIds in the cluster that are unique anyway
@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 7, 2017

Ref #51718

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

7 participants