New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources. #62076

Merged
merged 1 commit into from Apr 6, 2018

Conversation

@qingling128
Contributor

qingling128 commented Apr 3, 2018

What this PR does / why we need it:

Which issue(s) this PR fixes
Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources.
Also Fluentd 0.14 has some memory leak issues that caused the e2e tests to be flaky. Downgrading to v0.12.

Special notes for your reviewer:
We never released any previous version with Fluentd v0.14. Only upgraded it very recently. So this downgrading is not visible to users.

Release note:

Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources.
@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Apr 3, 2018

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 3, 2018

Just signed the CLA.

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 3, 2018

CC: @piosz

@k8s-ci-robot k8s-ci-robot added size/M and removed size/XS labels Apr 3, 2018

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 3, 2018

Just for the reference, the original PR where we updated to an image using Fluentd v0.14 and adjusted the config file's parser section is at #59128.

As part of the downgrading to v0.12, we had to revert the parser section config change.

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 3, 2018

/assign @piosz

@qingling128 qingling128 changed the title from Update fluentd_gcp_version to 0.2-1.5.30-1-k8s to include downgrading Fluentd to v0.12. to Update fluentd_gcp_version to 0.2-1.5.30-1-k8s (downgrade Fluentd to v0.12) and remove k8s_compatibility labels. Apr 3, 2018

@qingling128 qingling128 changed the title from Update fluentd_gcp_version to 0.2-1.5.30-1-k8s (downgrade Fluentd to v0.12) and remove k8s_compatibility labels. to Update fluentd_gcp_version to 0.2-1.5.30-1-k8s (downgrade Fluentd to v0.12), remove k8s_compatibility labels and let Logging Agent talk to Metadata Agent. Apr 4, 2018

# resource. The format is:
# 'k8s_container.<namespace_name>.<pod_name>.<container_name>'.
"logging.googleapis.com/local_resource_id" ${"k8s_container.#{tag_suffix[4].rpartition('.')[0].split('_')[1]}.#{tag_suffix[4].rpartition('.')[0].split('_')[0]}.#{tag_suffix[4].rpartition('.')[0].split('_')[2].rpartition('-')[0]}"}
# Rename the field 'log' to a more generic field 'message'. This way the

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

For gke_container resource, we are sending log field as text_payload after time/severity/stream are extracted. For every other resources, we are sending message field as textPayload if it is the only field.

To assure the same behavior for k8s_container, we can either rename the log field to message in the configuration, or hardcode an exception in the fluent-plugin-google-cloud gem next to the gke_container resource.

I'm slightly leaning towards handling it via configuration instead of hardcoding it in the gem. The only issue is that when there are additional fields (other than time/severity/stream/log), the behavior might be different. For the gke_container resource, we seem to be dropping all other fields (That feel like a bug to me, isn't it?). For the k8s_container resource, we will keep the jsonPayload structure and retain the additional fields.

WDYT? @bmoyles0117 @igorpeshansky

This comment has been minimized.

@igorpeshansky

igorpeshansky Apr 4, 2018

For gke_container, the implementation and the configuration on the Kubernetes side were carefully crafted to work together. This, as you observed, was very fragile.
The semantics you propose is simple and straightforward. We can document that the logging agent expects the text payload in a message field, and rename log to message in the configuration.

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

Sounds good.

This comment has been minimized.

@bmoyles0117

bmoyles0117 Apr 5, 2018

Contributor

+1 to what Igor said.

@x13n

This comment has been minimized.

Member

x13n commented Apr 4, 2018

/ok-to-test

args:
- /bin/sh
- -c
- STACKDRIVER_METADATA_AGENT_URL="http://${NODE_NAME}:8000" /usr/sbin/google-fluentd

This comment has been minimized.

@x13n

x13n Apr 4, 2018

Member

Can't STACKDRIVER_METADATA_AGENT_URL be set via env?

This comment has been minimized.

@x13n

x13n Apr 4, 2018

Member

Also, double-checking - is this the right port?

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

Changed STACKDRIVER_METADATA_AGENT_URL to be configurable via an env. PTAL.

Nice catch regarding to the port. We use 8000 in our testing cluster. The actual metadata agent yaml uses 8799 instead. Fixed.

@@ -366,6 +366,11 @@ data:
</record>
</filter>
# Do not collect fluentd's own logs to avoid infinite loops.
<match fluent.**>

This comment has been minimized.

@x13n

x13n Apr 4, 2018

Member

Does that mean no logs from fluentd will be exported?

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

Yes. This is also the behavior of our default packaging.

It's on our roadmap to support exporting Fluentd logs. But it has to be handled carefully.

The issue right now is that when Fluentd runs into issues ingesting any logs, it emits a Fluentd log entry. When it has issues ingesting its own log, that triggers infinite loops and leads to agent crashes.

This comment has been minimized.

@bmoyles0117

bmoyles0117 Apr 5, 2018

Contributor

Ling, have you confirmed that this section does what it tries to do? As far as I'm aware, fluent logs are emitted to stdout/stderr, making this tag meaningless. If you've confirmed this behavior works as intended great, but if it doesn't add any value, I would suggest removing it.

This comment has been minimized.

@qingling128

qingling128 Apr 5, 2018

Contributor

Yes, this indeed takes effect. I encountered infinite loops and this change fixed it.

A sample ingested fluentd log looks like below:

{
 insertId:  "9ppb4lg2hn1rtn"  
 logName:  "projects/stackdriver-kubernetes-1337/logs/fluent.info"  
 receiveTimestamp:  "2018-04-04T01:27:05.638928314Z"  
 resource: {
  labels: {
   cluster_name:  "stackdriver-metadata-e2e"    
   location:  "us-central1-a"    
   node_name:  "gke-stackdriver-metadata-default-pool-c622577e-vlmk"    
   project_id:  "stackdriver-kubernetes-1337"    
  }
  type:  "k8s_node"   
 }
 textPayload:  "following tail of /var/log/containers/metadata-agent-api-availability-test-p469t_metadata-agent-api-availability-test_metadata-agent-api-availability-test-ef878dbc383e0e59b558103f2acfa446c4ca8d8520695460f5764e5a5719b468.log"  
 timestamp:  "2018-04-04T01:27:04Z"  
}

As we can see, Fluentd logs are tagged as fluent.info, fluent.error and fluent.warn.

This is because Fluentd logs are emitted as Fluentd events directly.

@k8s-ci-robot k8s-ci-robot added size/XL and removed size/M labels Apr 4, 2018

@qingling128

All feedback addressed.

  • Grouped fluentd config setups into a setup-fluentd method.
  • Made metadata_agent_url configurable via env var.
  • Added an env var to control whether to ingest logs against old or new resources.

The contributor guide seems to suggest not to squash unless being suggested by reviewers. So I'll leave these commits separate for now.

# resource. The format is:
# 'k8s_container.<namespace_name>.<pod_name>.<container_name>'.
"logging.googleapis.com/local_resource_id" ${"k8s_container.#{tag_suffix[4].rpartition('.')[0].split('_')[1]}.#{tag_suffix[4].rpartition('.')[0].split('_')[0]}.#{tag_suffix[4].rpartition('.')[0].split('_')[2].rpartition('-')[0]}"}
# Rename the field 'log' to a more generic field 'message'. This way the

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

Sounds good.

@@ -366,6 +366,11 @@ data:
</record>
</filter>
# Do not collect fluentd's own logs to avoid infinite loops.
<match fluent.**>

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

Yes. This is also the behavior of our default packaging.

It's on our roadmap to support exporting Fluentd logs. But it has to be handled carefully.

The issue right now is that when Fluentd runs into issues ingesting any logs, it emits a Fluentd log entry. When it has issues ingesting its own log, that triggers infinite loops and leads to agent crashes.

args:
- /bin/sh
- -c
- STACKDRIVER_METADATA_AGENT_URL="http://${NODE_NAME}:8000" /usr/sbin/google-fluentd

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

Changed STACKDRIVER_METADATA_AGENT_URL to be configurable via an env. PTAL.

Nice catch regarding to the port. We use 8000 in our testing cluster. The actual metadata agent yaml uses 8799 instead. Fixed.

@@ -0,0 +1,425 @@
# This ConfigMap is used to ingest logs against old resources like

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

NOTE: This file is roughly a copy of fluentd-gcp-configmap.yaml. The only difference is whether to set up local_resource_id and enable talking to Metadata Agent. With those, logs are ingested against k8s_container and k8s_node. Without those, logs are ingested against gke_container and gce_instance.

This comment has been minimized.

@bmoyles0117

bmoyles0117 Apr 5, 2018

Contributor

+1 to what you've done here.

@@ -51,26 +57,18 @@ data:
pos_file /var/log/gcp-containers.log.pos
tag reform.*
read_from_head true
format none
format multi_format

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

NOTE: This entire format change is to revert the changes we made in https://github.com/kubernetes/kubernetes/pull/59128/files#diff-dec77c261fefaa453d67b0c26b4b07c2 to adapt to Fluentd v0.14 syntax. As we downgrade back to v0.12 in this PR, the syntax needs to be reverted as well.

@@ -83,7 +81,19 @@ data:
<match reform.**>
@type record_reformer

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

NOTE: This whole record_reformer section sets up the local_resource_id for k8s_container resource. It's present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

</match>
# Attach local_resource_id for 'k8s_node' monitored resource.

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

NOTE: This entire record_reformer section sets up the local_resource_id for k8s_node resource. It's present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

# the necessary resource types when this label is set.
"logging.googleapis.com/k8s_compatibility": "true"
}
# Use Metadata Agent to get monitored resource.

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

NOTE: This enable_metadata_agent true configuration is present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

"logging.googleapis.com/k8s_compatibility": "true"
}
# Use Metadata Agent to get monitored resource.
enable_metadata_agent true

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

NOTE: This enable_metadata_agent true configuration is present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

@@ -37,6 +37,10 @@ spec:
readOnly: true
- name: config-volume
mountPath: /etc/google-fluentd/config.d
args:

This comment has been minimized.

@qingling128

qingling128 Apr 4, 2018

Contributor

NOTE: STACKDRIVER_METADATA_AGENT_URL is set up as an environment variable regardless whether we are ingesting logs via old resources (in which case STACKDRIVER_METADATA_AGENT_URL is not needed) or new resources. Just because there seems to be no need to distinguish the case.

Thoughts?

This comment has been minimized.

@bmoyles0117

bmoyles0117 Apr 5, 2018

Contributor

I'm ok with setting this unconditionally, however, why aren't we setting this as a normal environment variable? Injecting it at this point is actually problematic as it harms the ability to shut down cleanly.

        - env:
          - name: STACKDRIVER_METADATA_AGENT_URL
            value: {{ stackdriver_metadata_agent_url }}

Is what I'm suggesting.

This comment has been minimized.

@qingling128

qingling128 Apr 5, 2018

Contributor

Good point. Changed.

@qingling128 qingling128 changed the title from Update fluentd_gcp_version to 0.2-1.5.30-1-k8s (downgrade Fluentd to v0.12), remove k8s_compatibility labels and let Logging Agent talk to Metadata Agent. to Add support to ingest log entries against new "k8s_container" and "k8s_node" resources. Apr 4, 2018

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 4, 2018

/retest

@qingling128 qingling128 changed the title from Add support to ingest log entries against new "k8s_container" and "k8s_node" resources. to Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources. Apr 6, 2018

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 6, 2018

@MaciekPytel - Changed the PR title and release note. Should I squash everything into one commit now?

@MaciekPytel

This comment has been minimized.

Contributor

MaciekPytel commented Apr 6, 2018

Oh yes, I completely missed the number of commits. Please squash them.

/hold
So it doesn't merge before squashing.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Apr 6, 2018

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 6, 2018

@MaciekPytel - Squashed.

@kawych

This comment has been minimized.

Contributor

kawych commented Apr 6, 2018

/lgtm
@x13n is out of office, so approving instead of him as he lgtmed before squashing commits.

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Apr 6, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, MaciekPytel, qingling128, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 6, 2018

Thanks @kawych!

@MaciekPytel - PTAL.

@MaciekPytel

This comment has been minimized.

Contributor

MaciekPytel commented Apr 6, 2018

/hold cancel

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Apr 6, 2018

@qingling128: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-100-performance 4ac1e9a link /test pull-kubernetes-e2e-gce-100-performance
pull-kubernetes-kubemark-e2e-gce-big 4ac1e9a link /test pull-kubernetes-kubemark-e2e-gce-big

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 6, 2018

/test pull-kubernetes-e2e-kops-aws

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Apr 6, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit 4009cb3 into kubernetes:master Apr 6, 2018

13 of 14 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
cla/linuxfoundation qingling128 authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Job succeeded.
Details
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 6, 2018

Finally made it into the queue (there was some flakiness of the test) and merged. A cherrypick PR has been created as well.

@StevenACoffman

This comment has been minimized.

Contributor

StevenACoffman commented Apr 11, 2018

@qingling128 Thank you for your work on this. What is the issue tracking the fluentd memory leak so we can know when to try to upgrade again?

k8s-merge-robot added a commit that referenced this pull request Apr 11, 2018

Merge pull request #62201 from qingling128/automated-cherry-pick-of-#…
…62076-origin-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #62076: Add support to ingest log entries to Stackdriver against new

Cherry pick of #62076 on release-1.10.

#62076: Add support to ingest log entries to Stackdriver against new
@StevenACoffman

This comment has been minimized.

Contributor

StevenACoffman commented Apr 12, 2018

@monotek are you aware of any issue tracking the fluentd memory leak referred to here?

@monotek

This comment has been minimized.

Contributor

monotek commented Apr 12, 2018

No.
Is there a bug report at fluentd about it?

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 12, 2018

@StevenACoffman @monotek - I've created an internal ticket to track it. Will report it with Fluentd as well and link it here. (Meant to do it earlier, but got distracted by the cherrypick issues)

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 12, 2018

Created fluent/fluentd#1941 to track the memory leak.

@StevenACoffman

This comment has been minimized.

Contributor

StevenACoffman commented Apr 12, 2018

Hmmm... I am not seeing that with latest (only running for a few days) and there have been several leak fixes in versions in between:

@qingling128

This comment has been minimized.

Contributor

qingling128 commented Apr 13, 2018

@StevenACoffman - That sounds promising! We are also setting up some soak test for various Fluentd versions including the latest (v1.1). Some additional testing work is needed from our side to bump a major Fluentd version. We might not have resources for it this quarter, but it's definitely on our roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment