Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/datadogexporter] Wrong ddtags on logs #17398

Closed
jmichalek132 opened this issue Jan 5, 2023 · 3 comments
Closed

[exporter/datadogexporter] Wrong ddtags on logs #17398

jmichalek132 opened this issue Jan 5, 2023 · 3 comments
Labels
data:logs Logs related issues exporter/datadog Datadog components priority:p2 Medium Stale

Comments

@jmichalek132
Copy link
Contributor

Component(s)

exporter/datadog

What happened?

Description

A certain amount of log lines have datadog tags containing metadata of the wrong k8s pod when batching is enabled.
The problem lies on this line:

tags := datadog.PtrString(payload[0].GetDdtags())

We copy the attributes only from the first item in the payload and set them as datadog tags for all the log lines in the payload. However, when we enable batching the payload can contain logs from multiple different pods. When this happens all the share the same tags even though they are from different pods making the tags incorrect.

By setting a breakpoint inside this function we can inspect the payload sent and verify the issue.

_, r, err := s.api.SubmitLog(ctx, payload, opts)

Snippet containing the function

func (a *LogsApi) SubmitLog(ctx _context.Context, body []HTTPLogItem, o ...SubmitLogOptionalParameters) (interface{}, *_nethttp.Response, error) {
	req, err := a.buildSubmitLogRequest(ctx, body, o...)
	if err != nil {
		var localVarReturnValue interface{}
		return localVarReturnValue, nil, err
	}

	return a.submitLogExecute(req)
}

This screenshot shows the issue when inspecting the payload sent using a debugger.
Screen Shot 2023-01-05 at 17 41 03
We have body containing logs from different pods and different ddtags and the o which contains only one set of ddtags. The o one seems to be preferred.

Steps to Reproduce

We ran otel collector as daemonset to collect logs traces and metrics and use otlp to send it to another otel collector ran as deployment for additional processing.

Expected Result

Properties such as pod name in tags match the pod name in attributes.

Actual Result

Example log line exported from datadog ui after ingestion:
You can see for example that the pod name in tags doesn't match the pod name in attributes. The pod name in attributes is the correct one.

{
  "id": "AgAAAYWBW3GXb6Tm_wAAAAAAAAAYAAAAAEFZV0JXNHhwQUFCdE9ZbW5xcGY0SVFBSQAAACQAAAAAMDE4NTgxNWItYWE5Ny00MGUxLTgyNTUtOGU2MWUwZWZiZGYx",
  "content": {
    "timestamp": "2023-01-05T09:55:31.863Z",
    "tags": [
      "container_id:31019c6707ad82a41d7ce4464ddaf81c532c7f9680cbc3d114dad0a97508ecfe",
      "source:undefined",
      "image_name:censored",
      "image_tag:v1.22.6-eksbuild.1",
      "kube_container_name:kube-proxy",
      "kube_namespace:kube-system",
      "otel_source:datadog_exporter",
      "container_name:kube-proxy",
      "pod_name:kube-proxy-8rfkc",
      "env:dev",
      "region:eu1"
    ],
    "host": "i-0c3c96c055c6ec25c",
    "service": "aws-cluster-autoscaler",
    "message": "I0105 09:55:31.863856       1 scale_down.go:918] No candidates for scale down",
    "attributes": {
      "hostname": "censored",
      "@timestamp": "2023-01-05T09:55:31Z",
      "container_name": "aws-cluster-autoscaler",
      "log": {
        "iostream": "stderr",
        "file": {
          "path": "/var/log/pods/kube-system_cluster-autoscaler-bc567684b-stcgs_1e69f29c-474d-4c2a-b0eb-df9c721f5973/aws-cluster-autoscaler/4.log"
        }
      },
      "service": "aws-cluster-autoscaler",
      "otel": {
        "timestamp": "1672912531863900731"
      },
      "time": "2023-01-05T09:55:31.863900731Z",
      "logtag": "F",
      "pod_name": "cluster-autoscaler-bc567684b-stcgs",
      "status": ""
    }
  }
}

Collector version

v0.67.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

k8s version: Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.13-eks-fb459a0", GitCommit:"55bd5d5cb7d32bc35e4e050f536181196fb8c6f7", GitTreeState:"clean", BuildDate:"2022-10-24T20:35:40Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}
Helm chart used

OpenTelemetry Collector configuration

# Configuration of opentelemetry collector running as deployment
    exporters:
      datadog:
        api:
          key: ${DD_API_KEY}
          site: datadoghq.eu
        host_metadata:
          enabled: true
          hostname_source: first_resource
        metrics:
          resource_attributes_as_tags: true
        traces:
          span_name_as_resource_name: false
      logging:
        loglevel: info
    extensions:
      health_check: {}
      memory_ballast:
        size_in_percentage: 20
      zpages: {}
    processors:
      batch:
        send_batch_size: 5
        timeout: 10s
      batch/logs:
        send_batch_max_size: 1000
        send_batch_size: 100
        timeout: 10s
      memory_limiter:
        check_interval: 5s
        limit_percentage: 80
        spike_limit_percentage: 25
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    service:
      extensions:
      - health_check
      - memory_ballast
      - zpages
      pipelines:
        logs:
          exporters:
          - datadog
          processors:
          - batch
          receivers:
          - otlp
      telemetry:
        metrics:
          address: 0.0.0.0:8888
---
# Configuration of opentelemetry collector running as daemonset
    exporters:
      logging: {}
      otlp:
        endpoint: opentelemetry-cluster:4317
        tls:
          insecure: true
    extensions:
      health_check: {}
      k8s_observer:
        node: ${K8S_NODE_NAME}
        observe_nodes: true
        observe_pods: true
      memory_ballast:
        size_in_percentage: 20
        size_mib: "819"
      zpages: {}
    processors:
      batch:
        send_batch_max_size: 200
        send_batch_size: 10
        timeout: 10s
      memory_limiter:
        check_interval: 5s
        limit_mib: 1638
        spike_limit_mib: 512
      resourcedetection/eks:
        detectors:
        - eks
        override: false
        timeout: 2s
    receivers:
      filelog:
        exclude:
        - /var/log/pods/**/opentelemetry-collector/*.log
        - /var/log/pods/**/linkerd-proxy/*.log
        - /var/log/pods/datadog_*/**/*.log
        include:
        - /var/log/pods/**/**/*.log
        include_file_name: false
        include_file_path: true
        operators:
        - id: parser-containerd
          regex: ^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
          timestamp:
            layout: '%Y-%m-%dT%H:%M:%S.%LZ'
            parse_from: attributes.time
          type: regex_parser
        - default: extract_metadata_from_filepath
          id: get-format
          routes:
          - expr: attributes.log matches "^{\".*\":\".*\",\".*\":\".*\"}+"
            output: logback_json_parser
          type: router
        - id: logback_json_parser
          output: trace
          parse_from: attributes.log
          parse_to: body
          type: json_parser
        - id: trace
          output: severity_parser
          type: trace_parser
        - id: severity_parser
          parse_from: body.level
          type: severity_parser
        - id: extract_metadata_from_filepath
          parse_from: attributes["log.file.path"]
          regex: ^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$
          type: regex_parser
        - from: attributes.stream
          to: attributes["log.iostream"]
          type: move
        - from: attributes.container_name
          output: copycontainer
          to: resource["k8s.container.name"]
          type: move
        - from: resource["k8s.container.name"]
          id: copycontainer
          output: copycontainertosource
          to: resource["container.name"]
          type: copy
        - from: resource["k8s.container.name"]
          id: copycontainertosource
          to: resource["source"]
          type: copy
        - from: attributes.namespace
          to: resource["k8s.namespace.name"]
          type: move
        - from: attributes.pod_name
          to: resource["k8s.pod.name"]
          type: move
        - from: attributes.restart_count
          to: resource["k8s.container.restart_count"]
          type: move
        - from: attributes.uid
          to: resource["k8s.pod.uid"]
          type: move
        start_at: end
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    service:
      extensions:
      - k8s_observer
      - health_check
      - memory_ballast
      - zpages
      pipelines:
        logs:
          exporters:
          - otlp
          processors:
          - batch
          receivers:
          - filelog
      telemetry:
        metrics:
          address: 0.0.0.0:8888

Log output

No response

Additional context

I can provide more details if necessary.

@jmichalek132 jmichalek132 added bug Something isn't working needs triage New item requiring triage labels Jan 5, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Mar 13, 2023
@mx-psi
Copy link
Member

mx-psi commented Mar 13, 2023

This was fixed by #17399

@mx-psi mx-psi closed this as completed Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:logs Logs related issues exporter/datadog Datadog components priority:p2 Medium Stale
Projects
None yet
Development

No branches or pull requests

2 participants