otel collector container keeps crashing due to nil pointer dereference when using fluentforwardreceiver #27469

karthik-kasarapu · 2023-10-06T19:31:01Z

Component(s)

receiver/fluentforward

What happened?

Description

The otel collector container running as a side car in ECS fargate to capture logs from app container that uses aws firelens log
driver is crashing with nil pointer de-reference error, as shown below

Steps to Reproduce

Deploy the collector as a side car to ECS fargate and use aws firelens logDriver for main container
Issue seems to be happening during high log volume. Noticed at around 1000 logs/sec

Expected Result

Expected to work without panic error

Actual Result

Container is crashing

Collector version

0.79.0, 0.85.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

extensions:
  # exposes http endpoint on port 13133 that can be used by k8s as a rediness probe
  health_check:
  
  # https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/ballastextension
  memory_ballast:
    size_in_percentage: 30
  
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  fluentforward:
    endpoint: unix:///var/run/fluent.sock

  awsecscontainermetrics:

  filelog/1:
    # Grafana Loki limit
    max_log_size: 256KiB
    include: "${ECS_LOG_FILES}"

processors:
  batch:

  # should be first processor in pipelines; may need to play with settings
  memory_limiter:
    check_interval: 1s
    limit_percentage: 10
    spike_limit_percentage: 5

  resourcedetection:
    # a list of resource detectors to run, valid options are: "env", "system", "gce", "gke", "ec2", "ecs", "elastic_beanstalk", "eks", "azure"
    # env resource detection uses environment variable: OTEL_RESOURCE_ATTRIBUTES=<key1>=<value1>,<key2>=<value2>,...
    detectors: [ env, ecs ]

  # rename some attributes to match semconv
  # ref: https://opentelemetry.io/docs/reference/specification/logs/semantic_conventions/media/
  transform/logs:
    log_statements:
    - context: resource
      statements:
      - delete_key(attributes, "aws.log.group.names")
      - delete_key(attributes, "aws.log.group.arns")
      - delete_key(attributes, "aws.log.stream.names")
      - delete_key(attributes, "aws.log.stream.arns")
    - context: log
      statements:
      - set(attributes["log.iostream"], attributes["source"])
      - set(resource.attributes["container.id"], attributes["container_id"])
      - set(resource.attributes["container.name"], attributes["container_name"])
      - delete_key(attributes, "source")
      - delete_key(attributes, "container_id")
      - delete_key(attributes, "container_name")
      - delete_key(attributes, "fluent.tag")


exporters:
  logging:
    loglevel: warn

  otlp/logs:
    endpoint: "${COLLECTOR_ENV}-collector-logs.com:4317"
    tls:
      insecure: true
  otlp/metrics:
    endpoint: "${COLLECTOR_ENV}-collector-metrics.com:4317"
    tls:
      insecure: true
  otlp/traces:
    endpoint: "${COLLECTOR_ENV}-collector-traces.com:4317"
    tls:
      insecure: true

service:
  # leaving off zpages and pprof as they are for debugging and can be added as necessary
  extensions: [health_check,memory_ballast]

  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter,resourcedetection,batch]
      exporters: [otlp/traces]
    
    metrics:
      receivers: [otlp,awsecscontainermetrics]
      processors: [memory_limiter,resourcedetection,batch]
      exporters: [otlp/metrics]
    
    logs:
      receivers: [otlp,filelog/1]
      processors: [memory_limiter,resourcedetection,batch]
      exporters: [otlp/logs]

    logs/ecs:
      receivers: [fluentforward]
      processors: [memory_limiter,resourcedetection,transform/logs,batch]
      exporters: [otlp/logs]

Log output

For 0.79.0
--
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xabe9d6]
goroutine 132 [running]:
go.opentelemetry.io/collector/pdata/plog.ResourceLogsSlice.At(...)
	go.opentelemetry.io/collector/pdata@v1.0.0-rcv0012/plog/generated_resourcelogsslice.go:53
go.opentelemetry.io/collector/pdata/plog.Logs.LogRecordCount({0xc000241c20?})
	go.opentelemetry.io/collector/pdata@v1.0.0-rcv0012/plog/logs.go:38 +0x16
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver.(*Collector).processEvents(0xc0008a5fb0, {0x452c8a0, 0xc00074f950})
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver@v0.79.0/collector.go:60 +0xe8
created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver.(*Collector).Start in goroutine 1
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver@v0.79.0/collector.go:39 +0x8c

For 0.85.0
--
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xacdeb6]
goroutine 138 [running]:
go.opentelemetry.io/collector/pdata/plog.ResourceLogsSlice.At(...)
go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/plog/generated_resourcelogsslice.go:53
go.opentelemetry.io/collector/pdata/plog.Logs.LogRecordCount({0xc000bcaf00?})
go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/plog/logs.go:38 +0x16
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver.(*Collector).processEvents(0xc0006a5bc0, {0x481e6c0, 0xc0001a7e50})
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver@v0.85.0/collector.go:60 +0xe8
created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver.(*Collector).Start in goroutine 1
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver@v0.85.0/collector.go:39 +0x8c

Additional context

No response

github-actions · 2023-10-06T19:31:19Z

Pinging code owners:

receiver/fluentforward: @dmitryax

See Adding Labels via Comments if you do not have permissions to add labels yourself.

bryan-aguilar · 2023-11-09T21:02:48Z

Setting this as a P1 for now since panics should be protected against in components

ms-jcorley · 2023-11-14T00:18:50Z

Looking at the code here:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/fluentforwardreceiver/collector.go#L59

This seems to be the same root cause as this bug:
#29107

Calling LogRecordCount on a plog that was previously passed to a ConsumeLogs call.
Downstream a processor (such as batch) may be modifying the pLog asynchronously while LogRecordCount is attempting to loop through it.

Fix: call LogRecordCount before ConsumeLogs and cache the value for logging after. It is unsafe to call LogRecordCount after ConsumeLogs.

bryan-aguilar · 2023-11-14T00:25:52Z

Thank's for following up @ms-jcorley! I added the comment then thought "I should look at the code first before I claim they are linked". I appreciate you doing the due diligence for me with a bit more investigation.

ms-jcorley · 2023-11-14T00:29:36Z

Thank's for following up @ms-jcorley! I added the comment then thought "I should look at the code first before I claim they are linked". I appreciate you doing the due diligence for me with a bit more investigation.

No problem! Also note that a quick (unreliable) search shows instances of this same pattern in several other places:
receiver/lokireceiver/loki.go
receiver/mongodbatlasreceiver/events.go
receiver/sqlqueryreceiver/logs_receiver.go
receiver/kafkareceiver/kafka_receiver.go
receiver/k8sobjectsreceiver/receiver.go
receiver/googlecloudpubsubreceiver/receiver.go
receiver/otlpjsonfilereceiver/file.go

bryan-aguilar · 2023-11-14T01:40:49Z

We should consider opening up a larger meta issue track all of these once we confirm that this pattern will lead to a panic.

djaglowski · 2023-11-14T16:09:57Z

I agree that this appears to be due to a race condition. I've opened #29274 with the quick fix of calling LogRecordCount before calling ConsumeLogs. I suggest we assume this fixes it for now and reopen the issue if it still occurs.

Resolves #29107 Resolves #27469

…etry#29274) Resolves open-telemetry#29107 Resolves open-telemetry#27469

karthik-kasarapu added bug Something isn't working needs triage New item requiring triage labels Oct 6, 2023

github-actions bot added the receiver/fluentforward label Oct 6, 2023

github-actions bot mentioned this issue Oct 10, 2023

Weekly Report: 2023-10-03 - 2023-10-10 #27574

Closed

This was referenced Oct 17, 2023

Weekly Report: 2023-10-10 - 2023-10-17 #27791

Closed

Weekly Report: 2023-10-17 - 2023-10-24 #28557

Closed

github-actions bot mentioned this issue Oct 31, 2023

Weekly Report: 2023-10-24 - 2023-10-31 #28813

Closed

github-actions bot mentioned this issue Nov 7, 2023

Weekly Report: 2023-10-31 - 2023-11-07 #29000

Closed

bryan-aguilar added priority:p1 High and removed needs triage New item requiring triage labels Nov 9, 2023

bryan-aguilar mentioned this issue Nov 13, 2023

Panic in pkg/stanza/adapter/receiver #29107

Closed

djaglowski mentioned this issue Nov 14, 2023

Fix panic when LogRecordCount is called after ConsumeLogs #29274

Merged

djaglowski closed this as completed in #29274 Nov 14, 2023

djaglowski added a commit that referenced this issue Nov 14, 2023

Fix panic when LogRecordCount is called after ConsumeLogs (#29274)

98124e0

Resolves #29107 Resolves #27469

RoryCrispin pushed a commit to ClickHouse/opentelemetry-collector-contrib that referenced this issue Nov 24, 2023

Fix panic when LogRecordCount is called after ConsumeLogs (open-telem…

6ccec9b

…etry#29274) Resolves open-telemetry#29107 Resolves open-telemetry#27469

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otel collector container keeps crashing due to nil pointer dereference when using fluentforwardreceiver #27469

otel collector container keeps crashing due to nil pointer dereference when using fluentforwardreceiver #27469

karthik-kasarapu commented Oct 6, 2023 •

edited

Loading

github-actions bot commented Oct 6, 2023

bryan-aguilar commented Nov 9, 2023

ms-jcorley commented Nov 14, 2023

bryan-aguilar commented Nov 14, 2023

ms-jcorley commented Nov 14, 2023

bryan-aguilar commented Nov 14, 2023

djaglowski commented Nov 14, 2023

otel collector container keeps crashing due to nil pointer dereference when using fluentforwardreceiver #27469

otel collector container keeps crashing due to nil pointer dereference when using fluentforwardreceiver #27469

Comments

karthik-kasarapu commented Oct 6, 2023 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Oct 6, 2023

bryan-aguilar commented Nov 9, 2023

ms-jcorley commented Nov 14, 2023

bryan-aguilar commented Nov 14, 2023

ms-jcorley commented Nov 14, 2023

bryan-aguilar commented Nov 14, 2023

djaglowski commented Nov 14, 2023

karthik-kasarapu commented Oct 6, 2023 •

edited

Loading