Support external queue system for exporter via extensions #31682

pepperkick · 2024-03-11T14:44:00Z

Component(s)

No response

Is your feature request related to a problem? Please describe.

No

Describe the solution you'd like

I want the ability to utilize Redis as the exporter queue to support upstream outages for longer period. The current two options of exporter do not fulfill the requirements or have issues with longer outages.

This usecase came up because I needed near 0% data loss on logs over long periods of time because the upstream is known to be down for many mins. The in-memory queue was ruled out because it would lose data on pod restarts. The persistent queue option was promising but it started refusing new data if the queue is full and based on the code it is difficult to disable the refusal which causes loss of new data once the queue is full.

So since I had access to a Redis cluster, I decided to modify the exporter to support Redis as the queue. I was able to test this and get stable data flow, however I was not able to do proper and long testing yet.

Can this feature be implemented via extensions so in future other queue systems could be added as well? I believe the existing queue systems will need to moved to extensions in that case.

Describe alternatives you've considered

Alternatives considered

In-memory Queue: As mentioned, data is lost if the pod restarts due to crashes.
Persistent Queue: Starts refusing new data when the queue is full.
Export to Pulsar, receive from Pulsar: This does work, but the 2nd collector still needs one of the above queue during exporting, ending up in same situation.

Additional context

Config example

exporters:
  logging:
  otlp:
    endpoint: "localhost:4217"
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 5s
      max_elapsed_time: 20s
    sending_queue:
      enabled: true
      requeue_enabled: true
      queue_backend: redis
      num_consumers: 10
      queue_size: 10000
      redis:
        address: "localhost:6379"
        backlog_check_interval: 30
        process_key_expiration: 120
        scan_key_size: 10
    tls:
      insecure: true

While queue_size is mentioned 10000, it is actually unlimited because redis can scale up based on usage.

github-actions · 2024-03-12T15:57:40Z

Pinging code owners for extension/storage: @dmitryax @atoulme @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

hughesjj · 2024-03-30T06:23:03Z

So to be clear, the ask is support for "backends" for the sending queue/buffer, for all exporters?

Overall I agree with the motivation. I've seen use cases in the past where a network outage/partition has occurred and one or more exporters overflowed as a result. That data could be useful for post-incident analysis or backfill.

That said, we'd likely want to add the ability to lower bound the stale-ness (oldest watermark/epoch). Some vendor backends, whether OTLP "compliant" or via bespoke exporters, don't allow ingestion of telemetry with an observed (metric) timestamp earlier than a (vendor-specific) relative epoch.

While we could theoretically grow an alternative "backend" for the exporter queue without bound, in practice that would require customers to either manually configure their distributed queue for autoscaling, or for us to hook into op-amp. I'm also concerned with correlated failures -- if there's a network issue exporting to FooVendor, there's a chance that network issue may also extend to the distributed queue backing the exporter (even if on the same node).

We could consider implementing a "dead letter queue" configuration instead. Given some configuration, we could even re-use existing exporters and route metrics matching the DLQ configuration to them. A (poorly thought) example follows:

exporters:
   otlp/experiencing_networkoutage:
      # Happy path, write data to this exporter normally
      dlq:
         max_latency: 15m # Anything older than 15m old goes to dlq
         min_latency: -5m # you can even reject stuff in the future etc
         max_buffer_size: 100000 # start dropping newer data
         FIFO: false # ordering is an important design consideration regardless of impl
         exporters: # if empty then just drop
          - otlp/dlq1
   otlp/dlq1:
      # try writing to redis or kafka or something else if the main sink is "out of order"
      dlq:
         max_latency: 1h # Anything older than 15m old goes to dlq
         FIFO: false # ordering is an important design consideration regardless of impl
         exporters:
           - fileexporter/dlq2
   fileexporter/dlq2:
      # If all else fails, try writing to a local file

The disadvantage of a DLQ is that you don't introduce the durability you'd get from your "sending_queue" example prior to the exporter. On the other hand, a non-local sending queue is (imo) conceptually similar to an exporter to begin with. Taking kafka as a sending_queue backend as an example, you could break up your pipeline into two pipelines, the first exporting to kafka, the second consuming from the same kafka stream and exporting to otlp. Then again, the question of backpressure becomes a bit murky in such a scenerio, as (off the top of my head) I don't believe the collector durably deqeues reads from a receiver if and only if an exporter has accepted the data... Then again, this would also be an implementation concern for a backing "sending_queue". Some sort of n-phase commit would need to be spec'd out for this use case.

Regardless of the implementation, we should (continue to) come up with a list of design considerations. I'd love to have a working session or collaborate on a google doc etc with you to flesh this out a bit before bringing it to a collector SIG.

djaglowski · 2024-04-01T13:41:43Z

I may be mistaken but isn't the sending queue designed to work with any storage extension? If so, all that is needed is another storage extension. Then, rather than configure redis as part of the exporter, you configure it as an extension and reference it in the exporter. Adapting the example:

extensions:
  redisstorage:
    address: "localhost:6379"
    backlog_check_interval: 30
    process_key_expiration: 120
    scan_key_size: 10

exporters:
  logging:
  otlp:
    endpoint: "localhost:4217"
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 5s
      max_elapsed_time: 20s
    sending_queue:
      enabled: true
      storage: redisstorage # refers to component name of extension configured above
      requeue_enabled: true
      queue_backend: redis
      num_consumers: 10
      queue_size: 10000
    tls:
      insecure: true

hughesjj · 2024-04-01T19:19:45Z

@djaglowski I believe so, yes

So @pepperkick could the ask "implement a redis storage extension for sending queues" satisfy your needs?

pepperkick · 2024-04-03T06:40:44Z

Yes, implementation via extension is the approach I would like to take as that will help creating additional backends later down the line.

For Redis I have created the following PR which is implemented via extension
#31731

Currently I am evaluating using Pulsar for this due to the recent license situation of Redis.

pepperkick added enhancement New feature or request needs triage New item requiring triage labels Mar 11, 2024

github-actions bot mentioned this issue Mar 12, 2024

Weekly Report: 2024-03-05 - 2024-03-12 #31693

Closed

atoulme added extension/storage and removed needs triage New item requiring triage labels Mar 12, 2024

pepperkick mentioned this issue Mar 13, 2024

[WIP] Add redis storage extension #31731

Closed

atoulme added the Sponsor Needed New component seeking sponsor label Mar 13, 2024

This was referenced Mar 19, 2024

Weekly Report: 2024-03-12 - 2024-03-19 #31825

Closed

Weekly Report: 2024-03-12 - 2024-03-19 asuresh4/opentelemetry-collector-contrib#11544

Open

Weekly Report: 2024-03-19 - 2024-03-26 #31947

Closed

github-actions bot mentioned this issue Apr 2, 2024

Weekly Report: 2024-03-26 - 2024-04-02 #32082

Closed

This was referenced Apr 9, 2024

Weekly Report: 2024-04-02 - 2024-04-09 #32230

Open

Weekly Report: 2024-04-09 - 2024-04-16 #32407

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support external queue system for exporter via extensions #31682

Support external queue system for exporter via extensions #31682

pepperkick commented Mar 11, 2024 •

edited

github-actions bot commented Mar 12, 2024

hughesjj commented Mar 30, 2024

djaglowski commented Apr 1, 2024

hughesjj commented Apr 1, 2024

pepperkick commented Apr 3, 2024

Support external queue system for exporter via extensions #31682

Support external queue system for exporter via extensions #31682

Comments

pepperkick commented Mar 11, 2024 • edited

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Mar 12, 2024

hughesjj commented Mar 30, 2024

djaglowski commented Apr 1, 2024

hughesjj commented Apr 1, 2024

pepperkick commented Apr 3, 2024

pepperkick commented Mar 11, 2024 •

edited