Add ability for OTLP to indicate which records were rejected for retryable reasons #470

joshdover · 2023-05-08T15:11:44Z

What are you trying to achieve?

For logging use cases, it is desirable to deliver logs with as little data loss as possible, without duplicating data. This is more critical for logs than metrics or traces, where a single log message may contain critical information about the behavior of the system being observed. A dropped metric or trace is rarely a critical issue. Metrics will be populated on the next reporting interval and traces are typically highly sampled anyways.

What did you expect to see?

An ability for an OLTP client to know which records should be resent in order to meet delivery guarantees. This should be considered optional for clients / exporters to implement, but highly encouraged for a robust logging implementation.

Additional context.

The OTLP protocol was enhanced in open-telemetry/opentelemetry-specification#2454 to allow server implementations to indicate if all data had been accepted, but it does not indicate which records were rejected or why they were rejected. This issue is to restart the conversation about adding this capability to the protocol.

In Elasticsearch's architecture, some data in a single bulk ingestion request may be written, while other data may not be. Elasticsearch indicates this with a status code on each individual event: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html#bulk-api-response-body

Proposed path forward

We should strive to provide the minimal amount of information for this case in the protocol, in order to simplify server and client support for this feature:

We should only provide a list of records that may succeed if retried. The goal is not to provide comprehensive error information on each record in the request.
Because records do not have an associated ID, we can use the index ordinal from the original request to communicate back to the client which records should be retried.

A first attempt at this:

message ExportLogsPartialSuccess {
  // The number of rejected log records.
  //
  // A `rejected_<signal>` field holding a `0` value indicates that the
  // request was fully accepted.
  int64 rejected_log_records = 1;

  // A developer-facing human-readable message in English. It should be used
  // either to explain why the server rejected parts of the data during a partial
  // success or to convey warnings/suggestions during a full success. The message
  // should offer guidance on how users can address such issues.
  //
  // error_message is an optional field. An error_message with an empty value
  // is equivalent to it not being set.
  string error_message = 2;

  // New - list of resource logs from the request that contain retryable logs
  // retryable_resource_logs is an optional field.
  repeated RetryableResourceLogIndex retryable_resource_logs = 3;
}

message RetryableResourceLogIndex {
  // The ordinal index a `ResourceLog` from the original request that contains retryable logs
  uint32 index = 1;

  // List of ordinal indexes of scope logs for this resource that should be retried
  repeated uint32 retryable_scope_logs = 2;
}

joaopgrassi · 2023-05-09T09:13:30Z

If I got it right, in your example index refers to the ResourceLogs index in the OTLP message? And retryable_scope_logs refers to the actual logs? Aren't we missing the ScopeLog index, as a ResourceLog can have N ScopeLog as well?

Something like:

message RetryableResourceLogIndex {
  // The ordinal index a `ResourceLog` from the original request
  uint32 resource_logs = 1;
  
  // The ordinal index a `ScopeLogs` within a `ResourceLog` 
  uint32 scope_logs = 2;

  // List of ordinal indexes of logs within a `ScopeLogs` that should be retried
  repeated uint32 logs = 3;
}

Given a request such as, where X represents ingestion failures:

resource_log_A
  scope_log_classA
    log_1 X
    log_2 X
  scope_log_classB
    log_1 X
    log_2
resource_log_B
  scope_log_classC
    log_1 X
    log_2

Then we would receive a (pseudo) response as:

{
  "rejected_log_records": 3,
  "error_message": "some logs could not be ingested..",
  "retryable_resource_logs": [
    {
	  "resource_logs": 1,
	  "scope_logs": 1,
	  "logs": [1, 2]
	},
	{
	  "resource_logs": 1,
	  "scope_logs": 2,
	  "logs": [1]
	},
	{
	  "resource_logs": 2,
	  "scope_logs": 1,
	  "logs": [1]
	}  
  ]
}

For the record, some old discussion about this, while the initial Partial Success feature was introduced: #390 (comment)

We also need to think if we would like to make this generic, so other signals can also benefit from it.

joshdover · 2023-05-10T14:25:42Z

Thanks for taking a look @joaopgrassi.

And retryable_scope_logs refers to the actual logs?

I meant for this to be a list of indices, not the actual logs. So the pseudo response for my proposed structure would be:

{
  "rejected_log_records": 3,
  "error_message": "some logs could not be ingested..",
  "retryable_resource_logs": [
    {
	  "resource_logs": 1,
	  "retryable_scope_logs": [1, 2]
	},
	{
	  "resource_logs": 2,
	  "retryable_scope_logs": [1]
	},
	{
	  "resource_logs": 3,
	  "retryable_scope_logs": [3]
	}  
  ]
}

There would never be more than one retryable_resource_logs with the same index field. To be honest, I feel like there's more elegant way of representing this, so I'm open to suggestions.

To me, the more important aspect of this discussion is whether or not this is something that OTLP should support.

And I definitely agree that we'd want to duplicate the pattern to every other signal as well, similar to how partial success was added 👍

joaopgrassi · 2023-05-10T16:03:36Z

But in your example, it's missing the scope as well no? Like

{
	  "resource_logs": 1,
	  "retryable_scope_logs": [1, 2]
},

Log 1 and 2 refers to the resource with index 1, but they belong to which scope? There could be more inside resource 1.

I feel like there's more elegant way of representing this

I'm also having a hard time coming up with alternatives, not sure we have another possibility. If things had "ids" that would be one way. To me, the tricky part is the hierarchy of the structure. That makes it all way more complicated, for both parties (receivers and exporters). For logs we have log.record.id but it's optional.. for metrics we have the metric key but that is not unique across resources/scope inside the OTLP message. For spans that would be easier we can just use the span+trace ids.

Another point is, for exporters, it could be "easy" to add a configuration to enable such detailed retry behavior, but for a receiver/server I wonder how that would be. I would not imagine a server always doing such complicated handling for all requests, specially when it's under pressure, but then how do a server decides when to perform such checks and response?

tigrannajaryan · 2023-05-10T19:59:11Z

Moving the discussion to https://github.com/open-telemetry/opentelemetry-proto/ where the OTLP spec is being moved right now.

tigrannajaryan · 2023-05-11T13:37:16Z

This needs a more elaborate proposal that also needs to explain how the interoperability of old and new versions is ensured.

github-actions bot assigned yurishkuro May 8, 2023

tigrannajaryan transferred this issue from open-telemetry/opentelemetry-specification May 10, 2023

joshdover mentioned this issue Aug 3, 2023

[Libbeat][Filebeat][outputs] Add Codec and exporter to serialize and export batch of beat events in OTLP format elastic/beats#32549

Open

joaopgrassi mentioned this issue Jun 17, 2024

Feature request: semantic conventions for (non-rejection) ingestion errors leading to truncation/mutation open-telemetry/semantic-conventions#1098

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability for OTLP to indicate which records were rejected for retryable reasons #470

Add ability for OTLP to indicate which records were rejected for retryable reasons #470

joshdover commented May 8, 2023 •

edited

Loading

joaopgrassi commented May 9, 2023 •

edited

Loading

joshdover commented May 10, 2023

joaopgrassi commented May 10, 2023 •

edited

Loading

tigrannajaryan commented May 10, 2023

tigrannajaryan commented May 11, 2023

Add ability for OTLP to indicate which records were rejected for retryable reasons #470

Add ability for OTLP to indicate which records were rejected for retryable reasons #470

Comments

joshdover commented May 8, 2023 • edited Loading

joaopgrassi commented May 9, 2023 • edited Loading

joshdover commented May 10, 2023

joaopgrassi commented May 10, 2023 • edited Loading

tigrannajaryan commented May 10, 2023

tigrannajaryan commented May 11, 2023

joshdover commented May 8, 2023 •

edited

Loading

joaopgrassi commented May 9, 2023 •

edited

Loading

joaopgrassi commented May 10, 2023 •

edited

Loading