Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support an S3 DLQ in OpenSearch #2298

Closed
dlvenable opened this issue Feb 21, 2023 · 4 comments · Fixed by #2451
Closed

Support an S3 DLQ in OpenSearch #2298

dlvenable opened this issue Feb 21, 2023 · 4 comments · Fixed by #2451
Assignees
Labels
enhancement New feature or request plugin - sink A plugin to write data to a destination.
Milestone

Comments

@dlvenable
Copy link
Member

Background

The current DLQ in the OpenSearch sink only writes to local files. However, sometimes pipelines authors want these DLQ files on Amazon S3.

Additionally the current DLQ format does not embed useful information on the pipeline. So a pipeline author must add a DLQ file name with the pipeline name to distinguish between multiple sinks and pipelines.

Solution

Create an S3 DLQ option in the OpenSearch sink.

Configurations

The DLQ should allow pipeline authors to configure:

  • The bucket name (required)
  • The key prefix (optional; defaults to no prefix and writes to the root of the bucket)

It should use the existing aws: sts_role_arn or aws_sts_role_arn to access the bucket.

Example:

sink:
- opensearch:
    hosts: [...]
    aws:
      sts_role_arn: arn:...
    s3_dlq:
      bucket_name: my-bucket
      key_prefix: path/to/my/dlq/

Compression

This should use compression for all files. Perhaps in the future we could add an option to disable if desired.

Format

This should use the same format as the current DLQ. Namely JSON-ND where each JSON object has the following properties:

  • Document field - the full document
  • failure field - the error from OpenSearch

Additionally it should add the following (these can be added to the current DLQ as well):

  • indexName - the target Index name. With the new dynamic index name, this might be different for any given sink.

Additional Metadata

This should store additional metadata which is relevant for all events. This could be expressed in the S3 object key itself so that it doesn't have to be repeated.

  • Pipeline name
  • The DLQ version format. Start at "1"

The key can embed this information:

dlq-v${version}-${pipelineName}-${PLUGIN_ID}-${timestampIso8601}-${uniqueId}.jsonnd.gz

The ${PLUGIN_ID} is currently static, so it will always be opensearch. By using this for now, the format will extend when Data Prepper supports #1025.

A hypothetical full path might be:

path/to/my/dlq/dlq-v1-raw-trace-pipeline-opensearch-20230221T10:11:12Z-a258d8eb-b264-41c6-871a-b53793eaf743.jsonnd.gz

Alternative - Metadata in JSON

The DLQ can include the following metadata in each JSON object:

  • Pipeline name (e.g. pipelineName: "raw-trace-pipeline")
  • A DLQ version format ("version" : "1")

Batching

The DLQ should build the document on a local file and send after reaching a threshold. The primary threshold is time. Thus, after a period of time, the file will be written to S3 no matter what. Secondarily, it can have a size threshold in bytes. Once that threshold is reached, it will write to S3 even if the time has not been met. This is similar behavior to that proposed in #1048.

Questions

  • Is there a standard extension for JSON-ND? I have .jsonnd above, but I'm not sure I've really seen this.
  • Should we rename the Document field to document? This is more consistent with other JSON. The downside is it would be different from the current DLQ format.

Alternatives

Generic DLQ

It could be useful to have a generic DLQ concept. However, the sink data may vary so it needs some discussion on the format and approach. Having a DLQ for the OpenSearch sink would cover a lot of ground and help users out quickly.

Related Issues

This DLQ is somewhat like #1048, except it is for the DLQ.

@dlvenable dlvenable added enhancement New feature or request plugin - sink A plugin to write data to a destination. labels Feb 21, 2023
@kkondaka
Copy link
Collaborator

Looks good to me. We should move the index_name to some high level key like attributes so that the format can be made generic for future generic DLQ.

@sharraj
Copy link

sharraj commented Feb 21, 2023

We should give option to add <index_name> also the objects.

@sharraj
Copy link

sharraj commented Feb 21, 2023

Also we should add detailed error numbers to the metadata. This will give hints to the user while this data landed in DLQs.

@dlvenable
Copy link
Member Author

We should move the index_name to some high level key like attributes so that the format can be made generic for future generic DLQ.

I'd like to clarify what you mean. I think you are suggesting that the index name go under a new attributes property in the top-level? Perhaps like the following:

{"document" : "...failed document...", "failure" : "...message...", "attributes: {"indexName" : "my-index-4xx"}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request plugin - sink A plugin to write data to a destination.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants