Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tagging Events in Data Prepper #629

Closed
graytaylor0 opened this issue Nov 19, 2021 · 9 comments · Fixed by #2629 or #2680
Closed

Tagging Events in Data Prepper #629

graytaylor0 opened this issue Nov 19, 2021 · 9 comments · Fixed by #2629 or #2680
Assignees
Labels
proposal Proposed major changes to Data Prepper
Milestone

Comments

@graytaylor0
Copy link
Member

graytaylor0 commented Nov 19, 2021

Is your feature request related to a problem? Please describe.
Currently, Data Prepper does not support a standard concept for adding descriptions dynamically for what happens in individual sources, processors, and sinks. For example, if the grok processor fails to match, the user should be able to look at an Event and tell that it was not able to match. Otherwise, it can feel like grok is simply not working at all. Another example would be a json processor failing to parse json. The user needs to know if the parsing failed in order to pinpoint the problems with their configuration. Additionally, many users would like to check for certain tags when routing, or drop Events with those tags to save space. Lastly, users of OpenSearch would like to query based on tags in Events. Data Prepper needs a concept for easily handling these types of situations for any sink, processor, or source.

Describe the solution you'd like
A key that is dedicated to adding information of this sort. For example, if grok fails to match, then the event will add a tags key with [grok_match_failure] as a value. The value of tags will be a Set, as it is not helpful to have duplicate tags.

{ "message": "a log", "tags": ["grok_match_failure"] }

Now a grok user is able to quickly tell that there was no match for this Event. If this Event then went through a json parser and failed to parse, you can add json_parse_failure to the tags key, like this.

{ "message": "a log", "tags": ["grok_match_failure", "json_parse_failure"] }

To make this functionality consistent between plugins, the Event class could have a new function

void addTag(String tagName);

which would add tagName to the set of tags.

In order to separate tags from the rest of the Event, checking for tags with conditional expressions would look like this:

drop:
     when: 'event.hasTag("grok_parse_failure")'

While I don't believe it to be required for the first iteration of tagging, a processor to control the adding and removing of custom tags could exist. It would look something like this:

processor:
     - tag_manager:
          add_tags: ["tag3", "tag4"]
          remove_tags: ["tag1", "tag2"]

This processor could also be split into two, with one called add_tags and one called remove_tags.

Describe alternatives you've considered (Optional)
This problem could be solved at the plugin level. So given the same scenario with grok match failure, the Event would become something like

{ "message": "a log", "grok_match_failure": true }

and then after the json parsing fails, the Event would become something like

{ "message": "a log", "grok_match_failure": true, "json_parse_failure": true }

As you can tell, this solution doesn't scale as well. You can imagine that with a large amount of sources, processors, and sinks adding their own booleans to an Event, the Event could quickly become cluttered, and the querying for tags in OpenSearch would also become more of a pain.

Add the tags to the EventMetadata

Additionally, the tags could be added to the EventMetadata rather than the actual Event. This would make the Events cleaner, and the overall tagging options more configurable and extracted from the event data itself. The EventMetadata would contain the following:

Set<String> getTags();

This approach would allow for conditional checks on tags, but it needs a little more implementation to make the tags a part of the sink output. For example, the OpenSearch sink could have a configuration option like the following:

opensearch:
  host: ["localhost:9200"]
  save_tags: false (default would be true)

This would give individual sinks the ability to configure tags however they please (they could change the name of the tags key or remove certain tags at the sink level)

The one concern with this is that it would result in some unnecessary duplicate code, but it is entirely likely that some sinks would like to have the tags in the Event itself, and some would not. To make some options like the save_tags logic reusable, a plugin could be created that would handle the logic for adding the tags from the EventMetadata to the Event itself before it is shipped to the sink.

Additional context
Please provide alternatives to solve this problem if there are other ideas that make more sense than the tagging concept described here.

@dlvenable dlvenable added the proposal Proposed major changes to Data Prepper label Nov 19, 2021
@dlvenable
Copy link
Member

Thanks for proposal. Yes, this would be helpful.

Regarding the idea of using an array, I believe this is also best for OpenSearch documents. It should be easier to query documents by querying matches on tag than doing a boolean match.

I think it would be appropriate to make this a concept directly on Event. This should promote consistency across plugins.

Perhaps the following method can be added to Event:

void addTag(String tagName);

@dlvenable dlvenable added this to Untriaged in (deprecated) Tracking Board via automation Dec 16, 2021
@dlvenable dlvenable added this to the v2.0 milestone Apr 19, 2022
@dlvenable dlvenable removed this from Untriaged in (deprecated) Tracking Board Apr 19, 2022
@dlvenable dlvenable added this to 2.1 - Additional Plugins (Aug 2022) in Data Prepper Project Roadmap Apr 19, 2022
@dlvenable dlvenable moved this from 2.1 - Additional Plugins (Aug 2022) to 2.0 - Breaking Changes (Jun 2022) in Data Prepper Project Roadmap Apr 19, 2022
@dlvenable dlvenable changed the title Tagging Concept in Data Prepper Tagging Events in Data Prepper May 9, 2022
@dlvenable dlvenable moved this from 2.0 - Breaking Changes (Jul/Aug 2022) to 2.1 - Additional Plugins (Sep 2022) in Data Prepper Project Roadmap May 18, 2022
@dlvenable dlvenable modified the milestones: v2.0, v2.1 Jun 25, 2022
@cmanning09
Copy link
Contributor

cmanning09 commented Jul 14, 2022

I have a few questions about putting the tags field directly in the event:

  • Should the tags field be a protected field in the event? Can rename_key, 'copy_key', 'delete_key', etc. alter the tags?
  • What happens if an event contains a tags that is not a set?

I think there is some potential for some poor experiences depending on the processors/events default behavior if we put the tags directly in the event.

@graytaylor0
Copy link
Member Author

graytaylor0 commented Jul 21, 2022

@cmanning09 After thinking about your comments for a while, I think I am more in favor of tags being a part of the EventMetadata, and then creating a plugin that can be used by all sinks to handle tagging. This way, all the functionality of choosing the tag name and whether they are sent to the sink as part of the Event is configurable, and it can be made very clear that mutate processors cannot be configured by the user to directly alter the tags. This approach also makes the conditional expressions based on tags less confusing, as it would be more apparent that /tags or hasKey("tags") would actually check the Event for a literal "tags" key, while getTags() == { "grok_match_failure", "json_parse_failure" } or hasTag("grok_match_failure") == false would only check the tags in the EventMetadata. The full plugin would look something like this by default

- opensearch:
     tagging:
        include_tags_in_event: false
        tags_key_name: "tags"

And this configuration would not include the tags as part of the Event sent to the OpenSearch sink.

and this same plugin could be utilized by, for example, an s3 sink as well (in this case it is sending the tags to the sink as part of the Event under the newTagsKey key.

- s3:
     tagging:
        include_tags_in_event: true
        tags_key_name: "newTagsKey"

This way, tags are completely decoupled from the Event until right before they are sent to the sink, and could be serialized into the Event with a separate method on Event such as event.toJsonStringWithTags("newTagsKey") for the sink plugins to use if include_tags_in_event is true and tag_key_name is newTagsKey. Even if the user had a tags Integer in the Event already and still used the default of tags for tag_key_name (which they should know not to do), nothing would collide and cause errors or data overwriting, and there would just be 2 different json keys for tags. This also provides some interesting query functionality in OpenSearch for the case of multiple sinks that use the same OpenSearch cluster with the same index, as one would still be able to figure out which path an Event took quickly by querying on the assigned tag key.

@dlvenable dlvenable moved this from 2.1 - Usability, OTel, OpenSearch Improvements (Dec 2022) to 2.2 - Plugins (early 2023) in Data Prepper Project Roadmap Oct 21, 2022
@dlvenable dlvenable removed this from the v2.1 milestone Oct 21, 2022
@dlvenable dlvenable moved this from 2.2 - Plugins (early 2023) to Backlog in Data Prepper Project Roadmap Jan 23, 2023
@dlvenable dlvenable added this to the v2.3 milestone May 2, 2023
@dlvenable dlvenable moved this from Backlog to 2.3 (2023) in Data Prepper Project Roadmap May 2, 2023
kkondaka added a commit to kkondaka/kk-data-prepper-f2 that referenced this issue May 3, 2023
Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
kkondaka added a commit that referenced this issue May 5, 2023
* Tagging Events in Data Prepper. Issue #629

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Addressed review comments

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Addressed review comments. Introduced JsonStringBuilder in JacksonEvent to return event with additinal info (like tags) as json string

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

---------

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
@dlvenable dlvenable reopened this May 10, 2023
@dlvenable
Copy link
Member

@kkondaka , I re-opened this issue to be sure we don't lose anything. Is there anything else necessary here?

@kkondaka
Copy link
Collaborator

@dlvenable it depends on what this issue includes. We have to add support to add tags at many places. Not sure if this issue is supposed to be for all those cases.

@kkondaka
Copy link
Collaborator

I think the processor also should have a support for condition optionally. So, the processor config should look something likes

processor:
   - tag_manager:
          add_tags: ["tag3", "tag4"]
          add_when: <condition1>
          remove_tags: ["tag1", "tag2"]
          remove_when: <condition2>

@dlvenable
Copy link
Member

@kkondaka , I think the only item left here is to include tags in the documents for OpenSearch.

If we don't get this in for 2.3, can you create a new GitHub issue for this addition?

@kkondaka
Copy link
Collaborator

kkondaka commented Jun 5, 2023

Completed by #2745 #2690 #2680 #2629

@kkondaka kkondaka closed this as completed Jun 5, 2023
@dlvenable
Copy link
Member

The feature for writing tags to OpenSearch documents was not included in this version and will be worked as part of #2827.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment