-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending data dictionaries? #95
Comments
Hi @nicolasreich. First of all, sorry for the late reply. So far we have developed data dictionaries as independent document, as close as possible to the raw events produced by the sensor. The main goal is that you will always be able to drill down (i.e. from the data model) to the source of truth of an event and its fields. One of the tradeoffs is, as you suggest, duplicate information, that becomes apparent when you consume multiple events in the same sensor. We are, however, planning to improve Data Dictionaries, in order to deal with situations were event fields can have different definitions depending on the event type, or in situations where a field contains a nested JSON,list,etc, that we could use to extend the fieldset of the event. Regardless, I would interested in further exploring your use case. |
Hi @hxnoyd. No worries, it was the holidays for everyone. The rationale for this question was Suricata Eve JSON logs, where you have common fields, then nested fields for specific data. So for any alert, you get common fields, like source and destination IP addresses, as well as an So for a alert triggered by a DNS request, you would get something like:
While for an alert triggered by an HTTP request:
So the common fields are present in every event; the It's obviously possible to have a data dictionary for each alert type, each containing the common fields and the alert fields; but it means a lot of duplication, causing a lot of potential mistakes, and what seems like unnecessary verbiage. I think it would make sense to be able to extend a data dictionary, much like it's possible for entities. The rendered markdown version of the Data Dictionary would still be an independent document containing all the data. |
Hi @nicolasreich. Thanks for the detailed explanation, it is now more clear what you mean by 'extending', in a nutshell: deconstruct data dictionaries depending on the field prevalence, to avoid duplicates, and keep the data dictionary YAML as clean as possible. I see the benefit of such approach for events in the same log source (keep it simple/reduce duplicate), but that would mean an increase in the number of data dictionaries, since we would need to create the 'common fields' data dictionaries (i.e. src_ip, dest_ip, etc). On the one hand we would have a schema with low duplicate fields and, on the other hand, we would have more YAML data dictionaries to maintain. The field name duplication have been raised multiple times in the past, but we always opted by keeping the data dictionaries as close as possible to the original events, so that the community could customize them as needed. The main reason for this is to keep the data dictionary atomicity, an absolutely independent object, or the source of truth in a single document if you like. By doing so we enable the community to model the data dictionaries as they like, to their own needs (i.e. logstash pipelines). Regardless, I think your suggestion is aligned with our vision for the improvement of data dictionaries, possibly with the creation of a separate dictionary that would provide a first layer of abstraction for data dictionaries, where the community would be able to better map events with entities, and/or the detection data model. This would allow us to keep the source of truth, at the expense of maintaining another dictionary with modeled/standardized events. Unfortunately the last few months have been insanely busy, and we haven't had the time to work on a PoC for this... but it is on the roadmap :) |
There is an extension mechanism for entities, in order not to duplicate field definitions. It would be good to have such a mechanism for data dictionaries as well. For example, all Zeek network protocol events have fields for source and destination IP and port, which are duplicated across all the data dictionaries; instead, they all could extend a generic dictionary which defines these common fields. What do you think? Is that already part of your plans?
The text was updated successfully, but these errors were encountered: