Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promtail's windows_events scraper should produce parseable, structured data #10608

Open
RamblingCookieMonster opened this issue Sep 15, 2023 · 4 comments

Comments

@RamblingCookieMonster
Copy link

RamblingCookieMonster commented Sep 15, 2023

Is your feature request related to a problem? Please describe.

The current implementation of the windows_events scraper for promtail does not fully parse Windows events into parseable structured data. There are many negative outcomes across various stakeholders; for example:

  • You / Grafana Labs - e.g. I would push a more batteries-included log management solution before recommending Grafana Cloud + Promtail
  • Customers - e.g. Wasting time either duct taping OSS configs themselves, or living with unstructured data
  • Community - e.g. No reasonable standard of data/schema to rely on - instead of tools standardizing on some event_data -> prefix_attributename unrolling (e.g. Data_attributename for telegraf), community efforts rely on data intended for human eyes that should be droppable - See this Sigma parser that (1) relies on regex parsing of a templated field designed for human eyes (Message) that (2) could break if event schemas changes, and which (3) means folks must decide on whether to drop that field to reduce their resource consumption (the Message field is not critical, it is a template with the actual event data filled in that should not be relied on), or keep the field to enable those Sigma queries

You can likely imagine other reasons. Parseable structured data is sort of critical in the world of logs, and systems that use their data.

Here's an example of what you produce today via Promtail's windows_events:

{
  "source": "Microsoft-Windows-TaskScheduler",
  "channel": "Microsoft-Windows-TaskScheduler/Operational",
  "computer": "REDACTED",
  "event_id": 201,
  "version": 2,
  "level": 4,
  "task": 201,
  "opCode": 2,
  "levelText": "Information",
  "taskText": "Action completed",
  "opCodeText": "Stop",
  "keywords": "0x8000000000000000",
  "timeCreated": "2023-09-14T17:10:00.579307200Z",
  "eventRecordID": 2295677,
  "correlation": {
    "activityID": "{6191c9fe-4655-4af1-bfbe-8d48d51ee41e}"
  },
  "execution": {
    "processId": 1816,
    "threadId": 2892,
    "processName": "svchost.exe"
  },
  "security": {
    "userId": "S-1-5-18",
    "userName": "NT AUTHORITY\\SYSTEM"
  },
  "event_data": "<Data Name='TaskName'>\\REDACTED</Data><Data Name='TaskInstanceId'>{6191c9fe-4655-4af1-bfbe-8d48d51ee41e}</Data><Data Name='ActionName'>C:\\Windows\\SYSTEM32\\cmd.exe</Data><Data Name='ResultCode'>0</Data><Data Name='EnginePID'>4628</Data>",
  "message": "Task Scheduler successfully completed task \"\\REDACTED\" , instance \"{6191c9fe-4655-4af1-bfbe-8d48d51ee41e}\" , action \"C:\\Windows\\SYSTEM32\\cmd.exe\" with return code 0."
}

Notice the event_data. It is not parsed into named fields (in this case, TaskName, TaskInstanceId, etc., or preferably with a prefix like Data_TaskName to avoid collisions, as used by Telegraf), it's an XML-ish string bunched into a single field.

This results in folks relying on one-off (not scalable/generalizeable) "solutions" using that XML-y field, or, relying on the Message field (again, which should not be relied on) with rather absurd queries like this, from the previously referenced sigma post:

{job=~"eventlog|winlog|windows|fluentbit.*"} 
| json | label_format Message=`{{ .message | replace "\\" "\\\\" | replace "\"" "\\\"" }}` 
| line_format `{{ regexReplaceAll "([^:]+): ?((?:[^\\r]*|$))(\r\n|$)" .Message "${1}=\"${2}\" "}}` 
| logfmt | event_id=1 and ...

Describe the solution you'd like

Parse EventData and UserData please. You likely should do this on the Windows/promtail side of the house. I cannot help you here, but I can at least point out that Telegraf, Winlogbeat, Splunk, and presumably other agents can do this (IMHO) bare-minimum windows event parsing.

for example, "event_data": "<Data Name='TaskName'>\\REDACTED</Data><Data Name='TaskInstanceId'>{6191c9fe-4655-4af1-bfbe-8d48d51ee41e}</Data><Data Name='ActionName'>C:\\Windows\\SYSTEM32\\cmd.exe</Data><Data Name='ResultCode'>0</Data><Data Name='EnginePID'>4628</Data>" might expand to:

{
  "TaskName": "\\REDACTED",
  "TaskInstanceId": "{6191c9fe-4655-4af1-bfbe-8d48d51ee41e}",
  "ActionName": "C:\\Windows\\SYSTEM32\\cmd.exe",
  "ResultCode": 0,
  "EnginePID": 4628
}

Considerations would need to be made as to escaping " and \ within values, I've just written the above by hand so it's not going to be perfect. You might also prefix the keys - e.g. Data_TaskName, Data_ResultCode and move them to the root level (or make this an option). Particularly if you want to help the community, who might be relying on Telegraf, which uses that convention (Data_ prefix).

This should cover UserData as well, on the subset of events with this.

Describe alternatives you've considered

  • promtail with the current windows_events scraper is a non-starter given that it does not parse event data (e.g. EventData or UserData). This includes Grafana Agent as that embeds promtail.
  • fluent-bit makes a start. It parses event data, but it does so into an array, where you would need to reference schema (e.g. index 0 of the event data array for this specific event id and source might be TaskName). This is also a non-starter - if I want to query multiple events for SubjectUserSid, I want to reference that field, not look up what index it is, possibly across multiple event IDs, resulting in an unreadable query. Please do not go this route.
  • telegraf works! Albeit with some telegraf processors to ensure the data it sends results in a format that logfmt can process. Example processors here. With that said, I am hoping promtail will implement similar functionality (producing structured data with named fields parseable by json or logfmt on the Loki side)

Additional context

Not much. A few references:

If this is just me holding it wrong, please let me know, but after a few days of reading and testing, I'm pretty confident this is indeed not in place. I include it as a "feature", but to me, for a logging solution, this is more a "bug". Thanks!

@RamblingCookieMonster
Copy link
Author

RamblingCookieMonster commented Sep 16, 2023

Oh dear, I see the example I linked actually turned into a parse-the-message-field implementation. While... that is something, and took time and effort, I want to emphasize that for Windows, that is absolutely not the approach to take, though I totally understand that perhaps not everyone using promtail/Loki has Windows experience..

There are other references, but take this Microsoft provided spreadsheet that focuses solely on the Security log event IDs, and presumably, a subset as things have changed. Note the Complete Event Messages sheet. This is illustrating how Windows Events work (there are far better / deeper references, but this is a simply way to illustrate it). For example:

Event ID 4713:

Kerberos policy was changed.

Subject:
 Security ID:  %1
 Account Name:  %2
 Account Domain:  %3
 Logon ID:  %4

Changes Made:
('--' means no changes, otherwise each change is shown as:
(Parameter Name): (new value) (old value))
%5

%5 would not be captured in this case.

Event ID 4899:

A Certificate Services template was updated.

%1 v%2 (Schema V%3)
%4
%5

Template Change Information:
 Old Template Content: %8
 New Template Content:  %7

Additional Information:
 Domain Controller: %6

More data that would not be parsed

Event ID 4624:

An account was successfully logged on.

              Subject:
                  Security ID:        %1
                  Account Name:        %2
                  Account Domain:        %3
                  Logon ID:        %4

              Logon Type:            %9

              New Logon:
                  Security ID:        %5
                  Account Name:        %6
                  Account Domain:        %7
                  Logon ID:        %8
                  Logon GUID:        %13

              Process Information:
                  Process ID:        %17
                  Process Name:        %18

              Network Information:
                  Workstation Name:    %12
                  Source Network Address:    %19
                  Source Port:        %20

              Detailed Authentication Information:
                  Logon Process:        %10
                  Authentication Package:    %11
                  Transited Services:    %14
                  Package Name (NTLM only):    %15
                  Key Length:        %16

              This event is generated when a logon session is created. It is generated on the computer that was
              accessed.

              The subject fields indicate the account on the local system which requested the logon. This is most
              commonly a service such as the Server service, or a local process such as Winlogon.exe or Services.exe.

              The logon type field indicates the kind of logon that occurred. The most common types are 2
              (interactive) and 3 (network).

              The New Logon fields indicate the account for whom the new logon was created, i.e. the account that was
              logged on.

              The network fields indicate where a remote logon request originated. Workstation name is not always
              available and may be left blank in some cases.

              The impersonation level field indicates the extent to which a process in the logon session can
              impersonate.

              The authentication information fields provide detailed information about this specific logon request.
                  - Logon GUID is a unique identifier that can be used to correlate this event with a KDC event.
                  - Transited services indicate which intermediate services have participated in this logon request.
                  - Package name indicates which sub-protocol was used among the NTLM protocols.
                  - Key length indicates the length of the generated session key. This will be 0 if no session key was
              requested.

So... Maybe the parsing accounted for this, but how would this parse Security ID and differentiate the subject from the new login? Also, do you see how long that field is with all that text? So in addition to the real event_data, this massive string is sent for an event ID that is quite, quite common in busy environments.

That data should be in a much more compact set of fields that windows provides, but which is not currently parsed. Here's an example from winlogbeat, which among other agents, parses this data without relying on the Message field:

    "event_data": {
      "ProcessName": "C:\\Windows\\System32\\lsass.exe",
      "LogonGuid": "{00000000-0000-0000-0000-000000000000}",
      "TargetOutboundDomainName": "-",
      "VirtualAccount": "%%1843",
      "IpPort": "52024",
      "TransmittedServices": "-",
      "LmPackageName": "-",
      "RestrictedAdminMode": "-",
      "ElevatedToken": "%%1842",
      "WorkstationName": "REDACTED",
      "SubjectDomainName": "REDACTED",
      "TargetDomainName": "REDACTED",
      "LogonProcessName": "Advapi  ",
      "LogonType": "3",
      "SubjectLogonId": "0x3e7",
      "KeyLength": "0",
      "TargetOutboundUserName": "-",
      "TargetLogonId": "0x1a2497c9f",
      "TargetLinkedLogonId": "0x0",
      "SubjectUserName": "REDACTED$",
      "IpAddress": "REDACTED",
      "ImpersonationLevel": "%%1833",
      "ProcessId": "0x530",
      "TargetUserName": "REDACTED",
      "SubjectUserSid": "S-1-5-18",
      "TargetUserSid": "S-1-5-21-REDACTED",
      "AuthenticationPackageName": "MICROSOFT_AUTHENTICATION_PACKAGE_V1_0"
    },

Cheers!

@cstyan
Copy link
Contributor

cstyan commented Dec 22, 2023

Hello, thanks for reporting this.

We're currently reevaluating promtails position as a project within Grafana Labs. Internally we're actually using the Agent for both metrics and logs collection at this point. Additionally, the agent team is more likely to have time to dedicate to this. It's likely a fix would only go into the agent, but if there's an argument for adding a change here in promtail as well that can be discussed.

At the very least, the Agent team is actually going to have people who would have context about Windows in general

@mennotech
Copy link

@RamblingCookieMonster Would you consider opening this issue with the Grafana Agent team? I am running into the same issue using the Grafana Agent. You've spent the time creating a well crafted issue / feature request, and it would be great if the appropriate team was notified. I could try creating the request there, but it wouldn't be as thorough a post as you have here.

It seems that winlogbeat parses the event_data into separate fields (see https://www.elastic.co/guide/en/beats/winlogbeat/current/exported-fields-winlog.html#_event_data).

My work around may be to have winlogbeat write the windows security events to a text file and then have Grafana Agent read this file and push it to loki. This should work, but greatly complicates the setup.

@RamblingCookieMonster
Copy link
Author

RamblingCookieMonster commented Apr 17, 2024

@mennotech - feel free to borrow from this and/or copy it over! Yeah, we ended up avoiding the write back thing for this and a few other spots it would have been handy (it also puts a bit more pressure on IO/storage, but it does work, good find!). Ultimately, we're going to likely end up using Splunk for this sort of data, so while this is something I would encourage Grafana Labs to implement, it's not something I'll have time to push for. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants