Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] Enhance ParseXML function with intuitive parse format and add MarshalXML function to convert parsed output back to XML #35210

Closed
wants to merge 5 commits into from

Conversation

shazlehu
Copy link
Contributor

Description:

Current implementation of ParseXML

The current implementation of the ParseXML produces output that is difficult for users to manipulate, particularly when the XML is complicated. The current design loses the sense of key:value pairs which are helpful for understanding the data, and, more importantly, always parses the XML into arrays, which are difficult to manipulate in OTTL, and don't provide a stable path that can be used to access the data.

This change proposes an additional parsing version for the ParseXML function that would parse XML into a nested map, where the keys are the paths to the data, and the values are the data itself, or in the case of nodes with both values and attributes, using the special keys xml_attributes and xml_value. It still maintains the order of nodes using the special key xml_ordering. Additionally, the proposed implementation can optionally "flatten" arrays, such as the EventData node in the following XML, into a single map. The arrays must have a single, common attribute, and no other children (in this case, EventData has only Data children). This would allow users to access the data more intuitively and would allow for easier manipulation of the data in OTTL.

This proposed change also includes a new function, MarshalXML, to reconstruct the XML from the parsed map. This would allow users to manipulate the data in OTTL, and then convert it back to XML, typically for ingestion into another system.

The proposed implementation would be backward compatible with the current implementation, allowing the user to specify the new implementation as an optional version parameter to the ParseXML function.

Consider the following XML from a Windows event log, a common telemetry source:

<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'>
    <System>
        <Provider Name='Microsoft-Windows-Security-Auditing'
            Guid='{54849625-5478-4994-a5ba-3e3b0328c30d}' />
        <EventID>4625</EventID>
        <Version>0</Version>
        <Level>0</Level>
        <Task>12544</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8010000000000000</Keywords>
        <TimeCreated SystemTime='2024-09-04T08:38:09.7477579Z' />
        <EventRecordID>1361885</EventRecordID>
        <Correlation ActivityID='{b67ee0c2-a671-0001-5f6b-82e8c1eeda01}' />
        <Execution ProcessID='656' ThreadID='2276' />
        <Channel>Security</Channel>
        <Computer>samuel-vahala</Computer>
        <Security />
    </System>
    <EventData>
        <Data Name='SubjectUserSid'>S-1-0-0</Data>
        <Data Name='SubjectUserName'>-</Data>
        <Data Name='SubjectDomainName'>-</Data>
        <Data Name='SubjectLogonId'>0x0</Data>
        <Data Name='TargetUserSid'>S-1-0-0</Data>
        <Data Name='TargetUserName'>mr_rogers</Data>
        <Data Name='TargetDomainName'>-</Data>
        <Data Name='Status'>0xc000006d</Data>
        <Data Name='FailureReason'>%%2313</Data>
        <Data Name='SubStatus'>0xc0000064</Data>
        <Data Name='LogonType'>3</Data>
        <Data Name='LogonProcessName'>NtLmSsp </Data>
        <Data Name='AuthenticationPackageName'>NTLM</Data>
        <Data Name='WorkstationName'>D-508</Data>
        <Data Name='TransmittedServices'>-</Data>
        <Data Name='LmPackageName'>-</Data>
        <Data Name='KeyLength'>0</Data>
        <Data Name='ProcessId'>0x0</Data>
        <Data Name='ProcessName'>-</Data>
        <Data Name='IpAddress'>194.169.175.45</Data>
        <Data Name='IpPort'>0</Data>
    </EventData>
</Event>

The current ParseXML function renders this output, in a flattened view, using dot-notation:

Path Value
children.0.children.0["attributes"]["Guid"] {54849625-5478-4994-a5ba-3e3b0328c30d}
children.0.children.0["attributes"]["Name"] Microsoft-Windows-Security-Auditing
children.0.children.0["tag"] Provider
children.0.children.1["content"] 4625
children.0.children.1["tag"] EventID
children.0.children.10["attributes"]["ProcessID"] 656
children.0.children.10["attributes"]["ThreadID"] 4652
children.0.children.10["tag"] Execution
children.0.children.11["content"] Security
children.0.children.11["tag"] Channel
children.0.children.12["content"] samuel-windows
children.0.children.12["tag"] Computer
children.0.children.13["tag"] Security
children.0.children.2["content"] 0
children.0.children.2["tag"] Version
children.0.children.3["content"] 0
children.0.children.3["tag"] Level
children.0.children.4["content"] 12544
children.0.children.4["tag"] Task
children.0.children.5["content"] 0
children.0.children.5["tag"] Opcode
children.0.children.6["content"] 0x8010000000000000
children.0.children.6["tag"] Keywords
children.0.children.7["attributes"]["SystemTime"] 2024-08-15T20:03:15.9454501Z
children.0.children.7["tag"] TimeCreated
children.0.children.8["content"] 57220
children.0.children.8["tag"] EventRecordID
children.0.children.9["attributes"]["ActivityID"] {b67ee0c2-a671-0001-5f6b-82e8c1eeda01}
children.0.children.9["tag"] Correlation
children.0.tag System
children.1.children.0["attributes"]["Name"] SubjectUserSid
children.1.children.0["content"] S-1-0-0
children.1.children.0["tag"] Data
children.1.children.1["attributes"]["Name"] SubjectUserName
children.1.children.1["content"] -
children.1.children.1["tag"] Data
children.1.children.10["attributes"]["Name"] LogonType
children.1.children.10["content"] 3
children.1.children.10["tag"] Data
children.1.children.11["attributes"]["Name"] LogonProcessName
children.1.children.11["content"] NtLmSsp
children.1.children.11["tag"] Data
children.1.children.12["attributes"]["Name"] AuthenticationPackageName
children.1.children.12["content"] NTLM
children.1.children.12["tag"] Data
children.1.children.13["attributes"]["Name"] WorkstationName
children.1.children.13["content"] -
children.1.children.13["tag"] Data
children.1.children.14["attributes"]["Name"] TransmittedServices
children.1.children.14["content"] -
children.1.children.14["tag"] Data
children.1.children.15["attributes"]["Name"] LmPackageName
children.1.children.15["content"] -
children.1.children.15["tag"] Data
children.1.children.16["attributes"]["Name"] KeyLength
children.1.children.16["content"] 0
children.1.children.16["tag"] Data
children.1.children.17["attributes"]["Name"] ProcessId
children.1.children.17["content"] 0x0
children.1.children.17["tag"] Data
children.1.children.18["attributes"]["Name"] ProcessName
children.1.children.18["content"] -
children.1.children.18["tag"] Data
children.1.children.19["attributes"]["Name"] IpAddress
children.1.children.19["content"] 103.151.140.135
children.1.children.19["tag"] Data
children.1.children.2["attributes"]["Name"] SubjectDomainName
children.1.children.2["content"] -
children.1.children.2["tag"] Data
children.1.children.20["attributes"]["Name"] IpPort
children.1.children.20["content"] 0
children.1.children.20["tag"] Data
children.1.children.3["attributes"]["Name"] SubjectLogonId
children.1.children.3["content"] 0x0
children.1.children.3["tag"] Data
children.1.children.4["attributes"]["Name"] TargetUserSid
children.1.children.4["content"] S-1-0-0
children.1.children.4["tag"] Data
children.1.children.5["attributes"]["Name"] TargetUserName
children.1.children.5["content"] Administrator
children.1.children.5["tag"] Data
children.1.children.6["attributes"]["Name"] TargetDomainName
children.1.children.6["content"] domain
children.1.children.6["tag"] Data
children.1.children.7["attributes"]["Name"] Status
children.1.children.7["content"] 0xc000006d
children.1.children.7["tag"] Data
children.1.children.8["attributes"]["Name"] FailureReason
children.1.children.8["content"] %%2313
children.1.children.8["tag"] Data
children.1.children.9["attributes"]["Name"] SubStatus
children.1.children.9["content"] 0xc000006a
children.1.children.9["tag"] Data
children.1.tag EventData
tag Event

This view is extremely opaque, and difficult to manipulate in OTTL. The current design also doesn't provide a stable path that can be used to access the data. For example, the path to the EventID is children.0.children.1["content"], which is not intuitive.

Proposed implementation:

The proposed implementation would parse the XML into nested maps. In the case where a node has multiple children with the same name, the children would be stored in an array. If a node has no attributes or children, its value would be stored as a string. Otherwise, it would be stored in the special key xml_value. The xml_attributes key would store the attributes of the node, and the special key xml_ordering would store the order of the children of the node, which is important for marshaling the nodes in the same order back into XML. For example, the schema definition for <EventData> in a Windows Event log specifies that the <Data> nodes are a sequence (<xs:sequence>), so the order of the nodes should be preserved:

<xs:complexType name="EventDataType">
    <xs:sequence>
        <xs:choice
            minOccurs="0"
            maxOccurs="unbounded"
        >
            <xs:element name="Data"
                type="DataType"
             />
            <xs:element name="ComplexData"
                type="ComplexDataType"
             />
        </xs:choice>
        <xs:element name="Binary"
            type="hexBinary"
            minOccurs="0"
         />
    </xs:sequence>
    <xs:attribute name="Name"
        type="string"
        use="optional"
     />
</xs:complexType>

We prefix the special keys with xml_ to avoid conflicts with the actual data. It is a limitation that if the original XML uses tags xml_value, xml_attributes, xml_ordering, or xml_flattened_array, it will not parse correctly. A future enhancement could be to add an optional parameter to the function to allow users to specify a different prefix.

Path Value
Event.EventData.Data.0["xml_attributes"]["Name"] SubjectUserSid
Event.EventData.Data.0["xml_value"] S-1-0-0
Event.EventData.Data.1["xml_attributes"]["Name"] SubjectUserName
Event.EventData.Data.10["xml_attributes"]["Name"] LogonType
Event.EventData.Data.10["xml_value"] 3
Event.EventData.Data.11["xml_attributes"]["Name"] LogonProcessName
Event.EventData.Data.11["xml_value"] NtLmSsp
Event.EventData.Data.12["xml_attributes"]["Name"] AuthenticationPackageName
Event.EventData.Data.12["xml_value"] NTLM
Event.EventData.Data.13["xml_attributes"]["Name"] WorkstationName
Event.EventData.Data.14["xml_attributes"]["Name"] TransmittedServices
Event.EventData.Data.15["xml_attributes"]["Name"] LmPackageName
Event.EventData.Data.16["xml_attributes"]["Name"] KeyLength
Event.EventData.Data.16["xml_value"] 0
Event.EventData.Data.17["xml_attributes"]["Name"] ProcessId
Event.EventData.Data.17["xml_value"] 0x0
Event.EventData.Data.18["xml_attributes"]["Name"] ProcessName
Event.EventData.Data.19["xml_attributes"]["Name"] IpAddress
Event.EventData.Data.19["xml_value"] 103.151.140.135
Event.EventData.Data.2["xml_attributes"]["Name"] SubjectDomainName
Event.EventData.Data.20["xml_attributes"]["Name"] IpPort
Event.EventData.Data.20["xml_value"] 0
Event.EventData.Data.3["xml_attributes"]["Name"] SubjectLogonId
Event.EventData.Data.3["xml_value"] 0x0
Event.EventData.Data.4["xml_attributes"]["Name"] TargetUserSid
Event.EventData.Data.4["xml_value"] S-1-0-0
Event.EventData.Data.5["xml_attributes"]["Name"] TargetUserName
Event.EventData.Data.5["xml_value"] Administrator
Event.EventData.Data.6["xml_attributes"]["Name"] TargetDomainName
Event.EventData.Data.6["xml_value"] domain
Event.EventData.Data.7["xml_attributes"]["Name"] Status
Event.EventData.Data.7["xml_value"] 0xc000006d
Event.EventData.Data.8["xml_attributes"]["Name"] FailureReason
Event.EventData.Data.8["xml_value"] %%2313
Event.EventData.Data.9["xml_attributes"]["Name"] SubStatus
Event.EventData.Data.9["xml_value"] 0xc000006a
Event.EventData.xml_ordering Data.0,Data.1,Data.2,Data.3,Data.4,Data.5,Data.6,Data.7,Data.8,Data.9,Data.10,Data.11,Data.12,Data.13,Data.14,Data.15,Data.16,Data.17,Data.18,Data.19,Data.20
Event.System.Channel Security
Event.System.Computer samuel-windows
Event.System.Correlation.xml_attributes.ActivityID {b67ee0c2-a671-0001-5f6b-82e8c1eeda01}
Event.System.EventID 4625
Event.System.EventRecordID 57220
Event.System.Execution.xml_attributes.ProcessID 656
Event.System.Execution.xml_attributes.ThreadID 4652
Event.System.Keywords 0x8010000000000000
Event.System.Level 0
Event.System.Opcode 0
Event.System.Provider.xml_attributes.Guid {54849625-5478-4994-a5ba-3e3b0328c30d}
Event.System.Provider.xml_attributes.Name Microsoft-Windows-Security-Auditing
Event.System.Task 12544
Event.System.TimeCreated.xml_attributes.SystemTime 2024-08-15T20:03:15.9454501Z
Event.System.Version 0
Event.System.xml_ordering Provider,EventID,Version,Level,Task,Opcode,Keywords,TimeCreated,EventRecordID,Correlation,Execution,Channel,Computer,Security
Event.xml_attributes.xmlns http://schemas.microsoft.com/win/2004/08/events/event
Event.xml_ordering System,EventData

Additionally, the proposed implementation can optionally "flatten" arrays, such as the EventData node in the XML, into a single map. The arrays must have a single, common attribute, and no other children (in this case, EventData has only Data children). This would allow users to access the data in a more intuitive way, and would allow for easier manipulation of the data in OTTL.

Flattened Parse

Path Value
Event.EventData.Data.AuthenticationPackageName NTLM
Event.EventData.Data.FailureReason %%2313
Event.EventData.Data.IpAddress 103.151.140.135
Event.EventData.Data.IpPort 0
Event.EventData.Data.KeyLength 0
Event.EventData.Data.LogonProcessName NtLmSsp
Event.EventData.Data.LogonType 3
Event.EventData.Data.ProcessId 0x0
Event.EventData.Data.Status 0xc000006d
Event.EventData.Data.SubStatus 0xc000006a
Event.EventData.Data.SubjectLogonId 0x0
Event.EventData.Data.SubjectUserSid S-1-0-0
Event.EventData.Data.TargetDomainName domain
Event.EventData.Data.TargetUserName Administrator
Event.EventData.Data.TargetUserSid S-1-0-0
Event.EventData.Data.xml_flattened_array Name
Event.EventData.Data.xml_ordering SubjectUserSid,SubjectUserName,SubjectDomainName,SubjectLogonId,TargetUserSid,TargetUserName,TargetDomainName,Status,FailureReason,SubStatus,LogonType,LogonProcessName,AuthenticationPackageName,WorkstationName,TransmittedServices,LmPackageName,KeyLength,ProcessId,ProcessName,IpAddress,IpPort
Event.System.Channel Security
Event.System.Computer samuel-windows
Event.System.Correlation.xml_attributes.ActivityID {b67ee0c2-a671-0001-5f6b-82e8c1eeda01}
Event.System.EventID 4625
Event.System.EventRecordID 57220
Event.System.Execution.xml_attributes.ProcessID 656
Event.System.Execution.xml_attributes.ThreadID 4652
Event.System.Keywords 0x8010000000000000
Event.System.Level 0
Event.System.Opcode 0
Event.System.Provider.xml_attributes.Guid {54849625-5478-4994-a5ba-3e3b0328c30d}
Event.System.Provider.xml_attributes.Name Microsoft-Windows-Security-Auditing
Event.System.Task 12544
Event.System.TimeCreated.xml_attributes.SystemTime 2024-08-15T20:03:15.9454501Z
Event.System.Version 0
Event.System.xml_ordering Provider,EventID,Version,Level,Task,Opcode,Keywords,TimeCreated,EventRecordID,Correlation,Execution,Channel,Computer,Security
Event.xml_attributes.xmlns http://schemas.microsoft.com/win/2004/08/events/event
Event.xml_ordering System,EventData

MarshalXML

The MarshalXML function is able to reliably reconstruct the XML, with the limitation imposed by the GO XML library, which doesn't support self closing tags (In this example, tags like <Security />.) The Unit tests for the MarshalXML function use the ParseXML function to parse the XML, and then compare the output of the MarshalXML function to the original XML. The test XML purposely omits self-closing tags.

Link to tracking Issue: #35076

Testing:

Added Unit tests for the new version, previous tests still pass for version 1.

Documentation:

Added documentation for the new parse version to pkg/ottl/ottlfuncs/README.md

@djaglowski
Copy link
Member

I think we should move discussion about the desired format back to the issue #35076.


Regarding this PR:

MarshalXML would be a new function. That should be proposed in a separate issue and implemented in a separately PR.

The notion of flatten seems quite specific to me. The given example demonstrates a case where it is useful, but at a glance it seems very fragile. I think any options like this need to be robustly explained. What happens if there is an element with multiple attributes? Does it not care if the children have the same tag? What happens if a child element is contains children of its own, or even just contains a string? Seems like there are a lot of very specific assumptions about this structure. I have to wonder if this actually belongs as part of parsing logic, rather than as a specific type of manipulation. (e.g. another function, which either operates directly on XML documents, or which could apply to data not originally parsed from xml)

The current implementation of the ParseXML produces output that is difficult for users to manipulate, particularly when the XML is complicated. The current design loses the sense of key:value pairs which are helpful for understanding the data, and, more importantly, always parses the XML into arrays, which are difficult to manipulate in OTTL, and don't provide a stable path that can be used to access the data.

I don't disagree with any of this, but it's not clear to me that the proposed format actually solves any of these problems. In your example, you generate paths like Event.EventData.Data.3. Isn't this an array which is difficult to manipulate in OTTL and also not a stable path? Moreover, is it necessary to use flatten, and therefore start with a structure which adheres to all the assumptions of flatten, in order to move beyond these drawbacks?

@shazlehu shazlehu closed this Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants