Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten XML Parser? #309

Closed
chriscroome opened this issue Oct 26, 2022 · 13 comments
Closed

Flatten XML Parser? #309

chriscroome opened this issue Oct 26, 2022 · 13 comments
Assignees
Labels
enhancement New feature or request ready-to-ship

Comments

@chriscroome
Copy link
Contributor

XML that makes use of a lot of attributes, rather than elements, results in JSON that is hard to work with using Ansible / JMESPath, for example nmap has XML output (but not JSON):

nmap -oX - -p 443 galaxy.ansible.com | xmllint --pretty 1 -
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nmaprun>
<?xml-stylesheet href="file:///usr/bin/../share/nmap/nmap.xsl" type="text/xsl"?>
<!-- Nmap 7.92 scan initiated Wed Oct 26 11:51:38 2022 as: nmap -oX - -p 443 galaxy.ansible.com -->
<nmaprun scanner="nmap" args="nmap -oX - -p 443 galaxy.ansible.com" start="1666781498" startstr="Wed Oct 26 11:51:38 2022" version="7.92" xmloutputversion="1.05">
  <scaninfo type="connect" protocol="tcp" numservices="1" services="443"/>
  <verbose level="0"/>
  <debugging level="0"/>
  <hosthint>
    <status state="up" reason="unknown-response" reason_ttl="0"/>
    <address addr="172.67.68.251" addrtype="ipv4"/>
    <hostnames>
      <hostname name="galaxy.ansible.com" type="user"/>
    </hostnames>
  </hosthint>
  <host starttime="1666781498" endtime="1666781498">
    <status state="up" reason="syn-ack" reason_ttl="0"/>
    <address addr="172.67.68.251" addrtype="ipv4"/>
    <hostnames>
      <hostname name="galaxy.ansible.com" type="user"/>
      <hostname name="galaxy.ansible.com" type="PTR"/>
    </hostnames>
    <ports>
      <port protocol="tcp" portid="443">
        <state state="open" reason="syn-ack" reason_ttl="0"/>
        <service name="https" method="table" conf="3"/>
      </port>
    </ports>
    <times srtt="12260" rttvar="9678" to="100000"/>
  </host>
  <runstats>
    <finished time="1666781498" timestr="Wed Oct 26 11:51:38 2022" summary="Nmap done at Wed Oct 26 11:51:38 2022; 1 IP address (1 host up) scanned in 0.10 seconds" elapsed="0.10" exit="success"/>
    <hosts up="1" down="0" total="1"/>
  </runstats>
</nmaprun>

Convert this into JSON / YAML and the results are not great...

nmap -oX - -p 443 galaxy.ansible.com | xmllint --pretty 1 - | jc --xml -py
---
nmaprun:
  '@scanner': nmap
  '@args': nmap -oX - -p 443 galaxy.ansible.com
  '@start': '1666781628'
  '@startstr': Wed Oct 26 11:53:48 2022
  '@version': '7.92'
  '@xmloutputversion': '1.05'
  scaninfo:
    '@type': connect
    '@protocol': tcp
    '@numservices': '1'
    '@services': '443'
  verbose:
    '@level': '0'
  debugging:
    '@level': '0'
  hosthint:
    status:
      '@state': up
      '@reason': unknown-response
      '@reason_ttl': '0'
    address:
      '@addr': 172.67.68.251
      '@addrtype': ipv4
    hostnames:
      hostname:
        '@name': galaxy.ansible.com
        '@type': user
  host:
    '@starttime': '1666781628'
    '@endtime': '1666781628'
    status:
      '@state': up
      '@reason': syn-ack
      '@reason_ttl': '0'
    address:
      '@addr': 172.67.68.251
      '@addrtype': ipv4
    hostnames:
      hostname:
      - '@name': galaxy.ansible.com
        '@type': user
      - '@name': galaxy.ansible.com
        '@type': PTR
    ports:
      port:
        '@protocol': tcp
        '@portid': '443'
        state:
          '@state': open
          '@reason': syn-ack
          '@reason_ttl': '0'
        service:
          '@name': https
          '@method': table
          '@conf': '3'
    times:
      '@srtt': '13479'
      '@rttvar': '11398'
      '@to': '100000'
  runstats:
    finished:
      '@time': '1666781628'
      '@timestr': Wed Oct 26 11:53:48 2022
      '@summary': Nmap done at Wed Oct 26 11:53:48 2022; 1 IP address (1 host up)
        scanned in 0.10 seconds
      '@elapsed': '0.10'
      '@exit': success
    hosts:
      '@up': '1'
      '@down': '0'
      '@total': '1'

However if the XML is flattened using XSLT first:

nmap -oX - -p 443 galaxy.ansible.com | xmllint --pretty 1 - > galaxy.ansible.com.xml
xsltproc attributes2elements.xslt galaxy.ansible.com.xml 
<?xml version="1.0"?>
<?xml-stylesheet href="file:///usr/bin/../share/nmap/nmap.xsl" type="text/xsl"?><!-- Nmap 7.92 scan initiated Wed Oct 26 12:01:56 2022 as: nmap -oX - -p 443 galaxy.ansible.com -->
<nmaprun><scanner>nmap</scanner><args>nmap -oX - -p 443 galaxy.ansible.com</args><start>1666782116</start><startstr>Wed Oct 26 12:01:56 2022</startstr><version>7.92</version><xmloutputversion>1.05</xmloutputversion>
  <scaninfo><type>connect</type><protocol>tcp</protocol><numservices>1</numservices><services>443</services></scaninfo>
  <verbose><level>0</level></verbose>
  <debugging><level>0</level></debugging>
  <hosthint>
    <status><state>up</state><reason>unknown-response</reason><reason_ttl>0</reason_ttl></status>
    <address><addr>172.67.68.251</addr><addrtype>ipv4</addrtype></address>
    <hostnames>
      <hostname><name>galaxy.ansible.com</name><type>user</type></hostname>
    </hostnames>
  </hosthint>
  <host><starttime>1666782116</starttime><endtime>1666782116</endtime>
    <status><state>up</state><reason>syn-ack</reason><reason_ttl>0</reason_ttl></status>
    <address><addr>172.67.68.251</addr><addrtype>ipv4</addrtype></address>
    <hostnames>
      <hostname><name>galaxy.ansible.com</name><type>user</type></hostname>
      <hostname><name>galaxy.ansible.com</name><type>PTR</type></hostname>
    </hostnames>
    <ports>
      <port><protocol>tcp</protocol><portid>443</portid>
        <state><state>open</state><reason>syn-ack</reason><reason_ttl>0</reason_ttl></state>
        <service><name>https</name><method>table</method><conf>3</conf></service>
      </port>
    </ports>
    <times><srtt>10773</srtt><rttvar>8291</rttvar><to>100000</to></times>
  </host>
  <runstats>
    <finished><time>1666782116</time><timestr>Wed Oct 26 12:01:56 2022</timestr><summary>Nmap done at Wed Oct 26 12:01:56 2022; 1 IP address (1 host up) scanned in 0.10 seconds</summary><elapsed>0.10</elapsed><exit>success</exit></finished>
    <hosts><up>1</up><down>0</down><total>1</total></hosts>
  </runstats>
</nmaprun>

You then have something that is nicer to work with:

xsltproc attributes2elements.xslt galaxy.ansible.com.xml | jc --xml -py
---
nmaprun:
  scanner: nmap
  args: nmap -oX - -p 443 galaxy.ansible.com
  start: '1666782116'
  startstr: Wed Oct 26 12:01:56 2022
  version: '7.92'
  xmloutputversion: '1.05'
  scaninfo:
    type: connect
    protocol: tcp
    numservices: '1'
    services: '443'
  verbose:
    level: '0'
  debugging:
    level: '0'
  hosthint:
    status:
      state: up
      reason: unknown-response
      reason_ttl: '0'
    address:
      addr: 172.67.68.251
      addrtype: ipv4
    hostnames:
      hostname:
        name: galaxy.ansible.com
        type: user
  host:
    starttime: '1666782116'
    endtime: '1666782116'
    status:
      state: up
      reason: syn-ack
      reason_ttl: '0'
    address:
      addr: 172.67.68.251
      addrtype: ipv4
    hostnames:
      hostname:
      - name: galaxy.ansible.com
        type: user
      - name: galaxy.ansible.com
        type: PTR
    ports:
      port:
        protocol: tcp
        portid: '443'
        state:
          state: open
          reason: syn-ack
          reason_ttl: '0'
        service:
          name: https
          method: table
          conf: '3'
    times:
      srtt: '10773'
      rttvar: '8291'
      to: '100000'
  runstats:
    finished:
      time: '1666782116'
      timestr: Wed Oct 26 12:01:56 2022
      summary: Nmap done at Wed Oct 26 12:01:56 2022; 1 IP address (1 host up) scanned
        in 0.10 seconds
      elapsed: '0.10'
      exit: success
    hosts:
      up: '1'
      down: '0'
      total: '1'

So I was wondering if a ---xml-flatten parser that first used XSLT to flatten XML might be something that could be considered?

@chriscroome
Copy link
Contributor Author

FWIW this is how I'm using nmap, XPath and JMESPath to check if a port is open with Ansible, this is was the thing that prompted me to open this issue, it could be more elegant using JC ;-)

    - name: Check if port 5665 is open from the master node
      ansible.builtin.command: "nmap -oX - -p 5665 {{ inventory_hostname }}"
      check_mode: false
      changed_when: false
      delegate_to: "{{ icinga_master_node }}"
      register: icinga_nmap_xml

    - name: Set a fact for the nmap output
      community.general.xml:
        xmlstring: "{{ icinga_nmap_xml.stdout }}"
        xpath: "/nmaprun/host/ports/port[@portid='5665']/state"
        content: attribute
      register: icinga_port_state_results

    - name: Set a fact for the port 5665 state
      ansible.builtin.set_fact:
        icinga_port_state: "{{ icinga_port_state_results | community.general.json_query('matches[].state | [0].state') }}"

    - name: Port 5665 on the agent node needs to be open to accept connections from the master node
      ansible.builtin.assert:
        that:
          - icinga_port_state is defined
          - icinga_port_state == "open"

@kellyjonbrazil
Copy link
Owner

Hey there!

Is XML flattening basically taking attributes like this:

<nmaprun scanner="nmap" args="nmap -oX - -p 443 galaxy.ansible.com" start="1666781498" startstr="Wed Oct 26 11:51:38 2022" version="7.92" xmloutputversion="1.05">...</nmaprun>

and turning them into elements, like this?

<nmaprun>
  <scanner>nmap</scanner>
  <args>nmap -oX - -p 443 galaxy.ansible.com</args>
  <start>1666782116</start>
  <startstr>Wed Oct 26 12:01:56 2022</startstr>
  <version>7.92</version>
  <xmloutputversion>1.05</xmloutputversion>
  ...
</nmaprun>

What is the actual difference in the JSON as output by jc other than removing the @ characters from the beginning of the key names?

@chriscroome
Copy link
Contributor Author

chriscroome commented Oct 26, 2022

Yes, that's it basically, the @ character is a special character in JMESPath, it is used as a selector, it makes querying the resulting JSON more complicated as things need to be quoted and escaped.

@kellyjonbrazil
Copy link
Owner

Ah, gotcha. There are a few options:

  • Create a new XML parser that doesn't prefix attributes, or prefixes with another character (this is configurable in the library I'm using)
  • Right now the -r option in jc is not being used for the XML parser, so I could make it so that the above behavior takes place with jc --xml -r

We just need to be aware that by not having any prefix character at all there could be collisions between attribute and element names, so maybe we should just change it to something that doesn't conflict with JMESPath, et al. Something like _ maybe? I have used that as a prefix for other parsers.

There is also a # prefix used for item text when an item has attributes and text. Is that prefix ok?

@chriscroome
Copy link
Contributor Author

chriscroome commented Oct 26, 2022

I don't know to be honest, I think I'd expect the -r option to result in the XML that is currently generated (with the @element names) and the default to be to strip them since that is the nicer resulting output, however that would seriously break backwards compatibility so it's probably not a great suggestion.

I'm not sure about adding prefixes, _ might be best 🤷‍♂️

I wonder how the XSLT stylesheet I've been using deals with attribute and element name collisions… I guess I should test it…

@kellyjonbrazil
Copy link
Owner

Let me know what happens with duplicate attribute/element names. Yeah, does make more sense the other way, but that would break existing scripts, so don't want to do that.

@kellyjonbrazil kellyjonbrazil added the enhancement New feature or request label Oct 26, 2022
@chriscroome
Copy link
Contributor Author

Having a element named the same as an attribute doesn't seen to be an issue for the XSLT stylesheet, for example this XML:

<?xml version="1.0" encoding="UTF-8"?>
<nmaprun>
  <host host="galaxy.ansible.com">
    <status state="up" reason="syn-ack" reason_ttl="0"/>
    <address addr="104.26.0.234" addrtype="ipv4"/>
    <hostnames>
      <hostname name="galaxy.ansible.com" type="user"/>
    </hostnames>
  </host>
</nmaprun>

When transformed using this XSLT stylesheet like this:

xsltproc attributes2elements.xslt galaxy.ansible.com.xml | xmllint --pretty 1 -

Results in this XML:

<?xml version="1.0"?>
<nmaprun>
  <host>
    <host>galaxy.ansible.com</host>
    <status>
      <state>up</state>
      <reason>syn-ack</reason>
      <reason_ttl>0</reason_ttl>
    </status>
    <address>
      <addr>104.26.0.234</addr>
      <addrtype>ipv4</addrtype>
    </address>
    <hostnames>
      <hostname>
        <name>galaxy.ansible.com</name>
        <type>user</type>
      </hostname>
    </hostnames>
  </host>
</nmaprun>

And transforming this into YML using:

xsltproc attributes2elements.xslt galaxy.ansible.com.xml | xmllint --pretty 1 - | jc --xml -py

Looks like this:

---
nmaprun:
  host:
    host: galaxy.ansible.com
    status:
      state: up
      reason: syn-ack
      reason_ttl: '0'
    address:
      addr: 104.26.0.234
      addrtype: ipv4
    hostnames:
      hostname:
        name: galaxy.ansible.com
        type: user

Compared with the original XML transformed into YAML using:

cat galaxy.ansible.com.xml | xmllint --pretty 1 - | jc --xml -py

Results in this:

---
nmaprun:
  host:
    '@host': galaxy.ansible.com
    status:
      '@state': up
      '@reason': syn-ack
      '@reason_ttl': '0'
    address:
      '@addr': 104.26.0.234
      '@addrtype': ipv4
    hostnames:
      hostname:
        '@name': galaxy.ansible.com
        '@type': user

So there doesn't appear to be a problem with element and attributes having the same name, but perhaps I misunderstood the potential problem?

@kellyjonbrazil
Copy link
Owner

How about something like this?

<?xml version="1.0" encoding="UTF-8"?>
<nmaprun>
  <host collision="attribute">
    <collision>element</collision>
    this is host text
  </host>
</nmaprun>

For example:

% echo '<?xml version="1.0" encoding="UTF-8"?>
<nmaprun>
  <host collision="attribute">
    <collision>element</collision>
    this is host text
  </host>
</nmaprun>' | jc --xml -p
{
  "nmaprun": {
    "host": {
      "@collision": "attribute",
      "collision": "element",
      "#text": "this is host text"
    }
  }
}

@chriscroome
Copy link
Contributor Author

Ah, yes, I see, this:

cat collision.xml | jc --xml -py
---
nmaprun:
  host:
    '@collision': attribute
    collision: element
    '#text': this is host text

Compared with:

xsltproc attributes2elements.xslt collision.xml | xmllint --pretty 1 - | jc --xml -py
---
nmaprun:
  host:
    collision:
    - attribute
    - element
    '#text': this is host text

Dunno… 🤔

@kellyjonbrazil
Copy link
Owner

Yeah, I think I'll make the -r option use _ as a prefix for attributes. Do you see any issue with the # prefix for element text?

@chriscroome
Copy link
Contributor Author

No but that doesn't mean much!

@kellyjonbrazil
Copy link
Owner

kellyjonbrazil commented Nov 2, 2022

I have updated the xml parser in the dev branch with the -r behavior.

https://github.com/kellyjonbrazil/jc/blob/dev/jc/parsers/xml.py

@kellyjonbrazil
Copy link
Owner

Released in jc v1.22.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ready-to-ship
Projects
None yet
Development

No branches or pull requests

2 participants