<a href="https://colab.research.google.com/github/scarfboy/wetsuite-dev/blob/main/examples/collocations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

If you're new to XML, or XML in python, this gives a quick tour of what XML looks like and does, 
and how to manipulate it in python.

(or, more realistically, this a file to copy-paste some extraction code from)

You should only need this if the given functions to extract metadata or text do not cover your needs,
and you must dive deep into the structured data.

XML is a specific style of structured data.

It has elements that look like `<this>`, and because you can see it as a [tree structure](https://en.wikipedia.org/wiki/Tree_structure) it is frequently shown with some indentation indicating how deep each element is in the tree, like:

```
<aanhef>
  <preambule>
    <al>De raad van de gemeente Oosterhout;</al>
    <al>gezien het voorstel van burgemeester en wethouder van 21 november 2014;</al>
    <al>gelet op<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a">artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet</extref>,
       a<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35">rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening
       ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers</extref>;<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAZ/article=35">artikel 35, lid 1, aanhef en 
       onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen</extref>, 
       de bepalingen van de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Algemene%20wet%20bestuursrecht">Algemene wet bestuursrecht</extref>en de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Gemeentewet">Gemeentewet</extref>;</al>
    <al>overwegende dat het noodzakelijk is het opleggen van een tegenprestatie bij verordening te regelen;</al>
    <al>Besluit</al>
    <al>vast te stellen: de “Verordening tegenprestatie Sociale Zekerheid 2015, gemeente Oosterhout”.</al>
  </preambule>
</aanhef>
```

It is usually [serialized](https://en.wikipedia.org/wiki/Serialization) into a file that is _technically_ a binary bytestream but practically fairly human-readable, and somewhat human-editable.

### elementtree's view on XML

Take some example data: 
```
<root>
  some text 
  <bold>with more text</bold>
  end text
</root>
```

There are actually a few ways to think of XML data in object form (and more beyond that, see e.g. SAX).


The [DOM](https://en.wikipedia.org/wiki/Document_Object_Model), as used e.g. by scripting in browsers, 
would parse that into a structure like (note: we're pretending there are no spaces or newlines in there):
* `root` element
  * `text node` containing `'some text'`
  * `bold` element
    * `text node` containing `'with more text'`
  * `text node` containing `'end text'`

Thinking of that as five different object is well defined, but also cumbersome.
Even browsers rarely show it quite this way.


Python's [elementtree](https://docs.python.org/3/library/xml.etree.elementtree.html) is similar, but instead of making each element and text fragment a separate object, it tries to center around elements (hence the name) and ends up tacking text onto the closest element. 
Of note is text nodes _after_ elements at the same level. 
The above becomes:
* `root` element with `text='some text'` (and `tail=None`)
  * `bold` element with `text='with more text'` and `tail='end text'`

Whether this makes your life 
* easier because it's fewer objects to inspect, with less typing, and maybe it's actually no differen because you're handling structured data like config files (where tail is typically empty), or
* harder, because it's _yet another_ way of looking at things and is sometimes no better, just another mental model

...depends a little on your data, and how fast you pick up the etree way of thinking.

## Some examples of how to handle xml

In [12]:
import wetsuite.helpers.net
import wetsuite.helpers.etree

test_xml = '''<aanhef>
  <preambule>
    <al>De raad van de gemeente Oosterhout;</al>
    <al>gezien het voorstel van burgemeester en wethouder van 21 november 2014;</al>
    <al>gelet op<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a">artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet</extref>, a<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35">rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers</extref>;<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAZ/article=35">artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen</extref>, de bepalingen van de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Algemene%20wet%20bestuursrecht">Algemene wet bestuursrecht</extref>en de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Gemeentewet">Gemeentewet</extref>;</al>
    <al>overwegende dat het noodzakelijk is het opleggen van een tegenprestatie bij verordening te regelen;</al>
    <al>Besluit</al>
    <al>vast te stellen: de “Verordening tegenprestatie Sociale Zekerheid 2015, gemeente Oosterhout”.</al>
  </preambule>
</aanhef>'''


## Finding and navigating elements

In [13]:
aanhef = wetsuite.helpers.etree.fromstring( test_xml ) # parse that text into etree object tree. 
# We might call it tree, root, or e.g. its root node, in this case aanhef.
#   to name it by what you _expect_ it to contain is not best practice, but here it makes the examples a little more readable

In [14]:
# pick a specific child/descendent element out of the tree
print( aanhef.find('preambule') ) # relative to the find the first node, relative to this one 

# in this case you are probably interested in whatever is within preambule
# You could do 
print( aanhef.find('preambule/al') )
# yet it would only find the first.



<Element preambule at 0x7fb19af3c980>
<Element al at 0x7fb19af3c780>


In [16]:
#To look for multiple thing, consider
    
preambule = aanhef.find('preambule')
print( preambule.getchildren( ) )   # which would print all direct child elements, _whatever_ those are

# if we wanted to be more selective, you might try some filtering: 
print( preambule.findall( 'al' ) )
# here happens to be the same, but if preambule contained non-al nodes, they would not show up)

[<Element al at 0x7fb19af2f480>, <Element al at 0x7fb174915300>, <Element al at 0x7fb174915700>, <Element al at 0x7fb174915640>, <Element al at 0x7fb174915600>, <Element al at 0x7fb174915380>]
[<Element al at 0x7fb174915f40>, <Element al at 0x7fb174915280>, <Element al at 0x7fb174915980>, <Element al at 0x7fb174915300>, <Element al at 0x7fb174915940>, <Element al at 0x7fb174915700>]


In [18]:
# XPath is a more succinct way of specifying more varied types of query, 
aanhef.xpath('//preambule//extref')  # this is specific to the lxml etree interface, not the generic one, but wetsuite specifically uses the lxml-based etree

# This is more powerful - here conveniently expressing 'extref elements within preambule elements, no matter the elements inbetween'
#    but arguably only if you already understand XPath, or end up preferring it over writing longer  it all out


[<Element extref at 0x7fb19af0e9c0>,
 <Element extref at 0x7fb17490f580>,
 <Element extref at 0x7fb174915d40>,
 <Element extref at 0x7fb174915a80>,
 <Element extref at 0x7fb174915e00>]

## Getting out attributes

In [20]:
# there are a few different ways to get out attributes. Less typing:

for extref in aanhef.xpath('preambule//extref'):
    # print the fragment we're extracting from a  serialized XML, for reference
    print( wetsuite.helpers.etree.debug_pretty(extref) )
    #print( extref.attrib ) # _all_ attributes as dict, though in these cases there is only one
    print( 'DOC  = %r'%extref.get('doc')) # if there was no such attribute, it would return None
    print( 'TEXT = %r'%extref.text )
    print()

<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a">artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet</extref>, a
DOC  = 'http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a'
TEXT = 'artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet'

<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35">rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers</extref>;
DOC  = 'http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35'
TEXT = 'rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers'

<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAZ/article=35">artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zel

## Getting out text

### Direct text content, or subtree text?

In [22]:
alineas = aanhef.xpath('//al')
alineas[2].text

'gelet op'

In [23]:
# That seems like elementtree being a smartass - 
#   that sure is is the text within the element - up to the first child element, which isn't what we wanted.

# But what did we want?  If it was:
#   "gelet op artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet, artikel 35, lid 1, aanhef en onderdeel e, 
#    van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers; artikel 35, lid 1, 
#    aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen, 
#    de bepalingen van de Algemene wet bestuursrecht</extref>en de Gemeentewet;"
# ...then we instead wanted to ask for 'all text (and tail) values inthe tree under a particular starting particular element'. 

# That's a little more work to do, and we have a function to help
wetsuite.helpers.etree.all_text_fragments( alineas[2] )
# It pieces together each .text and .tail, so gives a list like:

['gelet op',
 '',
 'artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet',
 ', a',
 'rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers',
 ';',
 'artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen',
 ', de bepalingen van de',
 'Algemene wet bestuursrecht',
 'en de',
 'Gemeentewet',
 ';']

In [17]:
# Why not return it as as a single string?   We could do:
''.join( wetsuite.helpers.etree.all_text_fragments( alineas[2] ) )

'gelet opartikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet, artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers;artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen, de bepalingen van deAlgemene wet bestuursrechten deGemeentewet;'

In [18]:
# Note that 'opartikel'. Okay, maybe that should have been
' '.join( wetsuite.helpers.etree.all_text_fragments( alineas[2] ) )
# or, basically equivalent, and a little clearer in intent:
#wetsuite.helpers.etree.all_text_fragments( alineas[2], ignore_empty=True, join=' ' )

# ...yet now we have an "a rtikel". 

'gelet op  artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet , a rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers ; artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen , de bepalingen van de Algemene wet bestuursrecht en de Gemeentewet ;'

In [31]:
# That "a rtikel" is easily argued to be a markup mistake (extrefs are used without spaces so _should_ split words)

# ...so let's instead consider BWBR0044578
bwb_test = wetsuite.helpers.net.download('https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0044578/2021-01-01_0/xml/BWBR0044578_2021-01-01_0.xml')
tree = wetsuite.helpers.etree.fromstring( bwb_test )
# The intitule is 
#   Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de 
#   invoering van een CO2-heffing voor de industrie (Wet CO<inf>2</inf> -heffing industrie)
# so 
#   tree.find('wetgeving/intitule').text  
# would stop after "(Wet CO". 

from importlib import reload
reload( wetsuite.helpers.etree )

# So do we want all_text_fragments again, with spaces as we just figured was a good idea?
print( wetsuite.helpers.etree.all_text_fragments(  tree.find('wetgeving/intitule'), join=' ') )

Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO  2 -heffing industrie) 2020 544 23-12-2020 16-12-2020 35575 2020 544 23-12-2020 16-12-2020 35575 01-01-2021


In [32]:
# Also, those numbers are there because there's no easy separation between tags that are part of the title and part of the metadata under the same tag.
#   there's a stop_at to help with that part

# Now it says "CO  2" where we probably wanted "CO2"
print( wetsuite.helpers.etree.all_text_fragments(  tree.find('wetgeving/intitule'), join='') )
#   For just titles you could guess we could probably get away with not adding spaces
#   but in reality you should check whether that's valid to do


# This actually highlights a subtler issue,
# that in such free-form documents it is uncertain which elements act as word separators,
# and which do not - it's down largely semantics, and possibly also by use
#  * not for subscript
#  * _probably not_ for bold or italics
#  * _probably_ for links
# In actual HTML the conventions are at least stronger, but in this document-pretending-to-be-data,
# you might want to be able to control that, and we might want to give you .

# (...and we have to ignore human mistakes like, above, that second extref anove starting a character late)

# TODO: create a specialized text extractor for CVDR, BWB that includes these semantics

Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO2-heffing industrie)202054423-12-202016-12-202035575202054423-12-202016-12-20203557501-01-2021


In [35]:
# Why all the numbers at the end? 
# 
#  Well, here the XML is less structured than it perhaps should be - the actual title within the intitule isn't well separated from the metadata tag
#  so in this case it seems almost impossible to ask for just that text

# Such issues are _generally_ rare, but in this particular case you have to get more creative.

# This should probably mean all_text_fragments's ignore_under should come to mean 'ignore subtree when you see an element called this',
#   but until then you would need to get more creative.

# One way is to use the stop_at argument, which stops walking the tree at the first mention of a node with a particular name
print( wetsuite.helpers.etree.all_text_fragments(  tree.find('wetgeving/intitule'), join='', stop_at=['meta-data']) )


# A somewhat more selective hack would be to selectively remove the `meta-data` tag 
#   (based on knowing the structure from the XML schema, and probably work on a copy of the tree)
tree = wetsuite.helpers.etree.fromstring( bwb_test ) # parse that text into etree object tree 
intitule = tree.find('wetgeving/intitule')
intitule.remove( intitule.find('meta-data') )
wetsuite.helpers.etree.all_text_fragments( intitule, join='')

Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO2-heffing industrie)


'Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO2-heffing industrie)'

## On (ignoring) namespaces

Consider needing to combine data from different sources in one XML document.
- If done free-form, there is potential ambiguity about what standard each element (or even attribute) even comes from,
- and if they happen to pick the same element (or attribute) name, they may clash and overwrite each other.
Namespaces can be thought of as a little tag on each element (and attribute) that assigns it to a well-defined source.


Consider the following XML: (a simplified version of document from [here](view-source:https://standaarden.overheid.nl/owms/terms/Leiden_(gemeente).rdf))
```
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:overheid="http://standaarden.overheid.nl/owms/terms/"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
         xmlns:skos="http://www.w3.org/2004/02/skos/core#">
  <rdf:Description rdf:about="http://standaarden.overheid.nl/owms/terms/Leiden_(gemeente)">
    <overheid:CBSCode>0546</overheid:CBSCode>
    <overheid:overlapsWith rdf:resource="http://standaarden.overheid.nl/owms/terms/Hoogheemraadschap_van_Rijnland"/>
    <overheid:overlapsWith rdf:resource="http://standaarden.overheid.nl/owms/terms/Zuid-Holland"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Commissie_Regionaal_Overleg_Luchthaven_Schiphol"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Milieudienst_West-Holland"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Regiokorps_Politie_Hollands_Midden"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/SZHR"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Servicepunt71"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Veiligheidsregio_Hollands_Midden"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/bsgr"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/odwhol"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/rdoghm"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/sohlrl"/>
    <rdf:type rdf:resource="http://purl.org/dc/terms/Agent"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Agent"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Gemeente"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Organisatie"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Overheidsorganisatie"/>
    <rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Resource"/>
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
    <rdfs:label xml:lang="nl">Leiden</rdfs:label>
    <skos:prefLabel xml:lang="nl">Leiden</skos:prefLabel>
  </rdf:Description>
</rdf:RDF>
```


**technicalities (feel free to skip)**

It turns out that (due to the way namespaces were bolted onto XML _after_ initial design)
a namespace declaration like `xmlns:overheid="http://standaarden.overheid.nl/owms/terms/"`
to humans seems to mean 'can use `overheid:` as a prefix on nodes'.

...**but** 
- That prefix isn't technically part of the document model, 
- just XML specs cannot even guarantee the same prefix will be used if it were written out again,
- and the only thing that actually identifies the namespace is `http://standaarden.overheid.nl/owms/terms/`

So when etree (and many other parsers) then sees 
	`<overheid:CBSCode>0546</overheid:CBSCode>`
it will actually think of that as as an element with name 
`{http://standaarden.overheid.nl/owms/terms/}CBScode`



**correct namespaces is important when producing XML**,
in that the document will not conform to a standard that says you must use them.

At the same time, **matching namespces is annoying**, and 
- when you are only _consuming_ XML into something else,
- when you are doing so manually rather than with something like XSLT,
- when you know that the schemas being combined in your documents do not actually clash (which is 'usually'),
then you can consider removing them. 

Say, if you parsed the above, example, pulled it through `wetsuite.helpers.etree.strip_namespace()`, and printed it out again, you would get:

```
<RDF>
  <Description about="http://standaarden.overheid.nl/owms/terms/Leiden_(gemeente)">
    <CBSCode>0546</CBSCode>
    <overlapsWith resource="http://standaarden.overheid.nl/owms/terms/Hoogheemraadschap_van_Rijnland"/>
    <overlapsWith resource="http://standaarden.overheid.nl/owms/terms/Zuid-Holland"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Commissie_Regionaal_Overleg_Luchthaven_Schiphol"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Milieudienst_West-Holland"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Regiokorps_Politie_Hollands_Midden"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/SZHR"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Servicepunt71"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Veiligheidsregio_Hollands_Midden"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/bsgr"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/odwhol"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/rdoghm"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/sohlrl"/>
    <type resource="http://purl.org/dc/terms/Agent"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Agent"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Gemeente"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Organisatie"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Overheidsorganisatie"/>
    <type resource="http://www.w3.org/2000/01/rdf-schema#Resource"/>
    <type resource="http://www.w3.org/2002/07/owl#Thing"/>
    <label lang="nl">Leiden</label>
    <prefLabel lang="nl">Leiden</prefLabel>
  </Description>
</RDF>
```

Is this cheating? 

Yes.

Does it make your life simpler, e.g. let you write `etree.find('Description/CBScode')` 
instead of `etree.find('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description/{http://standaarden.overheid.nl/owms/terms/}CBScode')`? 

Also yes.