<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/notebooks/intermediate/technical_xml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Purpose of this notebook

Explain how XML contains data, and how to get that data out, for those who have not seen or seriously used it before.

<!-- -->

You probably do not need this, until you know that you do.

We provide some readymade datasets (TODO:link) that extracted useful information from XML.
If those cover your needs, then you don't really need to know about data in XML form, or even what XML means.

<!-- -->

...yet where those is _not_ enough,
or you can't convince use to make something better,
or you have a specific question that means you must dive deep into the structured data,
then it may help to read this tour of what XML in general looks like, and how to manipulate it it in python.

It then moves on to some example uses, e.g. those various data collection notebooks.

### XML as a data format
XML is a specific style of writing down structured data, one that focuses on text and nesting structures in other data structures - a [tree structure](https://en.wikipedia.org/wiki/Tree_structure).

It has elements that look like `<this>when it has content</this>`, so  like `<this></this>` when empty (though there's a shorthand like `<this />`).

It also allows putting those elements inside other elements, which allows storing some more structured types of data.

It doesn't specialize in anything, so everything is similarly nice, or depending on your view, similarly awkward,
and this confusion of different purposes makes it impossible to do a quick and consise introduction to XML.

Let's still give it a shot:
XML is one format that lends itself to data that is more complex than just a list of something.

...both more data-like structures such as [value lists](https://standaarden.overheid.nl/owms/terms/Dienst.xml) with contents such as:

```
<cv name="overheid:Dienst">
   <value>
      <prefLabel>Netherlands Space Office</prefLabel>
      <resourceIdentifier>http://standaarden.overheid.nl/owms/terms/nso</resourceIdentifier>
      <startDate>2009-01-01</startDate>
   </value>
   <value>
      <prefLabel>Agentschap CBI</prefLabel>
      <resourceIdentifier>http://standaarden.overheid.nl/owms/terms/Agentschap_CBI</resourceIdentifier>
   </value>
   <value>
      <prefLabel>Agentschap NL</prefLabel>
      <resourceIdentifier>http://standaarden.overheid.nl/owms/terms/Agentschap_NL</resourceIdentifier>
      <startDate>2010-01-01</startDate>
      <endDate>2013-12-31</endDate>
   </value>
</cv>
```

In this particular case there is actually very little structure, and there are ways to push that into simpler formats.

But a lot of uses store more document-like things,
which e.g. lets documents attaching non-textual information to just parts of a text. 

For example, consider the following (slightly reformatted) fragment from [CVDR352889](https://lokaleregelgeving.overheid.nl/CVDR352889) in XML form:
```
<aanhef>
  <preambule>
    <al>De raad van de gemeente Oosterhout;</al>
    <al>gezien het voorstel van burgemeester en wethouder van 21 november 2014;</al>
    <al>gelet op<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a">artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet</extref>,
       a<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35">rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening
       ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers</extref>;<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAZ/article=35">artikel 35, lid 1, aanhef en
       onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen</extref>,
       de bepalingen van de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Algemene%20wet%20bestuursrecht">Algemene wet bestuursrecht</extref>en de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Gemeentewet">Gemeentewet</extref>;</al>
    <al>overwegende dat het noodzakelijk is het opleggen van een tegenprestatie bij verordening te regelen;</al>
    <al>Besluit</al>
    <al>vast te stellen: de “Verordening tegenprestatie Sociale Zekerheid 2015, gemeente Oosterhout”.</al>
  </preambule>
</aanhef>
```

XML is often [stored into a file](https://en.wikipedia.org/wiki/Serialization),
which is _technically_ a binary format, but practically looks like a text file that 
- is often human-readable, at least in the "I get roughly what it's doing and storing" sene
  - depending on what document structure you follow
  - (may not be indented helpfully like the example above - and reindenting like that is technically an alteration of the document).
- is usually editable in a text editor
  - while may not enjoy the process tremendously, and there are some edge cases that may bite you, _simpler_ cases are usually fine.

### For context / elementtree's view on XML

Say we want to parse something into a form we can manipulate / extract from directly.
Say we have some example data:
```
<root>
  some text
  <bold>with more text</bold>
  end text
</root>
```

There are actually a few ways to think of XML data in object form (and more than the two mentioned below).


The [DOM](https://en.wikipedia.org/wiki/Document_Object_Model), 
as used e.g. by scripting in browsers, would parse the above into a structure much like (note: for brevity we're pretending there are no spaces or newlines in there):
* `root` element
  * a `text node` containing `'some text'`
  * a element named `bold`, containing
    * a `text node` containing `'with more text'`
  * a `text node` containing `'end text'`

Thinking of that as five different object is very well defined - yet also cumbersome to deal with.


When e.g. web browsers show you XML data, even they try to display it in a visually simpler form, 
and Python's [elementtree](https://docs.python.org/3/library/xml.etree.elementtree.html) takes a form much like them:

Instead of making each element and each text fragment a separate object, 
it tries to center around elements (hence the name) and ends up tacking text onto the closest element.

Of particular note is text nodes _after_ elements at the same level.
The above becomes:
* `root` element with `text='some text'` (and `tail=None`)
  * `bold` element with `text='with more text'` and `tail='end text'`

Whether this makes your life
* easier because it's fewer objects to inspect, with less typing, and 
* no different, e.g. because you're handling key-value lists like config files (where _both_ views are simplified)
* harder, because it's _yet another_ way of looking at things
  - and sometimes no better (in particular the position of this 'end text'/tail can be awkward)
  - and sometimes arguably worse - e.g. 'how do I just get the text out?' took a bit of thinking around the DOM, and takes even more thinking with etree just because it takes a different view

...depends a little on your data, and how fast you (want to) pick up the etree way of thinking.

This notebook is here to make this look a little less scary.

## Some examples of how to handle xml

In [2]:
import wetsuite.helpers.net
import wetsuite.helpers.etree  # atually 90% just the existing lxml.etree module, but we added some convenience

# and some test data to start
test_xml = '''<aanhef>
  <preambule>
    <al>De raad van de gemeente Oosterhout;</al>
    <al>gezien het voorstel van burgemeester en wethouder van 21 november 2014;</al>
    <al>gelet op<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a">artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet</extref>, a<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35">rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers</extref>;<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAZ/article=35">artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen</extref>, de bepalingen van de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Algemene%20wet%20bestuursrecht">Algemene wet bestuursrecht</extref>en de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Gemeentewet">Gemeentewet</extref>;</al>
    <al>overwegende dat het noodzakelijk is het opleggen van een tegenprestatie bij verordening te regelen;</al>
    <al>Besluit</al>
    <al>vast te stellen: de “Verordening tegenprestatie Sociale Zekerheid 2015, gemeente Oosterhout”.</al>
  </preambule>
</aanhef>'''


### Finding and navigating elements

Terms:
* child is a node _directly_ under another
* descendant is a node _somewhere_ under another

Functions:
* [`find()`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.find) finds the first child that matches by name, or descendent by path (of names).
* [`findall()`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findall) finds all matching elements
* or [`iter()`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.iter), a [generator](https://wiki.python.org/moin/Generators) that can do a _little_ more than findall

In [3]:
aanhef = wetsuite.helpers.etree.fromstring( test_xml ) # parse that text into etree object.
# That object points at the root of a tree, so we migh choose to call this variable 'tree', or 'root',
# Or, when we know it is fixed,  the name of its root element might make the following code a little more readable

In [4]:
# find() picks a specific child element out of the tree
print( aanhef.find('preambule') )  # finds the first node called preambule under the annhef node

<Element preambule at 0x7f7581929d40>


In [4]:
# ...in this case you are probably interested in whatever is within preambule, so you can go find a descendent two levels deeper
print( aanhef.find('preambule/al') )

<Element al at 0x7f1a90112480>


In [5]:
# ...yet find() only finds the first (that is its purpose), and you are probably interested in everthing.

# To look for multiple things, consider the following three fragments:


# All children under an element, whatever those are.
preambule = aanhef.find('preambule')
print( preambule.getchildren( ) )
# That makes sense e.g. where you already _know_ you are interested in all children equally,
#   e.g. because the document _must_ only contain all the same thing at that level.
# Usually you can't assume that, so you make more specific requests...


# ...say, if you only wanted <al> elements, and you know there could be other things in there, you might do:
preambule = aanhef.find('preambule')
print( preambule.findall( 'al' ) )
# in this case happens to be the same, but if preambule contained non-al nodes, they would not show up)


# If you aren't going to reuse that preambule reference, this is a shorter equivalent:
print( aanhef.findall( 'preambule/al' ) )

[<Element al at 0x7f758198ed00>, <Element al at 0x7f758198ec00>, <Element al at 0x7f758198efc0>, <Element al at 0x7f758198e940>, <Element al at 0x7f758198e780>, <Element al at 0x7f758198e880>]
[<Element al at 0x7f758198ed00>, <Element al at 0x7f758198e7c0>, <Element al at 0x7f758198ec00>, <Element al at 0x7f758198e800>, <Element al at 0x7f758198efc0>, <Element al at 0x7f758198e980>]
[<Element al at 0x7f758198ec00>, <Element al at 0x7f758198ee40>, <Element al at 0x7f758198e800>, <Element al at 0x7f758198e780>, <Element al at 0x7f758198efc0>, <Element al at 0x7f758198eac0>]


## Side note: some questions are more structured than others.

If your question is very structured, you might often find yourself writing nested code that expresses how you think about the structure, like:

    preambule = aanhef.find('preambule')
    for alinea in preambule.findall('al'):
        handle( alinea )


In other cases, your wishes are less structured, e.g. 
"I want to find the preambule tags no matter where it is in the structure, then want extrefs at any depth under that".

But find() and findall() don't let you say "I don't care what's inbetween",
so you might end up using something like [`iter()`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.iter) ('step through all nodes in a tree'), but this easily gets a little awkward. 


In [7]:
# This is one of a few ways of expressing that.   Most variants of this are awkward in some way, this is just one of them.
for node in aanhef.iter():
    if node.tag == 'preambule':
        for up in node.iter():
            if up.tag == 'extref':
                print( up )

<Element extref at 0x7f7581982d00>
<Element extref at 0x7f7581982ac0>
<Element extref at 0x7f7581982e80>
<Element extref at 0x7f75819826c0>
<Element extref at 0x7f75819825c0>


## Side note: extraction code may need to change

You can imagine that the first find+findall style encourages code that is entangled with a single purpose,
and is not necessarily even correct for similar cases.

...because did you catch that the above would give duplicate results if preambules could exist within other preambules?

"Correct only with a completely hidden assumption", which can be a category of "wrong, actually".
Or perfectly usable. 



The original quesion happens to be easier to express in XPath.

And you can imagine that it would be less typing to change later. (...assuming your next need is also easy to express in XPath)

The standard etree doesn't support that, but wetsuite happens to focus on a specific flavour of etree that does (technically: [lxml's etree interface has it](https://lxml.de/xpathxslt.html#xpath))

So consider:

In [8]:
aanhef.xpath('//preambule//extref')

[<Element extref at 0x7f758198bb80>,
 <Element extref at 0x7f75a00c9b00>,
 <Element extref at 0x7f7581982980>,
 <Element extref at 0x7f7581982880>,
 <Element extref at 0x7f7581982740>]

## Getting out attributes, and text

In [12]:
# Each node can have attributes.
#   often we are picking out detailed data from there,
#   sometimes we are filter on values in there, which is a little more work

for extref in aanhef.xpath('preambule//extref'):
    # print the fragment we're extracting from a  serialized XML, for reference
    print( 'LOOKING AT', wetsuite.helpers.etree.debug_pretty(extref) )   # debug_pretty is largely  ET.tostring but geared at humans skimming: namespaces removed, indenting added

    #print( extref.attrib ) # _all_ attributes as dict, though in these cases there is only one
    print( 'DOC:       %r'%extref.get('doc')) # if there was no such attribute, it would return None

    print( 'TEXT:      %r'%extref.text ) # this gets the initial text, this isn't always enough (we'll explain why next)
    print()

LOOKING AT <extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a">artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet</extref>, a
DOC:       'http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a'
TEXT:      'artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet'

LOOKING AT <extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35">rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers</extref>;
DOC:       'http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35'
TEXT:      'rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers'

LOOKING AT <extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAZ/article=35">artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oude

## Getting out text

XML, due to its generic nature, does not specialize in text, and that means it's often a little cumbersome to deal with text.

Looking at the documentation, you might have found
* `element.text` and `elemen.tail`, reminding yourself of where etree sticks text

* [`findtext()`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findtext) roughly amounts to `find()` followed by `.text`, is fewer keystrokes but the same as before

* [`findall()`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findall) could be used by then fishing out `.text` and `.tail` of each matching node, ignorning `None`s
  * yet that's a whole bunch of typing.

* [`itertext()`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.itertext) can be used something like `list( aanhef.itertext('al') )`,  does something like that for you?
  * Better - probably.

But there's a more practical issue:

### When you say text, do you mean direct text content, subtree text, or something else?

In [10]:
alineas = aanhef.xpath('//al') # select all alineas anywhere
alineas[2].text  # what's the text in the third one? (cherry picked example)

'gelet op'

That seems like elementtree being a smartass -
that sure is is the text within the element - up to the first child element, which isn't what we wanted.

We can start looking at functions, but they are just a means to and end,
and frankly we should first be more specific about what we wanted to do in the first place.

If we wanted the output:

    gelet op artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet, artikel 35, lid 1, aanhef en onderdeel e,
    van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers; artikel 35, lid 1,
    aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen,
    de bepalingen van de Algemene wet bestuursrecht en de Gemeentewet;

...then our question was actually "all text and tail values at any depth under the subtree under a particular starting particular element,
smushed together", and that's a little more work to do.

In [15]:
# We provide a function that should help.
#   side note: there is a sneaky default text strip(), via a parameter, which you may sometimes want to change
wetsuite.helpers.etree.all_text_fragments( alineas[2], strip='' )

# this mostly it  grabs each .text and .tail it finds under the given element, so gives a list like:

['gelet op',
 '\n    ',
 'artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet',
 ', a',
 'rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers',
 ';',
 'artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen',
 ', de bepalingen van de',
 'Algemene wet bestuursrecht',
 'en de',
 'Gemeentewet',
 ';']

In [16]:
# so, why not return it as as a single string?   We could join that array into a single string:
''.join( wetsuite.helpers.etree.all_text_fragments( alineas[2] ) )

'gelet opartikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet, artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers;artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen, de bepalingen van deAlgemene wet bestuursrechten deGemeentewet;'

In [17]:
# Note that 'opartikel'. Okay, maybe that should have inserted a space between each part:
' '.join( wetsuite.helpers.etree.all_text_fragments( alineas[2] ) )
# or, basically equivalent, and a little clearer in intent:
#wetsuite.helpers.etree.all_text_fragments( alineas[2], ignore_empty=True, join=' ' )

# ...yet now we have an "a rtikel".

'gelet op  artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet , a rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers ; artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen , de bepalingen van de Algemene wet bestuursrecht en de Gemeentewet ;'

In [18]:
# That "a rtikel" is easily argued to be an error in that particular document (extrefs are used without spaces so _should_ split words),
# and join-with-a-space is probably preferable in this particular case.


# ...but let's instead consider BWBR0044578
bwb_test = wetsuite.helpers.net.download('https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0044578/2021-01-01_0/xml/BWBR0044578_2021-01-01_0.xml')
tree = wetsuite.helpers.etree.fromstring( bwb_test )
# The intitule is
#   Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer
#   voor de invoering van een CO2-heffing voor de industrie (Wet CO<inf>2</inf> -heffing industrie)
# so
#   tree.find('wetgeving/intitule').text
# would stop after "(Wet CO".

# So do we want all_text_fragments again, with spaces as we just figured was a good idea?
print( wetsuite.helpers.etree.all_text_fragments(  tree.find('wetgeving/intitule'), join=' ') )

Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO  2 -heffing industrie) 2020 544 23-12-2020 16-12-2020 35575 2020 544 23-12-2020 16-12-2020 35575 01-01-2021


In [19]:
# Now it says "CO  2" where we probably wanted "CO2". Okay, let's try
print( wetsuite.helpers.etree.all_text_fragments(  tree.find('wetgeving/intitule'), join='') )

Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO2-heffing industrie)202054423-12-202016-12-202035575202054423-12-202016-12-20203557501-01-2021


In [None]:
# Right, so for just titles you could guess we could probably get away with not adding spaces because you won't get much markup in there
# but in reality, doing that for an entire body text will probably just be wrong.


# Yet this actually highlights a subtler issue,
# that in such free-form documents it is uncertain which elements act as word separators, and which do not.
# 
# This is largely down to semantics that are defined in the somewhere the parts of a standard that fewer people read.
# Or, you know, not. But often still have strong conventions, like:
# - not for subscript and superscript
# - _probably not_ for bold or italics
# - _probably_ for links

# In HTML this may not be much of an issue.
# Yet in XML we have documents pretending to be data  (or maybe the other way around)

# (...and we have to ignore human mistakes like, above, that second extref anove starting a character late)


# TODO: create a specialized text extractor for CVDR and BWB that includes these semantics for the node names used in there

In [None]:
# Also, look back to the BWBR0044578 example - why all the numbers at the end of the title?
#   Well, here the XML is less structured than it perhaps should be - the actual title within the intitule
#   isn't well separated from the metadata tag at the same level,
#   so in this case it seems almost impossible to ask for just that text
#   Such issues are _generally_ rare, but in this particular case you have to get more creative.

# This should probably mean all_text_fragments's ignore_under should come to mean 'ignore subtree when you see an element called this',
#   but until then you would need to get more creative.

# One way is to use the stop_at argument, which stops walking the tree at the first mention of a node with a particular name
print( wetsuite.helpers.etree.all_text_fragments(  tree.find('wetgeving/intitule'), join='', stop_at=['meta-data']) )


# A somewhat more selective hack would be to selectively remove the `meta-data` tag
#   (based on knowing the structure from the XML schema, and probably work on a copy of the tree)
# Such cleanup might be better when you hand trees along -- but means you have to know even more about etree, so it's not for regular use.
tree = wetsuite.helpers.etree.fromstring( bwb_test ) # parse that text into etree object tree
intitule = tree.find('wetgeving/intitule')
intitule.remove( intitule.find('meta-data') )
wetsuite.helpers.etree.all_text_fragments( intitule, join='')

Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO2-heffing industrie)


'Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO2-heffing industrie)'

## On (ignoring) namespaces

Consider needing to combine data from different sources in one XML document.

If done free-form, you would combined element names that _might_ have different meanings to different sources
  - there is potential ambiguity about what standard each element (or even attribute) even comes from,
  - and if they happen to pick the same element (or attribute) name, they may clash and overwrite each other.

Namespaces can be thought of as a little tag on each element (and attribute) that assigns it to a well-defined source.


Consider the following XML: (a simplified version of document from [here](https://standaarden.overheid.nl/owms/terms/Leiden_(gemeente).rdf))
```
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:overheid="http://standaarden.overheid.nl/owms/terms/"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
         xmlns:skos="http://www.w3.org/2004/02/skos/core#">
  <rdf:Description rdf:about="http://standaarden.overheid.nl/owms/terms/Leiden_(gemeente)">
    <overheid:CBSCode>0546</overheid:CBSCode>
    <overheid:overlapsWith rdf:resource="http://standaarden.overheid.nl/owms/terms/Hoogheemraadschap_van_Rijnland"/>
    <overheid:overlapsWith rdf:resource="http://standaarden.overheid.nl/owms/terms/Zuid-Holland"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Commissie_Regionaal_Overleg_Luchthaven_Schiphol"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Milieudienst_West-Holland"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Regiokorps_Politie_Hollands_Midden"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/SZHR"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Servicepunt71"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Veiligheidsregio_Hollands_Midden"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/bsgr"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/odwhol"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/rdoghm"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/sohlrl"/>
    <rdf:type rdf:resource="http://purl.org/dc/terms/Agent"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Agent"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Gemeente"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Organisatie"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Overheidsorganisatie"/>
    <rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Resource"/>
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
    <rdfs:label xml:lang="nl">Leiden</rdfs:label>
    <skos:prefLabel xml:lang="nl">Leiden</skos:prefLabel>
  </rdf:Description>
</rdf:RDF>
```


**technicalities (feel free to skip)**

It turns out that (due to the way namespaces were bolted onto XML _after_ initial design)
a namespace declaration like `xmlns:overheid="http://standaarden.overheid.nl/owms/terms/"`
to humans seems to mean 'can use `overheid:` as a prefix on nodes'.

...**but**
- That prefix isn't technically part of the document model,
- just XML specs cannot even guarantee the same prefix will be used the next time we write this out,
- so the only thing that actually identifies the namespace is `http://standaarden.overheid.nl/owms/terms/`, *****not***** `overheid`

So when etree (and many other parsers) then sees
	`<overheid:CBSCode>0546</overheid:CBSCode>`
it will actually think of that as as an element with name
`{http://standaarden.overheid.nl/owms/terms/}CBScode`



**correct namespaces is important when producing XML**,
in that the document will not conform to a standard that says you must use them.

At the same time, **matching namespces is annoying**, and
- when you are only _consuming_ XML into something else,
- when you are doing so manually rather than with something like XSLT,
- when you know that the schemas being combined in your documents do not actually clash (which is 'usually'),
then you can consider removing them.

Say, if you parsed the above, example, pulled it through `wetsuite.helpers.etree.strip_namespace()`, and printed it out again, you would get:

```
<RDF>
  <Description about="http://standaarden.overheid.nl/owms/terms/Leiden_(gemeente)">
    <CBSCode>0546</CBSCode>
    <overlapsWith resource="http://standaarden.overheid.nl/owms/terms/Hoogheemraadschap_van_Rijnland"/>
    <overlapsWith resource="http://standaarden.overheid.nl/owms/terms/Zuid-Holland"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Commissie_Regionaal_Overleg_Luchthaven_Schiphol"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Milieudienst_West-Holland"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Regiokorps_Politie_Hollands_Midden"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/SZHR"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Servicepunt71"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Veiligheidsregio_Hollands_Midden"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/bsgr"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/odwhol"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/rdoghm"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/sohlrl"/>
    <type resource="http://purl.org/dc/terms/Agent"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Agent"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Gemeente"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Organisatie"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Overheidsorganisatie"/>
    <type resource="http://www.w3.org/2000/01/rdf-schema#Resource"/>
    <type resource="http://www.w3.org/2002/07/owl#Thing"/>
    <label lang="nl">Leiden</label>
    <prefLabel lang="nl">Leiden</prefLabel>
  </Description>
</RDF>
```

Is this cheating?

Yes.

Does it make your life simpler, e.g. let you write `etree.find('Description/CBScode')`
instead of `etree.find('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description/{http://standaarden.overheid.nl/owms/terms/}CBScode')`?

Also yes.   

In any context where you are sure it won't introduce ambiguity, you can get away with this.

## On playing with selection queries

This can still feel fairly abstract, so let's play with examples some more.

There are a number of 
- online XPath tester tools,
- there is your browser's development console (press F12), (which in most browsers let you search for nodes within the current page with such queries),

For the sake of testing this on our own data, there is a quick imitation of such visualisation.

(Note that it is _INCOMPLETE_ in a few senses, including that it's focused on elements, not text nodes or attributes - '//extref/@id' or //extref/text() will not do anything. In fact, at etree level the first doesn't select anything, though the second _does_)

In [13]:
import wetsuite.helpers.notebook

test_xml = '''<aanhef>
  <preambule>
    <al>De raad van de gemeente Oosterhout;</al>
    <al>gezien het voorstel van burgemeester en wethouder van 21 november 2014;</al>
    <al>gelet op<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a">artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet</extref>, a<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35">rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers</extref>;<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAZ/article=35">artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen</extref>, de bepalingen van de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Algemene%20wet%20bestuursrecht">Algemene wet bestuursrecht</extref>en de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Gemeentewet">Gemeentewet</extref>;</al>
    <al>overwegende dat het noodzakelijk is het opleggen van een tegenprestatie bij verordening te regelen;</al>
    <al>Besluit</al>
    <al>vast te stellen: de “Verordening tegenprestatie Sociale Zekerheid 2015, gemeente Oosterhout”.</al>
  </preambule>
</aanhef>'''

preambule = wetsuite.helpers.etree.fromstring( test_xml ) # our basic preambule / al / extref example from earlier

In [15]:
# The query 'preambule' means "select all <preambule> tags directly under the root", 
# the function parameters ask to additionally highlight show us what's inside (more useful for sections on more complex data)
display( wetsuite.helpers.notebook.etree_visualize_selection(preambule, 'preambule', mark_subtree=True) )

In [16]:
# "mark all <al> tags you can find anywhere"   (note that it only highlights the direct text content, you can have it mark _everything_ under there with mark_subtree)
display( wetsuite.helpers.notebook.etree_visualize_selection(preambule, '//al') )

In [19]:
# "all <extrefs> nodes, that are direct children of <al> nodes placed anywhere"
display( wetsuite.helpers.notebook.etree_visualize_selection(preambule, '//al/extref') )
# ...notice the subtle difference to:

# "all <extrefs> nodes, anywhere under <al> nodes placed anywhere"
display( wetsuite.helpers.notebook.etree_visualize_selection(preambule, '//al//extref') )
# which in this case is exactly the same selection, but doesn't need to be

In [20]:
# "the first <al> tag in a preambule"
display( wetsuite.helpers.notebook.etree_visualize_selection(preambule, '//preambule/al[1]')  )