<a href="https://colab.research.google.com/github/scarfboy/wetsuite-dev/blob/main/examples/collocations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

If our provided datasets, or data extraction functions (...often examples of such in a notebook...) cover your needs,
then you don't really need to know about data in XML form, or even what XML means.

If our code _doesn't_ help yet, or you have a specific question that means you must dive deep into the structured data, 
then it may help to read this tour of what XML in general looks like, and how to manipulate it it in python. 
...and then move on to example uses, e.g. those various data collection notebooks.

### XML as a data format
XML is a specific style of structured data, one that focuses on nesting data structures in other data structures - a [tree structure](https://en.wikipedia.org/wiki/Tree_structure).

It has elements that look like `<this>when it has content</this>`, or `<this />` or`<this></this>` when empty. 

While indentation is (largely) immaterial to XML in a data sense,
indentation is common when _presenting_ it for human review to help indicate how deep each element is in the tree, like:

```
<aanhef>
  <preambule>
    <al>De raad van de gemeente Oosterhout;</al>
    <al>gezien het voorstel van burgemeester en wethouder van 21 november 2014;</al>
    <al>gelet op<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a">artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet</extref>,
       a<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35">rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening
       ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers</extref>;<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAZ/article=35">artikel 35, lid 1, aanhef en 
       onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen</extref>, 
       de bepalingen van de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Algemene%20wet%20bestuursrecht">Algemene wet bestuursrecht</extref>en de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Gemeentewet">Gemeentewet</extref>;</al>
    <al>overwegende dat het noodzakelijk is het opleggen van een tegenprestatie bij verordening te regelen;</al>
    <al>Besluit</al>
    <al>vast te stellen: de “Verordening tegenprestatie Sociale Zekerheid 2015, gemeente Oosterhout”.</al>
  </preambule>
</aanhef>
```

XML is usually [serialized](https://en.wikipedia.org/wiki/Serialization) into a file, which is _technically_ a binary format, but practically happens to be fairly human-readable and fairly human-editable. You can do it, though you may not enjoy it tremendously.

### elementtree's view on XML

Say we have some example data: 
```
<root>
  some text 
  <bold>with more text</bold>
  end text
</root>
```
Say we want to parse it into data we can extract or manipulate directly.


There are actually a few ways to think of XML data in object form (and a few more more beyond that, see e.g. [SAX](https://en.wikipedia.org/wiki/Simple_API_for_XML)).


The [DOM](https://en.wikipedia.org/wiki/Document_Object_Model), as used e.g. by scripting in browsers,
would parse the above into a structure much like (note: for brevity we're pretending there are no spaces or newlines in there):
* `root` element
  * a `text node` containing `'some text'`
  * a element named `bold`, containing
    * a `text node` containing `'with more text'`
  * a `text node` containing `'end text'`

Thinking of that as five different object is well defined, yet also a little cumbersome.
Even browsers rarely show it quite this verbosely.


Python's [elementtree](https://docs.python.org/3/library/xml.etree.elementtree.html) takes a different view: instead of making each element and each text fragment a separate object, it tries to center around elements (hence the name) and ends up tacking text onto the closest element. 
Of particular note is text nodes _after_ elements at the same level.
The above becomes:
* `root` element with `text='some text'` (and `tail=None`)
  * `bold` element with `text='with more text'` and `tail='end text'`

Whether this makes your life 
* easier because it's fewer objects to inspect, with less typing, and maybe it's actually no differen because you're handling structured data like config files (where tail is typically empty), or
* harder, because it's _yet another_ way of looking at things, and sometimes no better (in particular the position of the 'end text' is sort of awkward)

...depends a little on your data, and how fast you pick up the etree way of thinking.

## Some examples of how to handle xml

In [1]:
import wetsuite.helpers.net
import wetsuite.helpers.etree  

test_xml = '''<aanhef>
  <preambule>
    <al>De raad van de gemeente Oosterhout;</al>
    <al>gezien het voorstel van burgemeester en wethouder van 21 november 2014;</al>
    <al>gelet op<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a">artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet</extref>, a<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35">rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers</extref>;<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAZ/article=35">artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen</extref>, de bepalingen van de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Algemene%20wet%20bestuursrecht">Algemene wet bestuursrecht</extref>en de<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Gemeentewet">Gemeentewet</extref>;</al>
    <al>overwegende dat het noodzakelijk is het opleggen van een tegenprestatie bij verordening te regelen;</al>
    <al>Besluit</al>
    <al>vast te stellen: de “Verordening tegenprestatie Sociale Zekerheid 2015, gemeente Oosterhout”.</al>
  </preambule>
</aanhef>'''


### Finding and navigating elements

Terms:
* child is a node _directly_ under another
* descendant is a node _somewhere_ under another

Functions:
* [find()](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.find) finds the first child that matches by name, or descendent by path (of names).
* [findall()](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findall) finds all matching elements
* or [iter()](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.iter), a [generator](https://wiki.python.org/moin/Generators) that can do a _little_ more than findall

In [3]:
aanhef = wetsuite.helpers.etree.fromstring( test_xml ) # parse that text into etree object tree. 
# We might call the variable tree, or root, or like in this case, the name of its root element
#   (when we know it, this can make our code a little more readable)

In [4]:
# pick a specific child element out of the tree
print( aanhef.find('preambule') )  # finds the first node called preambule under the annhef node

<Element preambule at 0x7fd69d4821c0>


In [5]:
# in this case you are probably interested in whatever is within preambule, so you can go find a descendent two levels deeper
print( aanhef.find('preambule/al') )

<Element al at 0x7fd69d482800>


In [7]:
# ...yet find() only finds the first.   That is its function, that is just not what we care about just now.

# To look for multiple things, consider the following three fragments:

# - all children under an element, whatever those are, night makes sense for some structured data
preambule = aanhef.find('preambule')
print( preambule.getchildren( ) )  


# - if that preambule contained tags other than al they would also be there - and we wanted just al. You would probably prefer:
preambule = aanhef.find('preambule')
print( preambule.findall( 'al' ) )
# in this case happens to be the same, but if preambule contained non-al nodes, they would not show up)


# If you aren't going to reuse that preambule reference, this is a shorter equivalent:
print( aanhef.findall( 'preambule/al' ) )

[<Element al at 0x7fd69d482680>, <Element al at 0x7fd69d482a80>, <Element al at 0x7fd69d482a40>, <Element al at 0x7fd69d4829c0>, <Element al at 0x7fd69d482280>, <Element al at 0x7fd69d48b380>]
[<Element al at 0x7fd69d482a80>, <Element al at 0x7fd69d482a40>, <Element al at 0x7fd69d4829c0>, <Element al at 0x7fd69d482280>, <Element al at 0x7fd69d48b300>, <Element al at 0x7fd69d48b340>]
[<Element al at 0x7fd69d482a40>, <Element al at 0x7fd69d4829c0>, <Element al at 0x7fd69d482280>, <Element al at 0x7fd69d48b240>, <Element al at 0x7fd69d48b2c0>, <Element al at 0x7fd69d48b480>]


To power users / people who will use this a lot:

You will often find yourself writing code that expresses how you think about the structure.

For example, if a document contains a number of records, you might find yourself finding individual records (e.g. `findall()`),
then handling each individually (possibly fishing out a few specific things with `find()`).
This might be more clear to read or alter later -- or it might be more verbose.

In other cases, your wishes are less structured, e.g.  "I want to find the preambule tags no matter where it is in the structure, then want extrefs at any depth under that",
are a lot easier to express in a single, shorter bit of XPath (The standard etree doesn't have xpath. Wetsuite provides it because [lxml's etree interface has it](https://lxml.de/xpathxslt.html#xpath) and wetsuite specifically uses this lxml flavour of etree.) 

Sometimes you can cheat, e.g. if you know that even if this _could_ be matching anywhere in the document, you might be able to know that the only part of the document that _will_ in practice is the part you are interested in.

Consider:

In [8]:
aanhef.xpath('//preambule//extref')  

[<Element extref at 0x7fd69d482480>,
 <Element extref at 0x7fd69d4827c0>,
 <Element extref at 0x7fd69d482700>,
 <Element extref at 0x7fd69d4821c0>,
 <Element extref at 0x7fd69d482780>]

## Getting out attributes, and text

In [10]:
# Each node can have attributes.  
#   often we are picking out detailed data from there,
#   sometimes we are filter on values in there, which is a little more work

for extref in aanhef.xpath('preambule//extref'):
    # print the fragment we're extracting from a  serialized XML, for reference
    print( wetsuite.helpers.etree.debug_pretty(extref) )   # debug_pretty is largely  ET.tostring but geared at humans skimming: namespaces removed, indenting added

    #print( extref.attrib ) # _all_ attributes as dict, though in these cases there is only one
    print( 'DOC  = %r'%extref.get('doc')) # if there was no such attribute, it would return None

    print( 'TEXT = %r'%extref.text ) # this gets the initial text, this isn't always enough (we'll explain why next)
    print()

<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a">artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet</extref>, a
DOC  = 'http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Participatiewet/article=8a'
TEXT = 'artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet'

<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35">rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers</extref>;
DOC  = 'http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAW/article=35'
TEXT = 'rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers'

<extref doc="http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=IOAZ/article=35">artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zel

## Getting out text

XML, due to its generic nature, does not specialize in text and often it a little cumbersome to deal with.

Looking at the documentation, you might have found
* `element.text` and `elemen.tail`, reminding yourself of where etree sticks text 

* [findtext()](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findtext) roughly amounts to `find()` followed by `.text`, is fewer keystrokes but the same as before

* [findall()](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findall) could be used to then fish out `.text` and `.tail` of each matching node, ignorning `None`s
  * yet that's a whole bunch of typing.

* [itertext()](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.itertext) can be used something like `list( aanhef.itertext('al') )`,  does something like that for you? 
  * Better - probably.

But there's a more practical detail:

### When you say text, do you mean direct text content, subtree text, or something else?

In [11]:
alineas = aanhef.xpath('//al') # select all alineas anywhere
alineas[2].text  # what's the text in the third one? (cherry picked example)

'gelet op'

That seems like elementtree being a smartass - 
that sure is is the text within the element - up to the first child element, which isn't what we wanted.

We can start looking at functions, but they are just a means to and end, 
and frankly we should first be more specific about what we wanted to do in the first place.

If we wanted the output:

    gelet op artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet, artikel 35, lid 1, aanhef en onderdeel e, 
    van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers; artikel 35, lid 1, 
    aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen, 
    de bepalingen van de Algemene wet bestuursrecht en de Gemeentewet;

...then our question amounts to "all text and tail values at any depth under the subtree under a particular starting particular element, 
smushed together", and that's a little more work to do.

In [12]:
# We provide a function that should help.
wetsuite.helpers.etree.all_text_fragments( alineas[2] )     # side note: there is a sneaky default text strip() on each framgment in there, which you may sometimes want to change

# and mostly it just grabs each .text and .tail it finds under the given element, so gives a list like:

['gelet op',
 '',
 'artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet',
 ', a',
 'rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers',
 ';',
 'artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen',
 ', de bepalingen van de',
 'Algemene wet bestuursrecht',
 'en de',
 'Gemeentewet',
 ';']

In [13]:
# Why not return it as as a single string?   We could join that array into a single string:
''.join( wetsuite.helpers.etree.all_text_fragments( alineas[2] ) )

'gelet opartikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet, artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers;artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen, de bepalingen van deAlgemene wet bestuursrechten deGemeentewet;'

In [14]:
# Note that 'opartikel'. Okay, maybe that should have inserted a space between each part:
' '.join( wetsuite.helpers.etree.all_text_fragments( alineas[2] ) )
# or, basically equivalent, and a little clearer in intent:
#wetsuite.helpers.etree.all_text_fragments( alineas[2], ignore_empty=True, join=' ' )

# ...yet now we have an "a rtikel".

'gelet op  artikel 8a, lid 1, aanhef en onderdeel b, van de Participatiewet , a rtikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening ouderen en gedeeltelijk arbeidsongeschikte werkloze werknemers ; artikel 35, lid 1, aanhef en onderdeel e, van de Wet inkomensvoorziening oudere en gedeeltelijk arbeidsongeschikte gewezen zelfstandigen , de bepalingen van de Algemene wet bestuursrecht en de Gemeentewet ;'

In [15]:
# That "a rtikel" is easily argued to be an error in that particular document (extrefs are used without spaces so _should_ split words),
# and join-with-a-space is probably preferable in this particular case.


# ...but let's instead consider BWBR0044578
bwb_test = wetsuite.helpers.net.download('https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0044578/2021-01-01_0/xml/BWBR0044578_2021-01-01_0.xml')
tree = wetsuite.helpers.etree.fromstring( bwb_test )
# The intitule is 
#   Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer 
#   voor de invoering van een CO2-heffing voor de industrie (Wet CO<inf>2</inf> -heffing industrie)
# so 
#   tree.find('wetgeving/intitule').text  
# would stop after "(Wet CO". 

# So do we want all_text_fragments again, with spaces as we just figured was a good idea?
print( wetsuite.helpers.etree.all_text_fragments(  tree.find('wetgeving/intitule'), join=' ') )

Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO  2 -heffing industrie) 2020 544 23-12-2020 16-12-2020 35575 2020 544 23-12-2020 16-12-2020 35575 01-01-2021


In [16]:
# Now it says "CO  2" where we probably wanted "CO2"
print( wetsuite.helpers.etree.all_text_fragments(  tree.find('wetgeving/intitule'), join='') )
# For just titles you could guess we could probably get away with not adding spaces because you won't get much markup in there
# but in reality, doing that for an entire body text will probably just be wrong.


# This actually highlights a subtler issue,
# that in such free-form documents it is uncertain which elements act as word separators, and which do not.
# This is largely down to semantics, say
#  * not for subscript and superscript
#  * _probably not_ for bold or italics
#  * _probably_ for links

# These semantics will be defined by the document standard - or at least strong convention. 
# Say, in HTML this isn't much of an issue.
# Yet in XML we have documents pretending to be data, or maybe the other way around?
#   Best thing you might get is a note in the documentation somewhere. If not, guess?

# (...and we have to ignore human mistakes like, above, that second extref anove starting a character late)


# TODO: create a specialized text extractor for CVDR and BWB that includes these semantics for the node names used in there

Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO2-heffing industrie)202054423-12-202016-12-202035575202054423-12-202016-12-20203557501-01-2021


In [17]:
# Also, why all the numbers at the end? 
#   Well, here the XML is less structured than it perhaps should be - the actual title within the intitule isn't well separated from the metadata tag
#   so in this case it seems almost impossible to ask for just that text
#   Such issues are _generally_ rare, but in this particular case you have to get more creative.

# This should probably mean all_text_fragments's ignore_under should come to mean 'ignore subtree when you see an element called this',
#   but until then you would need to get more creative.

# One way is to use the stop_at argument, which stops walking the tree at the first mention of a node with a particular name
print( wetsuite.helpers.etree.all_text_fragments(  tree.find('wetgeving/intitule'), join='', stop_at=['meta-data']) )


# A somewhat more selective hack would be to selectively remove the `meta-data` tag 
#   (based on knowing the structure from the XML schema, and probably work on a copy of the tree)
# Such cleanup might be better when you hand trees along -- but means you have to know even more about etree, so it's not for regular use.
tree = wetsuite.helpers.etree.fromstring( bwb_test ) # parse that text into etree object tree 
intitule = tree.find('wetgeving/intitule')
intitule.remove( intitule.find('meta-data') ) 
wetsuite.helpers.etree.all_text_fragments( intitule, join='')

Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO2-heffing industrie)


'Wet van 16 december 2020 tot wijziging van de Wet belastingen op milieugrondslag en de Wet Milieubeheer voor de invoering van een CO2-heffing voor de industrie (Wet CO2-heffing industrie)'

## On (ignoring) namespaces

Consider needing to combine data from different sources in one XML document.
- If done free-form, there is potential ambiguity about what standard each element (or even attribute) even comes from,
- and if they happen to pick the same element (or attribute) name, they may clash and overwrite each other.
Namespaces can be thought of as a little tag on each element (and attribute) that assigns it to a well-defined source.


Consider the following XML: (a simplified version of document from [here](view-source:https://standaarden.overheid.nl/owms/terms/Leiden_(gemeente).rdf))
```
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:overheid="http://standaarden.overheid.nl/owms/terms/"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
         xmlns:skos="http://www.w3.org/2004/02/skos/core#">
  <rdf:Description rdf:about="http://standaarden.overheid.nl/owms/terms/Leiden_(gemeente)">
    <overheid:CBSCode>0546</overheid:CBSCode>
    <overheid:overlapsWith rdf:resource="http://standaarden.overheid.nl/owms/terms/Hoogheemraadschap_van_Rijnland"/>
    <overheid:overlapsWith rdf:resource="http://standaarden.overheid.nl/owms/terms/Zuid-Holland"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Commissie_Regionaal_Overleg_Luchthaven_Schiphol"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Milieudienst_West-Holland"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Regiokorps_Politie_Hollands_Midden"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/SZHR"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Servicepunt71"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/Veiligheidsregio_Hollands_Midden"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/bsgr"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/odwhol"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/rdoghm"/>
    <overheid:serviceAreaOf rdf:resource="http://standaarden.overheid.nl/owms/terms/sohlrl"/>
    <rdf:type rdf:resource="http://purl.org/dc/terms/Agent"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Agent"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Gemeente"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Organisatie"/>
    <rdf:type rdf:resource="http://standaarden.overheid.nl/owms/terms/Overheidsorganisatie"/>
    <rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Resource"/>
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
    <rdfs:label xml:lang="nl">Leiden</rdfs:label>
    <skos:prefLabel xml:lang="nl">Leiden</skos:prefLabel>
  </rdf:Description>
</rdf:RDF>
```


**technicalities (feel free to skip)**

It turns out that (due to the way namespaces were bolted onto XML _after_ initial design)
a namespace declaration like `xmlns:overheid="http://standaarden.overheid.nl/owms/terms/"`
to humans seems to mean 'can use `overheid:` as a prefix on nodes'.

...**but** 
- That prefix isn't technically part of the document model, 
- just XML specs cannot even guarantee the same prefix will be used if it were written out again,
- and the only thing that actually identifies the namespace is `http://standaarden.overheid.nl/owms/terms/`, not `overheid`

So when etree (and many other parsers) then sees 
	`<overheid:CBSCode>0546</overheid:CBSCode>`
it will actually think of that as as an element with name 
`{http://standaarden.overheid.nl/owms/terms/}CBScode`



**correct namespaces is important when producing XML**,
in that the document will not conform to a standard that says you must use them.

At the same time, **matching namespces is annoying**, and 
- when you are only _consuming_ XML into something else,
- when you are doing so manually rather than with something like XSLT,
- when you know that the schemas being combined in your documents do not actually clash (which is 'usually'),
then you can consider removing them. 

Say, if you parsed the above, example, pulled it through `wetsuite.helpers.etree.strip_namespace()`, and printed it out again, you would get:

```
<RDF>
  <Description about="http://standaarden.overheid.nl/owms/terms/Leiden_(gemeente)">
    <CBSCode>0546</CBSCode>
    <overlapsWith resource="http://standaarden.overheid.nl/owms/terms/Hoogheemraadschap_van_Rijnland"/>
    <overlapsWith resource="http://standaarden.overheid.nl/owms/terms/Zuid-Holland"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Commissie_Regionaal_Overleg_Luchthaven_Schiphol"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Milieudienst_West-Holland"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Regiokorps_Politie_Hollands_Midden"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/SZHR"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Servicepunt71"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/Veiligheidsregio_Hollands_Midden"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/bsgr"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/odwhol"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/rdoghm"/>
    <serviceAreaOf resource="http://standaarden.overheid.nl/owms/terms/sohlrl"/>
    <type resource="http://purl.org/dc/terms/Agent"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Agent"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Gemeente"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Organisatie"/>
    <type resource="http://standaarden.overheid.nl/owms/terms/Overheidsorganisatie"/>
    <type resource="http://www.w3.org/2000/01/rdf-schema#Resource"/>
    <type resource="http://www.w3.org/2002/07/owl#Thing"/>
    <label lang="nl">Leiden</label>
    <prefLabel lang="nl">Leiden</prefLabel>
  </Description>
</RDF>
```

Is this cheating? 

Yes.

Does it make your life simpler, e.g. let you write `etree.find('Description/CBScode')` 
instead of `etree.find('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description/{http://standaarden.overheid.nl/owms/terms/}CBScode')`? 

Also yes.   

As long as you're sure you do not  introduce ambiguity.