## Goal of this notebook

When you want more structure than plain text, you quickly have to dive deeper, in a way specific to a data source.

This is a continuation of what we started in the [dataset_docstructure_cvdr](dataset_docstructure_cvdr.ipynb) notebook,
applied to Basis WettenBestand (BWB).

## BWB

In [1]:
import collections
#import wetsuite.datacollect.db
import wetsuite.helpers.net
import wetsuite.helpers.etree
import wetsuite.helpers.koop_parse

### What do the documents look like?


Again, let's start with a cherry-picked example, and let's skip ahead to showing only the body (using a path we don't technically know just yet).

In [2]:
url = 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0009262/1998-01-14_0/xml/BWBR0009262_1998-01-14_0.xml'
example_tree = wetsuite.helpers.etree.fromstring( wetsuite.helpers.net.download(url) )
example_tree = wetsuite.helpers.etree.strip_namespace( example_tree )
print( wetsuite.helpers.etree.tostring( wetsuite.helpers.etree.indent( example_tree.find('wetgeving/wet-besluit/wettekst') ) ).decode('u8') )

<wettekst>
  <artikel bwb-ng-variabel-deel="/Artikel1" code="b1-1" stam-id="676033" versie-id="983772" id="C983771" label-id="655064" inwerking="1998-01-14" label="Artikel 1" bron="Stb.1998-16" effect="nieuwe-regeling" ondertekening_bron="1997-12-24" publicatie_bron="1998-01-13" publicatie_iwt="1998-01-13" status="goed">
    <kop>
      <label>Artikel</label>
      <nr>1</nr>
    </kop>
    <lid bwb-ng-variabel-deel="/Artikel1/Lid1" label-id="655064L1">
      <lidnr>1</lidnr>
      <al>De verkiezing van de leden van de raden van de gemeenten Deventer, Diepenveen en Bathmen, waarvoor de kandidaatstelling op 20 januari 1998 zou plaatsvinden, blijft achterwege.</al>
      <meta-data>
        <jcis>
          <jci versie="1.3" verwijzing="jci1.3:c:BWBR0009262&amp;artikel=1&amp;lid=1&amp;z=1998-01-14&amp;g=1998-01-14" onderdeel="lid=1"/>
        </jcis>
      </meta-data>
    </lid>
    <lid bwb-ng-variabel-deel="/Artikel1/Lid2" label-id="655064L2">
      <lidnr>2</lidnr>
      <al>De leden

Again, there's [a schema](https://repository.officiele-overheidspublicaties.nl/Schema/BWB-WTI/2016-1/xsd/wti_2016-1.xsd),
which says that there is consistent wrapping:
- ***`toestand`*** node (useful attributes include `bwb-id`)
  - `bwb-inputbestand` (required but may be empty)
  - `bwb-wijzigingen` (required but may be empty)
  - `redactionele-correcties` (optional)
  - ***`wetgeving`***  node (useful attributes include `soort`)
    - `intitule`
    - `citeertitel`
    - ***(general content root)*** node
    - `meta-data`

It turns out what that 'general content root' element is called will vary, with what kind of document it is. 

Which is well correlated with `wetgetving`'s `soort` - though with just enough exceptions that you really can't just assume.

Let's get that conclusion from actual data:

#### What types of documents are there?

In [3]:
# there are currently roughly 37k active toestanden.   All of it in one go takes a while, and takes a lot of RAM, so let's make a selection.
# TODO: replace with fetches from dataset
conn = wetsuite.datacollect.db.connect()
curs = conn.cursor()
#curs.execute('SELECT toestand_url  FROM bwb')
curs.execute('SELECT toestand_url  FROM bwb  WHERE random() < 0.4  LIMIT 10000') # the random is a hackish selection to get a better-spread test sample
bwb_urls = curs.fetchall()
conn.rollback()

bwb_parsed = []
for url, in bwb_urls:
    bytestring, _, _ = wetsuite.datacollect.db.cached_fetch( url )
    tree = wetsuite.helpers.etree.fromstring( bytestring )
    tree = wetsuite.helpers.etree.strip_namespace( tree )
    bwb_parsed.append( (url, tree) )
print('DONE parsing %d items'%len(bwb_parsed))

DONE parsing 10000 items


In [4]:
content_root_count = collections.defaultdict(int)

for url, tree in bwb_parsed:
    wetgeving = tree.find('wetgeving')
    soort     = wetgeving.get('soort')
    intitule, citeertitel, content_root, metadata = wetgeving.getchildren() # implicitly also tests whether there are always those four nodes the schema says
    content_root_count[ (soort, content_root.tag) ] += 1

In [5]:
for path_string in sorted(content_root_count):
    soort,content_root = path_string
    count = content_root_count[path_string]
    print(' %50r  with   %-20r  appeared %s times'%(soort, content_root, count))

                                             'AMvB'  with   'wet-besluit'         appeared 1669 times
                                         'AMvB-BES'  with   'wet-besluit'         appeared 111 times
                                               'KB'  with   'regeling'            appeared 56 times
                                               'KB'  with   'wet-besluit'         appeared 330 times
                                     'beleidsregel'  with   'circulaire'          appeared 1020 times
                                     'beleidsregel'  with   'regeling'            appeared 148 times
                                 'beleidsregel-BES'  with   'circulaire'          appeared 1 times
                                       'circulaire'  with   'circulaire'          appeared 257 times
                                       'circulaire'  with   'regeling'            appeared 2 times
                                   'circulaire-BES'  with   'circulaire'          appeared 2 t

So that's messy (and if we had included all documents there would have been a few more combinations).

On the other hand, if you treat rare things as exceptions, it's not so bad:

In [5]:
for path_string in sorted(content_root_count):
    soort,content_root = path_string
    count = content_root_count[path_string]
    if count>40:
        print(' %50r  with   %-20r  appeared %s times'%(soort, content_root, count))

                                             'AMvB'  with   'wet-besluit'         appeared 1357 times
                                               'KB'  with   'wet-besluit'         appeared 313 times
                                     'beleidsregel'  with   'circulaire'          appeared 597 times
                                     'beleidsregel'  with   'regeling'            appeared 77 times
                                       'circulaire'  with   'circulaire'          appeared 153 times
                            'ministeriele-regeling'  with   'regeling'            appeared 5326 times
       'ministeriele-regeling-archiefselectielijst'  with   'regeling'            appeared 225 times
                                              'pbo'  with   'regeling'            appeared 632 times
                                         'rijkswet'  with   'wet-besluit'         appeared 54 times
                                              'wet'  with   'wet-besluit'         appeared 

#### What about the document body?

The schema says little about the structure within that main content node.
How would we e.g. get to all the structure, find out how to refer to parts, etc?

Again, let's look at the data, with the same path counter.

In [7]:
count_paths = collections.defaultdict( int )

for url, tree in bwb_parsed:
    wetgeving = tree.find('wetgeving')
    #soort = wetgeving.get('soort')
    _, _, content_root, _ = wetgeving.getchildren() 

    for path, count in wetsuite.helpers.etree.path_count( content_root ).items():
        count_paths[path] += count

for path, count in sorted( count_paths.items() ): # sort by path as an approximate 'group similar things'
    print( '%6d  %s'%(count,path))


   792  circulaire
   489  circulaire/bijlage
    20  circulaire/bijlage/adres
   129  circulaire/bijlage/adres/adresregel
     1  circulaire/bijlage/adreslijst
    21  circulaire/bijlage/adreslijst/adres
   116  circulaire/bijlage/adreslijst/adres/adresregel
  1072  circulaire/bijlage/al
    39  circulaire/bijlage/al/extref
     2  circulaire/bijlage/al/intref
    57  circulaire/bijlage/al/nadruk
     3  circulaire/bijlage/al/noot
     3  circulaire/bijlage/al/noot/noot.al
     2  circulaire/bijlage/al/noot/noot.lijst
    12  circulaire/bijlage/al/noot/noot.lijst/noot.li
    12  circulaire/bijlage/al/noot/noot.lijst/noot.li/li.nr
    12  circulaire/bijlage/al/noot/noot.lijst/noot.li/noot.al
     2  circulaire/bijlage/al/noot/noot.lijst/noot.li/noot.al/extref
     3  circulaire/bijlage/al/noot/noot.nr
   157  circulaire/bijlage/al/redactie
     3  circulaire/bijlage/al/sup
    10  circulaire/bijlage/artikel
     5  circulaire/bijlage/artikel/al
     3  circulaire/bijlage/artikel/al/ext

...that's a lot of output - but to be fair, that's most path that appears ever - you would *expect* that to be a lot. 


### Structure down to artikel?

If we assume for a moment that all interesting content is in `artikel` tags (and ignoring some parts of the structure, like bijlage, for _relative_ brevity), let's see what path is between that content root and artikels, and then what the structure is within artikels.

In [8]:
# we could probably alter path_count() to do stop at a node name, but for one time the code's not so bad:
how_to_get_to_artikel = collections.defaultdict(int)

for path_string in list( count_paths.keys() ):
    if path_string.endswith('/artikel'):
        how_to_get_to_artikel[ path_string ] += count_paths[path_string]

for path, count in sorted( how_to_get_to_artikel.items() ):
    print( '%6d  %s'%(count,path))

    10  circulaire/bijlage/artikel
    58  circulaire/circulaire-tekst/artikel
    43  circulaire/circulaire-tekst/circulaire.divisie/artikel
    50  circulaire/circulaire-tekst/circulaire.divisie/circulaire.divisie/artikel
   711  regeling/bijlage/artikel
  1274  regeling/bijlage/divisie/artikel
   190  regeling/bijlage/divisie/divisie/artikel
   361  regeling/bijlage/divisie/divisie/divisie/artikel
 28714  regeling/regeling-tekst/artikel
   513  regeling/regeling-tekst/hoofdstuk/afdeling/artikel
  1065  regeling/regeling-tekst/hoofdstuk/afdeling/paragraaf/artikel
   107  regeling/regeling-tekst/hoofdstuk/afdeling/paragraaf/sub-paragraaf/artikel
  5888  regeling/regeling-tekst/hoofdstuk/artikel
  3723  regeling/regeling-tekst/hoofdstuk/paragraaf/artikel
   294  regeling/regeling-tekst/hoofdstuk/paragraaf/sub-paragraaf/artikel
    92  regeling/regeling-tekst/hoofdstuk/titeldeel/afdeling/artikel
   235  regeling/regeling-tekst/hoofdstuk/titeldeel/afdeling/paragraaf/artikel
    51  regel

That looks reasonably regular, actually.


### Structure within artikel?

Additionally assuming all real text is in `al` tags

In [9]:
within_artikel = collections.defaultdict(int)

for path_string in list( count_paths.keys() ):
    arti = path_string.find('/artikel/')
    if arti!=-1 and path_string.endswith('/al'):
        if '/meta-data' in path_string: # focus more on content for a moment
            continue
        within_artikel[ path_string[arti+1:] ] += count_paths[path_string]

for path, count in sorted( within_artikel.items() ):
    #if count>40: # uncomment for 'just the more common stuff'
        print( '%6d  %s'%(count,path))

 96878  artikel/al
  1257  artikel/artikel.toelichting/al
   309  artikel/artikel.toelichting/lijst/li/al
    25  artikel/artikel.toelichting/lijst/li/lijst/li/al
    10  artikel/artikel.toelichting/lijst/li/lijst/li/lijst/li/al
  4813  artikel/definitielijst/definitie-item/definitie/al
   490  artikel/definitielijst/definitie-item/definitie/lijst/li/al
    54  artikel/definitielijst/definitie-item/definitie/lijst/li/lijst/li/al
     3  artikel/definitielijst/definitie-item/definitie/lijst/li/lijst/li/lijst/li/al
168916  artikel/lid/al
  1175  artikel/lid/definitielijst/definitie-item/definitie/al
   185  artikel/lid/definitielijst/definitie-item/definitie/lijst/li/al
 88395  artikel/lid/lijst/li/al
    36  artikel/lid/lijst/li/definitielijst/definitie-item/definitie/al
 10300  artikel/lid/lijst/li/lijst/li/al
   568  artikel/lid/lijst/li/lijst/li/lijst/li/al
    48  artikel/lid/lijst/li/lijst/li/lijst/li/lijst/li/al
     3  artikel/lid/lijst/li/lijst/li/lijst/li/lijst/li/lijst/li/al
 

Again, not so bad, and it resembles what we had already done for CVDR.

In fact, let's try to use the same 'text with context' function.

In [24]:
import random, pprint
import wetsuite.helpers.koop_parse

from importlib import reload
reload(wetsuite.helpers.koop_parse)

for url, tree in random.sample( bwb_parsed, 3 ):
    print(' === %s ==='%url )

    tree = wetsuite.helpers.etree.fromstring( wetsuite.datacollect.db.cached_fetch( url )[0] )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    alinea_dicts = wetsuite.helpers.koop_parse.alineas_with_selective_path( 
        tree, 
        start_at_path = wetsuite.helpers.etree.path_between(tree, tree.find('wetgeving').getchildren()[2])
    )
    if 0: # ungrouped
        for alinea_dict in alinea_dicts:
            print('-'*80)
            pprint.pprint( alinea_dict )
    else:  # grouped
        # this groups text in specific keys are unique)
        pprint.pprint( wetsuite.helpers.koop_parse.merge_alinea_data( alinea_dicts ) )


 === https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0007878/1996-02-04_0/xml/BWBR0007878_1996-02-04_0.xml ===
[([],
  ['Besluit: ',
   'Deze regeling zal met de toelichting in de Staatscourant worden '
   'geplaatst. ']),
 ([('artikel', '1')],
  ['De door de ondernemers voor het jaar 1996 vastgestelde tarieven worden '
   'goedgekeurd. ']),
 ([('artikel', 'II')],
  ['Deze regeling treedt in werking met ingang van de tweede dag na '
   'dagtekening van de Staatscourant waarin zij wordt geplaatst en werkt terug '
   'tot en met 1 januari 1996. '])]
 === https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0002976/2013-04-01_0/xml/BWBR0002976_2013-04-01_0.xml ===
[([],
  ['Zo is het, dat Wij, de Raad van State gehoord, en met gemeen overleg der '
   'Staten-Generaal, hebben goedgevonden en verstaan, gelijk Wij goedvinden en '
   'verstaan bij deze: ',
   'Lasten en bevelen, dat deze in het  Staatsblad zal worden geplaatst, en '
   'dat alle ministeriële departementen

It's missing various BWB-specific details you _could_ be extracting, but not a bad start at all.

#### Other examples you could try

In [None]:
# short example with lists
#https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0006881/1994-09-01_0/xml/BWBR0006881_1994-09-01_0.xml

#https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0034743/2014-02-05_0/xml/BWBR0034743_2014-02-05_0.xml

# definities as examples of valuable text _not_ in an al
# https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0010278/1999-12-01_0/xml/BWBR0010278_1999-12-01_0.xml 

# https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0001827/2022-08-01_0/xml/BWBR0001827_2022-08-01_0.xml