Continuing from [extras_datacollect_rechtspraak](datacollect/extras_datacollect_rechtspraak.ipynb)...





In [None]:
## What does the data XML look like, and what can I easily do with it?

In [1]:
import wetsuite.helpers.net
import wetsuite.helpers.localdata
import wetsuite.helpers.etree


In [2]:
if 0: # let's have a cherry-picked example
    bytestring = wetsuite.helpers.net.download('https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RBZWB:2020:5807') 
else: # or a random example   (note that a lot of them will be without text, that's normal)
    rechtspraak_fetched = wetsuite.helpers.localdata.LocalKV('rechtspraak_fetched.db', key_type=str, value_type=bytes)
    _, bytestring = rechtspraak_fetched.random_choice()
    # TODO: replace with dataset

example_tree = wetsuite.helpers.etree.fromstring( bytestring )
print( wetsuite.helpers.etree.debug_pretty( example_tree ) ) # print indented
#pprint.pprint( wetsuite.datacollect.rechtspraaknl.parse_content( example_tree ) )

<open-rechtspraak>
  <RDF>
    <Description>
      <identifier>ECLI:NL:HR:2010:BN3914</identifier>
      <format>text/xml</format>
      <accessRights>public</accessRights>
      <modified>2013-04-05T06:48:29</modified>
      <issued label="Publicatiedatum">2013-04-05</issued>
      <publisher resourceIdentifier="http://rechtspraak.nl/">Raad voor de Rechtspraak</publisher>
      <language>nl</language>
      <replaces label="Vervangt">BN3914</replaces>
      <creator resourceIdentifier="http://standaarden.overheid.nl/owms/terms/Hoge_Raad_der_Nederlanden" scheme="overheid.RechterlijkeMacht" label="Instantie">Hoge Raad</creator>
      <date label="Uitspraakdatum">2010-08-13</date>
      <zaaknummer label="Zaaknr">09/04963</zaaknummer>
      <type resourceIdentifier="http://psi.rechtspraak.nl/uitspraak" language="nl">Uitspraak</type>
      <procedure resourceIdentifier="http://psi.rechtspraak.nl/procedure#cassatie" language="nl" label="Procedure">Cassatie</procedure>
      <coverage>NL</c

It seems 
* there are specific ideas about what should be in what structure (e.g. para in parablock in paragroup,  overview stuff in uitspraak.info)

But at the same time, that structure is missing from a lot of documents.

TODO: see if that's a thing over time.

In [None]:
from importlib import reload
reload(wetsuite.datacollect.rechtspraaknl)

pprint.pprint(   wetsuite.datacollect.rechtspraaknl.parse_content( example_tree )   )

## Inspect the fetched documents, looking for its text

Like in the exploration of the BWB and CVDR data, let's point out there are [schemas](https://www.rechtspraak.nl/SiteCollectionDocuments/Schema-Open-Data-voor-de-Rechtspraak.zip)
to the text's structure, but we should take a look at how they're followed or not.

And, regardless of that, of how we should flatten that text when we want to,
which we do for this dataset.

One of the things we do is counting paths, like in the mentioned [cvdr_docstructure](extras_datacollect_koop_cvdr_docstructure.ipynb) and [bwb_docstructure](extras_datacollect_koop_bwb_docstructure.ipynb) notebook.

As of this writing, that has guided how rechtspraaknl.parse_content() is implemented, though this needs more work.

In [None]:
count_paths = collections.defaultdict(int)

for key, xmldoc_bytes in rechtspraak_fetched.random_sample( 100 ): # we want a small selection to get only a reasonable amount of things to review
    tree = wetsuite.helpers.etree.fromstring( xmldoc_bytes )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    print(  )
    print( '-----------------------------------' )
    print( key )

    if 0: # check there's any other nodes beyond RDF, inhoudsindicatie, uitspraak, conclusie -- looks like no.
        childnames = list( node.tag  for node in tree.findall('*') )
        childnames.remove('RDF')
        if 'inhoudsindicatie' in childnames:
            childnames.remove('inhoudsindicatie')
        if 'uitspraak' in childnames:
            childnames.remove('uitspraak')
        if 'conclusie' in childnames:
            childnames.remove('conclusie')
        if len(childnames) > 0:
            print( childnames )

    uitspraak = tree.find('uitspraak')
    conclusie = tree.find('conclusie')
    
    if uitspraak is not None:

        for path, count in wetsuite.helpers.etree.path_count( uitspraak ).items():
            count_paths[path] += count

        try:
            parsed = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
            print( parsed['uitspraak'] )
        except Exception as e:
            print( 'ERROR', e )
            print( wetsuite.helpers.etree.tostring(uitspraak).decode('u8') )
            #raise

    elif conclusie is not None:
        for path, count in wetsuite.helpers.etree.path_count( conclusie ).items():
            count_paths[path] += count

        try:
            parsed = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
            print(   parsed['conclusie'] )
        except Exception as e:
            print( 'ERROR', e )
            print( wetsuite.helpers.etree.tostring(conclusie).decode('u8') )
            #raise


In [None]:
# Show those counted paths
pci = list( count_paths.items() )
pci.sort( key=lambda x:x[0] )
for path, count in pci:
    print('%7d   %s'%(count, path))

## 