<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/notebooks/extras/datacollect/extras_datacollect_rechtspraak.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

Understanding what you can get out of [rechtspraak.nl](https://www.rechtspraak.nl/).

Note that this is mostly just showing our work. If the rechtspraak [dataset](../../intro/wetsuite_datasets.ipynb) we provide suits your needs,
then running this notebook would just be a slower and more cumbersome way to get basically the same.

Somewhat related: [extras_datacollect_rechtspraak_codes](extras_datacollect_rechtspraak_codes.ipynb)

## Website, Open data API, and some other notes

You are probably familiar with the [rechtspraak.nl](https://rechtspraak.nl) website and its [search that has a number of filters](https://uitspraken.rechtspraak.nl/#!/) (and some [exta query logic for the text](https://www.rechtspraak.nl/Uitspraken/Paginas/Hulp-bij-zoeken.aspx#1ab85aa0-e737-4b56-8ad5-d7cb7954718d77a998be-3c73-40e3-90f7-541fceeb00fd3)), which gives webpage results, with text where present.


There is also an open data API that exposes much the same in data form.
As [its documentation](https://www.rechtspraak.nl/SiteCollectionDocuments/Technische-documentatie-Open-Data-van-de-Rechtspraak.pdf) ([this intro](https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx) may also be useful) mentions,
- you stick query parameters on the base URL of http://data.rechtspraak.nl/uitspraken/zoeken 
- the results mainly mention ECLIs, which you can fetch details for via e.g. https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:PHR:2011:BP5608 (more notes below)


Worthy of note:
- the fields you can search mostly matches with the 'Uitgebreid zoeken' at uitspraken.rechtspraak.nl, like
  - instantie / court code (basically that third element in the ECLI)
  - rechtsgebied
  - procedure 

- **You can't search in the body text**.  This largely limits the API to a 'keep updated with new cases' feed.  
  - Worthy of note: the website search (queried via `https://uitspraken.rechtspraak.nl/api/zoek`) does support thhis, and even seems like a better data API than the _actual_ one -- but it doesn't look like it's supposed to be used externally.

- there are plenty of cases where there is no text / document.  You can filter for this in the search.

- the **identifiers** used are **ECLI** (European Case Law Identifier)
  - which in this case will be mainly Dutch ECLIs  (`ECLI:NL:`...). 
    - used since 2013 or so, and which absorbed the previously used LJN (Landelijk Jurisprudentie Nummer) identifiers.
  - court code XX (`ECLI:NL:XX:`...) is used for things not (yet?) assgined to a court, and/or non-Dutch ECLIs,
    - rechtspraak.nl may later resolve such ECLIs to a different ECLI. That makes their site the most up-to-date information, that a mirror may not necessarily be aware of (yet). 
    - Example: ECLI:NL:XX:2009:BJ4574 
      - in [XML metadata](https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:XX:2009:BJ4574) mentions it's now (isReplacedBy) ECLI:CE:ECHR:2009:0528JUD002671305 (note: )
      - in [webpage form](https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:XX:2009:BJ4574) also seems to link to [a place to find it](https://hudoc.echr.coe.int/eng#%7B%22ecli%22:%5B%22ECLI:CE:ECHR:2009:0528JUD002671305%22%5D%7D) (e.g. `hudoc.echr.coe.int` or `e-justice.europa.eu`).

For each ECLI you might consider various URLs, including
  - XML, e.g. at https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:PHR:2011:BP5608
    - most interesting when you want case details as data
    - (there is a different XML form at https://uitspraken.rechtspraak.nl/api/document/?id=ECLI:NL:PHR:2011:BP5608 but this is for the webpage view, and is less interesting as data)
  - the case on the website
    - is linked by the website as   https://uitspraken.rechtspraak.nl/InzienDocument?id=ECLI:NL:PHR:2011:BP5608
    - it seems the slightly shorter https://deeplink.rechtspraak.nl/uitspraak?id=ECLI:NL:PHR:2011:BP5608 is equivalent
    - both of the above redirects to a URL like https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:PHR:2011:BP5608
      - which is a general page with scripting that picks up that identifer and then does another request to https://uitspraken.rechtspraak.nl/api/document/?id=ECLI:NL:PHR:2011:BP5608
  - The website also often links to LiDo, e.g.  https://linkeddata.overheid.nl/document/ECLI:NL:PHR:2011:BP5608
    - If you want that as data, consider http://linkeddata.overheid.nl/service/get-links?ext-id=ECLI:NL:PHR:2011:BP5608&output=xml
    - though as https://linkeddata.overheid.nl/front/portal/services notes, this is not part of public LiDo, so you'll need to request an account first


Because there are over three million dutch ECLIs on record, getting a lot of data from here would take a while, and they ask you to be nice to their server.
There used to be a ZIP file (linked from [https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx](https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx) you could download to bootstrap your own copy, but [this open data request](https://data.overheid.nl/community/datarequest/zip-bestand-alle-uitspraken) seems to confirm this was removed (and seems to imply that you should get it the hammery way).  As of this writing the URL to the ZIP file still works but they probably removed the link for a reason.


In [1]:
import pprint, collections, random

import wetsuite.helpers.etree
import wetsuite.datacollect.rechtspraaknl
import wetsuite.helpers.koop_parse
import wetsuite.helpers.meta
import wetsuite.helpers.net
import wetsuite.helpers.localdata
import wetsuite.helpers.notebook

### Example of search, its results, and fetching

#### Querying 

The base URL for data is http://data.rechtspraak.nl/uitspraken/ and just that URL will link you to some identifier/value lists

The **search** base is http://data.rechtspraak.nl/uitspraken/zoeken

Search parameters include: (again, see [the documentation](https://www.rechtspraak.nl/SiteCollectionDocuments/Technische-documentatie-Open-Data-van-de-Rechtspraak.pdf))
* from, max  (from is 0-based, max value of max is documented as 1000)
* sort - default is by modification date, ascending. `DESC` lets you do descending instead.
* type - `Uitspraak` or `Conclusie`
* date - date of this uitspraak / conclusie
* uitspraakdatum (date, or date range; optional)
* instantie - as mentioned in https://data.rechtspraak.nl/Waardelijst/Instanties 
* subject - rechtsgebied as mentioned in https://data.rechtspraak.nl/Waardelijst/Rechtsgebieden
* return - if you specify return=DOC you only get entries for which there is a document; if not you also get entries for which there is only metadata
* modified - last change of the metadata and/or text (with some subtleties, e.g. not necessarily of the uitrspraak or conclusie document)
* replaces - previous ECLI, or LJN,. for this case.  Meant for backwards compatibility (of searches?).


For example: http://data.rechtspraak.nl/uitspraken/zoeken?modified=2023-01-01&max=50

The response format is [Atom](https://en.wikipedia.org/wiki/Atom_(standard)) and very minimal: title, date, summary, and an URL pointing at the XML data document.

...but let's get our code to help:

In [2]:
search_results = wetsuite.datacollect.rechtspraaknl.search( [
    ('max',  '50000'), 
    ('return', 'DOC'),                                         # DOC asks for things with body text only
    #('modified', '2023-10-01'), ('modified', '2023-11-01')     # date range    (keep in mind that larger ranges easily means we hit the max)
    ('modified', '2023-11-01'),
] )
# search_results is a parsed etree object.

# We could show that relatively raw like    
#print( wetsuite.helpers.etree.debug_pretty(search_results) ) 
#  yet our parsed form (each entry as a dict) is little simpler to read:
entries = wetsuite.datacollect.rechtspraaknl.parse_search_results( search_results )
print( "Entries in results: %d\n"%len(entries) )
for entry in random.sample( entries, 3 ): # show a few random examples
    print('--------------------')
    pprint.pprint(entry)

https://data.rechtspraak.nl/uitspraken/zoeken?max=50000&return=DOC&modified=2023-11-01


Entries in results: 3324

--------------------
{'ecli': 'ECLI:NL:GHDHA:2023:1970',
 'link': 'https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:GHDHA:2023:1970',
 'summary': 'Vertegenwoordiging. Nieuwe contractspartij? Vereenzelviving. '
            'Misbruik van identiteit. Aangaan van onbetaald gelaten schulden.',
 'title': 'ECLI:NL:GHDHA:2023:1970, Gerechtshof Den Haag, 31-10-2023, '
          '200.317.810/02',
 'updated': '2023-11-07T09:00:07Z',
 'xml': 'https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:GHDHA:2023:1970'}
--------------------
{'ecli': 'ECLI:NL:RBNHO:2021:9348',
 'link': 'https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:RBNHO:2021:9348',
 'summary': 'WAHV. Ondanks de begrijpelijke stresssituatie waarin betrokkene '
            'verkeerde, had betrokkene, net zoals iedere weggebruiker, zich '
            'aan de regels moeten houden. Dat betrokkene er al dan niet bewust '
            'voor heeft gekozen om het rode licht te negeren, dient dan ook '

#### Fetching the documents the search refers to

In [3]:
rechtspraak_fetched = wetsuite.helpers.localdata.LocalKV('rechtspraak_fetched.db', key_type=str, value_type=bytes)

In [4]:
paths = collections.defaultdict(int)

count_fetched, count_cached = 0,0

pbar = wetsuite.helpers.notebook.progress_bar( len(entries), description='fetching... ')

for entry in entries:
    entry_xml_url = 'https://data.rechtspraak.nl/uitspraken/content?id=%s'%entry['ecli']
    bytestring, came_from_cache = wetsuite.helpers.localdata.cached_fetch( rechtspraak_fetched, entry_xml_url)
    if came_from_cache:
        count_cached +=1
    else:
        count_fetched += 1
    
    tree  = wetsuite.helpers.etree.fromstring( bytestring )
    tree  = wetsuite.helpers.etree.strip_namespace( tree )
        
    pbar.value += 1
    pbar.description = f"Fetched {count_fetched}, cached {count_cached}  "

print(f"Fetched {count_fetched},  while {count_cached} were cached\n") # because the progress bar doesn't update after iterating

  import tqdm.autonotebook


fetching... :   0%|          | 0/3324 [00:00<?, ?it/s]

Fetched 241,  while 3083 were cached



## What does the data XML look like, and what can I easily do with it?

In [10]:
if 0: # let's have a cherry-picked example
    bytestring = wetsuite.helpers.net.download('https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RBZWB:2020:5807') 
else: # or a random example   (note that a lot of them will be without text, that's normal)
    _, bytestring = rechtspraak_fetched.random_choice()

example_tree = wetsuite.helpers.etree.fromstring( bytestring )
print( wetsuite.helpers.etree.debug_pretty( example_tree ) ) # print indented
#pprint.pprint( wetsuite.datacollect.rechtspraaknl.parse_content( example_tree ) )

<open-rechtspraak>
  <RDF>
    <Description>
      <identifier>ECLI:NL:HR:2012:BW0936</identifier>
      <format>text/xml</format>
      <accessRights>public</accessRights>
      <modified>2021-07-14T15:47:40</modified>
      <issued label="Publicatiedatum">2013-04-05</issued>
      <publisher resourceIdentifier="http://rechtspraak.nl/">Raad voor de Rechtspraak</publisher>
      <language>nl</language>
      <replaces label="Vervangt">BW0936</replaces>
      <creator resourceIdentifier="http://standaarden.overheid.nl/owms/terms/Hoge_Raad_der_Nederlanden" scheme="overheid.RechterlijkeMacht" label="Instantie">Hoge Raad</creator>
      <date label="Uitspraakdatum">2012-04-06</date>
      <zaaknummer label="Zaaknr">11/02989</zaaknummer>
      <type resourceIdentifier="http://psi.rechtspraak.nl/uitspraak" language="nl">Uitspraak</type>
      <procedure resourceIdentifier="http://psi.rechtspraak.nl/procedure#cassatie" language="nl" label="Procedure">Cassatie</procedure>
      <coverage>NL</c

It seems there are specific ideas about what should be in what structure (e.g. para in parablock in paragroup,  overview stuff in uitspraak.info)

But at the same time, that structure is missing from a lot of documents.

TODO: see if that's a thing over time.

In [11]:
from importlib import reload
reload(wetsuite.datacollect.rechtspraaknl)

pprint.pprint(   wetsuite.datacollect.rechtspraaknl.parse_content( example_tree )   )

{'creator': 'Hoge Raad',
 'date': '2012-04-06',
 'identifier': 'ECLI:NL:HR:2012:BW0936',
 'inhoudsindicatie': 'Vennootschapsbelasting, artikel 13, '
                     'deelnemingsvrijstelling van toepassing op '
                     'beëindigingsvergoeding?',
 'issued': '2013-04-05',
 'modified': '2021-07-14T15:47:40',
 'publisher': 'Raad voor de Rechtspraak',
 'replaces': 'BW0936',
 'subject': 'Bestuursrecht; Belastingrecht',
 'type': 'Uitspraak',
 'uitspraak': '\n'
              '6 april 2012\n'
              'nr. 11/02989\n'
              '\n'
              'Arrest\n'
              '\n'
              'gewezen op het beroep in cassatie van X Holding B.V. te Z '
              '(hierna: belanghebbende) tegen de uitspraak van het Gerechtshof '
              "te 's-Gravenhage van 24 mei 2011, nr. BK-10/00429, betreffende "
              'een aanslag in de vennootschapsbelasting en de daarbij gegeven '
              'boetebeschikking.\n'
              '\n'
              '1. Het geding 

## Inspect the fetched documents, looking for its text

Like in the exploration of the BWB and CVDR data, let's point out there are [schemas](https://www.rechtspraak.nl/SiteCollectionDocuments/Schema-Open-Data-voor-de-Rechtspraak.zip)
to the text's structure, but we should take a look at how they're followed or not.

And, regardless of that, of how we should flatten that text when we want to,
which we do for this dataset.

One of the things we do is counting paths, like in the mentioned [cvdr_docstructure](extras_datacollect_koop_cvdr_docstructure.ipynb) and [bwb_docstructure](extras_datacollect_koop_bwb_docstructure.ipynb) notebook.

As of this writing, that has guided how rechtspraaknl.parse_content() is implemented, though this needs more work.

In [12]:
count_paths = collections.defaultdict(int)

for key, xmldoc_bytes in rechtspraak_fetched.random_sample( 100 ): # we want a small selection to get only a reasonable amount of things to review
    tree = wetsuite.helpers.etree.fromstring( xmldoc_bytes )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    print(  )
    print( '-----------------------------------' )
    print( key )

    if 0: # check there's any other nodes beyond RDF, inhoudsindicatie, uitspraak, conclusie -- looks like no.
        childnames = list( node.tag  for node in tree.findall('*') )
        childnames.remove('RDF')
        if 'inhoudsindicatie' in childnames:
            childnames.remove('inhoudsindicatie')
        if 'uitspraak' in childnames:
            childnames.remove('uitspraak')
        if 'conclusie' in childnames:
            childnames.remove('conclusie')
        if len(childnames) > 0:
            print( childnames )

    uitspraak = tree.find('uitspraak')
    conclusie = tree.find('conclusie')
    
    if uitspraak is not None:

        for path, count in wetsuite.helpers.etree.path_count( uitspraak ).items():
            count_paths[path] += count

        try:
            parsed = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
            print( parsed['uitspraak'] )
        except Exception as e:
            print( 'ERROR', e )
            print( wetsuite.helpers.etree.tostring(uitspraak).decode('u8') )
            #raise

    elif conclusie is not None:
        for path, count in wetsuite.helpers.etree.path_count( conclusie ).items():
            count_paths[path] += count

        try:
            parsed = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
            print(   parsed['conclusie'] )
        except Exception as e:
            print( 'ERROR', e )
            print( wetsuite.helpers.etree.tostring(conclusie).decode('u8') )
            #raise



-----------------------------------
https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RBROT:2018:1663

-----------------------------------
https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:HR:2009:BJ6967

29 september 2009
Strafkamer
nr. 08/03868 A

Hoge Raad der Nederlanden

Arrest

op het beroep in cassatie tegen een vonnis van het Gemeenschappelijk Hof van Justitie van de Nederlandse Antillen en Aruba van 27 mei 2008, nummer H-39/2008, in de strafzaak tegen:
[Verdachte], geboren op [geboorteplaats] op [geboortedatum] 1967, ten tijde van de betekening van de aanzegging gedetineerd in het Huis van Bewaring op Curaçao (Nederlandse Antillen). 

1. Geding in cassatie

1.1. Het beroep is ingesteld door de verdachte. Namens deze hebben mr. J. Goudswaard en mr. I. van Straalen, beiden advocaat te 's-Gravenhage, bij schriftuur middelen van cassatie voorgesteld. De schriftuur is aan dit arrest gehecht en maakt daarvan deel uit. 
De Advocaat-Generaal Vegter heeft geconcludeerd

In [22]:
# Show those counted paths
pci = list( count_paths.items() )
pci.sort( key=lambda x:x[0] )
for path, count in pci:
    print('%7d   %s'%(count, path))

    488   conclusie
     25   conclusie/bridgehead
     87   conclusie/conclusie.info
     11   conclusie/conclusie.info/bridgehead
     15   conclusie/conclusie.info/informaltable
     15   conclusie/conclusie.info/informaltable/tgroup
     30   conclusie/conclusie.info/informaltable/tgroup/colspec
     15   conclusie/conclusie.info/informaltable/tgroup/tbody
     69   conclusie/conclusie.info/informaltable/tgroup/tbody/row
    138   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry
    173   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry/para
     34   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry/para/emphasis
   1062   conclusie/conclusie.info/para
     20   conclusie/conclusie.info/para/emphasis
    485   conclusie/conclusie.info/parablock
    740   conclusie/conclusie.info/parablock/para
    172   conclusie/conclusie.info/parablock/para/emphasis
      6   conclusie/conclusie.info/parablock/para/footnote-ref
   5118   conclusie/footnote
 

## Start making a dataset

In [5]:
rechtspraaknl_extracted = wetsuite.helpers.localdata.MsgpackKV('rechtspraaknl_extracted.db', str, None)

In [14]:
# how much do we have at all?
len( rechtspraak_fetched )

3182815

In [19]:
# speed optimization: ignore cases we previously noticed have no text (based on URL, rather than unparsed or parsed document)
#   Since this reads gigabytes of data, expect this to take a few minutes
import tqdm
known_notext = set()

print("known-notext store: figuring out figuring out new cases")
for key, bytestring in tqdm.tqdm( rechtspraak_fetched.items() ):
    if key in known_notext: # (this condition only helps for re-runs in the same session that update from newer fetched state)
        continue
    if b'<conclusie' not in bytestring and b'<uitspraak' not in bytestring: # cheaper than parsing before deciding based on the parse?
        known_notext.add( key )

print("known-notext store: updating")
rechtspraaknl_knownnotext = wetsuite.helpers.localdata.LocalKV('rechtspraaknl_knownnotext.db', str, None)
print( "  items before update:", len(rechtspraaknl_knownnotext) )
current_keys = set( rechtspraaknl_knownnotext.keys() )
for url in known_notext:
    if url not in current_keys:
        rechtspraaknl_knownnotext.put(url, 'y', commit=False)
rechtspraaknl_knownnotext.commit()
print( "  items after update:", len(rechtspraaknl_knownnotext) )
#rechtspraaknl_extracted._put_meta('valtype','msgpack') # I do believe that's internal now; TODO: test

known-notext store: figuring out figuring out new cases


100%|██████████| 3182815/3182815 [01:22<00:00, 38777.31it/s] 


known-notext store: updating
2464757
2464757


### (incremental) parsing and storing

In [22]:
# these are used in the update section
known_notext_urls      = set( rechtspraaknl_knownnotext.keys() )     # actually a few hunred MB.   Also, set, not list, so that 'in' isn't slow as molasses.

already_extracted_urls = set( rechtspraaknl_extracted.keys() )

In [30]:
from importlib import reload
#reload(wetsuite.datacollect.rechtspraaknl)


count_uitspraken, count_conclusies, count_neither, count_skip  =  0, 0, 0, 0

selected_keys = list(rechtspraak_fetched.keys()) # all keys
random.shuffle(selected_keys)
#selected_keys = random.sample( selected_keys, 50000 ) # uncomment during debug, if you want to get just a subset, faster
pbar = wetsuite.helpers.notebook.progress_bar( len(selected_keys), description='parsing...')
import time
for key in selected_keys:
    pbar.value += 1
    #time.sleep(0.0000001)
    if pbar.value % 25 == 0:
        pbar.description = f"{count_conclusies} conclusies, {count_uitspraken} uitspraken, {count_neither} neither, {count_skip} skipped " 

    if key in known_notext_urls: # we previously figured it had no text, no sense trying
        count_neither += 1
        continue
    
    if key in already_extracted_urls:  # only when adding incrementally; won't update
        count_skip += 1
        continue

    # actually load
    bytestring = rechtspraak_fetched.get(key)

    if b'<conclusie' not in bytestring and b'<uitspraak' not in bytestring: # cheaper than parsing before deciding based on the parse?
        count_neither += 1
        continue

    # actually parse
    tree = wetsuite.helpers.etree.fromstring( bytestring )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    uitspraak = tree.find('uitspraak')
    conclusie = tree.find('conclusie')
    
    if uitspraak is not None:
        count_uitspraken += 1
        rechtspraaknl_extracted.put(key,  wetsuite.datacollect.rechtspraaknl.parse_content( tree )  )
        #collected[key] = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
        
    elif conclusie is not None:
        count_conclusies += 1
        rechtspraaknl_extracted.put(key,  wetsuite.datacollect.rechtspraaknl.parse_content( tree )  )
        #collected[key] = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
        
    else: # actually shouldn't happen
        count_neither += 1
    #    print( wetsuite.helpers.etree.debug_pretty( example_tree ) ) # print indented
    #    #raise ValueError()
    #    break

print(f"{count_conclusies} conclusies and {count_uitspraken} uitspraken   (and {count_neither} that have no text)")       

parsing...:   0%|          | 0/3182815 [00:00<?, ?it/s]

17 conclusies and 206743 uitspraken   (and 2464757 that have no text)
