<a href="https://colab.research.google.com/github/scarfboy/wetsuite-dev/blob/main/examples/collocations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

Understanding what you can get out of rechtspraak.nl.

Note that this is mostly just showing our work, in that _if_ the rechtspraak dataset we provide suits your needs
then using that may be faster and potentially less cumbersome, and you won't need to run this to do nothing more than
basically just create the same dataset.

Somewhat related: [extras_datacollect_rechtspraak_codes](extras_datacollect_rechtspraak_codes.ipynb)

## Website, Open data API, and some other stuff

You are probably familiar with the [rechtspraak.nl](https://rechtspraak.nl) website and its [search that has a number of filters](https://uitspraken.rechtspraak.nl/#!/) (and some [exta query logic for the text](https://www.rechtspraak.nl/Uitspraken/Paginas/Hulp-bij-zoeken.aspx#1ab85aa0-e737-4b56-8ad5-d7cb7954718d77a998be-3c73-40e3-90f7-541fceeb00fd3)), which gives webpage results, with text where present.


There is also an open data API that exposes much the same in data form.
As [its documentation](https://www.rechtspraak.nl/SiteCollectionDocuments/Technische-documentatie-Open-Data-van-de-Rechtspraak.pdf) ([this intro](https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx) may also be useful) mentions,
- you stick query parameters on the base URL of http://data.rechtspraak.nl/uitspraken/zoeken 
- the results mainly mention ECLIs, which you can fetch details for via e.g. https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:PHR:2011:BP5608 (more notes below)


Worthy of note:
- the fields you can search mostly matches with the 'Uitgebreid zoeken' at uitspraken.rechtspraak.nl, like
  - instantie / court code (basically that third element in the ECLI)
  - rechtsgebied
  - procedure 

- **You can't search in the body text**.  This largely limits the API to a 'keep updated with new cases' feed.  
  - Worthy of note: the website search (queried via `https://uitspraken.rechtspraak.nl/api/zoek`) does support thhis, and even seems like a better data API than the _actual_ one -- but it doesn't look like it's supposed to be used externally.

- there are plenty of cases where there is no text / document.  You can filter for this in the search.

- the **identifiers** used are **ECLI** (European Case Law Identifier)
  - which in this case will be mainly Dutch ECLIs  (`ECLI:NL:`...). 
    - used since 2013 or so, and which absorbed the previously used LJN (Landelijk Jurisprudentie Nummer) identifiers.
  - court code XX (`ECLI:NL:XX:`...) is used for things not (yet?) assgined to a court, and/or non-Dutch ECLIs,
    - rechtspraak.nl may later resolve such ECLIs to a different ECLI. That makes their site the most up-to-date information, that a mirror may not necessarily be aware of (yet). 
    - Example: ECLI:NL:XX:2009:BJ4574 
      - in [XML metadata](https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:XX:2009:BJ4574) mentions it's now (isReplacedBy) ECLI:CE:ECHR:2009:0528JUD002671305 (note: )
      - in [webpage form](https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:XX:2009:BJ4574) also seems to link to [a place to find it](https://hudoc.echr.coe.int/eng#%7B%22ecli%22:%5B%22ECLI:CE:ECHR:2009:0528JUD002671305%22%5D%7D) (e.g. `hudoc.echr.coe.int` or `e-justice.europa.eu`).

For each ECLI you might consider various URLs, including
  - XML, e.g. at https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:PHR:2011:BP5608
    - most interesting when you want case details as data
    - (there is a different XML form at https://uitspraken.rechtspraak.nl/api/document/?id=ECLI:NL:PHR:2011:BP5608 but this is for the webpage view, and is less interesting as data)
  - the case on the website
    - is linked by the website as   https://uitspraken.rechtspraak.nl/InzienDocument?id=ECLI:NL:PHR:2011:BP5608
    - it seems the slightly shorter https://deeplink.rechtspraak.nl/uitspraak?id=ECLI:NL:PHR:2011:BP5608 is equivalent
    - both of the above redirects to a URL like https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:PHR:2011:BP5608
      - which is a general page with scripting that picks up that identifer and then does another request to https://uitspraken.rechtspraak.nl/api/document/?id=ECLI:NL:PHR:2011:BP5608
  - The website also often links to LiDo, e.g.  https://linkeddata.overheid.nl/document/ECLI:NL:PHR:2011:BP5608
    - If you want that as data, consider http://linkeddata.overheid.nl/service/get-links?ext-id=ECLI:NL:PHR:2011:BP5608&output=xml
    - though as https://linkeddata.overheid.nl/front/portal/services notes, this is not part of public LiDo, so you'll need to request an account first


Because there are over three million dutch ECLIs on record, getting a lot of data from here would take a while, and they ask you to be nice to their server.
There used to be a ZIP file (linked from [https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx](https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx) you could download to bootstrap your own copy, but [this open data request](https://data.overheid.nl/community/datarequest/zip-bestand-alle-uitspraken) seems to confirm this was removed (and seems to imply that you should get it the hammery way).  As of this writing the URL to the ZIP file still works but they probably removed the link for a reason.


In [3]:
import pprint, collections, random

import wetsuite.helpers.etree
import wetsuite.datacollect.rechtspraaknl
import wetsuite.helpers.koop_parse
import wetsuite.helpers.meta
import wetsuite.helpers.net
import wetsuite.helpers.localdata
import wetsuite.helpers.notebook

### Example of search, and its results, and fetching

#### Querying 

Base URL is http://data.rechtspraak.nl/uitspraken/
: just that will give you some identifier/value lists

The **search** base is http://data.rechtspraak.nl/uitspraken/zoeken

Search parameters include:
* from, max
:: from is 0-based
:: max value of max is 1000
* sort - 
* type - `Uitspraak` or `Conclusie`
* date - date of this uitspraak / conclusie
* uitspraakdatum (date, or date range; optional)
* instantie - as mentioned in https://data.rechtspraak.nl/Waardelijst/Instanties 
* subject - rechtsgebied as mentioned in https://data.rechtspraak.nl/Waardelijst/Rechtsgebieden
* return - if you specify return=DOC you only get entries for which there is a document; if not you also get entries for which there is only metadata
* modified - 
* replaces - 

For example: http://data.rechtspraak.nl/uitspraken/zoeken?modified=2023-01-01&max=50

Response format is [Atom](https://en.wikipedia.org/wiki/Atom_(standard)) but are vert minimal, amounting to title, date, summary, and pointing at the XML data.

...but let's get our code to help:

In [4]:
params = [
    ('max', '50000'), 
    ('return', 'DOC'),                                         # DOC asks for things with body text only
    #('modified', '2023-10-01'), ('modified', '2023-11-01')     # date range    (keep in mind that larger ranges easily means we hit the max)
    ('modified', '2023-11-01'),
]
search_results = wetsuite.datacollect.rechtspraaknl.search( params )
# search_results is a parsed etree object.
# We could show that relatively raw like    print( wetsuite.helpers.etree.debug_pretty(search_results) ) 
#   yet our parsed form (each entry as a dict) is little simpler:
entries = wetsuite.datacollect.rechtspraaknl.parse_search_results( search_results )
print( "Entries in results: %d\n"%len(entries) )
for entry in random.sample( entries, 3 ): # show a few random examples
    print('--------------------')
    pprint.pprint(entry)

https://data.rechtspraak.nl/uitspraken/zoeken?max=50000&return=DOC&modified=2023-11-01
Entries in results: 2245

--------------------
{'ecli': 'ECLI:NL:RBAMS:2023:6704',
 'link': 'https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:RBAMS:2023:6704',
 'summary': 'Vrijspraak na bewijsuitsluiting wegens onherstelbare '
            'vormverzuimen.',
 'title': 'ECLI:NL:RBAMS:2023:6704, Rechtbank Amsterdam, 25-10-2023, '
          '13-040321-22',
 'updated': '2023-11-01T09:04:44Z',
 'xml': 'https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RBAMS:2023:6704'}
--------------------
{'ecli': 'ECLI:NL:GHSHE:2023:2475',
 'link': 'https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:GHSHE:2023:2475',
 'summary': 'Belanghebbende reikt een factuur met omzetbelasting uit en '
            'voldoet die niet op aangifte. De inspecteur legt een '
            'naheffingsaanslag op met een vergrijpboete. De naheffingsaanslag '
            'blijft in stand op grond van artikel 37 Wet OB 1968, o

#### Fetching the documents the search refers to

In [5]:
rechtspraak_fetched = wetsuite.helpers.localdata.LocalKV('rechtspraak_fetched.db', key_type=str, value_type=bytes)

In [9]:
paths = collections.defaultdict(int)

count_fetched, count_cached = 0,0

pbar = wetsuite.helpers.notebook.progress_bar( max=len(entries), description='fetching... ')

for entry in entries:
    entry_xml_url = 'https://data.rechtspraak.nl/uitspraken/content?id=%s'%entry['ecli']
    bytestring, came_from_cache = wetsuite.helpers.localdata.cached_fetch( rechtspraak_fetched, entry_xml_url)
    if came_from_cache:
        count_cached +=1
    else:
        count_fetched += 1
    #bytestring = rechtspraak_fetched.get( entry_xml_url )
    
    tree  = wetsuite.helpers.etree.fromstring( bytestring )
    tree  = wetsuite.helpers.etree.strip_namespace( tree )
        
    pbar.value += 1
    pbar.description = f"Fetched {count_fetched}, cached {count_cached}  "

print(f"Fetched {count_fetched},  while {count_cached} were cached\n") # because the progress bar doesn't update after iterating

fetching... :   0%|          | 0/2245 [00:00<?, ?it/s]

## What does the data XML look like, and what can I easily do with it?

In [13]:
# let's have a cherry-picked example
#bytestring = wetsuite.helpers.net.download('https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RBZWB:2020:5807') 
# or a random example   (note that a lot of them will be without text, that's normal)
key, bytestring = rechtspraak_fetched.random_choice()

example_tree = wetsuite.helpers.etree.fromstring( bytestring )
print( wetsuite.helpers.etree.debug_pretty( example_tree ) ) # print indented
#pprint.pprint( wetsuite.datacollect.rechtspraaknl.parse_content( example_tree ) )

<open-rechtspraak>
  <RDF>
    <Description>
      <identifier>ECLI:NL:CRVB:2012:BY5260</identifier>
      <format>text/xml</format>
      <accessRights>public</accessRights>
      <modified>2013-05-01T12:01:51</modified>
      <issued label="Publicatiedatum">2013-04-05</issued>
      <publisher resourceIdentifier="http://rechtspraak.nl/">Raad voor de Rechtspraak</publisher>
      <language>nl</language>
      <replaces label="Vervangt">BY5260</replaces>
      <creator resourceIdentifier="http://standaarden.overheid.nl/owms/terms/Centrale_Raad_van_Beroep" scheme="overheid.RechterlijkeMacht" label="Instantie">Centrale Raad van Beroep</creator>
      <date label="Uitspraakdatum">2012-12-05</date>
      <zaaknummer label="Zaaknr">11-6185 WW</zaaknummer>
      <type resourceIdentifier="http://psi.rechtspraak.nl/uitspraak" language="nl">Uitspraak</type>
      <procedure resourceIdentifier="http://psi.rechtspraak.nl/procedure#hogerBeroep" language="nl" label="Procedure">Hoger beroep</procedu

## Inspect the fetched documents, looking for its text

Right now we're looking just to figure out whether our parsing makes sense, e.g. whether it picks out all the parts of the text.

Counting and later showing the document structure (via its paths) probably helps there.

In [21]:
count_paths = collections.defaultdict(int)


#for key in rechtspraak_fetched.items():
for key, xmldoc_bytes in rechtspraak_fetched.random_sample( 500 ):
    tree = wetsuite.helpers.etree.fromstring( xmldoc_bytes )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    if 0: # check there's any other nodes beyond RDF, inhoudsindicatie, uitspraak, conclusie -- looks like no.
        childnames = list( node.tag  for node in tree.findall('*') )
        childnames.remove('RDF')
        if 'inhoudsindicatie' in childnames:
            childnames.remove('inhoudsindicatie')
        if 'uitspraak' in childnames:
            childnames.remove('uitspraak')
        if 'conclusie' in childnames:
            childnames.remove('conclusie')
        if len(childnames)>0:
            print( childnames )

    uitspraak = tree.find('uitspraak')
    conclusie = tree.find('conclusie')
    
    if uitspraak is not None:
        for path, count in wetsuite.helpers.etree.path_count( uitspraak, max_depth=9 ).items():
            count_paths[path] += count
        wetsuite.datacollect.rechtspraaknl.parse_content( tree )
    elif conclusie is not None:
        for path, count in wetsuite.helpers.etree.path_count( conclusie, max_depth=9 ).items():
            count_paths[path] += count
        wetsuite.datacollect.rechtspraaknl.parse_content( tree )

    break


#import wetsuite.helpers.notebook

#wetsuite.helpers.notebook.etree_visualize_selection( example_tree, '', reindent=True, mark_subtree=True )


b'<?xml version="1.0" encoding="utf-8"?>\r\n<open-rechtspraak>\r\n  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ecli="https://e-justice.europa.eu/ecli" xmlns:tr="http://tuchtrecht.overheid.nl/" xmlns:eu="http://publications.europa.eu/celex/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:bwb="bwb-dl" xmlns:cvdr="http://decentrale.regelgeving.overheid.nl/cvdr/" xmlns:psi="http://psi.rechtspraak.nl/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">\r\n    <rdf:Description>\r\n      <dcterms:identifier>ECLI:NL:GHARL:2019:10782</dcterms:identifier>\r\n      <dcterms:format>text/xml</dcterms:format>\r\n      <dcterms:accessRights>public</dcterms:accessRights>\r\n      <dcterms:modified>2019-12-17T00:01:17</dcterms:modified>\r\n      <dcterms:issued rdfs:label="Publicatiedatum">2019-12-16</dcterms:issued>\r\n      <dcterms:publisher resourceIdentifier="http://rechtspraak.nl/">Raad voor de Rechtspraak</dcterms:publisher>\r\n      <dcterms:language>nl</dcterms:la

In [22]:
# Show those counted paths
pci = list( count_paths.items() )
pci.sort( key=lambda x:x[0] )
for path, count in pci:
    print('%7d   %s'%(count, path))

    488   conclusie
     25   conclusie/bridgehead
     87   conclusie/conclusie.info
     11   conclusie/conclusie.info/bridgehead
     15   conclusie/conclusie.info/informaltable
     15   conclusie/conclusie.info/informaltable/tgroup
     30   conclusie/conclusie.info/informaltable/tgroup/colspec
     15   conclusie/conclusie.info/informaltable/tgroup/tbody
     69   conclusie/conclusie.info/informaltable/tgroup/tbody/row
    138   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry
    173   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry/para
     34   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry/para/emphasis
   1062   conclusie/conclusie.info/para
     20   conclusie/conclusie.info/para/emphasis
    485   conclusie/conclusie.info/parablock
    740   conclusie/conclusie.info/parablock/para
    172   conclusie/conclusie.info/parablock/para/emphasis
      6   conclusie/conclusie.info/parablock/para/footnote-ref
   5118   conclusie/footnote
 

## Start making a dataset

In [73]:
rechtspraaknl_extracted = wetsuite.helpers.localdata.MsgpackKV('rechtspraaknl_extracted.db', str, None)
#rechtspraaknl_extracted._put_meta('valtype','msgpack') # I do believe that's internal now; TODO: test

In [14]:
# how much do we have at all?
len( rechtspraak_fetched )

3182393

Like in [datacollect_koop_docstructure_cvdr](datacollect_koop_docstructure_cvdr.ipynb) and [_bwb](datacollect_koop_docstructure_bwb.ipynb), 
let's point out there are [schemas](https://www.rechtspraak.nl/SiteCollectionDocuments/Schema-Open-Data-voor-de-Rechtspraak.zip) to the text's structure -- and look at how they're followed or not.

In [15]:
count_uitspraken, count_conclusies, count_neither = 0,0,0
dataset = {}

sample = random.sample( rechtspraak_fetched.keys(), 50000 )
#sample = fetched_keys
pbar = wetsuite.helpers.notebook.progress_bar( max=len(sample), description='parsing...')

for key in sample:
#for key in fetched_keys:
    tree = wetsuite.helpers.etree.fromstring( rechtspraak_fetched.get(key) )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    uitspraak = tree.find('uitspraak')
    conclusie = tree.find('conclusie')
    
    if uitspraak is not None:
        count_uitspraken += 1
        dataset[key] = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
        
    elif conclusie is not None:
        count_conclusies += 1
        dataset[key] = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
        
    else:
        count_neither += 1
    #    print( wetsuite.helpers.etree.debug_pretty( example_tree ) ) # print indented
    #    #raise ValueError()
    #    break
    pbar.value += 1

print(f"{count_conclusies} conclusies and {count_uitspraken} uitspraken   (and {count_neither} that have no text)")       

parsing...:   0%|          | 0/50000 [00:00<?, ?it/s]



488 conclusies and 10840 uitspraken   (and 38672 that have no text)


    488   conclusie
     25   conclusie/bridgehead
     87   conclusie/conclusie.info
     11   conclusie/conclusie.info/bridgehead
     15   conclusie/conclusie.info/informaltable
     15   conclusie/conclusie.info/informaltable/tgroup
     30   conclusie/conclusie.info/informaltable/tgroup/colspec
     15   conclusie/conclusie.info/informaltable/tgroup/tbody
     69   conclusie/conclusie.info/informaltable/tgroup/tbody/row
    138   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry
    173   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry/para
     34   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry/para/emphasis
   1062   conclusie/conclusie.info/para
     20   conclusie/conclusie.info/para/emphasis
    485   conclusie/conclusie.info/parablock
    740   conclusie/conclusie.info/parablock/para
    172   conclusie/conclusie.info/parablock/para/emphasis
      6   conclusie/conclusie.info/parablock/para/footnote-ref
   5118   conclusie/footnote
 

In [76]:
#  ...then actually put the parsed items into a store we can call a dataset.
for key, d in dataset.items():
    rechtspraaknl_extracted.put( key, d, commit=False )
rechtspraaknl_extracted.commit()

In [19]:
# basic sanity check inpection
print( '%s'%len(dataset) ) 
random.sample( dataset.items(), 5)

11328


[('https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RBSGR:2011:BV3576',
  {'identifier': 'ECLI:NL:RBSGR:2011:BV3576',
   'issued': '2013-04-05',
   'publisher': 'Raad voor de Rechtspraak',
   'replaces': 'BV3576',
   'date': '2011-11-23',
   'type': 'Uitspraak',
   'modified': '2022-04-29T10:39:53',
   'zaaknummer': '1081633/11-5167',
   'creator': "Rechtbank 's-Gravenhage",
   'subject': 'Civiel recht',
   'inhoudsindicatie': [([],
     ['Einde huurovereenkomst bedrijfsruimte door opzegging van die overeenkomst door curator. ',
      'Vraag of schade van € 24.000,00, die bij einde huur wordt geconstateerd, boedelschuld is. ',
      'Kantonrechter : ',
      '- Uitgangspunt in deze procedure is dat de schade van € 24.000,00 is ontstaan in de periode dat de failliet huurder was. ',
      '- In dit geval geen schade die naar haar aard pas na het beëindigen van de huurovereenkomst vergoed moet worden. ',
      '- Schadevergoedingsverplichting vloeit dan ook niet voort uit de opze

It seems there are specific ideas about what should be in what structure (e.g. para in parablock in paragroup,  overview stuff in uitspraak.info)