<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/notebooks/extras/datacollect/extras_datacollect_rechtspraak.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

Understanding what you can get out of [rechtspraak.nl](https://www.rechtspraak.nl/).

Note that this is mostly just showing our work. If the rechtspraak [dataset](../../intro/wetsuite_datasets.ipynb) we provide suits your needs,
then running this notebook would just be a slower and more cumbersome way to get basically the same.

Somewhat related: [extras_datacollect_rechtspraak_codes](extras_datacollect_rechtspraak_codes.ipynb)

## Website, Open data API, and some other notes

You are probably familiar with the [rechtspraak.nl](https://rechtspraak.nl) website and its [search that has a number of filters](https://uitspraken.rechtspraak.nl/#!/) (and some [exta query logic for the text](https://www.rechtspraak.nl/Uitspraken/Paginas/Hulp-bij-zoeken.aspx#1ab85aa0-e737-4b56-8ad5-d7cb7954718d77a998be-3c73-40e3-90f7-541fceeb00fd3)), which gives webpage results, with text where present.


There is also an open data API that exposes much the same in data form.
As [its documentation](https://www.rechtspraak.nl/SiteCollectionDocuments/Technische-documentatie-Open-Data-van-de-Rechtspraak.pdf) ([this intro](https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx) may also be useful) mentions,
- you stick query parameters on the base URL of http://data.rechtspraak.nl/uitspraken/zoeken 
- the results mainly mention ECLIs, which you can fetch details for via e.g. https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:PHR:2011:BP5608 (more notes below)


Worthy of note:
- the fields you can search mostly matches with the 'Uitgebreid zoeken' at uitspraken.rechtspraak.nl, like
  - instantie / court code (basically that third element in the ECLI)
  - rechtsgebied
  - procedure 

- **You can't search in the body text**.  This largely limits the API to a 'keep updated with new cases' feed.  
  - Worthy of note: the website search (queried via `https://uitspraken.rechtspraak.nl/api/zoek`) does support thhis, and even seems like a better data API than the _actual_ one -- but it doesn't look like it's supposed to be used externally.

- there are plenty of cases where there is no text / document.  You can filter for this in the search.

- the **identifiers** used are **ECLI** (European Case Law Identifier)
  - which in this case will be mainly Dutch ECLIs  (`ECLI:NL:`...). 
    - used since 2013 or so, and which absorbed the previously used LJN (Landelijk Jurisprudentie Nummer) identifiers.
  - court code XX (`ECLI:NL:XX:`...) is used for things not (yet?) assgined to a court, and/or non-Dutch ECLIs,
    - rechtspraak.nl may later resolve such ECLIs to a different ECLI. That makes their site the most up-to-date information, that a mirror may not necessarily be aware of (yet). 
    - Example: ECLI:NL:XX:2009:BJ4574 
      - in [XML metadata](https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:XX:2009:BJ4574) mentions it's now (isReplacedBy) ECLI:CE:ECHR:2009:0528JUD002671305 (note: )
      - in [webpage form](https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:XX:2009:BJ4574) also seems to link to [a place to find it](https://hudoc.echr.coe.int/eng#%7B%22ecli%22:%5B%22ECLI:CE:ECHR:2009:0528JUD002671305%22%5D%7D) (e.g. `hudoc.echr.coe.int` or `e-justice.europa.eu`).

For each ECLI you might consider various URLs, including
  - XML, e.g. at https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:PHR:2011:BP5608
    - most interesting when you want case details as data
    - (there is a different XML form at https://uitspraken.rechtspraak.nl/api/document/?id=ECLI:NL:PHR:2011:BP5608 but this is for the webpage view, and is less interesting as data)
  - the case on the website
    - is linked by the website as   https://uitspraken.rechtspraak.nl/InzienDocument?id=ECLI:NL:PHR:2011:BP5608
    - it seems the slightly shorter https://deeplink.rechtspraak.nl/uitspraak?id=ECLI:NL:PHR:2011:BP5608 is equivalent
    - both of the above redirects to a URL like https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:PHR:2011:BP5608
      - which is a general page with scripting that picks up that identifer and then does another request to https://uitspraken.rechtspraak.nl/api/document/?id=ECLI:NL:PHR:2011:BP5608
  - The website also often links to LiDo, e.g.  https://linkeddata.overheid.nl/document/ECLI:NL:PHR:2011:BP5608
    - If you want that as data, consider http://linkeddata.overheid.nl/service/get-links?ext-id=ECLI:NL:PHR:2011:BP5608&output=xml
    - though as https://linkeddata.overheid.nl/front/portal/services notes, this is not part of public LiDo, so you'll need to request an account first


Because there are over three million dutch ECLIs on record, getting a lot of data from here would take a while, and they ask you to be nice to their server.
There used to be a ZIP file (linked from [https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx](https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx) you could download to bootstrap your own copy, but [this open data request](https://data.overheid.nl/community/datarequest/zip-bestand-alle-uitspraken) seems to confirm this was removed (and seems to imply that you should get it the hammery way).  As of this writing the URL to the ZIP file still works but they probably removed the link for a reason.


In [1]:
import collections, random, time

import wetsuite.helpers.date
import wetsuite.helpers.etree
import wetsuite.datacollect.rechtspraaknl
import wetsuite.helpers.koop_parse
import wetsuite.helpers.meta
import wetsuite.helpers.net
import wetsuite.helpers.localdata
import wetsuite.helpers.notebook

### Example of search, its results, and fetching

#### Querying 

The base URL for data is http://data.rechtspraak.nl/uitspraken/ and just that URL will link you to some identifier/value lists

The **search** base is http://data.rechtspraak.nl/uitspraken/zoeken

Search parameters include: (again, see [the documentation](https://www.rechtspraak.nl/SiteCollectionDocuments/Technische-documentatie-Open-Data-van-de-Rechtspraak.pdf))
* from, max  (from is 0-based, max value of max is documented as 1000)
* sort - default is by modification date, ascending. `DESC` lets you do descending instead.
* type - `Uitspraak` or `Conclusie`
* date - date of this uitspraak / conclusie
* uitspraakdatum (date, or date range; optional)
* instantie - as mentioned in https://data.rechtspraak.nl/Waardelijst/Instanties 
* subject - rechtsgebied as mentioned in https://data.rechtspraak.nl/Waardelijst/Rechtsgebieden
* return - if you specify return=DOC you only get entries for which there is a document; if not you also get entries for which there is only metadata
* modified - last change of the metadata and/or text (with some subtleties, e.g. not necessarily of the uitrspraak or conclusie document)
* replaces - previous ECLI, or LJN,. for this case.  Meant for backwards compatibility (of searches?).


For example: http://data.rechtspraak.nl/uitspraken/zoeken?modified=2023-01-01&max=50

The response format is [Atom](https://en.wikipedia.org/wiki/Atom_(standard)) and very minimal: title, date, summary, and an URL pointing at the XML data document.

...but let's get our code to help:

In [8]:
## First get the search result metadata

overall_entries = {}

at_a_time = 1000
for range_from, range_to in wetsuite.helpers.date.date_ranges(from_date='2023-06-01', to_date ='2023-12-29', increment_days=1, strftime_format="%Y-%m-%d"):
    print()
    frm = 0
    while True:
        query = [
            ('from', str(frm)),  
            ('max',  str(at_a_time)),     # max seems capped at 1000, so we have to do more in multiple fetches
            ('return', 'DOC'),                                          # DOC asks for things with body text only
            #('modified', '2023-12-28'),
        ]
        query.extend( [('modified',range_from), ('modified',range_to)] )
        print(query)

        search_results = wetsuite.datacollect.rechtspraaknl.search( query )
        # search_results is a parsed etree object.
        # We could show that relatively raw like    
        #print( wetsuite.helpers.etree.debug_pretty(search_results) ) 
        #  yet our parsed form (each entry as a dict) is little simpler to read:

        search_entries = wetsuite.datacollect.rechtspraaknl.parse_search_results( search_results )
        # we could make overall_entries a list and just extend() it, but there would probably be a lot of overlap,
        # so instead we use the ECLI to deduplicate
        #print('adding %d entries'%len(search_entries))
        for entry_dict in search_entries:
            overall_entries[ entry_dict.get('ecli') ] = entry_dict 

        if len(search_results) < 1000:
            break
        frm += at_a_time

#print( "Entries in results: %d\n"%len(entries) )
#for entry in random.sample( entries, 3 ): # show a few random examples
#    print('--------------------')
#    pprint.pprint(entry)


[('from', '0'), ('max', '1000'), ('return', 'DOC'), ('modified', '2023-06-01'), ('modified', '2023-06-02')]
https://data.rechtspraak.nl/uitspraken/zoeken?from=0&max=1000&return=DOC&modified=2023-06-01&modified=2023-06-02
adding 874 entries

[('from', '0'), ('max', '1000'), ('return', 'DOC'), ('modified', '2023-06-02'), ('modified', '2023-06-03')]
https://data.rechtspraak.nl/uitspraken/zoeken?from=0&max=1000&return=DOC&modified=2023-06-02&modified=2023-06-03
adding 275 entries

[('from', '0'), ('max', '1000'), ('return', 'DOC'), ('modified', '2023-06-03'), ('modified', '2023-06-04')]
https://data.rechtspraak.nl/uitspraken/zoeken?from=0&max=1000&return=DOC&modified=2023-06-03&modified=2023-06-04
adding 1000 entries
[('from', '1000'), ('max', '1000'), ('return', 'DOC'), ('modified', '2023-06-03'), ('modified', '2023-06-04')]
https://data.rechtspraak.nl/uitspraken/zoeken?from=1000&max=1000&return=DOC&modified=2023-06-03&modified=2023-06-04
adding 1000 entries
[('from', '2000'), ('max', '1

#### Fetching the documents the search refers to

In [3]:
rechtspraak_fetched = wetsuite.helpers.localdata.LocalKV('rechtspraak_fetched.db', key_type=str, value_type=bytes)

In [9]:
## Now get the text belonging to each
paths = collections.defaultdict(int)

count_fetched, count_cached = 0,0

pbar = wetsuite.helpers.notebook.progress_bar( len(overall_entries), description='fetching... ')

for entry in overall_entries.values():
    entry_xml_url = 'https://data.rechtspraak.nl/uitspraken/content?id=%s'%entry['ecli']
    # TODO: timeout catch
    bytestring, came_from_cache = wetsuite.helpers.localdata.cached_fetch( rechtspraak_fetched, entry_xml_url)
    if came_from_cache:
        count_cached +=1
    else:
        count_fetched += 1
    
    tree  = wetsuite.helpers.etree.fromstring( bytestring )
    tree  = wetsuite.helpers.etree.strip_namespace( tree )
        
    pbar.value += 1
    pbar.description = f"Fetched {count_fetched}, cached {count_cached}  "

print(f"Fetched {count_fetched},  while {count_cached} were cached\n") # because the progress bar doesn't update after iterating

fetching... :   0%|          | 0/179817 [00:00<?, ?it/s]

Fetched 3677,  while 176140 were cached



Note:
There is a [extras_diagnose_rechtspraak_docstructure](../extras_diagnose_rechtspraak_docstructure.ipynb), 
an exploration of the documents that informs some of the choices made below,
and in particular some of the helper code in wetsuite.datacollect.rechtspraaknl

## Start making a dataset

In [5]:
rechtspraaknl_extracted = wetsuite.helpers.localdata.MsgpackKV('rechtspraaknl_extracted.db', str, None)

In [6]:
# how much do we have at all?
len( rechtspraak_fetched )

3193523

In [7]:
# speed optimization for a later step:  
#   Store cases (by URL) we detect contain no text
#   Since this reads gigabytes of data, expect this to take a few minutes
#   Note this may take few hundred MB of RAM
import tqdm
known_notext = set()    # TODO: figure out why we're not putting this in a store

print("known-notext store: figuring out new cases")
for key, bytestring in tqdm.tqdm( rechtspraak_fetched.items() ):
    if key in known_notext: # (this condition only helps for re-runs in the same session that update from newer fetched state)
        continue
    if b'<conclusie' not in bytestring and b'<uitspraak' not in bytestring: # cheaper than parsing before deciding based on the parse?
        known_notext.add( key )

print("known-notext store: updating")
rechtspraaknl_knownnotext = wetsuite.helpers.localdata.LocalKV('rechtspraaknl_knownnotext.db', str, None)
print( "  items before update:", len(rechtspraaknl_knownnotext) )
current_keys = set( rechtspraaknl_knownnotext.keys() )
for url in known_notext:
    if url not in current_keys:
        rechtspraaknl_knownnotext.put(url, 'y', commit=False)
rechtspraaknl_knownnotext.commit()
print( "  items after update:", len(rechtspraaknl_knownnotext) )
#rechtspraaknl_extracted._put_meta('valtype','msgpack') # I do believe that's internal now; TODO: test

known-notext store: figuring out figuring out new cases




100%|██████████| 3193523/3193523 [01:19<00:00, 40310.00it/s]


known-notext store: updating
  items before update: 2464757
  items after update: 2464757


### (incremental) parsing and storing

In [8]:
# these are used in the update section. 
known_notext_urls      = set( rechtspraaknl_knownnotext.keys() )  # Set, not list, so that 'in' isn't slow as molasses.
already_extracted_urls = set( rechtspraaknl_extracted.keys() )

In [10]:
count_uitspraken, count_conclusies, count_neither, count_skip  =  0, 0, 0, 0

selected_keys = list(rechtspraak_fetched.keys()) # all keys
random.shuffle(selected_keys)
#selected_keys = random.sample( selected_keys, 50000 ) # uncomment during debug, if you want to get just a subset, faster
pbar = wetsuite.helpers.notebook.progress_bar( len(selected_keys), description='parsing...')

for key in selected_keys:
    pbar.value += 1
    pbar.description = f"{count_conclusies} conclusies, {count_uitspraken} uitspraken, {count_neither} neither, {count_skip} skipped " 

    if key in known_notext_urls: # we previously figured it had no text, no sense trying
        count_neither += 1
        continue
    
    if key in already_extracted_urls:  # only when adding incrementally; won't update
        count_skip += 1
        continue

    # actually load
    bytestring = rechtspraak_fetched.get(key)

    if b'<conclusie' not in bytestring and b'<uitspraak' not in bytestring: # cheaper than parsing before deciding based on the parse?
        count_neither += 1
        continue

    # actually parse
    tree = wetsuite.helpers.etree.fromstring( bytestring )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    uitspraak = tree.find('uitspraak')
    conclusie = tree.find('conclusie')
    
    if uitspraak is not None:
        count_uitspraken += 1
        rechtspraaknl_extracted.put(key,  wetsuite.datacollect.rechtspraaknl.parse_content( tree )  )
        #collected[key] = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
        
    elif conclusie is not None:
        count_conclusies += 1
        rechtspraaknl_extracted.put(key,  wetsuite.datacollect.rechtspraaknl.parse_content( tree )  )
        #collected[key] = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
        
    else: # actually shouldn't happen
        count_neither += 1
    #    print( wetsuite.helpers.etree.debug_pretty( example_tree ) ) # print indented
    #    #raise ValueError()
    #    break

print(f"{count_conclusies} conclusies and {count_uitspraken} uitspraken   (and {count_neither} that have no text)")       

parsing...:   0%|          | 0/3193523 [00:00<?, ?it/s]

212 conclusies and 10496 uitspraken   (and 2464757 that have no text)
