<a href="https://colab.research.google.com/github/scarfboy/wetsuite-dev/blob/main/examples/datacollect_koop_repos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# For local installs you can install the package once.   In colab you get a disposable environment and will have to start with this install each time. 
!pip3 --quiet install -U https://github.com/scarfboy/wetsuite-dev/archive/refs/heads/main.zip

# Introduction to searching KOOP's repositories via their SRU API

In [None]:
# KOOP's repositories give access to the same data behind wetten.overheid.nl, lokaleregelgeving.overheid.nl, and more.
# These repositories can be queried live and without any configuration or storage,
#   though if you are going to do regular and/or bulky searches is is probably a good idea to cache the documents you fetch.
# 
# The below shows the use of a (currently very minimal) python API to access some of these repositories.

# The query syntax is Common Query Language. 
# You can get fairly far copying examples, or guessing.  For the somewhat more technically minded:
# - most of the time you use    'indexname operator term'    e.g. 'dc.title = chicken'  or  'dc.modified > 2022-01-01'
#   use doublequotes when there's a space or <>=/() in the term   (or just always)
# - the operators supported vary per field and per server (some get fancy, many do not), and you will have to find its documentation OR stick to a few basics
#   - for dates and numbers you mostly have '<' '<=' '>', '>=' '=' 
#   - for text you usually have
#      'any'      roughly speaking:  body any "foo bar"   is short for   body any foo OR  body any bar
#      'all'      roughly speaking:  body all "foo bar"   is short for   body any foo AND body any bar
#      '=='       exact match
#      'exact'    exact match
#      '='        server choice, e.g. for text might be may be '==' or 'adj' or such
# - the indexes available to you (which are usually item's metadata fields) will vary per server, and will be listed in its SRU explain operation
# - you can combine those, using AND and OR and brackets, see e.g. the CVDR example below

# Further notes:
# - in a few cases this SRU functionality is more limited than its website equivalent.
#   For example, the SRU interface to BWB does not seem to allow searching the body text, the website does)
# - Details can vary per repository, e.g. 
#   - whast fields are in the search result items
#   - how they point to the content documents they describe
#   - there may be shorthands for index names, e.g. BWB allows 'titel' meaning 'overheidbwb.titel'
# The more detailed question you have, the more you have to figure out repository details.
# We try to provide helper functions.

In [2]:
import pprint, datetime
import wetsuite.datacollect.koop_repositories

As some indication of how to use the library to access SRU:

In [3]:
help( wetsuite.datacollect.koop_repositories.BWB() )

Help on BWB in module wetsuite.datacollect.koop_repositories object:

class BWB(wetsuite.datacollect.sru.SRUBase)
 |  BWB(verbose=False)
 |  
 |  SRU endpoint for the Basis Wetten Bestand repository
 |  
 |  See a description in https://www.overheid.nl/sites/default/files/wetten/Gebruikersdocumentatie%20BWB%20-%20Zoeken%20binnen%20het%20basiswettenbestand%20v1.3.1.pdf
 |  
 |  Method resolution order:
 |      BWB
 |      wetsuite.datacollect.sru.SRUBase
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, verbose=False)
 |      base_url should be everything up to the ?
 |      
 |      Notes:
 |      - x_connection is used to specify the collection within a server, and seems to be non-standard and required
 |      
 |      - extra_query is used to let us AND something into the query, and is intended to restrict to a subset of documents
 |        in these cases, x_connection seems to include in extra sets, and the combination is sometimes too much (?)
 |  
 |  

## Basis WettenBestand

There is some technical documentation at https://www.overheid.nl/sites/default/files/wetten/Gebruikersdocumentatie%20BWB%20-%20Zoeken%20binnen%20het%20basiswettenbestand%20v1.3.1.pdf 

The indices in this one are limited, so this is mainly useful for [known-item searches](https://en.wikipedia.org/wiki/Known-item_search).

In [5]:
sru_bwb = wetsuite.datacollect.koop_repositories.BWB() # object that mostly just knows where to fetch from


In [None]:
pprint.pprint( sru_bwb.explain_parsed() )  # this is a self-decripion of the API for you to read, mainly useful to figure out the names of indices that you can search in

{'database/numRecs': '128802',
 'description': 'Gemeenschappelijke zoekdienst van overheid.nl voor BWB Online',
 'explain_url': 'http://zoekservice.overheid.nl/sru/Search?&version=1.2&x-connection=BWB&operation=explain',
 'extent': 'Dutch national legislation',
 'host': 'zoekservice.overheid.nl',
 'indices': [('dcterms', 'identifier'),
             ('dcterms', 'modified'),
             ('dcterms', 'type'),
             ('overheid', 'authority'),
             ('overheidbwb', 'rechtsgebied'),
             ('overheidbwb', 'overheidsdomein'),
             ('overheidbwb', 'onderwerpVerdrag'),
             ('overheidbwb', 'titel'),
             ('overheidbwb', 'afkorting'),
             ('overheidbwb', 'wetsfamilie'),
             ('overheidbwb', 'geldigheidsdatum'),
             ('overheidbwb', 'zichtdatum'),
             ('overheidbwb', 'bekendmaking'),
             ('overheidbwb', 'dossiernummer')],
 'port': '80',
 'sets': [('dcterms',
           'http://purl.org/dc/terms/',
           'I

In [12]:
# Some example queries
for example_query in (
        'overheidbwb.titel any textiel', 
        'dcterms.modified >= 2023-01-01', # changes this year
        'dcterms.modified > %s'%(  (datetime.date.today() - datetime.timedelta(days=7)).strftime('%Y-%m-%d') ), # changes in the last week
        'dcterms.identifier = BWBR0045754',
    ):
  
    print('\n\n*************** %s ****************'%example_query)
    for i, record in enumerate( sru_bwb.search_retrieve_many( example_query, up_to=5 ) ): # just the first 5, to keep this example output brief
        print('***  Record %d of %d  ***'%(i+1, sru_bwb.numberOfRecords))
        meta = wetsuite.datacollect.koop_repositories.bwb_searchresult_meta(record) # this parses BWB search results XML into digestible data in a dict
        pprint.pprint(meta)



*************** overheidbwb.titel any textiel ****************
***  Record 1 of 2  ***
{'authority': 'Volksgezondheid, Welzijn en Sport',
 'created': '2015-07-02',
 'creator': 'Ministerie van Binnenlandse Zaken en Koninkrijksrelaties',
 'geldigheidsperiode_einddatum': '2022-04-13',
 'geldigheidsperiode_startdatum': '2001-04-13',
 'identifier': 'BWBR0012348',
 'language': 'nl',
 'locatie_manifest': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0012348/manifest.xml',
 'locatie_toestand': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0012348/2001-04-13_0/xml/BWBR0012348_2001-04-13_0.xml',
 'locatie_wti': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0012348/BWBR0012348.WTI',
 'modified': '2022-04-15',
 'overheidsdomein': 'Economie en ondernemen',
 'rechtsgebied': 'Ondernemingspraktijk',
 'title': 'Warenwetbesluit formaldehyde in textiel',
 'toestand': 'http://wetten.overheid.nl/id/BWBR0012348/2001-04-13/0',
 'type': 'AMvB',
 'zichtperiode_

## CVDR
CVDR is roughly to be the data equivalent of https://lokaleregelgeving.overheid.nl

In [None]:
sru_cvdr = wetsuite.datacollect.koop_repositories.CVDR( verbose=False )

pprint.pprint( sru_cvdr.explain_parsed() ) # seeing which indexes are here. 
# This one has a more complex information model, so you can dig a little deeper to see what you can do with it.

{'database/numRecs': '262582',
 'description': 'Gemeenschappelijke zoekdienst van overheid.nl voor Centrale '
                'Voorziening Decentrale Regelgeving',
 'explain_url': 'http://zoekservice.overheid.nl/sru/Search?&version=1.2&x-connection=cvdr&operation=explain',
 'extent': 'Lokale regelingen of the Dutch government',
 'host': 'zoekservice.overheid.nl',
 'indices': [('dcterms', 'identifier'),
             ('dcterms', 'title'),
             ('dcterms', 'language'),
             ('dcterms', 'creator'),
             ('dcterms', 'modified'),
             ('dcterms', 'isFormatOf'),
             ('dcterms', 'alternative'),
             ('dcterms', 'source'),
             ('dcterms', 'isRatifiedBy'),
             ('dcterms', 'subject'),
             ('dcterms', 'issued'),
             (None, 'workid'),
             (None, 'bronformaat'),
             (None, 'organisatieType'),
             (None, 'sorteerTitel'),
             (None, 'gemeente'),
             (None, 'provincie'),
   

In [None]:
today = datetime.date.today()

for example_query in (
        'dcterms.modified > %s'%(  (today - datetime.timedelta(days=7)).strftime('%Y-%m-%d') ), # changes in the last week
        'dcterms.title = damocles',
        'body any damocles',
        '(creator any "Amsterdam") AND ( (body any "damoclesbeleid damocles") OR (body any "drugs softdrugs harddrugs handelshoeveelheid opiumwet 13b") AND (body any "sluiting herstelsanctie bestuursdwang"))',
    ):
    print('\n\n*************** %s ****************'%example_query)

    for i, record in enumerate( sru_cvdr.search_retrieve_many( example_query, up_to=5 ) ):
        print('***  Record %d of %d  ***'%(i+1, sru_cvdr.numberOfRecords))
        meta = wetsuite.datacollect.koop_repositories.cvdr_meta(record, flatten=True) # flatten smushes down possibly-repeated fields into a single value. Good enough (only) for presentation.
        pprint.pprint(meta)
        #print( wetsuite.helpers.etree.tostring(record).decode('u8') ) search record in XML form, to see whether we're getting out everything



*************** dcterms.modified > 2023-02-10 ****************
***  Record 1 of 667  ***
{'alternatieveIdentifier': '',
 'alternative': 'Gebiedsgericht geluidsbeleid',
 'betreft': 'bijlage 4',
 'creator': 'Tubbergen (overheid:Gemeente)',
 'identifier': 'CVDR135436_2',
 'inwerkingtredingDatum': '2023-02-21',
 'isFormatOf': 'gmb-2023-61650 '
               '(https://zoek.officielebekendmakingen.nl/gmb-2023-61650)',
 'isRatifiedBy': 'gemeenteraad (overheid:BestuursorgaanGemeente)',
 'issued': '2023-01-31',
 'kenmerk': '9 B',
 'language': 'nl',
 'modified': '2023-02-21',
 'onderwerp': '',
 'opvolgerVan': '',
 'organisatietype': 'Gemeente',
 'preferred_url': 'https://lokaleregelgeving.overheid.nl/CVDR135436/2',
 'publicatieurl_xhtml': 'https://repository.officiele-overheidspublicaties.nl/cvdr/CVDR135436/2/html/CVDR135436_2.html',
 'publicatieurl_xml': 'https://repository.officiele-overheidspublicaties.nl/cvdr/CVDR135436/2/xml/CVDR135436_2.xml',
 'redactioneleToevoeging': '',
 'source': 'W

## Some related code

In [None]:
# there are a bunch of helper functions to help you deal with search results (e.g. parsing metadata and identifiers) 
# ...and to some degree the documents.  One or two are used above.    
# TODO: document, explain, demonstrate more

# there are also some more specific tools, like:

# "given a CVDR work id (or specific expression ID implying the work), find all knovn expression IDs for that work ID"
wetsuite.datacollect.koop_repositories.cvdr_versions_for_work( 'CVDR165982_1' )


['CVDR165982_1', 'CVDR165982_2']

## Officiele publicaties

There is some more technical detail in https://www.koopoverheid.nl/binaries/koop/documenten/instructies/2021/02/09/handleiding-voor-het-uitvragen-van-de-collectie-officiele-publicaties/Handleiding+SRU2.0+v1.2+28052021.pdf also touches on BWB details