<a href="https://colab.research.google.com/github/scarfboy/wetsuite-dev/blob/main/examples/datacollect_koop_repos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# For local installs you can install the package once.   In colab you get a disposable environment and will have to start with this install each time. 
!pip3 --quiet install -U https://github.com/scarfboy/wetsuite-dev/archive/refs/heads/main.zip

[K     \ 82 kB 968 kB/s
[K     |████████████████████████████████| 53 kB 1.5 MB/s 
[K     |████████████████████████████████| 1.1 MB 10.0 MB/s 
[K     |████████████████████████████████| 4.7 MB 9.8 MB/s 
[K     |████████████████████████████████| 6.6 MB 38.3 MB/s 
[K     |████████████████████████████████| 163 kB 57.3 MB/s 
[?25h  Building wheel for wetsuite (setup.py) ... [?25l[?25hdone


# Introduction to searching KOOP's repositories via their SRU API

In [None]:
# Koop's repositories are usable as a code-and-data level access to the same data that sits behind
#   wetten.ovrheid.nl, lokaleregelgeving.overheid.nl, and such.
# These repositories can be queried live and without any configuration or storage,
#   though if you are going to do repeated and/or bulky searches is is probably a good idea to cache the documents you fetch.
# 
# The below shows the use of a (currently very minimal) python API to do just that.

# The query syntax is Common Query Language.  There's a standard you can read, but you can get fairly far copying examples, 
# and for the somewhat more technically minded:
# - most of the time you use    indexname operator term    e.g. 'dc.title = chicken',   'dc.modified > 2022-01-01'
#   use doublequotes when there's a space or <>=/() in the term (or just always)
# - you can boolean those together
# - the indexes (usually meaning metadata fields) you can search in vary per server, and will be listed in its explain - see above
#   in fact, in some cases the functionality is less than the website provides (e.g. BWB repo does not seem to allow searching the body text)
# - the operators supported vary per field and per server (some get fancy, many do not), and you will have to find its documentation OR stick to a few basics
#   for text you usually have
#     'any'      roughly speaking:  body any "foo bar"   is short for   body any foo OR  body any bar
#     'all'      roughly speaking:  body all "foo bar"   is short for   body any foo AND body any bar
#     '=='       exact match
#     'exact'    exact match
#     '='        server choice, e.g. for text might be may be == or adj or such
#    for dates and numbers you mostly have '<' '<=' '>', '>=' '=' 

# The details and format within each search result item will vary, as will the way they point to the content documents they describe,
#   so the more detailed question you have, the more you have to figure out repository details.   We try to provide helper functions.

In [None]:
import pprint, datetime
import wetsuite.datacollect.koop_repositories

## Basis WettenBestand

There is some technical documentation at https://www.overheid.nl/sites/default/files/wetten/Gebruikersdocumentatie%20BWB%20-%20Zoeken%20binnen%20het%20basiswettenbestand%20v1.3.1.pdf 

The indices in this one are limited, so this is mainly useful for known item searches

In [None]:
sru_bwb = wetsuite.datacollect.koop_repositories.BWB() # object that mostly just knows where to fetch from

# pprint.pprint( sru_bwb.explain_parsed() )  # this is a self-decripion of the API for you to read, mostly to figure out the names of indices that you can search in

In [None]:
# Some example queries
for example_query in (
        'overheidbwb.titel any textiel', # it also allows titel as short for overheidbwb.titel
        'dcterms.modified >= 2022-01-01', # changes this year
        'dcterms.modified > %s'%(  (datetime.date.today() - datetime.timedelta(days=7)).strftime('%Y-%m-%d') ), # changes in the last week
        'dcterms.identifier = BWBR0045754',
    ):
    print('\n\n*************** %s ****************'%example_query)

    for i, record in enumerate( sru_bwb.search_retrieve_many( example_query, up_to=5 ) ):
        print('***  Record %d of %d  ***'%(i+1, sru_bwb.numberOfRecords))
        meta = wetsuite.datacollect.koop_repositories.bwb_searchresult_meta(record)
        pprint.pprint(meta)



*************** overheidbwb.titel any textiel ****************
***  Record 1 of 2  ***
{'authority': 'Volksgezondheid, Welzijn en Sport',
 'created': '2015-07-02',
 'creator': 'Ministerie van Binnenlandse Zaken en Koninkrijksrelaties',
 'geldigheidsperiode_einddatum': '2022-04-13',
 'geldigheidsperiode_startdatum': '2001-04-13',
 'identifier': 'BWBR0012348',
 'language': 'nl',
 'locatie_manifest': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0012348/manifest.xml',
 'locatie_toestand': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0012348/2001-04-13_0/xml/BWBR0012348_2001-04-13_0.xml',
 'locatie_wti': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0012348/BWBR0012348.WTI',
 'modified': '2022-04-15',
 'overheidsdomein': 'Economie en ondernemen',
 'rechtsgebied': 'Ondernemingspraktijk',
 'title': 'Warenwetbesluit formaldehyde in textiel',
 'toestand': 'http://wetten.overheid.nl/id/BWBR0012348/2001-04-13/0',
 'type': 'AMvB',
 'zichtperiode_

## CVDR
...which seems to be the data equivalent of lokaleregelgeving.overheid.nl

In [None]:
sru_cvdr = wetsuite.datacollect.koop_repositories.CVDR( verbose=False )
pprint.pprint( sru_cvdr.explain_parsed() ) # seeing which inices are here. 
# This one has a somewhat more complex information model, though you'll have to dig a little deeper to see what you can do with it.

{'database/numRecs': '253317',
 'description': 'Gemeenschappelijke zoekdienst van overheid.nl voor Centrale '
                'Voorziening Decentrale Regelgeving',
 'explain_url': 'http://zoekservice.overheid.nl/sru/Search?&version=1.2&x-connection=cvdr&operation=explain',
 'extent': 'Lokale regelingen of the Dutch government',
 'host': 'zoekservice.overheid.nl',
 'indices': [('dcterms', 'identifier'),
             ('dcterms', 'title'),
             ('dcterms', 'language'),
             ('dcterms', 'creator'),
             ('dcterms', 'modified'),
             ('dcterms', 'isFormatOf'),
             ('dcterms', 'alternative'),
             ('dcterms', 'source'),
             ('dcterms', 'isRatifiedBy'),
             ('dcterms', 'subject'),
             ('dcterms', 'issued'),
             (None, 'workid'),
             (None, 'bronformaat'),
             (None, 'organisatieType'),
             (None, 'sorteerTitel'),
             (None, 'gemeente'),
             (None, 'provincie'),
   

In [None]:
today = datetime.date.today()

for example_query in (
        'dcterms.modified > %s'%(  (today - datetime.timedelta(days=7)).strftime('%Y-%m-%d') ), # changes in the last week
        'dcterms.title = damocles',
        'body any damocles',
        '(creator any "Amsterdam") AND ( (body any "damoclesbeleid damocles") OR (body any "drugs softdrugs harddrugs handelshoeveelheid opiumwet 13b") AND (body any "sluiting herstelsanctie bestuursdwang"))',
    ):
    print('\n\n*************** %s ****************'%example_query)

    for i, record in enumerate( sru_cvdr.search_retrieve_many( example_query, up_to=5 ) ):
        print('***  Record %d of %d  ***'%(i+1, sru_cvdr.numberOfRecords))
        print( wetsuite.helpers.etree.tostring(record).decode('u8') )

        meta = wetsuite.datacollect.koop_repositories.cvdr_meta(record, flatten=True) # flatten smushes down possibly-repeated fields into a single value. Good enough (only) for presentation.
        pprint.pprint(meta)




*************** dcterms.modified > 2022-11-05 ****************
***  Record 1 of 631  ***
<record xmlns="http://www.loc.gov/zing/srw/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
         <recordSchema>http://standaarden.overheid.nl/sru/</recordSchema>
         <recordPacking>xml</recordPacking>
         <recordData>
            <gzd xmlns="http://standaarden.overheid.nl/sru" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:overheid="http://standaarden.overheid.nl/owms/terms/" gzd="http://standaarden.overheid.nl/sru http://standaarden.overheid.nl/sru/gzd.xsd">
               <originalData>
                  <meta xmlns:overheidrg="http://standaarden.overheid.nl/cvdr/terms/">
                     <owmskern>
                        <identifier>CVDR165982_2</identifier>
                        <title>Bouwverordening Tiel 2012</title>
                        <language>nl</language>
                        <type scheme="overheid:Informatietype">regeling</type>
                   

## Some related code

In [None]:
# there are a bunch of helper functions to help you deal with search results (e.g. parsing metadata and identifiers) 
# and to some degree the documents.  One or two are used above.    TODO: demonstrate more.

# there are also some more specific tools, like:
wetsuite.datacollect.koop_repositories.cvdr_versions_for_work( 'CVDR165982_1' )


amt results: 2


['CVDR165982_1', 'CVDR165982_2']

## Officiele publicaties

There is some more technical detail in https://www.koopoverheid.nl/binaries/koop/documenten/instructies/2021/02/09/handleiding-voor-het-uitvragen-van-de-collectie-officiele-publicaties/Handleiding+SRU2.0+v1.2+28052021.pdf also touches on BWB details