# Purpose of this notebook

See what kind of things we can find out from the [tweede kamer open data portal](https://opendata.tweedekamer.nl/).

There are two interfaces for this: an [Atom-style API called SyncFeed](https://opendata.tweedekamer.nl/documentatie/syncfeed-api), and an [OData API](https://opendata.tweedekamer.nl/documentatie/odata-api).

The Atom API is a little easier to speak yourself, OData can be a little more thorough but more work to use.

The returned formats seem to be XML and JSON.

Either way, there is a [relational data model](https://opendata.tweedekamer.nl/documentatie/informatiemodel) that you should be thinking of,
though in this notebook we focus primarily on the dossiers and documents in them, which stays relatively simple.

Because there is a bunch of data referring to other data, much of the below is trying to explore/show
what kind of interesting things might be in there in the first place.

...because we probably don't want to just fetch everything,
we want to show how to figure how to get and use the parts you need for a specific purpose.

## Atom/SyncFeed API


In [1]:
import collections, pprint

from wetsuite.datacollect     import tweedekamer_nl # contains some basic code dealing with this API
from wetsuite.helpers         import etree
from wetsuite.helpers         import strings
from wetsuite.helpers         import notebook

#### Any one resource type

Fetch all resources of the mentioned soort/category, save into a single .xml file of its name.

There e.g. is [another notebook](extras_datacollect_tweede_kamer_parties.ipynb) that fetches 'Persoon', 'Fractie', 'FractieZetel', and 'FractieZetelPersoon', to extract who is member of what party.

If you wanted a record of what gets done in an everyday way, you might care about 'Vergadering', 'Verslag', 'Stemming',
or if you are more interested in documentation, then 'Zaak', 'Document', 'Kamerstukdossier'. 

For a list with explanation, see [its documentation](https://opendata.tweedekamer.nl/documentatie/).
For just a list, `see wetsuite.datacollect.tweedekamer_nl.resource_types`.

In [None]:
# For a quick example, though, let's stick to something with simple ouput

# this function actually does little more stick that soort on a URL and fetch ...repeatedly if necessary
etrees = tweedekamer_nl.fetch_all( 'Zaal' )     
# it returned a list of etree objects,  
# so if our goal were to to write that into a single XML file, we want to merge that
single_tree = tweedekamer_nl.merge_etrees( etrees )  

#There is another function that helps us see each entry's
pprint.pprint( tweedekamer_nl.entries( single_tree ) )

print( etree.debug_pretty( single_tree ) )

#xmlstring = wetsuite.helpers.etree.tostring( single_tree )                # then save that 
#with open('%s.xml'%soort, 'wb') as xf:   # technically injection-sensitive, except it's fine as long as we control the values
#    xf.write( xmlstring )       


In [None]:
from importlib import reload
reload(wetsuite.datacollect.tweedekamer_nl)
for detail_dict in wetsuite.datacollect.tweedekamer_nl.entries( single_tree ):
    print(f"{detail_dict['naam']:40s}    {detail_dict['id']}")
#pprint.pprint( wetsuite.datacollect.tweedekamer_nl.entries( single_tree ) )


#### Kamerstukdossiers

Now let's focus on the kamerstukdossiers.

And see what kind of topics we have in there (...so far purely by looking at the text in their title).

(note: we don't actually use this, it's just inspection)

In [2]:
# Fetch all kamerstukdossiers.    
#   May take half a minute or so to fetch all,
#   because that's ~30 fetches amounting to ~6MByte of XML.
#   and so also let's not print it
single_tree = tweedekamer_nl.merge_etrees( tweedekamer_nl.fetch_all( 'Kamerstukdossier' ) )

In [14]:
verbose = 0

ourtypes = collections.defaultdict(list)
for i, entry_node in enumerate( single_tree ):
    edict = tweedekamer_nl.entry_dict_from_node( entry_node )

    ksd = entry_node.find('content/kamerstukdossier')
    if verbose:
        print()
        print(etree.debug_pretty(ksd))
        pprint.pprint(edict)

    # exception case:   if that attribute is there, there are no contents
    if ksd.get('verwijderd', None) == 'true':   
        ourtypes['[verwijderd]'].append( ksd.get('id') )
        #print("SKIP verwijderd (%s)"%ksd.get('id'))
        continue 

    try:
        titel = ksd.find('titel').text
    except AttributeError:
        print( 'ERROR: no title? (%r)'%entry_node )
        display( notebook.etree_visualize_selection(ksd, '*', mark_subtree=True) )

    if titel is None:
        print( 'ERROR: dossier without titel (%r)'%entry_node )
        display( notebook.etree_visualize_selection(ksd, '*', mark_subtree=True) )
        continue
    titel = titel.strip()

    if strings.contains_any_of(titel, ['wetsvoorstel', 'voorstel van wet'], case_sensitive=False):
        #print( 'LAW       %-7s %20s  %s'%(ksd.get('nummer'), ksd.get('updated'), titel) )
        ourtypes['wet'].append( titel )
        continue

    elif titel.startswith('Wet '):
        ourtypes['wet'].append( titel ) 
        continue

    elif strings.contains_any_of(titel, ['begrotingssta', 'slotwet', 'voorjaarsnota','najaarsnota', 'Financieel jaarverslag'], case_sensitive=False):
        #print( 'BEGROTING %-7s %20s  %s'%(edict.get('nummer'), edict.get('updated'), titel) )
        ourtypes['begroting'].append( titel )   # up here to not accidentally count wijziging in begrotingsstaat as a law
        continue
    elif strings.contains_any_of(titel, ['omzetbelasting'], case_sensitive=False):
        #print( 'BELASTING %-7s %20s  %s'%(edict.get('nummer'), edict.get('updated'), titel) )
        ourtypes['belasting'].append( titel )
        continue

    elif strings.contains_any_of(titel, ['wetswijziging', 'wijziging van wet ', 'wijziging van de wet ', 'aanpassing van de wet',
                                 'Wijziging van de', # followed by a specifically named law   this one is fuzzier than necessary, might be better to regexp-match
                                 ], case_sensitive=False):
        ourtypes['wet'].append( titel )
        continue
    elif strings.contains_all_of(titel, ['wijziging', 'wetboek'], case_sensitive=False):
        ourtypes['wet'].append( titel )
        continue
    elif strings.contains_all_of(titel, ['wijziging', 'wetten'], case_sensitive=False):
        #print( 'LAW       %-7s %20s  %s'%(ksd.get('nummer'), ksd.get('updated'), titel) )
        ourtypes['wet'].append( titel )
        continue
    elif strings.contains_all_of(titel, ['verbeter', 'wetten'], case_sensitive=False):
        ourtypes['wet'].append( titel )
        continue
    elif strings.contains_all_of(titel, ['aanpassing', ' Wet'], case_sensitive=False):
        ourtypes['wet'].append( titel )
        continue

    elif strings.contains_any_of(titel, ['Initiatiefnota','Interpellatie'], case_sensitive=False):
        ourtypes['discussions'].append( titel )
        continue
    elif strings.contains_any_of(titel, ['burgerinitiatief',], case_sensitive=False):
        ourtypes['discussions'].append( titel )
        continue
    elif strings.contains_any_of(titel, ['EU-voorstel', 'EU voorstel', 'EU-mededeling', 'EU-trendrapport']):
        ourtypes['eu'].append( titel )
        continue
    elif strings.contains_any_of(titel, ['Herindeling van de gemeenten',]):
        ourtypes['local'].append( titel )
        continue

    elif strings.contains_any_of(titel, ['mbudsman',]):
        ourtypes['mbudsman'].append( titel )
        continue

    elif strings.contains_any_of(titel, ['Evaluatie',]):
        ourtypes['evaluatie'].append( titel )
        continue

    elif strings.contains_any_of(titel, ['Verdrag',]):
        ourtypes['verdrag'].append( titel )
        continue

    else:
        ourtypes['unsorted'].append( titel )
        #print( 'DUNNO %-7s %20s  %s'%(edict.get('nummer'), edict.get('updated'), titel) )
        continue
        #if re.search('', titel):

    #sru_openpub.search_retrieve_many( 'w.dossiernummer=%s'%edict.get('nummer'), callback=op_callback )


for typ, title_list in ourtypes.items():
    #if typ=='unsorted': # cases for which the title isn't a strong indication -- fair enough, but print them to see if there's any patterns we're missing
    #    pprint.pprint( title_list )
    print(  '%-5d %s'%( len(title_list), typ )  )
#pprint.pprint( dtypes )

2056  wet
2005  unsorted
1710  begroting
437   verdrag
244   discussions
108   evaluatie
290   eu
24    mbudsman
6     local
16    belasting
17    [verwijderd]


In [None]:
# TODO: remove this, or decide what to do with it if not
#     print("\n== kamerstukdossier %s =="%entry.get('nummer'))
#     print("titel: %s"%titel) # check whether citeertitle is unused or not
#     print("kamer: %s"%entry.get('kamer')) 
#     print("updated: %s"%entry.get('updated')) 
#     print("afgesloten: %s"%entry.get('afgesloten')) 
#     #'hoogsteVolgnummer': '516', 
#     print( edict )


#     def op_callback(record):
#         recordData     = record.find('recordData')        # the actual record
#         recordPosition = record.find('recordPosition')    # e.g. <recordPosition>12</recordPosition>
#         print( '\n====== record %s ====='%recordPosition.text )
#         payload = recordData[0]
#         print( wetsuite.helpers.etree.tostring( wetsuite.helpers.etree.indent( payload ) ).decode('u8') )

#     sru_openpub.search_retrieve( 'w.dossiernummer=%s'%edict.get('nummer'), callback=op_callback )
#     print( "Number of kamerstukken: %d"%sru_openpub.num_records )



# https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Entiteiten/cab88793-32df-413a-9ca1-51a075c196db

# https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Resources/cab88793-32df-413a-9ca1-51a075c196db


#with open('fractie_membership.json','wb') as jsonfile:
#    jsonfile.write( json.dumps( persoon_combined ).encode('ascii') )

## OData interface

The SyncFeed API is perfectly functional, though it leaves you to do interpretation of the relations yourself, 
so let's see if the OData API is any more help.

There is a helpful library out there, [tkapi](https://github.com/openkamer/tkapi) (MIT license). 
This means we don't have to implement it ourselves.

Note that neither API is is a very _fast_ interface, 
or efficient when your goal is sifting through the entire collection of data in ways it was not initially designed for. 
(It also seems it sometimes refuses to connect?)

That is part of why it might be interesting to some to have this notebook, and the produced dataset just give you the actual list.

In [None]:
# if you haven't already
!pip3 install tkapi

In [15]:
import tkapi, tkapi.document   
from tkapi.document import DocumentSoort
api = tkapi.TKApi()

In [16]:
# Get an idea of what this API even does
#   the document types are function calls, to wit:
list(name   for name in dir(api)   if name.startswith('get_'))

['get_activiteiten',
 'get_agendapunten',
 'get_all_items',
 'get_antwoorden',
 'get_besluiten',
 'get_commissies',
 'get_documenten',
 'get_dossiers',
 'get_fractie_zetels',
 'get_fracties',
 'get_geschenken',
 'get_item',
 'get_items',
 'get_kamervragen',
 'get_personen',
 'get_reizen',
 'get_related',
 'get_stemmingen',
 'get_vergaderingen',
 'get_verslagen',
 'get_verslagen_van_algemeen_overleg',
 'get_zaken']

In [20]:
# Okay, let's try that
dossiers = api.get_dossiers( )
len(dossiers)  # a few thousand, which is why the previous line will take a few secdonds to fetch

6896

In [41]:
# Let's get a basic summary of dossiers. 
#    we find out apparently  dossier nummers  are not unique without the  toevoeging

# let's also sort by dossies -- and toevoeging, which requires minor syntax-fu right now
sorted_dossiers = sorted(  dossiers,   key=lambda dossier:str(dossier.nummer)+(dossier.toevoeging or '')  )

i=0
for dossier in sorted_dossiers:   #[:200]: # show a few hundred, not all
    nummer_and_toevoeging = ('%s-%s'%(dossier.nummer, dossier.toevoeging or '')).rstrip('-')
    print( "== Dossier %s == %s =="%( nummer_and_toevoeging, dossier.titel) )
    print( '  ',dossier.url.replace(')','%29') ) # the replace is to make the notebook's url include the final bracket
    for doci, document in enumerate(dossier.documenten):
        print( f'    DOC: {document.onderwerp:100s}  {document.bestand_url:30s}'
              )
    #break

== Dossier 17050 == Misbruik en oneigenlijk gebruik op het gebied van belastingen, sociale zekerheid en subsidies ==
   https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Kamerstukdossier(58cf1611-70a0-436a-ba73-f5e0819fd2a9%29
    DOC: Motie Ulenbelt over beoordelingscriteria voor samenwonen                                              https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(29ba2e71-84b6-4b8d-bc0e-0014f7bb5766)/TK.DA.GGM.OData.Resource()
    DOC: Motie van het lid Berndsen-Jansen c.s. over een jaarlijkse nationale frauderisicoanalyse              https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(ca155dba-efc8-4a24-8850-002a349aa2c4)/TK.DA.GGM.OData.Resource()
    DOC: Reactie op berichtgeving over het koppelen van data in verband met fraudeonderzoeken                  https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(66490208-7212-4b91-83a9-01552e52a243)/TK.DA.GGM.OData.Resource()
    DOC: Tweede voortgangsbrief herijking handhavingsins

Let's say that our interest is more specific:
finding what  Raad van State  has to say about  proposed laws (wetsvoorstellen).

...and, in the process also learn what the kinds of documents there are in each dossier.
 
There is also the [advice on the raad van state site](https://www.raadvanstate.nl/adviezen/),
(for a more data-like form, see also our [extras_datacollect_raadvanstate](extras_datacollect_raadvanstate.ipynb)),
but there it is not placed in the context of the law it's referring to.
This interface should at least gives us the law's name.

In [42]:
# We start by selecting dossiers where there already _is_ RvS advice.
#  - this is a decent filter for wetsvoorstellen
#  - and filters out wetsvoorstellen that don't need this advice (e.g. begroting)
# ...but we are about to find out
# - there are other things that RVS advises on, like finances (see e.g. 36200) 
# - there are law changes that RVS does not advise on (e.g. TODO)

sorted_dossiers = sorted(dossiers,  key=lambda d:d.nummer,  reverse=True )

count = 0
for i, dossier in enumerate( sorted_dossiers ):
    nummer_and_toevoeging = ('%s-%s'%(dossier.nummer, dossier.toevoeging or '')).rstrip('-')

    #if (dossier.nummer%100) == 0: # ignore a few specific special cases for now,   just because they're large to print
    #    continue

    ## In our stated interest:  first see if it has RvS advice
    sorted_docs      = sorted(dossier.documenten,  key=lambda d:d.volgnummer )
    has_raadvanstate = False
    for document in sorted_docs:
        try:
            # these come from an enum, try  list( tkapi.document.DocumentSoort )  to see a list
            if document.soort in (DocumentSoort.ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE, 
                                  #DocumentSoort.ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE_EN_NADER_RAPPORT, # seems to be begrotingstuff?  (TODO: check)
                                  DocumentSoort.ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE_EN_REACTIE_VAN_DE_INITIATIEFNEMERS,
                                ):
                has_raadvanstate = True
        except ValueError: # there's some invalid / non-covered soort values in the data
            pass # ignore
        # we can filter on more, but we may not need to?

    if not has_raadvanstate:
        continue
    # if execution gets here, it's probably interesting to us.
    
    count += 1
    #if len(sorted_docs)>500:
    #    print( "\n\n== Dossier %s == %s =="%( dossier.nummer, dossier.titel) )
    #    print(' LARGE: %d documents'%len(sorted_docs))
    #    print(' %s ({{kamerdossier|%d}}'%(dossier.titel, dossier.nummer))
    #    continue

    print( "\n== %r == Dossier %s == %d docs == %s =="%( dossier.id, nummer_and_toevoeging, len(dossier.documenten), dossier.titel) )
    for document in sorted_docs:
        try:
            if 0: # just to make the summaries a little easier to read
                if document.soort in (DocumentSoort.MOTIE, DocumentSoort.AMENDEMENT, DocumentSoort.BRIEF_REGERING, DocumentSoort.VERSLAG_VAN_EEN_ALGEMEEN_OVERLEG,
                                    DocumentSoort.MEMORIE_VAN_TOELICHTING_INITIATIEFVOORSTEL,
                                    ):
                    continue
        except ValueError:
            print( "soort not known by tkapi")
            continue

        try:
            docsoort = document.soort
        except ValueError: # this seems to be internal inconsistency
            continue

        show_all_docs = False
        if show_all_docs or docsoort in (DocumentSoort.ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE, 
                                DocumentSoort.ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE_EN_REACTIE_VAN_DE_INITIATIEFNEMERS):

            print( '#%s'%(document.volgnummer, ), document.soort.name)
            #print( 'soort', document.soort.name, '(%s)'%document.soort.value )
            print( '  onderwerp    ', document.onderwerp )     # for wetsvoorstel-dossiers, seems to often be the same as soort plus some detail (who a letter is from, who )
            print( '  citeertitel  ', document.titel_citeer ) # for wetsvoorstel-dossiers, this often seems to name the law. Or a related one, see e.g. 36195
            print( '  titel        ', document.titel )              # for wetsvoorstel-dossiers, this seems to often name the law, plus sometimes some reason
            #print( 'versies', document.versies )
            print( '  url          ', document.bestand_url )
            if 0:        # It may be interesting to know the document is part of multiple dossiers and/or multiple zaken
                print( '  zaken         ', document.zaken )
                #nums = document.dossier_nummers
                #nums.pop(dossier.nummer)
                #if len(nums)>0:
                #  print( "  also in dossiers: %s"%nums )

            print()

    #if i > 1000: # show only a bunch, not all
    #    print("break %d"%i)
    #    break
print( 'Interesting cases: %d'%count )


== '87a7ff74-bf81-4e71-a69e-2439ce536c2c' == Dossier 36346 == 5 docs == Voorstel van wet van het lid Van Houwelingen betreffende het houden van een raadplegend referendum over het Nederlandse lidmaatschap van de Europese Unie (Wet raadplegend referendum Nederlands EU-lidmaatschap) ==
#4 ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE_EN_REACTIE_VAN_DE_INITIATIEFNEMERS
  onderwerp     Advies Afdeling advisering Raad van State en Reactie van de initiatiefnemer
  citeertitel   Wet raadplegend referendum Nederlands EU-lidmaatschap
  titel         Voorstel van wet van het lid Van Houwelingen betreffende het houden van een raadplegend referendum over het Nederlandse lidmaatschap van de Europese Unie (Wet raadplegend referendum Nederlands EU-lidmaatschap)
  url           https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(bd0dbb7d-bbbe-4aa3-8e28-ba55df7756e3)/TK.DA.GGM.OData.Resource()


== '49f64470-6e41-4115-9547-420f5e3b6a3e' == Dossier 36200 == 188 docs == Nota over de toestand van ’

KeyboardInterrupt: 

## Actually making data from that

TODO: decide what