## Purpose of this notebook



Fetch responses to [Woo](https://nl.wikipedia.org/wiki/Wet_openbaarheid_van_bestuur#Wet_open_overheid) requests, that is,
the subset available at  [rijksoverheid.nl/documenten](https://www.rijksoverheid.nl/documenten?type=Woo-besluit) if you filter for `type=Woo-besluit`.

At a glance:
* Browse result pages look like https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&pagina=45

* A case's detail page looks like: https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/05/22/besluit-op-woo-verzoek-over-alle-eendenhouderijen-nederland

* A response document looks something like https://open.overheid.nl/documenten/76f1a787-3f8a-452c-bf58-5911e1a89bcd/file 

However, there is variation both in how detail pages work, and how the fetched documents are structured.

Consider:
* https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/06/20/besluit-op-woo-verzoek-over-cites-2-b-soorten
  - is an overall rejection so only has a decision document

* https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/06/22/besluit-op-woo-verzoek-over-de-vergaderingen-van-de-kerngroep-bloembollen
  - has a single document that is decision + inventory + contents

* https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/06/14/besluit---woo-verzoek-ongebruikelijke-transacties-estland-letland-en-litouwen
  - has a separate decision document, and document that is the contents of the response, here a bunch of tables

* https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/06/15/besluit-op-woo-verzoek-over-correspondentie-naar-aanleiding-van-fouten-in-de-lijst-met-top-100-ammoniakuitstoters
  - has a separate decision, and the bijlage/documents is actually a link to _another_ detail page first (which breaks our mostly-correct assumption that every page is a case)

* https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/06/21/besluit-op-woo-verzoek-over-vertraging-van-de-bouw-van-stikstofinstallatie-zuidbroek-ii
  - is a separate decision. It also separates inventaris and seven separate content documents, also each via their own detail page.

* https://open.overheid.nl/documenten/a645fe8c-6c58-4e75-b355-aa3a97023eb8/file
  - has a document that is the decision + inventory, and points out the real data is large files should be requested for via mail

* https://open.overheid.nl/documenten/107e45ab-6533-4bda-bfc2-816ce107906e/file
  - is images of a besluit document. The PDF contains no text layer / OCR.

* https://open.overheid.nl/documenten/ronl-439dfebe8cffecb9a385633cb757ced59de469ee/pdf
  - has one page of OCR-less image-of-text, then goes on to actual text in the middle of a sensentece


As such, creating a dataset with more consistency than that will take some creativity.

Let's only care about the decision motivation for now, and not the attached document(s), so that most of the above considerations can be ignored for now.

In [1]:
import re
import urllib.parse
import datetime
import warnings
import pprint
import os
import time
import random

import bs4, dateutil.parser

import wetsuite.helpers.net
import wetsuite.helpers.strings
import wetsuite.helpers.notebook
import wetsuite.helpers.localdata
import wetsuite.helpers.patterns
import wetsuite.extras.pdf

### Fetch search/browse result pages

This section fetches only the search summaries and links, not yet the documents.

None of this part is cached, as we assume new entries are added regularly,
so each fetches an up-to-date version of the search results.

In [2]:
result_pages_to_fetch = set()   # set of urls
# since we can pick up the other-page links, we can add the first page and it'll pick up the rest of the results. 
# It turns out browsing (search without parameters) like
#result_pages_to_fetch.add('https://www.rijksoverheid.nl/documenten?type=Woo-besluit&pagina=1')
# ...will never show more than 50 pages, *10 = 500 cases,
#   so we have to ensure we have few enough results in each search. 
#   The easiest way seems to do that in date increments.

interval_start = datetime.date(2020, 1, 1)
days_indrement = 60

while interval_start < datetime.date.today():
    interval_end = interval_start + datetime.timedelta(days=days_indrement)
    result_pages_to_fetch.add(
        #'https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&dateRange=specific&startdatum=%s&einddatum=%s'%(
        'https://www.rijksoverheid.nl/documenten?type=Woo%%2Dbesluit&startdatum=%s&einddatum=%s'%(
            interval_start.strftime('%d-%m-%Y'),  # e.g. 01-01-2022
            interval_end.strftime('%d-%m-%Y'),
        )
    )
    interval_start = interval_end

# We _could_ also fetch Wob (type=Wob-verzoek  instead of type=Woo-besluit) but are not currently interested

# see if those ranges make basic sense before starting fetches
#pprint.pprint( result_pages_to_fetch )

In [3]:
result_pages_fetched = {}       # url -> bs4 object       because we need to parse result pages as we go to find more

while len(result_pages_to_fetch) > 0:   # we add numbered pages as we go.  At some point (currently 50 pages) there will be no more to add or fetch
    result_page_url = result_pages_to_fetch.pop()                 # pick one
    print(result_page_url)
    page_bytes = wetsuite.helpers.net.download( result_page_url ) # fetch,
    soup = bs4.BeautifulSoup(page_bytes)                          # parse the HTML  (see bs4 documetation)
    result_pages_fetched[result_page_url] = soup                  # cache for later use
    for a in soup.select("ul.paging__numbers li a"):              # look for links to other pages
        other_page_url = a.get('href')
        if 'pagina' in other_page_url  and  other_page_url not in result_pages_fetched:
            result_pages_to_fetch.add(other_page_url)             # add them to the 'still to fetch' list as we are fetching
            if 'pagina=50' in other_page_url:
                warnings.warn('Arrived at page 50 for a search, chances are we are are missing some data')
    #time.sleep(1)  # be slightly nice to the server  (actually makes up most of the time spent)
    # CONSIDER have progress bar -- with accordingly varying max   #len(result_pages_to_fetch)

https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&startdatum=16-12-2022&einddatum=14-02-2023


https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&startdatum=29-06-2020&einddatum=28-08-2020
https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&startdatum=16%2D12%2D2022&einddatum=14%2D02%2D2023&pagina=5
https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&startdatum=26-12-2020&einddatum=24-02-2021
https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&startdatum=25-04-2021&einddatum=24-06-2021
https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&startdatum=30-04-2020&einddatum=29-06-2020
https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&startdatum=18-08-2022&einddatum=17-10-2022
https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&startdatum=22-10-2021&einddatum=21-12-2021
https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&startdatum=18%2D08%2D2022&einddatum=17%2D10%2D2022&pagina=6
https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&startdatum=24-02-2021&einddatum=25-04-2021
https://www.rijksoverheid.nl/documenten?type=Woo

### Fetch case detail pages

Each result page contains a number of cases,
each case link is to a detail page,
that detail page typically containing a short paragraph and links to one or more PDFs.

So based on the above, we can
- fetch and cache those detail pages
- fetch and cache the PDFs those detail pages link to

We cache the documents because they amount to gigabytes (currently ~2500 cases, times a bunch of MByte of one or more PDFs per case, is maybe 40GByte).
That amount of data should take an hour or two to fetch from scratch, longer if you're being nice to the servers.

Cacheing the _details_ page is based on the assumption that these are finished, not evolving cases; TODO: check that.

In [5]:
# some of these are only later than others
# fetched
woo_detail_pages     = wetsuite.helpers.localdata.LocalKV('woo-besluiten_detailpages.db', key_type=str,value_type=bytes)    # url -> page_bytes
woo_linked_docs      = wetsuite.helpers.localdata.LocalKV('woo-besluiten_docs.db',        key_type=str,value_type=bytes)    # url -> content_bytes
# collected
woo_metadata         = wetsuite.helpers.localdata.MsgpackKV('woo-besluiten_meta.db',      key_type=str,value_type=None)     # case -> metadata_dict
# generated
woo_linked_docs_txt  = wetsuite.helpers.localdata.LocalKV('woo-besluiten_docs_txt.db',    key_type=str,value_type=str)      # url -> text

In [6]:
# summarizing metadata into a dataset
#woo_metadata.truncate()

PBAR

for result_page_url, result_page_soup in tqdm.autonotebook.tqdm( result_pages_fetched.items(), desc='pages'):
    # Take each parsed search/browse-result page, find the HTML links to the detail pages
    #   Note that this is scraping, so much of this is contingent on the generated HTML not doing a structural change.
    #print( 'DOING PAGE', result_page_url )

    for li in result_page_soup.select('main ol.results li.results__item'):
        # each result item on that page is mostly a short summary, 
        #   and a link to a detail page at another URL, which duplicates most information so we only focus on the detail page

        a = li.select('a.publication')[0] # assumes there is just one a in the item / li
        
        detail_page_absurl = urllib.parse.urljoin( result_page_url, a.get('href') ) # relative to the page, so resolve it relative to the page URL we're on
        print( detail_page_absurl )

        ## Fetch that detail HTML page and parse various metadata out of that detail-page HTML.   
        try:   
            # cached fetch , we assume it's answered once and won't change over time -- TODO: check that's true, we could in theory adhere to HTTP cacheing rules better
            #detail_page_bytes = wetsuite.helpers.net.download( detail_page_absurl )
            detail_page_bytes, came_from_cache = wetsuite.helpers.localdata.cached_fetch( woo_detail_pages, detail_page_absurl )
        except ValueError as ve:
            print('SKIP, error  %s  while fetching  %r'%(ve, detail_page_absurl))
            continue
        
        soup = bs4.BeautifulSoup( detail_page_bytes )
        entry_metadata = {
            'detail_page_url':       detail_page_absurl,
            'title':                 soup.find('h1').text.strip(),
            'response_document_url': None,                                    # the decision to the request, with reasoning
            'attachments':           [],   # - zero or more documents (each downloading on their own HTML page). See notes above on the variation of what is in here
            'dates':                 [],   
            'onderwerpen':           [],   # subject       (div.linkBlock)
            'responsible':           None, # who responded (div.belongsto).   not a list, only ever one
        }

        # fish out the dates,
        for mt in soup.select('p.article-meta, p.meta'):
            mtt = mt.text.strip()
            if 'pagina' in mtt:
                continue
            if '|' in mtt:
                mtd = mtt.split('|',1)[1].strip()
                try:
                    entry_metadata['dates'].append( dateutil.parser.parse(mtd).strftime('%Y-%m-%d') ) # reformat ISO style 
                except:
                    print("WARN: didn't understand %r as date"%mtd)
        # subject and answerer(?),
        for lb in soup.select('div.linkBlock a'):
            entry_metadata['onderwerpen'].append( lb.text.strip() )
        for bt in soup.select('div.belongsTo a'):
            entry_metadata['responsible'] = bt.text.strip()

        ## Many pages seem to follow this format
        #  main document, often the decision 
        alist = soup.select('div.article.content div.intro a')
        if len(alist) == 0:
            pass
            #print( "no div.intro a" )
        else: # assume it's len 1, CONSIDER: check
            besluit_a = alist[0]
            besluit_absurl = urllib.parse.urljoin( detail_page_absurl, besluit_a.get('href') ) # urljoin in case they're relative (I think they're not)
            entry_metadata['response_document_url'] = besluit_absurl
        #  additional attachment documents (note: links to the respective download pages, not to the documents themselves)
        for attachment_li in soup.select('div.results ul.common li'):
            attachment_a = attachment_li.find('a')
            entry_metadata['attachments'].append( (urllib.parse.urljoin( detail_page_absurl, attachment_a.get('href') ),
                                                   attachment_a.find('h3').text.strip()) )

        ## ...unless the page is different.
        #   neither of the above blocks will have collected nothing (so no real need to make it conditional)
        #   and this probably will instad
        for adlc in soup.select('div.download a.download-chunk'):
            adlc_absurl = urllib.parse.urljoin( detail_page_absurl, adlc.get('href') ) 
            name = adlc.find('h2').text.strip()

            if name.startswith("Download '"): # clean up link text like  "Download 'title'"  to  "title"
                name = name[10:].rstrip("'")

            #if name.lower().startswith('bijlage') or 'inventaris' in name.lower():
            #    pass

            # whitelist style to get complaints rather than silently ignoring
            if re.search('^(besluit|([12345]e )?deelbesluit|woo-besluit|aanvullend besluit|herstelbesluit|beslissing)', name.lower()) is not None:
                if entry_metadata.get('response_document_url', None) is None:
                    entry_metadata['response_document_url'] = adlc_absurl
                else: # assume it's an attachment?
                    entry_metadata['attachments'].append( (adlc_absurl, name) )

            else: # assume it's an attachment?
                entry_metadata['attachments'].append( (adlc_absurl, name) )

        ## complain  if we still didn't find any document links
        if entry_metadata['response_document_url'] is None and len(entry_metadata['attachments']) == 0:
            if '[ingetrokken]' in entry_metadata['title'].lower():
                print('NOTE: no document, seems fine because [ingetrokken], on %r'%detail_page_absurl)
                # actually there may still be a link if retracted, see https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/07/05/besluit-op-woo-verzoek-over-documentatie-tussen-ministerie-van-ezk-en-energiebedrijven-rwe-en-uniper
            else:
                print("WARN: no document links (%s, %s) found on %r"%(
                    entry_metadata['response_document_url'], 
                    len(entry_metadata['attachments']),
                    detail_page_absurl,
                    )) # there are a few cases like this, that's fine

        # store the entry's metadata into a store
        woo_metadata.put( detail_page_absurl, entry_metadata )

pages:   0%|          | 0/281 [00:00<?, ?it/s]

https://www.rijksoverheid.nl/documenten/woo-besluiten/2022/06/17/besluit-op-woo-verzoek-betreffende-een-nota-over-de-schikking-met-farmers-defence-force
https://www.rijksoverheid.nl/documenten/woo-besluiten/2022/06/17/besluit-op-wob-verzoek-over-leegstaande-gebouwen-in-eigendom-van-het-rijk
https://www.rijksoverheid.nl/documenten/woo-besluiten/2022/06/17/besluit-wob-verzoek-plannen-koninklijk-conservatorium-locatie-den-haag
https://www.rijksoverheid.nl/documenten/woo-besluiten/2022/06/17/besluit-op-wob--woo-verzoek-over-meldingen-hondenhandelaar
https://www.rijksoverheid.nl/documenten/woo-besluiten/2022/06/16/besluit-op-wob-woo-verzoek-over-kopie-van-verzoek-tot-legalisatie-middels-het-programma-legalisatie-pas-meldingen-gedaan-door-lelystad-airport
https://www.rijksoverheid.nl/documenten/woo-besluiten/2022/06/16/besluit-op-wob--woo-verzoek-over-konikpaarden-in-de-oostvaardersplassen-20-0847
https://www.rijksoverheid.nl/documenten/woo-besluiten/2022/06/16/besluit-wob-verzoek-gestorte-g

In [7]:
print( len( woo_metadata ) )

2647


In [8]:
# Basic inspection
list( woo_metadata.items() )[:3]

[('https://www.rijksoverheid.nl/documenten/wob-verzoeken/2022/06/22/besluit-op-wob-verzoek-inzake-taskforce',
  {'detail_page_url': 'https://www.rijksoverheid.nl/documenten/wob-verzoeken/2022/06/22/besluit-op-wob-verzoek-inzake-taskforce',
   'title': 'Besluit op Wob-verzoek inzake Taskforce IND',
   'response_document_url': 'https://open.overheid.nl/repository/ronl-439dfebe8cffecb9a385633cb757ced59de469ee/1/pdf/Besluit%20op%20Wob-verzoek%20Taskforce.pdf',
   'attachments': [],
   'dates': ['2022-06-22'],
   'onderwerpen': [],
   'responsible': 'Ministerie van Justitie en Veiligheid'}),
 ('https://www.rijksoverheid.nl/documenten/wob-verzoeken/2022/09/01/besluit-op-woo-verzoek-over-de-aan-de-npo-verleende-vergunning-voor-gebruik-van-frequentieruimte-voor-digitale-omroep',
  {'detail_page_url': 'https://www.rijksoverheid.nl/documenten/wob-verzoeken/2022/09/01/besluit-op-woo-verzoek-over-de-aan-de-npo-verleende-vergunning-voor-gebruik-van-frequentieruimte-voor-digitale-omroep',
   'title'

#### Data

The above stored the metadata, let's see about the documents.

The response PDFs can be relatively large, and contain a lot of images of text.
  Giving you that as-is would lead to a gigabytes-large dataset that wouldn't compress well.

Assuming research interest is in the argument, 
  we might as well create a dataset of extracted text (the images-of-text are usually the content documents)

In [9]:
woo_linked_docs_txt.truncate()

#for detail_page_url, entry_metadata in random.sample( woo_metadata.items(), 20 ):
for detail_page_url, entry_metadata in tqdm.autonotebook.tqdm( woo_metadata.items(), desc='cases' ):
    
    ### Check that the detail page actually did link to a response document  (arguably can/should be done earlier)        
    response_document_url = entry_metadata['response_document_url']
    if response_document_url is None: # (no need here anymore?)
        print('SKIP CASE; detail page seems to have no response doc?   %s'%entry_metadata)
        continue


    # Fetch the document that we thought was the decision
    #   note this is sometimes dozens of megabytes - there are some 300-page documents in there
    try:
        doc_bytes, came_from_cache = wetsuite.helpers.localdata.cached_fetch( woo_linked_docs, response_document_url )
    except ValueError as ve:
        print( 'SKIP CASE; response doc failed to fetch: %s for %r   on detail page %r'%(ve, response_document_url, detail_page_url ) )
        continue

    ### Check that it's a PDF
    if not doc_bytes.startswith(b'%PDF-'):
        print( "SKIP CASE; not sure what kind of response document %r is, first bytes are %r"%(
            response_document_url, doc_bytes[:25]) )
        continue
    
    # Okay, we have a document and it's a PDF.  Extract page text as reported by the PDF itself   (no OCR necessary for most cases)
    pages_text = list( wetsuite.extras.pdf.page_text( doc_bytes ) ) # (explicit list() because generator)

    # The smallest of cleanup:  try to remove the page number lines on each page  (which should only appear in the footer - not that we're testing for that)
    pages_temp = []
    for page_text in pages_text:
        ptemp = page_text
        ptemp = re.sub( r'\n[Pp]agina(?:nummer)?\s+[0-9]+(?:\s+van\s[0-9]+)?', ' ', page_text ) 
        #ptemp = re.sub( r'\n\s*[1-9][0-9]?}[\s\n]*\Z', ' ', ptemp, flags=re.M ) # seems a little too fuzzy without also using location
        pages_temp.append( ptemp )
    pages_text = pages_temp

    all_text = '\n'.join( pages_text )

    woo_linked_docs_txt.put(response_document_url, all_text)

cases:   0%|          | 0/2647 [00:00<?, ?it/s]

SKIP CASE; detail page seems to have no response doc?   {'detail_page_url': 'https://www.rijksoverheid.nl/documenten/woo-besluiten/2022/04/25/woo-verzoek-betreft-verzoek-om-stukken-met-betrekking-tot-forensisch-medisch-onderzoek', 'title': 'Woo verzoek betreft verzoek om stukken met betrekking tot Forensisch medisch onderzoek', 'response_document_url': None, 'attachments': [['https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/woo-besluiten/2022/04/25/woo-verzoek-betreft-verzoek-om-stukken-met-betrekking-tot-forensisch-medisch-onderzoek/Besluit+op+Woo-verzoek+mbt+betrekking+tot+Forensisch+medisch+onderzoek.pdf', 'Woo verzoek betreft verzoek om stukken met betrekking tot Forensisch medisch onderzoek']], 'dates': ['2022-04-25'], 'onderwerpen': [], 'responsible': 'Ministerie van Justitie en Veiligheid'}
SKIP CASE; detail page seems to have no response doc?   {'detail_page_url': 'https://www.rijksoverheid.nl/documenten/woo-besluiten/2022/06/08/besluit-op-wob-verzoek-over-meitell

In [17]:
# These are headed out to be datasets, so set some supporting store metadata

#### Meta
woo_metadata._put_meta('valtype','msgpack')
woo_metadata._put_meta('description','''

This dataset tries to focus on the reactions to Woo requests.

its .data is a a map 
- from an unique case identifier (not necessarily meaningful, currently happens to be the case's URL on the website, looks like 'https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/08/31/besluit-op-woo-verzoek-geen-documenten-over-gewasbeschermingsregistratie')
- to a dict like:
    {'detail_page_url':       'https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/08/31/besluit-op-woo-verzoek-geen-documenten-over-gewasbeschermingsregistratie',
     'title':                 'Besluit op Woo-verzoek geen documenten over gewasbeschermingsregistratie',
     'response_document_url': 'https://open.overheid.nl/documenten/ef04ac22-1eb9-4b5c-a439-39fb3b636fdf/file',
     'attachments':           [],
     'dates':                 ['2023-08-31'],
     'onderwerpen':           ['Bestrijdingsmiddelen'],
     'responsible':           'Ministerie van Landbouw, Natuur en Voedselkwaliteit'
    }

You are proably also interested in wetsuite.datasets.load()-ing the related dataset that has the text for the response documents (fetched by response_document_url)

Both could use some refinement.   TODO: clean up both the woo metadata and woo text datasets                       
''')


#### Text
woo_linked_docs_txt._put_meta('description','''
This dataset tries to focus on the reactions to Woo requests.
                                                     
You will probably want the Woo metadata dataset first.
It will have `response_document_url` keys linking to a PDF document.

This is a map from that URL to the a plaintext string of the text that PDF contains.
                              
Notes:
- if we figure it was something other than a decision, it may not be present
- the extraction is currenty 'the text the PDF itself reports'
    
Both could use some refinement.   TODO: clean up both the woo metadata and woo text datasets
''')

In [16]:
# glancing in spection: TODO: fix those names, are clearly wrong :)
print( ' number of cases:               ', len( woo_metadata        ) )
print( ' decision doc texts it links to:', len( woo_linked_docs_txt ) )

 number of cases:                2542
 decision doc texts it links to: 2339


In [9]:
(woo_metadata.bytesize(), woo_linked_docs_txt.bytesize())


(3170304, 128946176)

In [11]:
woo_metadata.close()
woo_linked_docs_txt.close()