<a href="https://colab.research.google.com/github/scarfboy/wetsuite-dev/blob/main/examples/dataset_kansspelautoriteit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Purpose of this notebook

Collect data from contents PDFs under https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/

We are not aware of any API, so are currently collecting information based on scraping web pages.


NOTE: This is how a dataset was created. If you only care about how to _use_ that dataset, see [using_dataset_kansspelautoriteit.ipynb](../using_dataset_kansspelautoriteit.ipynb) instead.

In the process we will give an example of extracting data from a website like that, reorganizing that data and, as it turns out, of applying OCR.

If you _do_ want to read this code, e.g. to get a start at doing your own OCR processing, read on...
(note also that this code relies on poppler (for PDF parsing) and easyOCR (for OCR), neither of which are quite trivial installs)

### Preparation
imports we will use, and some helper functions.



In [1]:
#imports
import hashlib, urllib, json, json, pprint, textwrap, re

import bs4, numpy

import wetsuite.helpers.localdata
import wetsuite.helpers.meta
import wetsuite.helpers.net
import wetsuite.helpers.format
import wetsuite.helpers.strings
import wetsuite.helpers.date
import wetsuite.extras.pdf
import wetsuite.extras.ocr


# helpers
def hash(data: bytes):
    ' calculate SHA1 hash of some data. '
    s1h = hashlib.sha1()
    s1h.update( data )
    return s1h.hexdigest()


def find_eur_money(s:str, minimum=0):
    ''' Given a string, returns a list of substrings that look like money amounts, e.g.
            find_eur_money('   EUR 10,-  ')       == ['10']
            find_eur_money('   EUR 10.    ')      == ['10']
            find_eur_money('   EUR100.000,-')     == ['100000']
            find_eur_money('   EUR100000  ')      == ['100000']
            find_eur_money('   \u20ac100.000')    == ['100000']   # but not commas, because Dutch uses that as a digit.  That should probably be configurable
            find_eur_money('   \u20ac 100000')    == ['100000']
            find_eur_money('   \u20ac 100000  ')  == ['100000']
    '''
    ret = []
    # the with-context was partially for debug, but might actually be useful to return
    for before, match_str, match_object, after in wetsuite.helpers.strings.findall_with_context(r"(?:EUR|\u20ac)\s*([0-9.]+)\b", s, 20):
        cap = match_object.groups()[0].replace('.','')
        try:
            if int(cap) < minimum:
                continue
        except ValueError as ve: # not parseable as number? Remove
            pass
        #print( '[%s]%s[%s] -> %r'%(before, match_str, after, cap) )
        ret.append(cap)
    return ret


### Fetching the data

In [3]:
# Use a local store so that we only need to fetch the PDFs once, only render PDF pages once 
import wetsuite.helpers.localdata
pdfstore   = wetsuite.helpers.localdata.open_store('kansspel_pdfstore.db', key_type=str, value_type=bytes )     # URL -> PDF file bytestring
ocrstore   = wetsuite.helpers.localdata.open_store('kansspel_ocrstore.db', key_type=str, value_type=bytes )     # page-specific key -> json as bytes

#### Fetching case summaries
Let's first scrape the webpage to figure out all cases, all documents that relate to each case

This part isn't cached, because the details can change.

In [7]:
# The purpose of this section is to fill the following variable:
extracted_cases = []  # list of (casename, [list of document dicts], [same-sized list of ocr results])

maxpage  = 9999999  # will be set to the actual number of pages by the first (well, every) page we fetch
cur_page = 0        # zero-based counting in the pagination

print( "FETCHING CASE SUMMARIES" )
while cur_page <= maxpage:
    page_url = 'https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/?pager_page=%d'%cur_page
    print( page_url )
                
    page_data = wetsuite.helpers.net.download(page_url)
    soup = bs4.BeautifulSoup( page_data, 'lxml' )

    # get the amount of pages, from the pagination links
    pagelinks = soup.select('a[class~="pager_step"][class~="pagina"]')   # CSS selector looking for both of those classes set in a whitespace-token list
    maxpage = int( pagelinks[-1].get('data-page') )                      # -1: last of those we see on the page

    print( "page %d of %d"%( cur_page+1, maxpage+1 ) ) # numbering is zero-based,  print out one-based for humans

    # fetch all links to specific case detail pages
    for detail_page_a in soup.select('#results a[class~="siteLink"]'): # pick out the links (URLs) of the detail page of each case
        detail_page_url = detail_page_a.get('href') # these are already absolute  (otherwise we'd have to urljoin them)
        case_name = detail_page_a.text.replace('/','_')
        case_dict = {
            'name':case_name,
            'date_range':[],
            'money':[],
            'ecli':[],
            'case_detail_url':detail_page_url,
        }

        print()
        print( '  Case: %r'%(case_name) )

        # Note: the date shown here and on the detail page may be the start date?   It may still useful to distinguish cases for repeat offenders


        ### fetch that case's detail page, and find all PDF links on it
        # This section used to be three lines long, until we decided that hey, maybe that status would be nice to have.
        # Then we discovered this is a free-form mess
        # and this is the hand-crafted combination of of exception cases that will probably break in the future.

        #if detail_page_url in (
        #    #'https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/toto-online/',
        #    #'https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/1x-corp-exinvest/',
        #          status is voor beiden eersten
        #    #'https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/vriendenloterij/',
        #    #'https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/redslots/',
        #):
        #    print("SKIP FOR NOW: %r"%detail_page_url)
        #    continue


        #print( detail_page_url )
        detail_page_data = wetsuite.helpers.net.download( detail_page_url )
        detail_soup = bs4.BeautifulSoup( detail_page_data, 'lxml' )
        
        ## Construct a list of dicts, one for each document, with keys 'title', 'status', and 'pdf_url'
        doc_dicts, cur = [], {}

        def check_add_clear():
            global doc_dicts, cur
            ''' if there's sensible content, we add it to our list of docs.
                clears cur for next document
            '''
            # if effectively empty, do nothing at all
            if len(cur.keys())==1  and  'title' in cur  and  cur['title']=='': # effectively empty. Also one of a handful of exception cases
                pass
            elif len(cur)>0: # effectively empty
                # fail if it's incomplete, so that I fix it.
                if 'title' not in cur:
                    raise ValueError('scraping code is missing title, cur=%r'%cur)
                if 'pdf_url' not in cur:
                    raise ValueError('scraping code is missing pdf_url, cur=%r'%cur)
                #if 'status' not in cur:
                #    raise ValueError('scraping code is missing status, cur=%r'%cur) that happens, see e.g. https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/n1-interactive/ or https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/1x-corp-exinvest/
                #print("ADD %r"%cur)
                doc_dicts.append(cur)
            cur = {}


        pdflinks = list( detail_soup.select('a[class~="importLink"][class~="pdf"]') ) # used only to find the content block now, we need more structure from the page
        # (and things like select('div[class~="grid-blok"] div[class~="grid-element"] div[class~="grid-inside"] div[class~="iprox-content"]') is not specific enough, that's a general template
        if len(pdflinks)==0:
            print("WARNING - no content?")
        else:
            # This whole part used to be short, and readable(ish), but the pages are a bit messy so it ended up with a lot of exception cases.

            # a sequence of one or more:  h2 (title) p (link) h2-or-h3 "status" p (status text)            
            # but I've seen an initial paragraph in front of it. and a header and a paragraph in front of it.
            child = pdflinks[0].parent.previous_element # try to position on the first document's title header. Parent would be the P
            while child.name is None and child.previous_sibling is not None: # find previous non-text node.  There must be a better way of doing this.
                child = child.previous_sibling
            #print( 'Chosen starting spot: ', child )

            while child is not None:
                if child.name: # filter out text nodes (iirc)
                    pdflinks = list( child.select('a[class~="importLink"][class~="pdf"]') )
                    has_pdflinks = len(pdflinks) > 0
                    alltext = (''.join(child.findAll(text=True))).strip().lower().strip(':')

                    # augmenting some things we find
                    case_dict['money'].extend( find_eur_money(alltext, minimum=5001) )
                    case_dict['ecli'].extend( wetsuite.helpers.meta.findall_ecli(alltext, rstrip_dot=True) )

                    #print( "LOOKING AT %r"% str(child).strip() )
                    if child.name in ('h2','h3'):
                        if has_pdflinks: # weird case in https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/vriendenloterij/
                            cur['pdf_url'] = urllib.parse.urljoin( detail_page_url, pdflinks[0].get('href') )   # make relative URLs absolute
                            if 'title' not in cur: # would be overwritten in almost all cases
                                cur['title'] = alltext # but helps deal with https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/1x-corp-exinvest/
                        else:
                            if alltext == 'status': # header that just says 'status'
                                #print('HEADER - "status"')
                                pass 
                            else: # probably a title
                                #print('HEADER - title?')
                                check_add_clear() # starts a new one, so:
                                cur['title'] = child.text
                    elif child.name == 'p':
                        if len(cur)==0: # dict empty?
                            pass # probably an initial paragraph
                        else:
                            if has_pdflinks: #elif 'pdf_url' not in cur:
                                #print('FIRST P - PDF URL')
                                cur['pdf_url'] = urllib.parse.urljoin( detail_page_url, pdflinks[0].get('href') )   # make relative URLs absolute
                            else: #elif 'pdf_url' in cur: # status text
                                #print('SECOND P - STATUS TEXT')
                                cur['status'] = child.text
                child = child.next_sibling
            check_add_clear()

            #import pprint
            #pprint.pprint( doc_dicts )
            print( '    # documents: %d'%len(doc_dicts) )


        extracted_cases.append( (case_dict, doc_dicts, []) )
        #break  # debug: stop after first case on page

    cur_page += 1
    #break  # debug: stop after first page

FETCHING CASE SUMMARIES
https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/?pager_page=0
page 1 of 6

  Case: 'NetX Betting'
    # documents: 2

  Case: 'Bingoal Nederland B.V.'
    # documents: 2

  Case: 'Vriendenloterij N.V.'
    # documents: 2

  Case: 'Nationale Postcode Loterij N.V.'
    # documents: 2

  Case: 'GoldWin Ltd'
    # documents: 2

  Case: 'Merkur Casino Almere B.V.'
    # documents: 2

  Case: 'Betent'
    # documents: 1

  Case: 'Winning Poker Network'
    # documents: 2

  Case: 'Red Ridge Marketing'
    # documents: 2

  Case: 'Hillside New Media Malta Plc'
    # documents: 2

  Case: 'Probe Investments Limited'
    # documents: 5

  Case: 'Betpoint Group Limited'
    # documents: 2

  Case: 'N1 Interactive Limited'
    # documents: 5

  Case: 'Videoslots Limited'
    # documents: 2

  Case: 'Fairload Limited'
    # documents: 5

  Case: 'Equinox Dynamic N.V. & Domiseda and Partners s.r.o.'
    # documents: 5

  Case: 'Bingoal Nederland B.V.'
    # 

In [9]:
# that produces a structure like (example for one case):
extracted_cases[0]

({'name': 'NetX Betting',
  'date_range': [],
  'money': [],
  'ecli': [],
  'case_detail_url': 'https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/netx-betting/'},
 [{'title': 'Besluit last onder dwangsom NetX Betting',
   'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/besluit_lod_1a_netx_betting_limited_ov_.pdf',
   'status': 'Tegen dit besluit kan bezwaar worden gemaakt.'},
  {'title': 'Besluit openbaarmaking NetX Betting',
   'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/besluit_openbaarmaking_ov_.pdf',
   'status': 'Tegen dit besluit kan bezwaar worden gemaakt.'}],
 [])

### Fetch the PDFs, OCR them

The above fetched metadata and links to the PDF, but not yet the documents.

We could have split up the 'fetch PDFs' and 'renders page and OCRs then', and stil might.
This way we don't have to remember quite as much, and it's single-purpose anyway.

This produces OCR results in their raw form -- (a bunch of text fragments with their positions) and does not use it yet.
It sets this data on the same data structure (that third tuple item inited as []) - the section below actually uses it.
Which is perhaps somewhat confusing.

Doing the OCR for a hundred documents takes a few hours, which is why we cache the raw OCR results.

In [15]:
print( "FETCH PDFs,  OCRing pages" )

for case_i, case_tuple in enumerate(extracted_cases):
    case_dict, case_doc_dicts, _ = case_tuple
    case_name = case_dict['name']

    print('------------------------')
    print( "ENUM: %s"%case_i)
    print( "NAME: %s"%case_name)
    print( "DICTS: %s"%pprint.pformat(case_doc_dicts))
    for case_doc_dict in case_doc_dicts:   # for each PDF document in the case
        pdf_url = case_doc_dict['pdf_url']

        print( "== %s =="%wetsuite.helpers.format.url_basename( pdf_url ) )
        pdfbytes, _ = wetsuite.helpers.localdata.cached_fetch( pdfstore, pdf_url )
        doc_page_fragments = [] # list of lists:   [   [page1fragment1,page1fragment2], [page2fragment1,page2fragment2], etc.  ]   

        #if 0:   
        #    doc_text_pages == list( wetsuite.datacollect.pdf.page_text( pdfbytes ) )
        #else:

        # Render PDF images as images
        page_images = list( wetsuite.extras.pdf.pages_as_images(pdfbytes, dpi=200, antialiasing=True) )  # TODO: cache these too
        # high DPI and antialiasing does a _little_ better on things like periods and colons, but less than you'ld think.

        # For each page image, get OCR. This ic cached because this is sloooow, and PDFs are unlikely to change
        for page_i, page_image in enumerate(page_images):
            page_key = 'ocr::%s::%s'%(page_i, pdf_url)

            if page_key in ocrstore:
                print('     OCR - CACHED  for page %d of %d'%(page_i+1, len(page_images)))
                page_ocr_results = json.loads( ocrstore.get(page_key) )
            else: # generate and cache
                print('     OCR - DOING  for page %d of %d'%(page_i+1, len(page_images)))
                page_ocr_results = wetsuite.extras.ocr.easyocr( page_image, use_gpu=True ) # CONSIDER: prefer?

                ocrstore.put( page_key, json.dumps(page_ocr_results).encode('utf8') )

            #page_image.save( '%s__%s__page_%03d.png'%(case_name,  hash(pdfbytes), page_i+1) )
            # DEBUG: Draw OCR results on the page it came from, and save as PNG:
            #eval_image = wetsuite.extras.ocr.easyocr_draw_eval( page_image, page_ocr_results )
            #eval_image.save('%s__%s__page_%03d-boxes.png'%(case_name,  hash(pdfbytes), page_i+1))

            # I think this is here to potentially merge PDF text objects into the same sort of structure. As is it does nothing.
            page_fragments = []   # fragments of text in a page,  which in the case of EasyOCR will typically be lines at a time
            for bbox, text, cert in page_ocr_results:
                page_fragments.append( (bbox, text, cert) ) 
            doc_page_fragments.append( page_fragments ) # yes, this is currently just page_ocr_results, the idea was that the above might augment/simplify that

        extracted_cases[case_i][2].append( doc_page_fragments )



FETCH PDFs,  OCRing pages
------------------------
ENUM: 0
NAME: NetX Betting
DICTS: [{'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/besluit_lod_1a_netx_betting_limited_ov_.pdf',
  'status': 'Tegen dit besluit kan bezwaar worden gemaakt.',
  'title': 'Besluit last onder dwangsom NetX Betting'},
 {'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/besluit_openbaarmaking_ov_.pdf',
  'status': 'Tegen dit besluit kan bezwaar worden gemaakt.',
  'title': 'Besluit openbaarmaking NetX Betting'}]
== besluit_lod_1a_netx_betting_limited_ov_.pdf ==
A b'%PDF-1.5'
     OCR - DOING  for page 1 of 14


first use of ocr() - loading EasyOCR model (into GPU)


     OCR - DOING  for page 2 of 14
     OCR - DOING  for page 3 of 14
     OCR - DOING  for page 4 of 14
     OCR - DOING  for page 5 of 14
     OCR - DOING  for page 6 of 14
     OCR - DOING  for page 7 of 14
     OCR - DOING  for page 8 of 14
     OCR - DOING  for page 9 of 14
     OCR - DOING  for page 10 of 14
     OCR - DOING  for page 11 of 14
     OCR - DOING  for page 12 of 14
     OCR - DOING  for page 13 of 14
     OCR - DOING  for page 14 of 14
== besluit_openbaarmaking_ov_.pdf ==
A b'%PDF-1.5'
     OCR - DOING  for page 1 of 6
     OCR - DOING  for page 2 of 6
     OCR - DOING  for page 3 of 6
     OCR - DOING  for page 4 of 6
     OCR - DOING  for page 5 of 6
     OCR - DOING  for page 6 of 6
------------------------
ENUM: 1
NAME: Bingoal Nederland B.V.
DICTS: [{'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/01-288-556_15404_sanctiebesluit_bingoal.pdf',
  'status': 'Tegen dit besluit\xa0is bezwaar gemaakt.',
  'title': 'Sanctiebesluit'},
 {'pdf_url': 'https:/

In [6]:
# A case now looks like:
extracted_cases[0]

({'name': 'Bingoal Nederland B.V.',
  'date_range': [],
  'money': [],
  'ecli': [],
  'case_detail_url': 'https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/bingoal-nederland-0/'},
 [{'title': 'Sanctiebesluit',
   'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/01-288-556_15404_sanctiebesluit_bingoal.pdf',
   'status': 'Tegen dit besluit\xa0is bezwaar gemaakt.'},
  {'title': 'Openbaarmakingsbesluit',
   'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/15404_bingoal_openbaarmakingsbesluit_002_.pdf',
   'status': 'Tegen dit besluit\xa0is\xa0bezwaar gemaakt.'}],
 [[[([[901, 195], [1147, 195], [1147, 235], [901, 235]],
     'Kansspelautoriteit',
     0.9999940713236027),
    ([[196, 372], [354, 372], [354, 404], [196, 404]],
     'OPENBAAR',
     0.9998506269477122),
    ([[195, 569], [1433, 569], [1433, 609], [195, 609]],
     'Besluit van de raad van bestuur van de Kansspelautoriteit als bedoeld in artikel 35a',
     0.9089226683073498),
    ([[195

### Analyse raw OCR output into structured text

Take those OCR fagments and separate off headers, group into paragraphs, and such

This is separate from the above because this took a bunch of tweaking specific to this document layout.

It's also long because a bunch of specific augmentation goes on here.

It amounts to a bunch of math on numbers so takes less than a minutye It's separate so that we can tweak and re-run it easily.

In [16]:
# a bunch of helpers
from wetsuite.extras.ocr import doc_extent, page_fragment_filter, bbox_max_x, bbox_max_y, bbox_min_x, bbox_min_y, bbox_height

verbose = False

dataset_cases = [] # distinct name, has a different structure

for case_i, case_tuple in enumerate(extracted_cases):
    case_dict, case_doc_dicts, ocrdata = case_tuple
    case_name = case_dict['name']

    print( 'CASE %d: %s'%(case_i, case_name) )

    case_docs = [] #this part's main output, becomes a dict per case (with keys like 'url', 'pages')
    assert len(case_doc_dicts) == len(ocrdata)

    case_doc_dates = set() # or maybe count?

    ## for each document, analyze the OCR fragments and sort it into more directly useful data
    for doc_i in range(len(case_doc_dicts)): #ocrdata should have the same length
        pdf_dict = case_doc_dicts[doc_i]
        doc_ocr  = ocrdata[doc_i]


        pdf_url = pdf_dict['pdf_url']
        print( '  DOC %d:  %s'%(doc_i,   wetsuite.helpers.format.url_basename( pdf_url ) ) )


        doc_pages = [] # the main output for a document - a list of dicts that each detail a page
        doc_dates = [] # picks out dates from the headers

        doc_wide_extent = doc_extent( doc_ocr ) # the area outside there is no text at all, throughout all of the page's document
        #print( 'doc_min_x, doc_max_x, doc_min_y, doc_max_y', extent )

        for page_i, page in enumerate( doc_ocr ): # page is now all text framents on a page, a list of all (bbox, text, cert) 
            if verbose:
                print( '   PAGE %s ---------------------------------------------------------------------------'%(page_i) )

            ### Determine margins
            # so that we can have logic that extract text that hopefully flows between pages
            head_y_ary, foot_y_ary = [], []
            # - Top margin defined by lowest box extent of 
            #    - "Kansspenautoriteit" (not really necessary)
            #    - "OPENBAAR"
            #    - "Ons kenmerk" plus one extra box's worth
            matches = page_fragment_filter( page, r'^Kansspelautoriteit$', q_max_y=0.25, extent=doc_wide_extent )
            for bbox, text, cert in matches:
                #print ('    [page %d] Kansspelautoriteit MATCH: %s %s %s'%(page_i, bbox, text, cert))
                head_y_ary.append( bbox_max_y(bbox) )
            matches = page_fragment_filter( page, r'^OPENBAAR$', q_max_y=0.3, q_max_x=0.35, extent=doc_wide_extent )
            for bbox, text, cert in matches:
                #print ('    [page %d] OPENBAAR MATCH: %s %s %s'%(page_i, bbox, text, cert))
                head_y_ary.append( bbox_max_y(bbox) )
            matches = page_fragment_filter( page, r'Ons kenmerk', q_max_y=0.25, extent=doc_wide_extent )
            for bbox, text, cert in matches:
                #print ('    [page %d] Ons kenmerk MATCH: %s %s %s'%(page_i, bbox, text, cert))
                head_y_ary.append( bbox_max_y(bbox) + 1.2* bbox_height(bbox) ) # we expect one more line of the same height below it (times fudge factor, for expected whitespace)
            
            # - Bottom margin defined by highest box y of "agina [0-9]+ van [0-9]+" in the bottom right
            matches = page_fragment_filter( page, r'Pagina', q_min_x=0.7, q_min_y=0.75, extent=doc_wide_extent ) # the rest, e.g. (\s*[0-9]+\s*van\s*[0-9]+)?, is optional because it's sometimes detected separately, or not at all
            for bbox, text, cert in matches: # 
                #print ('    [page %d] Pagina MATCH: %s %s %s'%(page_i, bbox, text, cert))
                foot_y_ary.append( bbox_min_y(bbox)  )
            if len(foot_y_ary)==0: # look harder for page - a lone number to the bottom right that matches roughly with the page number we think it is is probably also the page number
                pages_around = '|'.join( str(pag)  for pag in range(page_i-1, page_i+2) )
                matches = page_fragment_filter( page, r'^(%s)$'%pages_around, q_min_y=0.75, q_min_x=0.7, extent=doc_wide_extent )
                for bbox, text, cert in matches: # (\s*[0-9]+\s*van\s*[0-9]+)?
                    #print ('    [page %d] Bare pagina MATCH: %s %s %s'%(page_i, bbox, text, cert))
                    foot_y_ary.append( bbox_min_y(bbox)  )

            if len(head_y_ary)==0:
                head_bot_y = None
            else:
                head_bot_y = max(head_y_ary)

            if len(foot_y_ary)==0:
                foot_top_y = None
            else:
                foot_top_y = min(foot_y_ary)

            if verbose:
                print("    page head_y: ", head_bot_y) # TODO: call this head_bot_y (and probably rename h)
                print("    page foot_y: ", foot_top_y) # TODO: call this foot_top_y


            ### Figure out some things about the page
            ## list-iten X position
            # Most of these documents have numbering on their headers and paragraphs
            # these list-iten numbers at the start of lines are not detected consistently by OCR,
            # nor are they linguistic information, so attempt to remove them.
            # We like to be sure (to not remove such things from actual text), so 
            lnum_righty = []
            matches = page_fragment_filter( page, r'^[0-9.]+$', q_max_x=0.2, q_min_y=head_bot_y,q_max_y=foot_top_y, extent=doc_wide_extent ) 
            for bbox, text, cert in matches:
                if verbose:
                    print ('    [page %d] LI NUM MATCH: %s %s %s'%(page_i, bbox, text, cert))
                lnum_righty.append( bbox_max_x(bbox)  )            
            if len(lnum_righty) < 4: # not sure enough - be more conservative
                lnum_righty = doc_wide_extent[0] + 20 # TODO: avoid that constant
            else:
                lnum_righty = max(lnum_righty)

            ## average box size
            box_heights = []
            for bbox, text, cert in page:
                box_heights.append( bbox_height( bbox ) )
            median_boxheight = numpy.median(box_heights)
            #print(box_heights)
            #print("Median bbox height: %d"%median_boxheight)


            ### Group and process fragments   (in passes, to make logic like 'is this the last thing in the body' easier)
            # - separate into head, body, foot
            # - polish the body, e.g. 
            #   'is this the last thing on the page AFTER we removed the footer' logic easier
            #   replace '-' at end of paragraph with '.'

            page_contents = {
                'head_fragments':[],
                'body_fragments':[],
                'misc_fragments':[],
                'foot_fragments':[],
                'body_text':[],
            }


            ## Sort fragments into header, body, and footer
            prev_topy, prev_boty = 0,0 
            for frag_i, (bbox, text, cert) in enumerate(page):
                topleft, topright, botright, botleft = bbox
                topy, boty = topleft[1], botright[1]

                text = re.sub('[_-]\s*$','.', text) # mistaken for period sometimes. Could be more thorough, but this is a start

                if head_bot_y is not None and topy < head_bot_y :
                    page_contents['head_fragments'].append( (bbox, text, cert) )

                elif foot_top_y is not None and boty > foot_top_y: # CONSIDER: also remove numbers from left of boxes that start at the -- IF we think it's such a number.
                    page_contents['foot_fragments'].append( (bbox, text, cert) )

                elif wetsuite.helpers.strings.is_numeric( text) and topleft[0] < lnum_righty+5:
                    pass
                    #page_contents['misc_fragments'].append( (bbox, text, cert) )

                else: #probably useful body
                    #print( '      %6s  %12s %-12s  fs:%-5s  %s '%( keep_ignore, topleft, botright, boxheight, text ) )
                    page_contents['body_fragments'].append( (bbox, text, cert) )

                    case_dict['money'].extend( find_eur_money(text, minimum=5001) )

                prev_topy, prev_boty = topy, boty

            ## Header stuff. Little smartness yet.
            head_text = [] 
            for body_frag_i, (bbox, text, cert) in enumerate(page_contents['head_fragments']):
                head_text.append(text)

                _, dts = wetsuite.helpers.date.find_dates_in_text(text)
                for dt in dts:
                    if dt is not None:
                        case_doc_dates.add( dt )
                        doc_dates.append( wetsuite.helpers.date.format_date(dt) )
                # CONSIDER: getting out kenmerk


            foot_text = [] 
            for body_frag_i, (bbox, text, cert) in enumerate(page_contents['foot_fragments']):
                foot_text.append(text)

            ## Figure out body's paragraphs, seprate where sensible
            body_text = [] 
            temp_par  = []
            prev_topy, prev_boty, prev_boxheight, prev_text = 0,0, 0, ''
            def flush_par():
                global temp_par, body_text
                if len(temp_par)>0:
                    body_text.append( ' '.join(temp_par) )
                    temp_par=[] 


            for body_frag_i, (bbox, text, cert) in enumerate(page_contents['body_fragments']):
                topleft, topright, botright, botleft = bbox
                topy, boty = topleft[1], botright[1]
                boxheight = bbox_height(bbox) # is a good indicator of font size

                same_line            = (boty - prev_boty) < 0.6*boxheight
                current_line_shorter = len(text) < 0.5 * len(prev_text) 
                prev_line_shorter    = 0.5 * len(prev_text) < len(text)

                #if topy-prev_boty > -5:
                #    print( "                                                   [ydist %d (%s->%s)]"%(topy-prev_boty, prev_boty, topy) )

                #if topy < prev_boty by roughly boxheight it's the same line

                if topy < prev_boty-200:
                    if verbose:
                        print( "LARGE DECREASE IN Y HUH?")
                        print( '      %12s %-12s  fs:%-5s  %s '%( topleft, botright, boxheight, text ) )
                    #continue
                    break
                    
                if prev_boty!=0  and  topy-prev_boty > median_boxheight: 
                    if verbose:
                        print( "                                                   [YSEP %d (%s->%s)]"%(topy-prev_boty, prev_boty, topy) )
                    flush_par()

                elif prev_boty!=0  and  topy-prev_boty > 0.6*median_boxheight: 
                    if verbose:
                        print( "                                                   [YSEP %d (%s->%s)]"%(topy-prev_boty, prev_boty, topy) )
                    flush_par()

                elif (boxheight > 1.25 * prev_boxheight)  and  current_line_shorter  and  not same_line:  # larger text, and shorter
                    # A large title of a new section is generally  caught by YSEP, actually
                    #print( 'size diff;   line diff?  botdiff is %d,  relative to 0.5*boxheight=%d'%( (boty-prev_boty),  0.5*boxheight ) )

                    if verbose:
                        print( "                                                   [LARGER_TEXT %s->%s]"%(prev_boxheight, boxheight) )
                    flush_par()

                elif boxheight < 0.8 * prev_boxheight  and  prev_line_shorter  and not same_line: # font smaller than the previous line, and the previous line was shorter
                    if verbose:
                        print( "                                                   [SMALLER_TEXT %s->%s]"%(prev_boxheight, boxheight) )
                    flush_par()

                temp_par.append( text )

                if verbose:
                    print( '      %12s %-12s  fs:%-5s  %s '%( topleft, botright, boxheight, text ) )

                prev_topy, prev_boty, prev_boxheight, prev_text = topy, boty,  boxheight, text

            flush_par()

            del page_contents['body_fragments']
            del page_contents['head_fragments']
            del page_contents['foot_fragments']
            del page_contents['misc_fragments']

            page_contents['head_text'] = head_text
            page_contents['body_text'] = body_text
            page_contents['foot_text'] = foot_text

            doc_pages.append( page_contents )

            #print( 'body_text' )
            #pprint.pprint( page_contents['body_text'] )
            #for par in page_contents['body_text']:
            #    for line in textwrap.wrap(par):
            #        print( '[%s]'%line )
            #    print()

        case_docs.append( 
            {
                'url':pdf_url, 
                'pages':doc_pages,
                'header_dates':doc_dates,
                'status':pdf_dict.get('status'),
            }
        )

        # summarize document
        #pprint.pprint( doc_contents )

        if 0:
            for page in doc_pages:
                for temp_par in page['body_text']:
                    for line in textwrap.wrap(temp_par):
                        print( '%s'%line )
                    print()



    date_range = ()
    if len(case_doc_dates)>0:
        date_range = (
            wetsuite.helpers.date.format_date( min(case_doc_dates) ), 
            wetsuite.helpers.date.format_date( max(case_doc_dates) )
        )

    dataset_cases.append( { 
                'name': case_name,
                'docs': case_docs,
        'date_range': date_range,

        # CONSIDER: maybe just start with case_dict so we don't have to manually do:
                'money': case_dict['money'],
                'ecli': case_dict['ecli'],
    'case_detail_url': case_dict['case_detail_url'],
    } )

print( len(dataset_cases) )




CASE 0: NetX Betting
  DOC 0:  besluit_lod_1a_netx_betting_limited_ov_.pdf
  DOC 1:  besluit_openbaarmaking_ov_.pdf
CASE 1: Bingoal Nederland B.V.
  DOC 0:  01-288-556_15404_sanctiebesluit_bingoal.pdf
  DOC 1:  15404_bingoal_openbaarmakingsbesluit_002_.pdf
CASE 2: Vriendenloterij N.V.
  DOC 0:  20230523_01-291-294_openbaar_besluit_last_onder_dwangsom_vl_spellen.pdf
  DOC 1:  20230523_01-291-296_openbare_versie_openbaarmakingsbesluit_woo_vl_last.pdf
CASE 3: Nationale Postcode Loterij N.V.
  DOC 0:  20230523_01-291-288_openbaar_besluit_last_onder_dwangsom_npl_spellen.pdf
  DOC 1:  20230523_01-291-290_openbare_versie_openbaarmakingsbesluit_woo_npl_last.pdf
CASE 4: GoldWin Ltd
  DOC 0:  20230420_01-291-425_besluit_lod_goldwin_ltd_ov_.pdf
  DOC 1:  20230420_01-191-426_besluit_openbaarmaking_ov_.pdf
CASE 5: Merkur Casino Almere B.V.
  DOC 0:  01-289-330_15484_sanctiebesluit_ov.pdf
  DOC 1:  01-289-336_15484_openbaarmakingsbesluit_woo_ov.pdf
CASE 6: Betent
  DOC 0:  betent_15403_openbaarmakin

# Write out

Now write that augmented structure into something we can call a dataset

In [None]:

with open('kansspelautoriteit_plain.json', 'wb') as f:
    dataset = {
        'description':'''This is a plaintext form of the set of documents you can find under https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/ as PDFs.

        Since almost half of those PDFs do not have a text stream, this data is entirely OCR'd,
        so expect some typical OCR errors.  The OCR quality seems fairly decent, and some effort was made to remove headers and footers,
        yet there are some leftovers  like _ instead of . and = instead of :


        The data is a fairly nested structure of python objects (or JSON, before it's parsed).
        - .data is a list of cases.

        - each case is a dict, with a 
            - 'name', 
            - 'docs' (a list) 
            - and some extracted information like mentioned money amounts, the apparent date span of the case

        - each document in that mentioned list is is a dict, with keys like
            - 'url' - to the PDF it came from
            - 'status' - from the detail page (if we could find it - not 100%) 
            - extracted informations like 'header_dates' (comes from PDF contents)
            - 'pages' (a list)

        - each page in that list is a dict, which has keys:
            - 'body_text' - a list, which containts text fragments that are _almost_ like paragraphs 
                except that text may continue between pages anyway - currently still up to you to detect - 
                plus the post-OCR processing isn't perfect.
            - 'foot_text' - generally just [ "Pagina 1 van 27" ]
            - 'head_text' - fragments like []"Kansspelautoriteit", "OPENBAAR"] but also the date and kenmerk lines


        (TODO: update this example)
        For example (body text edited for brevity), one case's dict, with one document:
            { # dict for a case
                'name': 'Toto Online B.V.',  # case's name
                'docs': [                    # list of PDF documents in this case
                    { # dict detailing first document in case
                        'url': 'https://kansspelautoriteit.nl/publish/library/32/01_278_071_15091_sanctiebesluit_toto_ov.pdf',
                        'pages': [
                            {  # first page's dict   (currently contains only body's text fragments; idea was to split off header contents)
                                'body_text':[  # first page's text fragments
                                    'Besluit van de raad van [more sentence]',
                                    'Zaak: 15091 Kenmerk: 15091 [more kenmerk]',
                                    'Besluit',
                                    'Inleiding',
                                    'De raad van bestuur van de Kansspelautoriteit [more paragraph]'
                                ]
                            },
                            { # second page's dict
                                'body_text': [   # second page text fragment
                                'heeft heeft ontvangen sinds hij daar [more paragraph]',
                                'De toezichthouders zijn naar aanleiding [more paragraph]'
                                ]
                            } 
                            # ...more pages
                        ], 
                    }, # end of document dict
                    # ...more documents
                ]
            }
        ''',
        'data':dataset_cases,
    }

    f.write( json.dumps( dataset ).encode('u8') )
