# Purpose of this notebook

Finding the varied names used to refer to laws can help,
among other things, resolve the varied references to them, and e.g. see how unambiguous those are.

Even when specific laws and regulations have official names, people will use shortened names, acronyms, and other variations,
and we would like to know these - also to support finding the many references that use these.

<!-- -->

This code takes the approach of "we don't care to catch every case, if we throw enough data at is we will get clean results via consistent use" approach.

That said, a lot of the (messy) code is specifically there to try to get something out of at least the majority of references,
which is also messy code because it's playing whack-a-mole with alternate ways of formatting names and/or identifiers.

<!-- -->

This notebook may not be directly useful to you, but might contain some code you might consider taking, and which we might put into our library.

<!-- -->

The below spends time stripping things like "article x lid y" from the text because we care mostly about the main names,
but actually the non-stripped  text is specifically interesting to e.g. train a classifier on finding these references by text.

In [1]:
import re, collections, pprint, random, json, urllib.parse

import bs4, requests

import wetsuite.datasets
import wetsuite.helpers.etree
import wetsuite.helpers.localdata
import wetsuite.helpers.meta
import wetsuite.helpers.strings
import wetsuite.helpers.koop_parse
import wetsuite.helpers.spacy
import wetsuite.extras.word_cloud

In [2]:
_parse_cache = {} # url -> etree object; see cached_parsed for why
# in a separate cell so we can tweak the helpers without clearing this cache

In [3]:
# some helper functions we will use later

def cached_parse(store, url, disabled=1):
    ''' This isn't necessary, but helps speed in notebooks while working on this:
        when repeatedly running extractions on the same HTML/XML, 
        it helps to keep those parses in memory, rather than interpret them each time.

        This caches the result of fetch-and-parse.
        It does however take considerable RAM when, say, reading _all_ versions within BWB, so `disabled` is a quick hack for this to _just_ parse each time after all
    '''
    if not disabled  and  url in _parse_cache:
        return _parse_cache[url] # return cached
    else: # parse, and store in cacke
        data = store.get( url )
        tree = wetsuite.helpers.etree.fromstring( data )
        tree = wetsuite.helpers.etree.strip_namespace( tree )
        #wetsuite.helpers.etree.indent_inplace( tree )
        if not disabled: # store
            _parse_cache[url] = tree
        return tree
    

# The distinction between the two cleanup functions became vague;  maybe merge
# and there is an argument for using  wetsuite.helpers.patterns.find_nonidentifier_references() to strip the 
#  TODO: consider both
def cleanup_basics(name: str):
    ' tries to turn a more detailed reference like "artikel 3 van de Woo"  into  "Woo"   but nothing more creative than that ' 
    if 'van de ' in name:
        name = name[name.index('van de ')+7:].strip()
    if 'van het ' in name:
        name = name[name.index('van het ')+8:].strip()
    if ', art' in name: # "Woo, artikel" -> Woo
        name = name[:name.index(', art')].strip()

    for re_remove in ('^[Aa]rt(?:[.]|ikel)? [0-9:.]+[a-z]*',   # at start
                        '[Aa]rt(?:[.]|ikel)? [0-9:.]+[a-z]*$',  # at end
                        ):
        if re.search( re_remove, name ) is not None:
            name = re.sub(re_remove, ' ', name).strip(', ')

    for re_remove in ('^lid [0-9:.]+[\u00baa-z]*',   # at start
                        'lid [0-9:.]+[\u00baa-z]*$',  # at end
                        ):
        if re.search( re_remove, name ) is not None:
            name = re.sub(re_remove, ' ', name).strip(', ')

    return name


def cleanup_wet_title(name: str):
    ''' Takes what is probably a full name of a thing, but may in some of the below uses be the entire aanhaal alinea,
        tries to take off anything not the title - quotes, spaces, rest of a sentence.
    '''
    name = name.replace('\u00A0',' ') # replace non-breaking space with a regular one

    # take out matching quotations -- and assume they are being used to demark the entire name  (the rest and the below still executes but is unlikely to match)
    for qleft, qright in (
        ('\u2018', '\u2019'),
        ('\u00ab', '\u00bb'),
        ('\u201c', '\u201d'),
        ('\u201e', '\u201d'),
        ('\u201d', '\u201d'),
    ):
        left_index  = name.find(qleft)
        right_index = name.rfind(qright)
        if left_index != -1 and right_index != -1   and left_index < 10  and  right_index > left_index+5:
            name = name[ left_index+1 : right_index ]

    if name.startswith('"') and name.count('"')>=2: #assume these are outside quotations
        try:
            name = name[ 1: name.index('"', 4) ].strip()
        except:
            print( repr(name) )
            raise

    # aimed at the aanheffing sentence specifically
    for re_cutafter in(
        r'en zal',                              # probaly more often than not the note where it will be publicised
        r'en treedt',                           # probaly more often than not the note when it will be publicised
        r'[;,.] ([zZ]ij |Het |Dit besluit )?(treedt|treden) in werking',
        r'[.] Deze ',
        r'zal worden gepubliceerd', 
        r'en wordt gepubliceerd',
        r'wordt .{0,20}in de Staatscourant',
        r' (en|zij|zei|zal) (wordt|worden) bekend\s?gemaakt',
        r'. De beleidsregels', 
        r' en werkt terug ',
        r'Zij werkt terug ',
        r'De regeling zal met ',
        r' met vermelding van ',
        r' met bijvoeging van ',

        # TODO: extract abbreviations too
        r'(, )?(of )? afgekort (als|tot)',
        r'[\(]afgekort\b',
    ):
        match = re.search( re_cutafter.replace(' ',r'[\s\n]+'), name, re.M )
        if match is not None:
            name = name[:match.start()]

    if name.startswith('de '):  # wet
        name = name[3:].strip()
    if name.startswith('het '): # besluit
        name = name[4:].strip()
    # note: taking off 'nieuwe' at the start is dangerous because that is frequently part of the actual name

    name = name.strip('\u2018\u2019') #quote
    name = re.sub('\s+',' ', name).strip()
    name = name.strip('"\'.;:,\u00ab\u00bb \u2018\u2019\u201c\u201d\u201e')
    return name


def title_is_too_generic( title ):
    """ various documents will refer to shortened names it defined earlier.    Until we perhaps parse and can use that contextual information, we can ignore these. 
        Sometimes used after abovementioned cleanup.
    """
    title_lower = title.strip().lower()
    title_lower = title_lower.replace('genoemde ','').strip()
    if title_lower in ('die wet', 'de wet', 'deze wet', 'regeling', 'wet',  'besluit', 'het besluit', 'wetboek', 'uitvoeringswet', 'subsidieregeling',  'fonds', 'het', 'model', 'mens',  
               ''): # some of these are singular edge cases, shouldn't really be here
        return True
    return False


def name_from_extref_tag(extref_tag, allow_fuzzier=False):
    ''' Thats an etree object, specifically an extref,
        Returns the part of the extref's text that is probably the name of a thing being referenced,
          stripped of some obvious stuff
        ...or None, if we decide it's probably not a useful name, or we're not really sure what's in there.
          with allow_fuzzier we're pretty strict about only matching some known patterns,
          with allow_fuzzier==True we just hope for the best

    '''
    match1 = re.search(' van (?:de|het) (.*)', extref_tag.text)   # I've seen a case where 'van de' is in the reference twice, like 'van de wet, in artikel 7 van de Elektriciteitswet 1998', but meh.
    match2 = re.search('lid, (Wet .*)', extref_tag.text)

    if match1 is not None:
        name = match1.groups()[0].strip()
        if name == 'wet': #  'van de wet' without a name
            return None
    elif match2 is not None:
        name = match2.groups()[0].strip()
    elif extref_tag.text.startswith('Wet ') or extref_tag.text.startswith('Successiewet ') or extref_tag.text.startswith('Besluit ') or extref_tag.text.startswith('Verordening ') or extref_tag.text.startswith('Warenwetbesluit '):
        name = extref_tag.text.strip()
    elif extref_tag.text.rstrip('0123456789 ').endswith('wet'): # assume it's a short name
        name = extref_tag.text.strip()
    elif extref_tag.text.rstrip('0123456789 ').endswith('besluit'): # assume it's a short name
        name = extref_tag.text.strip()
    elif extref_tag.text.rstrip('0123456789 ').endswith('regeling'): # assume it's a short name
        name = extref_tag.text.strip()
    elif re.match( 'W[A-Za-z]+$', extref_tag.text)   and len(extref_tag.text) in (2,3,4)  and  extref_tag.text.lower()!='wet': # assume it's a short initalism, for a law
        name = extref_tag.text.strip('')
    else: # maybe look at what these are
        if allow_fuzzier:
            name = extref_tag.text.strip('') # will still be cleaned below
            name = cleanup_basics( name ) # this is a little awkward of a combination, really
        else:
            #print( 'DUNNO %r'%(wetsuite.helpers.etree.tostring(extref_tag).decode('u8')) )
            return None

    # right now we're interested in the main name, not reference to specific part
    #  ...but we could later try to deal with those as well - there are ~15K cases that the below skipped
    name = cleanup_wet_title(name)
    #name = cleanup_basics(name)
    
    # ignore less-formal references
    if title_is_too_generic(name):
        #print( "SKIP non-useful name (generic 1): %r"%name)
        return None

    if wetsuite.helpers.strings.contains_any_of(name, [', van (deze|die) wet$'], regexp=True ):
        #print( "SKIP non-useful name (generic 2): %r"%name)
        return None
    
    #if wetsuite.helpers.strings.contains_any_of( name.lower(), (
    #    'artikel ','artikelen ', 'art.', 'lid,', 'paragraaf ', 'van die ', ' die wet', 'van de wet', 'van deze wet', 'met vermelding ',
    #    'eerste lid', 'tweede lid', 'boek van ', 'weede lid', 'met de bijbehorende', 'voornoemde wet', 'Bijlage ', 'bijlage bij', 'bijlage ') ):
    #    #print( "SKIP FOR NOW (specific reference): %r"%name) # there
    #    return None
    return name


resolved = wetsuite.helpers.localdata.open_store( 'redirect_urls.db', str, str )
def resolve_deeplink(url):
    ' Given BWB deeplink URLs (mainly the title ones), try to fetch '
    if url in resolved:
        #print("CACHED for %r"%url)
        retval = resolved.get( url ) # which might be None
        if retval == 'None': # special casing, since it's either that or a BWBR. We _could_ store 'could not resolve' and its reason, but there is little point.
            return None
        return retval
    else:
        print("FETCHING %r"%url)
        resp = requests.get(url, allow_redirects=True)
        soup = bs4.BeautifulSoup(resp.content)
        identifier_tags = soup.select("meta[name='dcterms:identifier']")
        if len(identifier_tags)>0:
            bwbid = identifier_tags[0].get('content')
            resolved.put( url, bwbid )
            return bwbid
        else:

            if b'De opgevraagde pagina is niet gevonden' in resp.content:
                #print("DID NOT RESOLVE %r"%resp.history[-1].url)
                resolved.put( url, 'None' )
            elif b'Gebruikt formaat in de URL wordt niet ondersteund':
                #print("DID NOT RESOLVE %r"%resp.history[-1].url)
                resolved.put( url, 'None' )
            else:
                print("DID NOT HANDLE %r"%resp.history[-1].url)
            return None



# Collect references from BWB

## Names from BWB's intitule, citeertitel, and "aangehaald als" text

__Citeertitel__ - is often a fairly succinct name
: this basically comes from the last paragraph when it says something like "Deze wet wordt aangehaald als: Wet op de rechterlijke organisatie.", though the code belows does not assume those are identical, in part for ease of code because mostly laws have that paragraph, but _everything_ has a citeertitel.

__Intitule__ - is often a more detailed description, and will rarely be the text people use to cite (though sometimes it is the _same_ as citeertitel)

In [4]:
bwb_latestonly_xml = wetsuite.datasets.load('bwb-mostrecent-xml')
# there are a few laws that changed name, but due to that choice of data, we focus on the recent version

                                                                     # lists mostly for code below to be simpler, they'll contain 0 or 1 items
bwb_names_citeertitel = collections.defaultdict(list)   # BWB-id -> list of name strings     from something's own metadata, what it calls itself
bwb_names_aanhaling   = collections.defaultdict(list)   # BWB-id -> list of name strings     from something's own data, what it calls itself
bwb_soort             = {}                              # BWB-id -> soort    


for xml_url in bwb_latestonly_xml.data.keys(): # note: going through ~36K documents will take a minute or two
    tree = cached_parse( bwb_latestonly_xml.data, xml_url )

    meta = wetsuite.helpers.koop_parse.bwb_toestand_usefuls(tree)
    bwbid = meta['bwb-id']
    bwb_soort[bwbid] = meta['soort']

    ## record the citeertitel 
    # which is just in the metadata
    bwb_names_citeertitel[ bwbid ].append( cleanup_wet_title(meta['citeertitel']) )  


    ## find the self-reference
    # find all alineas that mention 'aangehaald' (if any), find the thing it then mentions
    aanhaling = []
    for al_tag in tree.getiterator(tag='al'):
        tagtext = wetsuite.helpers.etree.all_text_fragments( al_tag, strip='\n', join='' )
        #tagtext = all_text_fragments( tag, strip='\n', join='' )
        #if 'aangehaald' in tagtext:
            #print( tagtext )
        match = re.search(r' (?:aangehaald als|aangehaald onder de titel van):?\s*(.*)', tagtext) # this probably filters out too much
        if match is not None:
            aanhaling.append( match.groups()[0].rstrip('.') )

    if len(aanhaling)==0: 
        #TODO: consider subelements, e.g. the CO<sub>2</sub> example

        # While debugging, it is useful to check that indeed they're without such a reference
        if wetsuite.helpers.strings.contains_any_of( meta['intitule'].lower(), ['regeling ', 'toepassing ', 'aanwijzing ', 'besluit ', 'beschikking ']): # maybe look at soort instead?
            pass
            #print( "SELFAANHAALFAIL (OKAY; seeming non-law) in %r"%xml_url)
        else:
            pass
            #print( "SELFAANHAALFAIL in %r"%xml_url)
        # there was previously also an 'is this french' check
    else:
        text = aanhaling[-1] # pretty sure it's always the last

        cleaner_title = cleanup_wet_title( text )
        if cleaner_title not in bwb_names_aanhaling[ bwbid ]:
            bwb_names_aanhaling[ bwbid ].append( cleaner_title )
        if '(' in cleaner_title: # some titles have parentheses at the end. Probably add them both with and without, because that may or may not be part of the name (TODO: clean more)
            cleaner_title = cleaner_title[:cleaner_title.index('(')].strip()
            if cleaner_title not in bwb_names_aanhaling[ bwbid ]:
                bwb_names_aanhaling[ bwbid ].append( cleaner_title )


In [6]:
len(bwb_latestonly_xml.data)

36906

### Quick summary

In [10]:
# how much did we get?
print("From  %d BWB documents  we got  %d citeertitels  and  %d aanhalingen."%(
    len(bwb_latestonly_xml.data),
    len(bwb_names_citeertitel),
    len(bwb_names_aanhaling),
))

From  36906 BWB documents  we got  36906 citeertitels  and  21196 aanhalingen.


In [12]:
# Quick inspection of
#    what kind of things are we apparently not cleaning from the aanhaling? 
# and/or 
#   what kind of differences are there between aanhaling and citeertitel?

for bwbid in bwb_names_aanhaling:
    aanhaling   = bwb_names_aanhaling[bwbid]
    citeertitel = bwb_names_citeertitel[bwbid]
    if set( aanhaling ).symmetric_difference( citeertitel ): 
        print()
        print( bwbid )
        print( aanhaling )
        print( citeertitel )


BWBR0001919
['Deegwarenbesluit (Warenwet)', 'Deegwarenbesluit']
['Deegwarenbesluit (Warenwet)']

BWBR0002028
['Meststoffenwet']
['Meststoffenwet 1947']

BWBR0002030
['Wet voorzieningen onder den vijand aangetroffen goederen']
['Wet van 24 april 1947, houdende voorzieningen onder den vijand aangetroffen goederen']

BWBR0002055
['Wet Souvereiniteitsoverdracht Indonesië']
['Wet souvereiniteitsoverdracht Indonesië']

BWBR0002099
['WET OORLOGSSTRAFRECHT']
['Wet oorlogsstrafrecht']

BWBR0002121
['Beschikking commissie gewetensbezwaren immunisatie militairen']
['Beschikking Commissie Gewetensbezwaren Immunisatie Militairen']

BWBR0002130
['Vestigingswet Bedrijven']
['Vestigingswet Bedrijven 1954']

BWBR0002226
['Successiewet']
['Successiewet 1956']

BWBR0002269
['Pachtwet 1937']
['Pachtwet']

BWBR0002346
['Lijst MKR']
['Militair keuringsreglement']

BWBR0002365
['']
['Rijkssubsidieregeling voor het algemeen maatschappelijk werk']

BWBR0002398
['Landbouwuitvoerbesluit 1963 (Algemene Voorwaard

### Extrefs in the BWB

In the BWB in XML form, `<extref>` tags are referces to other laws, regulations, and more. 
In a wider context, extref tags are fairly free-form in what link text they contain,
yet within the set of laws they are used fairly consistently and provide fairly clean data.

We are currently interested in those that point to laws, and those will contain a BRB-ids and look something like:
`<extref verwijzing-id="2189982" doc="jci1.3:c:BWBR0015163&amp;artikel=5" bwb-id="BWBR0015163" label-id="4716344">artikel 5 van het Instellingsbesluit Productschap Vis</extref>`
We should perhaps filter by it pointing to BWB documents with soort=wet, but can do that later.

As we specifically look for repeated, clearer cases, we are not bothered to extract every reference - we are just looking for consistentas long as a majority is considered useful.

In [13]:
bwb_names_extref = collections.defaultdict(list)   # BWB-id -> list of name strings    (those names being what references from elsewhere call this BWB)

for xml_url in bwb_latestonly_xml.keys():
    tree = cached_parse( bwb_latestonly_xml, xml_url )
    #meta = wetsuite.helpers.koop_parse.bwb_toestand_usefuls(tree)

    for extref_tag in tree.iter(tag='extref'): # find all external references, regardless of context in the document

        # is the thing we point to marked with a BWB identifier   (this may miss some things, but the else: below should tell us)
        ref_bwb = extref_tag.get('bwb-id') # normalize?
        if ref_bwb is not None:
            # We might not care about things the above didn't already have names for
            #if ref_bwb not in bwb_name_self:
            #    continue

            # (parsing an extref tag might become a function eventually, but right now it's specific content in a specific dataset)

            if extref_tag.text is not None:

                name = name_from_extref_tag( extref_tag )
                if name is None:
                    #print( "SKIP, apparently boring name: %r"%wetsuite.helpers.etree.tostring(extref_tag).decode('utf8') )
                    continue

                if name.endswith(')') and '(' in name: 
                    i = name.rfind('(') # find last open bracket, if there are multiple
                    
                    one = cleanup_wet_title( name[:i].strip() )
                    bwb_names_extref[ref_bwb].append( one )
                    
                    # IF this seems to be in the form 'bla bla (wet bla)' or 'bla bla (blawet)', then add the bracket contents
                    two = cleanup_wet_title( name[i+1:-1].strip() ) # -1 is valid only as long as that endswith up there stays
                    if ' ' in two and 'wet' in two.lower():
                        bwb_names_extref[ref_bwb].append( two )
                else: # enter as-is
                    bwb_names_extref[ref_bwb].append( name )

                # Note that we're specifically recording every reference (there will be many duplicates), 
                #  so that we can count them later and 

        else: # not references to bwb entries 
            # some checks to ignore what we know of and might handle later,
            #  and to print out everything else to figure out what we don't handle yet
            if extref_tag.get('reeks') == 'Celex': # EU identifiers
                pass
            elif extref_tag.get('doc') == 'onbekend': # ?
                pass
            else: # currently a few dozen cases left, that's fine to ignore
                pass
                #print( 'UNKNOWN extref %r'%(wetsuite.helpers.etree.tostring(extref_tag).decode('u8')) )

AttributeError: 'Dataset' object has no attribute 'keys'

In [None]:
# everything we just collected -- warning: about 300K lines of output
#pprint.pprint( bwb_names_extref )

### Check: check whether what we collect looks reasonable 

The above extracted
* `bwb_names_citeertitel` from metadata
* `bwb_names_aanhaling`   from text
* `bwb_names_extref`      from links
   * _every_ occurence was added to a list, to be able to count which one is common. We can use `wetsuite.extras.word_cloud.count_normalized()`, which also lets us unify capitalization-only variation (and reports the most common capitalisation)

We could do do various things with this, e.g.
* count extref names (or _all_) -- to look at unusual cases
* look at ambiguity, e.g. duplicates
* names used for self, names used by others -- to search in

In [9]:
# Look at some hand picked examples

bwbrs = sorted(  set(bwb_names_citeertitel) | set( bwb_names_aanhaling) | set(bwb_names_extref)  )  # their keys,  so joins all bwb-ids that any part found
print( 'Distinct BWB-ids we found some names for: ',len(bwbrs) )

for example_bwbid in ('BWBR0031986', # here to point out it's mostly laws that have citeertotels and aanhaaling paragraphs - references to this may be more descriptive
                      'BWBR0005537', # here to point out we can find acronyms too  (though we may need some cleverer non-linear behaviour in count_normalized)
                      'BWBR0015703', # here to point out this one changed names
):
    print()
    print( ' --> ', example_bwbid )
    print( 'Citeer:              ', bwb_names_citeertitel.get(example_bwbid) )
    print( 'Aanhaal:             ', bwb_names_aanhaling.get(example_bwbid)   )
    print( 'Extref refs to this: ', len(bwb_names_extref.get(example_bwbid))   )
    print( '  distinct & count:  ', wetsuite.extras.word_cloud.count_normalized( bwb_names_extref.get(example_bwbid), min_count=1,   min_word_length=1, normalize_func=lambda x:x.lower().strip() ) )
    print( '  filtered:          ', wetsuite.extras.word_cloud.count_normalized( bwb_names_extref.get(example_bwbid), min_count=0.05, min_word_length=1, normalize_func=lambda x:x.lower().strip() ) )

Distinct BWB-ids we found some names for:  37336

 -->  BWBR0031986
Citeer:               None
Aanhaal:              None
Extref refs to this:  3
  distinct & count:   {'Verordening fonds sociale aangelegenheden vleeswarenindustrie (PVV) 2012': 2, 'Verordening fonds voor onderzoek en ontwikkeling vleeswarenindustrie (PVV) 2012': 1}
  filtered:           {'Verordening fonds sociale aangelegenheden vleeswarenindustrie (PVV) 2012': 2, 'Verordening fonds voor onderzoek en ontwikkeling vleeswarenindustrie (PVV) 2012': 1}

 -->  BWBR0005537
Citeer:               ['Algemene wet bestuursrecht']
Aanhaal:              ['Algemene wet bestuursrecht']
Extref refs to this:  8392
  distinct & count:   {'Algemene wet bestuursrecht': 7718, 'artikel 8:86 van die wet': 1, 'artikel 8:54a van die wet': 1, 'tweede lid': 4, 'artikel 3:12 van die wet': 10, '3:12 van die wet': 5, 'artikel 3:18 van die wet': 7, 'artikel 3:13 van die wet': 1, 'artikel 3:44 van die wet': 2, 'Algemeen wet bestuursrecht': 4, '3:16 

## Collect references from CVDR

The CVDR XML has `<dcterms:source>` elements in its header which are references made in the text

There are such entries which signify something like "we know this is a reference but aren't certain what to", we ignore those.
To see an example that contains both, see e.g. [CVDR100088/1](https://repository.officiele-overheidspublicaties.nl/CVDR/100088/1/xml/100088_1.xml).


<!-- -->

The XML documents in the CVDR repository have extref tags,
but they are a little more varied than those in BWB due to what these documents practically do in a legal sense:
these extrefs are more often references to _specific parts_ of laws.

This require more cleanup of more human wording,
but the amount of data should give cleaner preferences for the main name,
and the type of its use should also give more idea of what laws are more commonly referenced.

In [10]:
# NOTE: going through 160K documents will take a few minutes - also why this is split from the next cells

## Go through CVDR XMLs and collect... 

cvdr_lastonly_xml = wetsuite.datasets.load('cvdr-mostrecent-xml')


cvdr_sref_data   = {} # xml_url it came from   ->   [(type, orig, specref, None, source_text), ...]    (see cvdr_sourcerefs's documentation)
cvdr_extref_data = {} # xml_url it came from   ->   [extrefnode, ...]    (see cvdr_sourcerefs's documentation)

for xml_url in list(cvdr_lastonly_xml.data):
    tree = cached_parse( cvdr_lastonly_xml.data, xml_url )
    meta = wetsuite.helpers.koop_parse.cvdr_meta(tree, flatten=True)

    ## ...collect the <dcterms:source> references,
    # use an existing function. Which returns a list of tuples, we'll deal with their content below
    srefs = wetsuite.helpers.koop_parse.cvdr_sourcerefs(tree, ignore_without_id=True)   
    if len(srefs)>0:
        cvdr_sref_data[ xml_url ] = srefs

    ## ...collect extrefs tagsfrom the document overall, since these are often referces to mostly laws and other CVDR articles.
    extrefs = list( tree.iter(tag='extref') )
    if len(extrefs)>0:
        cvdr_extref_data[ xml_url ] = extrefs

In [11]:
## process the sourceref part of what we just collected
cvdr_sourceref_names = collections.defaultdict(list) # BWB-id -> list of names used
count_ignoring = collections.defaultdict(int)

for xml_url, sourceref_list in cvdr_sref_data.items():
    for type, orig, bwbr, parts, spectext in sourceref_list:
        if type=='BWB':
            #print()
            #print([spectext, bwbr, parts])

            # The text may often be something like "artikel 5 can van de wet Bla" or "Artikel 5, wet Bla" and we want just Bla
            #   try to find the name by taking off specific bits of reference
            #   the references themselves are fairly clean to start with, so this works moderately well
            name = cleanup_basics( spectext )

            cvdr_sourceref_names[bwbr].append( name )

            if 0:
                # this is unrelated -- was trying to see which keys are used in the jci idetails,  because jci doesn't actually seem to define that
                # this next line is most of them
                for known in ('boek', 'hoofdstuk', 'titeldeel', 'afdeling','afd', 'artikel', 'lid', 'paragraaf', 'bijlage'):
                    if known in parts:
                        parts.pop( known )
                if 'g' in parts:
                    g = parts.pop('g')
                if len(parts)>0:
                    print('PARTSLEFT in %s: %s'%(xml_url, parts), [spectext, bwbr, parts])
        else:
            count_ignoring[type] += 1

count_ignoring

defaultdict(int, {'CVDR': 3878})

In [None]:
# warning: ~50k lines
#cvdr_sourceref_names

In [37]:
## and process the extref part
# The contents are a little more varied than in the above BWB case, so we need some more content cleanup code

_jcilike1 = re.compile('1.[0-9]+:[cv]:(BW[BRWVB0-9]+)') # something like '1.0:v:BWBR0005537' which seems like an abuse of jci
_jcilike2 = re.compile('1.[0-9]+:(CVDR[0-9_]+)')        # something like '1.1:CVDR215805_1' which seems like a bastardization of jci

cvdr_extrefs = []

# try to keep track of how many cases we're using, or ignoring for cleanliness's sake
count_ignored  = 0
count_matched  = 0
count_division = collections.defaultdict(int)

for xml_url, extrefs in cvdr_extref_data.items():
#for xml_url, extrefs in list(cvdr_extref_data.items())[:5000]:
    for extref in extrefs:
        if extref.text is None:
            continue

        matching = False
        ignoring = 0  # meaning not, 1 meaning tentatively, 2 meaning thoroughly

        value = extref.attrib.get('doc', None).strip()
        # that value is often an ID or URL
        #   in theory the 'struct' attribute tell us how to interpret the value, but it's not there much of the time, so we try not to rely on it
        
        #print( 'VALUE   %r '%value)
        if re.match(r'[\[\]01243456789\s]', extref.text): # numbered but nameless references like '[1]'
            ignoring = 2
            count_division['justnumber'] += 1

        if value is None:
            ignoring = 2
            count_division['none'] += 1

        else: # value is not None:
            name         = name_from_extref_tag( extref )                        # should give a cleaner answer, but answers None more easily
            name_fuzzier = name_from_extref_tag( extref, allow_fuzzier=True )    # used as a fallback if the previous is None
                                                                                 #   the code for that is in specific sections, because it can vary per extref type
                                                                                 #   though note that means each section can forget, so... don't.
            _jcilike1m = _jcilike1.match( value )
            _jcilike2m = _jcilike2.match( value )

            if value == '':
                ignoring = 2
                count_division['empty'] += 1
            elif value.startswith('about:blank'):   # riiiight.
                ignoring = 2
                count_division['empty'] += 1
            elif value.startswith('bookmark://'):   
                ignoring = 2  
                count_division['pointless'] += 1
            elif value.startswith('mk:@MSITStore'): # that's a local file   (IE, CHM stuff)
                ignoring = 2
                count_division['local'] += 1
            elif value.startswith('file://'):       # more local files
                ignoring = 2
                count_division['local'] += 1
            elif value.startswith('Xopus.asp'):     # that's an internal reference, apparently https://www.koopoverheid.nl/documenten/instructies/2017/10/24/gebruikershandleiding-xopus-xml-editor
                ignoring = 2
                count_division['local'] += 1
            elif value.startswith('mailto:'):
                ignoring = 2  
                count_division['pointless'] += 1

            elif value.startswith('/cvdr'): # probably to a served image or PDF or such
                ignoring = 2
                count_division['relative'] += 1

            elif value.startswith('#'): # probably a page anchor - maybe interesting actually?
                ignoring = 2  
                count_division['samepage'] += 1

            ## cases we can probably use:
            elif value.startswith('http://wetten.overheid.nl/cgi-bin/deeplink/'):
                # will look like
                #   http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Burgerlijk%20Wetboek%20Boek%201
                #   http://wetten.overheid.nl/cgi-bin/deeplink/law1/bwbid=BWBR0005537/article=1:2
                # Those are not query parameters, and I've not found the standard this might be following, so some estimation is involved here
                rest = value[value.index('law1/')+5:]

                if rest.startswith('title%3D'): # probably inconsistent escaping?
                    ignoring = 1
                    count_division['badesc'] += 1
                elif rest.startswith('bwbid%3D'):
                    ignoring = 1
                    count_division['badesc'] += 1

                elif rest.startswith('title='): # quite a few of these

                    bwbid = resolve_deeplink( value ) # does fetches, and caches them. First run will be slower
                    if bwbid is None:
                        ignoring = 1 # 2?
                        count_division['could not resolve deeplink'] += 1
                    else:

                        if name is None and name_fuzzier is None:  # this is copied deeper into the logic (a few times) because it's a less common reason to give up / fall back
                            name = name_fuzzier

                            count_division['uninteresting extref text?'] += 1
                            ignoring = 2 # maybe 1 to report, unless we're printing:
                            #print('SKIP uninteresting extref text (None or %r from %r)'%( name_fuzzier, extref.text) )
                        else:
                            name = name or name_fuzzier
                            
                            tt = cleanup_basics( name )

                            if '/' in rest:
                                count_division['detailed title deeplink'] += 1
                                matching = True
                                cvdr_extrefs.append( ['BWB-deeplink-title-detailed', bwbid, tt])
                                #print(['TEST1.1', bwbid, tt, name])
                                #raise ValueError( 'SKIP FOR NOW, specific reference   in  %r'%rest )
                            else:
                                count_division['basic title deeplink'] += 1
                                matching = True
                                #title = urllib.parse.unquote(rest[6:])
                                #print(['TEST1.2', bwbid, tt, name])
                                cvdr_extrefs.append( ['BWB-deeplink-title', bwbid, tt]) # fall back to the below style
                                # TODO: actually fetch these to see what identifier they end up on to

                elif rest.startswith('bwbid='):
                    if name is None and name_fuzzier is None:  # this is copied deeper into the logic (a few times) because it's a less common reason to give up / fall back
                        name = name_fuzzier

                        count_division['uninteresting extref text?'] += 1
                        ignoring = 2 # maybe 1 to report, unless we're printing:
                        #print('SKIP uninteresting extref text (None or %r from %r)'%( name_fuzzier, extref.text) )
                    else:
                        name = name or name_fuzzier

                        if '/' in rest:
                            tt = cleanup_basics( name )
                            bwbid = urllib.parse.unquote( rest[rest.index('=')+1:rest.index('/')] )
                            #print(['TEST2.1', bwbid, name])
                            
                            ignoring = 1 # TEMPORARILY 2
                            count_division['deeplink too detailed; TODO'] += 1
                            #raise ValueError( 'SKIP FOR NOW, specific reference   in  %r'%rest )
                        else:
                            matching = True
                            bwbid = urllib.parse.unquote(rest[6:])
                            #print(['TEST2.2', bwbid, name])
                            cvdr_extrefs.append( ['BWB-deeplink-bwbid', bwbid, name]) # fall back to the below style

                else:
                    raise ValueError( 'TODO: deal with  %r'%rest )


            elif value.startswith('http://wetten.overheid.nl/') or value.startswith('https://wetten.overheid.nl/'):
                # non-deeplinks; assume these will look something like
                #   http://wetten.overheid.nl/BWBR0012059/geldigheidsdatum_24-09-2008#Hoofdstuk4_Artikel37
                # or
                #   http://wetten.overheid.nl/jci1.3:c:BWBR0001941&amp;artikel=2&amp;z=2017-05-25&amp;g=2017-05-25
                count_division['wetten.overheid.nl'] += 1

                won_match1 = re.search('wetten.overheid.nl/(BW[^/]+)(?:[/]|$)', value)
                won_match2 = re.search('wetten.overheid.nl/jci[0-9.]+:[cv]:(BW[0-9A-Z]+)(?:[&]|$)', value) # seems we could use
                
                if won_match1 is not None:
                    if name is None and name_fuzzier is None:
                        ignoring = 1
                        name = name_fuzzier
                    else:
                        matching = True 
                        name = name or name_fuzzier
                        #print(['WON1', won_match1.group(1), name, value] )
                        cvdr_extrefs.append( ['won1', won_match1.group(1), name])

                elif won_match2 is not None:
                    if name is None and name_fuzzier is None:
                        ignoring = 1
                        name = name_fuzzier
                    else:
                        matching = True 
                        name = name or name_fuzzier
                        #print(['WON2', won2_match1.group(1), name, value] )
                        cvdr_extrefs.append( ['won2', won_match2.group(1), name])

                else:
                    ignoring = 1
                    #print(['NOWON', value ])


            # other http[s]:// is probably less meaningful.   A few may still be usef for other reasons, but we are currently not looking for them
            elif value.lower().startswith('http://') or value.lower().startswith('https://'):
                ignoring = 1  # was TEMPORARILY 2, while figuring out the deeplink stuff
                count_division['otherlink; CONSIDER'] += 1
                # eg. http://decentrale.regelgeving.overheid.nl/ might still be interesting, but most of these are less interesting
                #print( value)

            elif value.lower().startswith('www.'):
                ignoring = 1
                count_division['otherlink; MAYBE'] += 1
                # these all seem unineresting
                #print( value)

            else: # This maybe should just be part of the same if-elif chain above, because there is no longer a clean split of cases

                name = name_from_extref_tag( extref )
                if name is None:
                    ignoring = 2  # (don't print identifier part if we skip it for text reference reasons)
                    count_division['boringname?'] += 1
                    #print('SKIP, apparently boring name: %r'%extref.text)

                elif value.startswith('CVDR://'): # 'CVDR://97153_2'
                    matching = True
                    ref = wetsuite.helpers.koop_parse.cvdr_parse_identifier( value[7:] )
                    cvdr_extrefs.append( ['CVDRurl', ref, name])

                elif value.startswith('BWB://'): # e.g. 'BWB://1.0:v:BWBR0011468&artikel=76'
                    try:
                        ref = wetsuite.helpers.meta.parse_jci( value[6:] )
                        cvdr_extrefs.append( ['BWBurl', ref, name])
                        matching = True 
                    except Exception as e:
                        print([value, e])
                        ignoring = 1 

                elif _jcilike1m is not None:
                    matching = True 
                    cvdr_extrefs.append( ['jcilike-bwb', _jcilike1m.group(1), name])

                elif _jcilike2m is not None:
                    matching = True 
                    cvdr_extrefs.append( ['jcilike-cvdr', _jcilike2m.group(1), name])

        if matching:
            count_matched += 1
        elif ignoring:
            count_ignored += 1

        if 0:  # report things we didn't handle,  and/or  couldn't decide on
            say = ''
            if not matching  and  ignoring==0:
                say = 'DID NOT RECOGNIZE:  '
            elif ignoring == 1:
                say = 'IGNORING, MAYBE LATER: '

            if say:
                extref.tail = None            
                print( say, wetsuite.helpers.etree.tostring( extref ).decode('utf8') )
                #print( 'ATTRIB ', extref.attrib )
                #print( 'DOC    ', doc)

print("Ignored %d and took information from %d items"%( count_ignored, count_matched) )
pprint.pprint( count_division )

['BWB://1.0:v:271829_1', ValueError("'1.0:v:271829_1' does not look like a valid jci")]
Ignored 75910 and took information from 51455 items
defaultdict(<class 'int'>,
            {'badesc': 39,
             'basic title deeplink': 10539,
             'boringname?': 9497,
             'could not resolve deeplink': 5881,
             'deeplink too detailed; TODO': 300,
             'detailed title deeplink': 18164,
             'empty': 443,
             'justnumber': 18709,
             'local': 2092,
             'otherlink; CONSIDER': 23802,
             'otherlink; MAYBE': 464,
             'pointless': 722,
             'relative': 21459,
             'samepage': 5121,
             'uninteresting extref text?': 4452,
             'wetten.overheid.nl': 9521})


In [None]:
# peek at deeplinks that got resolved, and how many do not
list( resolved.items() )[:200]

### Summarize what we got from CVDR

In [None]:
# source refs
print( '%d source references, to %d BWB-ids'%( 
    sum(  list( len(names)   for _, names in cvdr_sourceref_names.items() )  ),  
    len(  cvdr_sourceref_names  ) 
) )

print('\nThe most common:')
#   we take from a dict and put into a list so we can sort by count
sortable = []
for bwbid, names in cvdr_sourceref_names.items():
    #print()
    name_count = wetsuite.extras.word_cloud.count_normalized( names, min_word_length=1, stopwords=[], min_count=0.3 )
    sortable.append( (sum(name_count.values()), bwbid, name_count) )
sortable.sort(key=lambda x:x[0], reverse=True)
pprint.pprint(sortable[:10])

In [44]:
print(  'Extracted %d interesting extrefs  (~%d different names)'%(
    len(cvdr_extrefs),
    len( set(nm   for _,_,nm  in cvdr_extrefs))
)  )

Extracted 51455 interesting extrefs  (~6081 different names)


In [46]:
# Go through what the above collected, put it into a dict like the previous sections did
cvdr_extrefs_filtered = collections.defaultdict(list)   # BWB-id -> list of name strings


print( '%d extrefs'%( len( cvdr_extrefs ) ) )

for typ, ref, text in cvdr_extrefs[:2000]:
    if typ in ('CVDRurl','jcilike-cvdr'): # not currently interested in these references
        continue
    
    if text is None:
        print( 'SKIP ', typ, ref, text )
        continue

    elif typ == 'BWB-deeplink-title':           # ref is a bwbid
        cvdr_extrefs_filtered[ref].append( text )
#        print( ref, text )
        #print( typ, ref, text)
        #if bwbid is not None:
        #    #print( ref)
        #    #print( typ, ref, text )
        #    cvdr_extrefs_filtered[bwbid].append( text )
        #else:
        #    print("DID NOT RESOLVE %r"%ref)

    elif typ == 'BWB-deeplink-title-detailed':  # ref is a bwbid (hopefully)
        # this may need a little more inspection
        cvdr_extrefs_filtered[ref].append( text )

    elif typ == 'BWBurl':               # ref is a parsed dict
        bwbid = ref.get('bwb')
        cvdr_extrefs_filtered[bwbid].append( text )

    elif typ == 'jcilike-bwb':               # ref is a bwb id
        cvdr_extrefs_filtered[ref].append( text )

    elif typ in ('won1','won2'):
        cvdr_extrefs_filtered[ref].append( text )

    else:
        print( 'UNHANDLED TYPE %r (%r, %r)' %( typ, ref, text ) )

51455 extrefs


In [None]:
cvdr_extrefs_filtered

In [None]:
sortable = []
for bwbid, names in cvdr_extrefs_filtered.items():
    #print(bwbid, names)
    name_count = wetsuite.extras.word_cloud.count_normalized( names, min_word_length=1, stopwords=[], min_count=0.3 )
    sortable.append( (sum(name_count.values()), bwbid, name_count) )
sortable.sort(key=lambda x:x[0], reverse=True)
pprint.pprint(sortable)

## Finally do something with all that data

to review, we made
* `bwb_names_citeertitel`,
* `bwb_names_aanhaling`, 
* `bwb_names_extref`, 
* `cvdr_sourceref_names`,
* `cvdr_extrefs_filtered`

...each a dict from BWB-id to a list of names used to refer to it

### Report ambiguous names

In [None]:

conflict_data = collections.defaultdict(set) # name -> list of bwbids

for bwbid in bwb_names_citeertitel:
    for name in set( bwb_names_citeertitel[bwbid] ):
        #if len(name) > 5: # quick way to focus on abbreviations
        #    continue
        conflict_data[name].add(bwbid)

for bwbid in bwb_names_aanhaling:
    for name in set( bwb_names_aanhaling[bwbid] ):
        #if len(name) > 5:
        #    continue
        conflict_data[name].add(bwbid)

for bwbid in bwb_names_extref:
    for name in set( bwb_names_extref[bwbid] ):
        #if len(name) > 5:
        #    continue
        conflict_data[name].add(bwbid)

for bwbid in cvdr_extrefs_filtered:
    for name in set( cvdr_extrefs_filtered[bwbid] ):
        #if len(name) > 5:
        #    continue
        conflict_data[name].add(bwbid)

for name, bwbids in conflict_data.items():
    if len(bwbids)>1:
        print(" %r can refer to any of %r"%(name, sorted(bwbids)))

# A bunch of those seems to be mistakes. 
#   Probably one of the sections above lets through too much of a mess.
#   TODO: figure out where they are from (maybe have the structs carry through the origin XML?)

### Merge all useful bits

In [52]:
merged_data = [] # data for a name-searching webpage: 
# list of   [BWBR, [self names], [other names] ]

merged_ids = set()
merged_ids.update( bwb_names_citeertitel )
merged_ids.update( bwb_names_aanhaling )
merged_ids.update( bwb_names_extref )

for bwbid in sorted(merged_ids, reverse=True):
    self_names  = []
    other_names = []
    
    if bwbid in bwb_names_citeertitel:
        for name in bwb_names_citeertitel[bwbid]:
            if name not in self_names:
                self_names.append( name )

    if bwbid in bwb_names_aanhaling:
        for name in bwb_names_aanhaling[bwbid]:
            if name not in self_names:
                self_names.append( name )

    if bwbid in cvdr_sourceref_names:
        for name in cvdr_sourceref_names[bwbid]:
            if name not in other_names:
                other_names.append( name )


    # the next two are messier sources, so we try to be stricter about what we take from it

    if bwbid in bwb_names_extref:
        name_count = wetsuite.extras.word_cloud.count_normalized( bwb_names_extref[bwbid], normalize_func=lambda s:s.lower(), min_word_length=1, stopwords=[], min_count=0.001 )
        for name, count in name_count.items():
            #if count >= 3:
            #print('ADD   %-12s  %4s   %r' %( bwbid, count, name ))
            if name not in self_names and name not in other_names:
                other_names.append(name)

    if bwbid in cvdr_extrefs_filtered:
        name_count = wetsuite.extras.word_cloud.count_normalized( cvdr_extrefs_filtered[bwbid], normalize_func=lambda s:s.lower(), min_word_length=1, stopwords=[], min_count=0.001 )
        for name, count in name_count.items():
            #if count >= 3:
            #print('ADD   %-12s  %4s   %r' %( bwbid, count, name ))
            if name not in self_names and name not in other_names:
                other_names.append(name)


    merged_data.append([ bwbid, self_names, other_names ] )

In [57]:
# some examples as sanity check
pprint.pprint( random.sample( merged_data, 7 ) )
#pprint.pprint( merged_data )

[['BWBR0002194',
  ['Besluit instelling Staatsdiploma leraar stenografie'],
  ['examen ter verkrijging van dat diploma']],
 ['BWBR0018384',
  ['Verordening HBAG bevoegdheden en werkwijze HBAG 2005'],
  ['Verordening HBAG bevoegdheden en werkwijze organen en secretaris 2005']],
 ['BWBR0025800',
  ['Beleidsregel storing door het gewenste signaal van radiozendapparaten'],
  []],
 ['BWBR0012645',
  ['Besluit tegemoetkoming onderwijsbijdrage en schoolkosten'],
  ['BTOS']],
 ['BWBR0025156',
  ['Deelregeling Nederlands Popmuziek Plan 2009–2010 van het Nederlands Fonds '
   'voor Podiumkunsten+'],
  []],
 ['BWBR0007044',
  ['Besluit overige niet-meldingplichtige gevallen bodemsanering'],
  ['Besluit overige niet-meldingsplichtige gevallen bodemsanering']],
 ['BWBR0047184',
  ['Regeling screenings- en testinstrumenten lwoo en pro schooljaar 2023–2024'],
  []]]


In [58]:

with open('/var/www/default/wetnames.json','w') as wd:
    wd.write( '%s'%json.dumps(merged_data) )

with open('/var/www/default/wetnames.js','w') as wd:
    wd.write( 'namedata = %s;'%json.dumps(merged_data) )    