<a href="https://colab.research.google.com/github/scarfboy/wetsuite-dev/blob/main/examples/datacollect_gemeentes_damocles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip --quiet install https://github.com/scarfboy/wetsuite-dev/archive/refs/heads/main.zip

# Goal

This is an exercise in finding specific policy per municipality, to be able to compare their content.

To start, we need a list of municipalities. 

In [17]:
import wetsuite.datasets
gem = wetsuite.datasets.load('gemeentes')
print( gem.description )

# you to get a random example, but in this case we're only interested in the names
# pprint.pprint( random.choice( gem.data ) )


    This is largely the more interesting fields from https://organisaties.overheid.nl/export/Gemeenten.csv
    augmented with RDF-like data like that under https://standaarden.overheid.nl/owms/terms/Leiden_(gemeente)

    
    .data is a list of dicts, one per gemeente (currently 344 of them). Keys in that dict include:
    
    'Namen' - a list of name variants. 
       Usually just the short name, and a longer one with "Gemeente " in front
       Sometimes with alternative names, e.g. ["Den Bosch", "Gemeente 's-Hertogenbosch", "'s-Hertogenbosch"]
       We have used these as "Match one of these" to search for gemeentebeleid per gemeente
   
    Descriptions like 'Aantal inwoners', 'Oppervlakte'
    
    Organisational relations like 
      - 'Bevat plaatsen'
      - 'Overlaps with', mentioning Provinces, Waterschappen
      - 'Service area of' - things like GGD, Police, Social services  (each item is a list because we tend to have a full name and an abbreviation)
      - 'Predecesso

## Search per municpality

The only thing of the above that we use is the names. 

The below does searches into the CVDR repository (see also the datacollect_koop_repos example for more introduction to the repositories).


We are considering making a friendlier way of doing this, that lies somewhere between "doing 344 separate queries on a website" and "let's teach you python (and leave you to weird data errors)".

If we want to use SRU, let's remind ourselves of the indices we can search in:

In [20]:
import pprint
cvdr = wetsuite.datacollect.koop_repositories.CVDR()
pprint.pprint( cvdr.explain_parsed() )

{'database/numRecs': '262919',
 'description': 'Gemeenschappelijke zoekdienst van overheid.nl voor Centrale '
                'Voorziening Decentrale Regelgeving',
 'explain_url': 'http://zoekservice.overheid.nl/sru/Search?&version=1.2&x-connection=cvdr&operation=explain',
 'extent': 'Lokale regelingen of the Dutch government',
 'host': 'zoekservice.overheid.nl',
 'indices': [('dcterms', 'identifier'),
             ('dcterms', 'title'),
             ('dcterms', 'language'),
             ('dcterms', 'creator'),
             ('dcterms', 'modified'),
             ('dcterms', 'isFormatOf'),
             ('dcterms', 'alternative'),
             ('dcterms', 'source'),
             ('dcterms', 'isRatifiedBy'),
             ('dcterms', 'subject'),
             ('dcterms', 'issued'),
             (None, 'workid'),
             (None, 'bronformaat'),
             (None, 'organisatieType'),
             (None, 'sorteerTitel'),
             (None, 'gemeente'),
             (None, 'provincie'),
   

As [the manual](https://data.overheid.nl/sites/default/files/dataset/d0cca537-44ea-48cf-9880-fa21e1a7058f/resources/Handleiding%2BSRU%2B2.0.pdf) mentions, `dt.spatial` refers to where it applies, `dt.creator` refers to who is responsible for creating the document, though for this case we assume they are the same.


Say that our interest is the policy around a municipality removing people from their home based on the presence of drugs. This is nicknamed the damocles policy. However, that name will also include policies against coffee shops, and may be mentioned in a municipality's general local regulation, we should expect a few results too many.

Also, not all the policies may mention this name, so and we picked some words that are very likely to appear in the body of texts dealing with this.

Those considerations, and the query below, are the result of a few iterations of doing searches and seeing what comes out.

In [15]:
import datetime
import dateutil.parser
import wetsuite.datacollect.koop_repositories
import wetsuite.helpers.etree as etree
import wetsuite.helpers.net


# the weird offset is trying to find Den Haag with its other name, to check that the search is not tripping over that
#  ...and to illustrate there isn't actually a good hit for some - Den Haag does actually have a policy, but not in CVDR
for gemeente_dict in gem.data[65:67]: # -35:-30  
    ## Construct a complex-looking query to mean:
    #   (match gemeente by one of its names)  AND (  mentions 'damocles'   OR   (mentions drugs or the opiumwet  AND  mentions words you likely see around putting people out of their house)
    # This is a practical consideration: we _will_ get too many results, but at least what we want is probably in there,  and filtering out can be easier than searching again 
    query_gemeente_names = ' OR '.join( '(creator = "%s")'%naam   for naam in gemeente_dict['Namen'] )
    query = '(%s) AND ( (body any "damoclesbeleid damocles")  OR  (body any "drugs softdrugs harddrugs handelshoeveelheid opiumwet 13b") AND (body any "sluiting herstelsanctie bestuursdwang"))'%( 
        query_gemeente_names
    )

    ## search and fetch only first page, just so that num_records is filled in to report
    cvdr = wetsuite.datacollect.koop_repositories.CVDR()
    cvdr.search_retrieve( query ) 
    print( "\n == %3d  hits for   %s == "%(cvdr.num_records(), ' / '.join(gemeente_dict['Namen'])) )

    ## search and fetch all, summarizing each record as we go
    def show_brief( record ): 
        meta = wetsuite.datacollect.koop_repositories.cvdr_meta( record, flatten=True )
        uit = meta.get('uitwerkingtredingDatum', None)  # ignore things that are expired, because they were probably replaced by something else also in the results  (note: the repo's expiry data doesn't look 100% correct)

        # policies that are altered get a new document in here.
        #  So if the last day is before today, then there should be another in the search results that _does_ apply.
        if uit not in (None,'')  and  (dateutil.parser.parse(uit.split('+')[0]).date() < datetime.date.today()): # so don't show it. Yes, this can also be done in the query (TODO: remove the need for that +)
            pass
        else:
            print( "  %15s  %10s..%-10s  %s"%( meta.get('identifier'), meta.get('inwerkingtredingDatum'),  meta.get('uitwerkingtredingDatum',''),  meta.get('title')) )
            #print('    URL: %s'%meta.get('publicatieurl_xml') )     # 'publicatieurl_xml' points to text in structured XML.  There is also 'publicatieurl_xhtml' (more browser-presentable),  and 'preferred_url' (a link to the page that lokaleregelgeving.overheid.nl would also send you to)
            
            if False: # If you wanted to extract the text, this would be a (very crude) start:
              xml_data = wetsuite.helpers.net.download( meta.get('publicatieurl_xml') )
              tree = etree.strip_namespace( etree.fromstring( xml_data ) )
              for al in tree.find('body/regeling/regeling-tekst').getiterator('al'):
                  print(  ''.join( etree.all_text_fragments(al) )  )


    cvdr.search_retrieve_many( query, callback=show_brief ) # all results, and show brief summary, mainly just titles



 ==  56  hits for   Den Haag / Gemeente Den Haag / 's-Gravenhage == 
     CVDR645629_1  2020-11-10..            Beleidsregel toezicht bedrijfsmatige activiteiten 2020
     CVDR674619_1  2022-03-24..            Beleidsregel bestuurlijke boete, sluiting en beheerovername op grond van de Woningwet Den Haag 2022
     CVDR690428_1  2023-01-01..            Beleidsregel beoordeling levensgedrag Den Haag 2023
     CVDR11313_53  2022-12-01..            Algemene plaatselijke verordening voor de gemeente Den Haag

 ==  24  hits for   Den Helder / Gemeente Den Helder == 
     CVDR657606_1  2021-05-15..            Beleidsregel van de burgemeester van de gemeente Den Helder, houdende regels over sluiting van lokalen en woningen op grond van artikel 13b Opiumwet (Damoclesbeleid Den Helder 2021)
     CVDR674768_1  2022-03-26..            Beleidsregels van de burgemeester van de gemeente Den Helder, houdende regels omtrent coffeeshops (Beleid coffeeshops Den Helder 2022)
     CVDR627607_1  2019-09-20.