<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/notebooks/extras/extras_find_abbrevs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
!pip3 install -U --no-cache-dir --quiet https://github.com/knobs-dials/wetsuite-dev/archive/refs/heads/main.zip

# Purpose of this notebook

Try to extract acronyms from text.
Not so much as an exercise in itself, but because it can be a good source of entity names for later training.

Particularly laws might be interesting, though we should figure ahead of this this is not a complete or even particularly way to get those.

There is a `wetsuite.phrases.abbreviations.abbrev_find()` that mainly just look for text like "Word Combination (WC)" and a few variants.
Which won't catch everything, but in enough data, you'll get fairly consistent results.

### Basic tests on unreasonably clean wordlist data
...namely existing lists of acronyms, which at the same time are a good way to find out what kind of cases we would miss.

In [1]:
import bs4

import wetsuite.helpers.net
import wetsuite.helpers.koop_parse
import wetsuite.helpers.patterns
import wetsuite.helpers.etree
import wetsuite.datasets

In [2]:
# Unreasonably clean data,  that also contains some less usual cases so we can report what we might eventually want to deal with
html = wetsuite.helpers.net.download('https://organisaties.overheid.nl/Zelfstandige_bestuursorganen/') 

soup = bs4.BeautifulSoup( html ) # parse the webpage into something we can query
for link in soup.select('.content .list--linked li a'):   # some scraping magic we might explain elsewhere 
    # we are interested in the link text:
    found = False
    for ab, words in wetsuite.helpers.patterns.abbrev_find( link.text ):
        print( 'FOUND  %s = %s'%( ab, words ) )
        found = True

    # Things we didn't find - more creative things that we _might_ want to consider
    if '(' in link.text and not found:    # (assuming bracket indicates there is an explained abbreviation in that link text)
        print( "MISS  ", link.text )

MISS   Aangewezen/aangemelde instanties (dezelfde) ex art. 1a.5.1 Vuurwerkbesluit
MISS   Airport Coordination Netherlands (ACNL)
MISS   Autoriteit Consument en Markt (ACM)
FOUND  AFM = ['Autoriteit', 'Financiële', 'Markten']
MISS   Autoriteit Nucleaire Veiligheid en Stralingsbescherming (ANVS)
FOUND  AP = ['Autoriteit', 'Persoonsgegevens']
FOUND  BA = ['Bureau', 'Architectenregister']
FOUND  BBL = ['Bureau', 'Beheer', 'Landbouwgronden']
FOUND  BFT = ['Bureau', 'Financieel', 'Toezicht']
FOUND  CBR = ['Centraal', 'Bureau', 'Rijvaardigheidsbewijzen']
MISS   Centraal Orgaan opvang asielzoekers (COA)
FOUND  CCD = ['Centrale', 'Commissie', 'Dierproeven']
FOUND  CCMO = ['Centrale', 'Commissie', 'Mensgebonden', 'Onderzoek']
FOUND  CIZ = ['Centrum', 'Indicatiestelling', 'Zorg']
MISS   College gerechtelijk deskundigen (NRGD)
FOUND  CSZ = ['College', 'sanering', 'zorginstellingen']
MISS   College ter Beoordeling van Geneesmiddelen (CBG)
MISS   College van toezicht collectieve beheersorganisaties 

In [3]:
# Similar idea, different site
html = wetsuite.helpers.net.download('https://publications.europa.eu/code/nl/nl-5000400.htm') 

soup = bs4.BeautifulSoup( html ) # parse the webpage into something we can query
for tr in soup.select('table.definitionsTable tr'):  
    tds = tr.findAll('td')

    # TODO: deal with the way it mentions multiple definitions
    text = '%s (%s)'%(tds[0].text.strip(), tds[1].text.strip())  # pretend we don't know this is good data and just put it next to each other

    found = False
    for ab, words in wetsuite.helpers.patterns.abbrev_find( text ):
        print( 'FOUND  %s = %s'%( ab, words ) )
        found = True

    # Things we didn't find - more creative things that we _might_ want to consider
    if '(' in text and not found:    # (assuming bracket indicates there is an explained abbreviation in that link text)
        print( "MISS  ", text )



MISS   ABH (Agentschap voor Buitenlandse Handel (voorheen BDBH (Belgische Dienst voor Buitenlandse Handel)))
MISS   ABVV (Algemeen Belgisch Vakverbond)
MISS   ACS (staten in Afrika, het Caribisch gebied en de Stille Oceaan)
FOUND  ACV = ['Algemeen', 'Christelijk', 'Vakverbond']
MISS   ADB (1.
Afrikaanse Ontwikkelingsbank
(African Development Bank)
2.
Arabische Ontwikkelingsbank
(Arab Development Bank)
3.
Aziatische Ontwikkelingsbank
(Asian Development Bank))
MISS   ADN (Europese Overeenkomst betreffende het internationale vervoer van gevaarlijke goederen over de binnenwateren)
MISS   ADR (Europese Overeenkomst betreffende het internationale vervoer van gevaarlijke goederen over de weg)
MISS   Afnor (Frans Normalisatie-instituut
(Association française de normalisation))
MISS   ALO (algemene leningsovereenkomst)
MISS   Altener II (meerjarenprogramma ter bevordering van hernieuwbare energiebronnen in de Gemeenschap)
MISS   AKE (Agentschap voor Kernenergie (OESO))
FOUND  ANP = ['Algemeen',

In [4]:
# Similar idea, different site
html = wetsuite.helpers.net.download( 'https://www.rijksfinancien.nl/memorie-van-toelichting/2021/OWB/XIII/onderdeel/644956' ) 

soup = bs4.BeautifulSoup( html ) # parse the webpage into something we can query
for tr in soup.select('.kio2 tr'):  

    tds = tr.findAll('td')

    text = '%s (%s)'%(tds[0].text.strip(), tds[1].text.strip())  # pretend we don't know this is good data and just put it next to each other

    found = False
    for ab, words in wetsuite.helpers.patterns.abbrev_find( text ):
        print( 'FOUND  %s = %s'%( ab, words ) )
        found = True

    # Things we didn't find - more creative things that we _might_ want to consider
    if '(' in text and not found:    # (assuming bracket indicates there is an explained abbreviation in that link text)
        print( "MISS  ", text )



MISS    ()
MISS   ACM (Autoriteit Consument en Markt)
FOUND  ACT = ['Accelerating', 'CCS', 'Technologies']
MISS   ACVG (Adviescollege Veiligheid Groningen)
FOUND  ANBI = ['Algemeen', 'nut', 'beogende', 'instellingen']
FOUND  AT = ['Agentschap', 'Telecom']
FOUND  ATR = ['Adviescollege', 'toetsing', 'regeldruk']
MISS   AWTI (Adviesraad voor Wetenschap, Technologie en Innovatie)
MISS   BBE (Biobased Economy)
FOUND  BBP = ['Bruto', 'Binnenlands', 'Product']
MISS   BES (Bonaire, Sint Eustatius, Saba)
MISS   BIS (Basisinfrastructuur voor cultuur)
MISS   BIPM (Bureau International des Poids en Mesures)
MISS   BMKB (Borgstellingsregeling Midden en Kleinbedrijf)
FOUND  BNP = ['Bruto', 'Nationaal', 'Product']
FOUND  BOM = ['Brabantse', 'Ontwikkelings', 'Maatschappij']
MISS   BPM (Belasting van personenauto's en motorrijwielen)
MISS   BTW (Belasting over de toegevoegde waarde)
MISS   BZ (Ministerie van Buitenlandse Zaken)
MISS   BZK (Ministerie van Binnenlandse Zaken en Koninkrijksrelaties)
MISS 

In [5]:
# Similar idea, different site
html = wetsuite.helpers.net.download('https://juridisch-woordenboek.nl/afkortingen') 

soup = bs4.BeautifulSoup( html ) # parse the webpage into something we can query
for tr in soup.select('table#afkortingen tbody tr'):
    #print(tr)
    tds = tr.findAll('td')

    text = '%s (%s)'%(tds[0].text.strip(), tds[1].text.strip())  # pretend we don't know this is good data and just put it next to each other

    found = False
    for ab, words in wetsuite.helpers.patterns.abbrev_find( text ):
        print( 'FOUND  %s = %s'%( ab, words ) )
        found = True

    # Things we didn't find - more creative things that we _might_ want to consider
    if '(' in text and not found:    # (assuming bracket indicates there is an explained abbreviation in that link text)
        print( "MISS  ", text )


FOUND  AA = ['Ars', 'Aequi']
FOUND  AA = ['Accountant', 'Administratieconsulent']
FOUND  AA = ['Advertising', 'Association']
MISS   a.a. (ad acta, bij de akten (wegleggen))
FOUND  AAA = ['American', 'Arbitration', 'Association']
MISS   AAC (Advies- en Arbitragecommissie)
MISS   AAf (Algemeen Arbeidsongeschiktheidsfonds)
MISS   AAR (Algemeen ambtenarenreglement)
MISS   AAR (Algemene Aanwijzingen voor de Rijksdienst)
FOUND  AAV = ['Algemene', 'administratieve', 'voorschriften']
MISS   AAW (Algemene Arbeidsongeschiktheidswet)
FOUND  AB = ['Administratiefrechterlijke', 'Beslissingen']
MISS   AB (Administratieve en Rechterlijke Beslissingen)
MISS   AB (Nederlandse Jurisprudentie Administratiefrechtelijke Beslissingen (sinds 1971))
MISS   AB (Wet Algemene Bepalingen)
FOUND  ABA = ['American', 'Bar', 'Association']
MISS   ABAR (Algemene bepalingen van administratief recht)
MISS   abbb (algemene beginselen van behoorlijk bestuur)
FOUND  ABP = ['Algemeen', 'Burgerlijk', 'Pensioenfonds']
MISS   

In [7]:
# Similar idea, different site
html = wetsuite.helpers.net.download('https://www.eur.nl/esl/campus/sanders-law-library/juridische-afkortingen') 

soup = bs4.BeautifulSoup( html ) # parse the webpage into something we can query
for tr in soup.select('div.accordion table tr'):
    tds = tr.findAll('td')
    if len(tds)!=2:
        print("SKIP %s"%tr)
    else:
        text = '%s (%s)'%(tds[1].text.strip(), tds[0].text.strip())  # pretend we don't know this is good data and just put it next to each other

        found = False
        for ab, words in wetsuite.helpers.patterns.abbrev_find( text ):
            print( 'FOUND  %s = %s'%( ab, words ) )
            found = True

        # Things we didn't find - more creative things that we _might_ want to consider
        if '(' in text and not found:    # (assuming bracket indicates there is an explained abbreviation in that link text)
            print( "MISS  ", text )


SKIP <tr><th>Afkorting</th><th>Betekenis</th></tr>
MISS   anno, in het jaar (a°)
MISS   Algemene bepalingen (A)
MISS   Antwoord der regering naar aanleiding van het verslag (A)
MISS   Arbeid; afzonderlijk verschenen van 1946-1953 (A)
MISS   Atlantic Reporter second series (A.2d.)
MISS   Accountancy en Bedrijfskunde (A&B)
MISS   Aansprakelijkheid en Verzekering (A&V)
MISS   Ars Aequi. Juridisch studentenblad (AA of A.A. of AAe)
MISS   Accountant-Administratieconsulent (AAC)
FOUND  AA = ['Advertising', 'Association']
MISS   ad acta, bij de akten (wegleggen) (a.a)
FOUND  AAA = ['American', 'Arbitration', 'Association']
MISS   Algemeen aanduidingenbesluit (AAB)
MISS   Algemene aannemingsvoorwaarden voor bedrijfsgebouwen in de landbouw (AABL)
MISS   Advies- en Arbitragecommissie (AAC)
MISS   Ars Aequi. Juridisch studentenblad (A Ae)
MISS   Algemeen arbeidsongeschiktheidsfonds (AAF of Aaf)
MISS   Adem-alcoholgehalte (AAG)
MISS   Ars Aequi jurisprudentiebundel (AA-Jur)
MISS   Algemene aannemi

### Run on a bunch of free-form document text

And let's try to make the results cleaner
by only reporting explanations that appear in multiple documents,
and counting how often each appears.

In [19]:
import random
cvdr_text = wetsuite.datasets.load('cvdr-mostrecent-text')

cvdr_urls = cvdr_text.data.keys()
cvdr_urls_subset = random.sample(cvdr_urls, 20000) # subset during debug

per_doc_results = [] # a list of (what abbrev_find) returns 

for cvdr_url in cvdr_urls_subset: 
    text = cvdr_text.data.get( cvdr_url )
    
    results = wetsuite.helpers.patterns.abbrev_find(text)
    if len(results) > 0:
        per_doc_results.append( results )    

#per_doc_results

In [20]:
per_doc_results

[[('KBR', ['Kenniscentrum', 'Bestuursdienst', 'Rotterdam'])],
 [('Wabo', ['Wet', 'algemene', 'bepalingen', 'omgevingsrecht'])],
 [('Wro', ['Wet', 'ruimtelijke', 'ordening']),
  ('Wabo', ['Wet', 'algemene', 'bepalingen', 'omgevingsrecht'])],
 [('wo', ['wetenschappelijk', 'onderwijs'])],
 [('LRK', ['Landelijk', 'Register', 'Kinderopvang'])],
 [('AVOI', ['Algemene', 'verordening', 'ondergrondse', 'infrastructuren']),
  ('APV', ['Algemene', 'plaatselijke', 'verordening']),
  ('WAM', ['Wet', 'aansprakelijkheidsverzekering', 'motorvoertuigen']),
  ('APV', ['Algemene', 'plaatselijke', 'verordening']),
  ('WABO', ['Wet', 'algemene', 'bepalingen', 'omgevingsrecht']),
  ('NPR', ['Nederlandse', 'Praktijk', 'Richtlijnen'])],
 [('Awb', ['Algemene', 'Wet', 'bestuursrecht']),
  ('GR', ['gemeenschappelijke', 'regelingen'])],
 [('NTC', ['Nationaal', 'Topsport', 'Centrum']),
  ('RTC', ['Regionaal', 'Trainings', 'Centrum']),
  ('EK', ['Europese', 'kampioenschappen']),
  ('OS', ['Olympische', 'Spelen'])],

In [21]:
### cleaning - report only things that were explained the same way in two or more documents
min_doc_occur = 2

report = []
abbrev_count = wetsuite.helpers.patterns.count_results( per_doc_results )
for abbrev, words_count in abbrev_count.items():
    for words, count in words_count.items():
        if count >= min_doc_occur:   # the point of that structure:  being able to ignore rarer explanations
            report.append( (abbrev, count, ' '.join(words) ) )

In [22]:
#report.sort(key=lambda tup: -tup[1]) # sort by count descending
report.sort(key=lambda tup: (tup[0], -tup[1])) # sort/group by abbreviation alphabetically, then by count descending
for abbrev, count, expl in report:
    print( '%10s   %3d:   %s'%( abbrev, count, expl ) )

       AAB     2:   Adviescommissie Agrarische Bouwaanvragen
        AB     9:   Algemeen Bestuur
        AB     2:   Activerende Begeleiding
        AB     2:   algemeen bestuur
       ABP     6:   Algemene Burgerlijke Pensioenwet
       ABU     2:   Algemene Bond Uitzendondernemingen
       ABZ     2:   Algemeen Bestuurlijke Zaken
      ABdK     2:   Actief Bodembeheer de Kempen
       ADL    12:   algemene dagelijkse levensverrichtingen
       ADL     8:   Algemene dagelijkse levensverrichtingen
       ADL     7:   algemeen dagelijkse levensverrichtingen
       ADL     5:   Algemene Dagelijkse Levensverrichtingen
       ADL     3:   Algemeen Dagelijkse Levensverrichtingen
       ADL     3:   activiteiten dagelijks leven
       ADL     2:   Algemeen dagelijkse Levensbehoeften
       ADL     2:   Algemeen Dagelijkse Levensbehoeften
       AED     4:   Automatische Externe Defibrillator
       AFM    17:   Autoriteit Financiële Markten
       AFM     2:   Autoriteit financiële Markten


In [23]:
# Similar idea, from BWB instead of CVDR

bwb_parsed = []

bwb_text = wetsuite.datasets.load('bwb-mostrecent-text')
bwb_urls = bwb_text.data.keys()
bwb_urls_subset = random.sample(bwb_urls, 10000)

for bwb_url in bwb_urls_subset:
    text = bwb_text.data.get( bwb_url )
    results = wetsuite.helpers.patterns.abbrev_find(text)
    if len(results) > 0:
        per_doc_results.append( results )    

In [24]:
# show one more intermediate step this time
per_doc_results

[[('KBR', ['Kenniscentrum', 'Bestuursdienst', 'Rotterdam'])],
 [('Wabo', ['Wet', 'algemene', 'bepalingen', 'omgevingsrecht'])],
 [('Wro', ['Wet', 'ruimtelijke', 'ordening']),
  ('Wabo', ['Wet', 'algemene', 'bepalingen', 'omgevingsrecht'])],
 [('wo', ['wetenschappelijk', 'onderwijs'])],
 [('LRK', ['Landelijk', 'Register', 'Kinderopvang'])],
 [('AVOI', ['Algemene', 'verordening', 'ondergrondse', 'infrastructuren']),
  ('APV', ['Algemene', 'plaatselijke', 'verordening']),
  ('WAM', ['Wet', 'aansprakelijkheidsverzekering', 'motorvoertuigen']),
  ('APV', ['Algemene', 'plaatselijke', 'verordening']),
  ('WABO', ['Wet', 'algemene', 'bepalingen', 'omgevingsrecht']),
  ('NPR', ['Nederlandse', 'Praktijk', 'Richtlijnen'])],
 [('Awb', ['Algemene', 'Wet', 'bestuursrecht']),
  ('GR', ['gemeenschappelijke', 'regelingen'])],
 [('NTC', ['Nationaal', 'Topsport', 'Centrum']),
  ('RTC', ['Regionaal', 'Trainings', 'Centrum']),
  ('EK', ['Europese', 'kampioenschappen']),
  ('OS', ['Olympische', 'Spelen'])],

In [33]:
### test the cleaning - report only things that were explained the same way in two or more documents
min_doc_occur = 2

abbrev_count = wetsuite.helpers.patterns.count_results( per_doc_results )

def print_sorted_abbrev_counts(abbrev_count):
    ' takes the nested count structure that count_results returns, prints readable output sorted by count '
    report = []
    for abbrev, words_count in abbrev_count.items():
        for words, count in words_count.items():
            if count >= min_doc_occur:   # the point of that structure: being able to ignore rarer explanations
                report.append( (abbrev, count, ' '.join(words) ) )

    report.sort(key=lambda tup: -tup[1]) # sort by count descending
    #report.sort(key=lambda tup: (tup[0], -tup[1])) # sort/group by abbreviation alphabetically, then by count descending
    for abbrev, count, expl in report:
        print( '%80s   %3d:   %s'%( abbrev, count, expl ) )

print_sorted_abbrev_counts( abbrev_count )

                                                                             Awb   710:   Algemene wet bestuursrecht
                                                                            Wabo   178:   Wet algemene bepalingen omgevingsrecht
                                                                             Wlz   134:   Wet langdurige zorg
                                                                             Wmo   117:   Wet maatschappelijke ondersteuning
                                                                             APV   104:   Algemene Plaatselijke Verordening
                                                                              OM    79:   Openbaar Ministerie
                                                                             CAK    77:   Centraal Administratie Kantoor
                                                                             Wmg    66:   Wet marktordening gezondheidszorg
                                        

### Experiment: look for 'hierna'
as another clean source

In [27]:
import re, pprint


def text_nearby_all( needle, haystack, chars_before=40, chars_after=40 ):
    ret = []
    for mob in re.finditer(needle, haystack):
        st, en = mob.start(0), mob.end(0)
        ret.append( (haystack[st-chars_before:st],  haystack[st:en].upper(),  haystack[en:en+chars_after]  ) )
    return ret



results = []
def find_hierna(text):
    text_res = []
    if 'hierna' in text:
        #for before, match, after in text_nearby_all('hierna', text):
        #    print( 'MATCH ...%s[%s]%s...'%(before, match.upper(), after) )
        #continue
        for match in re.findall( r'(?: de | het )([^.,\(]+)[\(](hierna[.]*?:? [^\)]+)[\)]', text ): # the ., is a quick and dirty "localize to sentence/phrase split
            long, short = match
            long = long.strip()
            text_res.append( [long, [short]] )
    if len(text_res) > 0:
        results.append( text_res )

    

for cvdr_url in cvdr_urls: # cvdr_urls_subset 
    text = cvdr_text.data.get( cvdr_url )
    find_hierna(text)


for bwb_url in bwb_urls: # bwb_urls_subset
    text = bwb_text.data.get( bwb_url )
    find_hierna(text)



In [34]:
print( len(results) )
counts = wetsuite.helpers.patterns.count_results( results )
#pprint.pprint( counts )
print_sorted_abbrev_counts( counts )


4946
                                                      Algemene wet bestuursrecht   253:   hierna: Awb
                     inwerkingtreding van de Tijdelijke wet maatregelen covid-19   104:   hierna: TWM
            TWM is een nieuw hoofdstuk toegevoegd aan de Wet publieke gezondheid   104:   hierna: Wpg
                                          Leidraad Invordering 2008 van het Rijk   102:   hierna: Rijksleidraad
                                        Tijdelijke regeling maatregelen covid-19    75:   hierna: TRM
                                         Wet maatschappelijke ondersteuning 2015    73:   hierna: Wmo 2015
technisch onderhoud van het toepassingssysteem waarmee de gemeentelijke voorziening voor de basisregistratie personen wordt gevoerd    66:   hierna toepassingssysteem
bestuurlijke en – met toepassing van een budgetkorting – financiële decentralisatie naar gemeenten van een aantal taken uit de Algemene Wet Bijzondere Ziektekosten    64:   hierna: AWBZ
    salaris en 