# Purpose of this notebook

Provide _some_ answer to "so how you detect interesting terms / phrases?"

Annoyingly, one has to respond with "that depends on what you mean with interesting phrases"


There are varied methods, some simple enough that you could implement yourself in half an hour,
that will return with some interesting fragments, so seem to work. 
...yet often have assumptions that turn out to not match up with what you thought when you heard 'interesting'.

While each of these methods visible do useful things, 
each will miss things you may have wanted to give, 
which is invisible, and it is also not clear why.


Consider if your goal was
- "what multi-word phrases appear in this document" 
- "what multi-word phrases make this document interesting" 
- "what multi-word phrases make this document different from others in a set" 
- "can we make lists of words" 
- "what multi-word phrases make this document interesting" 
- match known phrases
- match phrases of a specific topic
They may may seem like subtle variations,

Also, if you have not yet thought what kind of phrases are more interesting, 
or why, then you can't expect a method to prefer those. 


So for the most part, the below is a start on just the first in that list, 
to introduce some methods, but refined output will need your refined needs (and some refined code).


For example, 
- **tf-idf** is more of an ingredient for a larger analysis, search, and other things, yet 
  - combined with n-grams, they might tell you combinations of words that are more common than others, but still little about how they compare. 
  - so _by itself_ it's not useful for much more than assistance making stoplists.
  - there's a [separate notebook that goes into its basics](methods_text_terms_tfidf.ipynb)

- using any language parsing, you could look for patterns. 
  - the output may be clean, but it's unclear what one might miss
  - a basic example follows below

- **Collocation analysis** often refers to a probability-based "does this combination of words appears more often together than its parts would suggest?", which is still simple and works a little better.
  - it might pick up "eigen gebruik", "echtgenoot of geregistreerde partner", "werk en inkomen naar arbeidsvermogen"
  - ...but also just fragments that happen, well, because sentences have structure ("heeft gedaan", "verplichtingen uit"), or have been ripped from their context ("tijdstip zal", "KONING DER").
  - so 'more common together' turns out to not be quite enough for clean output 
  - there's a [separate notebook that goes into its basics](methods_text_terms_collocations.ipynb)

- topic modelling goes further, asking "what sets of words or phrases seem to join and disinguish documents in a set"
  - this adds a goal that pushes down in that above list.
  - there's a [separate notebook that goes into its basics](methods_text_topic_modeling.ipynb)


## Spacy pattern matcher

Notes: 
- you can express more complex types of patterns, see https://spacy.io/api/matcher#patterns
- You could extend this to more complex tasks, like maybe rule-based phrase and named entity extraction.
  - ...though you might base that on more specific existing code like PhraseMatcher and EntityRuler,
    which may work faster and/or annotate automatically.
  - and in the case of NER would probaly still be less effective than existing trained NER model components


In [1]:
import collections
#import spacy
import spacy.matcher

import wetsuite.datasets
import wetsuite.helpers.spacy

2024-03-12 23:06:36.132682: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-12 23:06:38.319145: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-03-12 23:06:38.323409: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-03-12 23:06:38.323531: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


In [3]:
dutch  = spacy.load('nl_core_news_lg')

# Look for   one or more adjectives  before  a noun or proper noun.    
# A little too simple, yet does a useful thing already
an_pattern = [
    [ {"POS": "ADJ", "OP": "+"},   
      {"POS": {"IN":["NOUN","PROPN"]}} ],
    # you can have more rules in a matcher
]
matcher = spacy.matcher.Matcher(dutch.vocab)
matcher.add("adjective-noun", an_pattern)


count_ans = collections.defaultdict(int)

rvs = wetsuite.datasets.load('raadvanstate-adviezen-struc')
larger_rvs_advice = [] # 
for key, item in rvs.data.random_sample(2500):
    body = '\n'.join( item['body'] )
    doc = dutch( body )
    matches = matcher( doc )
    for match_id, start_i, end_i in matches:
        # we could mark and display them in an existing parse, but for now just count them
        #print( doc[ start_i : end_i ].text )
        count_ans[ doc[ start_i : end_i ].text ] += 1 

for str, count in sorted( count_ans.items(), key=lambda x:x[1], reverse=True):
    print( f'{count:5d}  {str}')

 5277  eerste lid
 4366  tweede lid
 2628  derde lid
 2267  vierde lid
 1519  vijfde lid
 1340  Nader rapport
 1128  redactionele kanttekeningen
 1084  redactionele aard
 1079  algemene maatregel
 1056  uitsluitend opmerkingen
  853  ministeriële regeling
  814  zesde lid
  650  Algemene wet
  562  Gehele tekst
  491  bestuurlijke boete
  476  bevoegd gezag
  462  Nederlandse Antillen
  433  algemeen deel
  408  redactionele kanttekening
  361  hoger onderwijs
  348  inhoudelijke opmerkingen
  336  onderhavige wetsvoorstel
  327  nadere regels
  314  eerste plaats
  314  algemeen belang
  311  Burgerlijk Wetboek
  311  andere wetten
  295  nader rapport
  290  zevende lid
  286  wettelijke regeling
  282  fysieke leefomgeving
  281  strafbare feiten
  275  openbare orde
  273  Economische Zaken
  263  Ruimtelijke Ordening
  259  financieel toezicht
  249  redactionele bijlage
  244  openbare lichamen
  243  decentrale overheden
  242  overeenkomstige toepassing
  239  hoger beroep
  23

# Slightly less basic

## Extracting patterns with rule-based matching