## Goal of this notebook

Explain what topic modeling can do for you



## What topic modelling does

Topic modelling means looking for groups of terms (into sets called called topics),
that happen to distinguish the documents in the given set.

There is no canonical method, so each given implementation may do more than just find and group terms.
They might go further to, 
- try to describe documents,
- segment / classify the set based on topics (some more interactively than others).

<!-- -->

You could argue that even a classical 'expert system', a relatively dumb 
"look for some hardcoded terms to assign, then feed things into a basic classifier", can be used for basic forms of topic modelling.

<!-- -->

Yet 'topic modelling' these days refers to to minimally-supervised machine learning that tries to find topics for you **without any pre-set knowledge**.

A topic is then often a set of co-occurring words/terms that seem to have distinguishing power.

It is nicer if a topic (=set of words) has a very coherent theme, and can be intuitively understood.

This can be nice while exploring data, or trying to describe it using this analysis.

When only trying to separate dissimilar documents from one larger set, 
such intuition is less important than its distinguishing power.


It takes more care to 
- answer which method would works better for a particular wish
- describe how it is best applied to the document set you have,
- additionally discover the balance of topics ''throughout'' a document, e.g. its sections
- explain how different methods deal with overlapping topics

Yet even applied somewhat bluntly it can help explore a set of documents.


Assumptions: 
- the document set you train from is at least moderately representatives of the topics you care about


Useful for:
- labeling text (from paragraphs to documents) with its broad topics

- finding documents with similar topics even when they don't use the same words

- summarizing a dataset's common subject matter





Limitations:
- whether it's a way to index text depends on how you want to query it
  - yes, it will react to a topic though any of its words, not just the one you use, yet this can be way too fuzzy unless combined with other filters.
  - but there will also be plenty of words that will not resolve to the best topic. As topic modeling often focuses only on the stronger word relations, so there will be many that do not.

- methods may find only broad topics, not the specific destinctions you care about, and the same 'unsupervised' that makes this easy to use may also make it hard to guide.
  - if you've just decided you want to train a classifier, 

- also implicitly means ignore rare words

- can't really steer an existing model in a new direction, so if you want to categorize this, you may need to '(re)discover once, then hammer that down into a classifier'
- won't work on small texts - single documents would rarely give coherent enough wording for a result
- slowish
- quantify how to do use it for things like indexing and find things within a document set
  - e.g. whether/how apply topics to all documents, rather the ones that is is clear on (one way it differs from classification)




In [1]:
import os, random, pprint

import wetsuite.datasets
import wetsuite.helpers.etree
import wetsuite.helpers.koop_parse

## Getting some example text

In [2]:
cvdr_mostrecent = wetsuite.datasets.load('cvdr-mostrecent-xml')

cvdr_plaintext = {}   # url -> plain text string

# NOTE: reading 160K documents will take a few minutes
for cvdr_url, cvdr_xml in cvdr_mostrecent.data.iteritems():
    cvdr_etree = wetsuite.helpers.etree.fromstring( cvdr_xml )
    cvdr_plaintext[cvdr_url] = wetsuite.helpers.koop_parse.cvdr_text( cvdr_etree )

  if artikel.find('lid'):


In [3]:
if 0: # small random subset
    random.seed(1)
    per_doc_text_selection = random.sample( list(cvdr_plaintext.values()), 4000) # a few minutes' of work if using spacy (we're experimenting right now)
    print('  and a random selection of %d'%len(per_doc_text_selection))
else: # all
    per_doc_text_selection = list( cvdr_plaintext.values() )

In [6]:
import gensim.utils, tqdm, re
stop = 'de van het een en in of is op voor te aan die met niet bij zijn als dat tot door dan deze kan wordt worden bedoeld er dit kan om  artikel lid'.split()

per_paragraph_text_selection = []
for doc_text in tqdm.tqdm( per_doc_text_selection ):
    for paragraph in re.split(r'\n{2,}', doc_text):
        #print( paragraph )
        #print('-------------------------------------------------')
        if 1: # faster but...
            phrase_list = gensim.utils.simple_preprocess(paragraph, deacc=False, min_len=2, max_len=20) 
        else: # ...we're doing an experiment feeding it cleaned phrases
            phrase_list = wetsuite.helpers.spacy.nl_noun_chunks( text )

        phrase_list = list( phrase   for phrase in phrase_list   if phrase.lower() not in stop )
        per_paragraph_text_selection.append( phrase_list )

100%|██████████| 149432/149432 [06:03<00:00, 411.16it/s]


## LDA

LDA is a statistical model that tries to fit co-occurrences. 

LDA often lets lets you control 
- the target of how many topics to find per document
- the target of how many terms per topic.

It's not aware of words at all - so most of the co-occurrences will be common words, because that's... true.
LDA tutorials often explain that every real stopword on a stopword list avoids messiness later.

This will certainly makes the trained model look cleaner, though there is an argument that [stopword removal barely affects distinguising power](https://www.researchgate.net/publication/318741781_Pulling_Out_the_Stops_Rethinking_Stopword_Removal_for_Topic_Models), so removing them afterwards (if you can) works almost as well, and can be more flexible in that it lets you deal with domain-specific stopwords, and won't overdo it in preprocessing with no measure of the effect of what you removed.

That said, since we're already doing NLP things, we could consider seeing, say, what happens if you just include nouns and noun phrases, and maybe verbs.

### gensim LDA

https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

In [3]:
import tqdm
import gensim.models, gensim.utils, gensim.corpora
import pyLDAvis, pyLDAvis.gensim
import wetsuite.helpers.spacy

In [7]:
# gensim seems to make a point of constructing a dictionary of words (with which it seems to mean an enumeration,  a mapping between words and numbers indicating that word)

#corpus_tokens = [] # list of documents, each is a list of words   - mostly to illustrate what we're doing in a more readable way
corpus_gensim = [] # list of documents, each is a list of (idterm, frequency)
dic = gensim.corpora.Dictionary()

for phrase_list in per_paragraph_text_selection:
    doc_id_counts = dic.doc2bow(phrase_list, allow_update=True)  # we're currently creating, not looking up
    corpus_gensim.append( doc_id_counts ) 

In [8]:
# the analysis's data churning:
lda = gensim.models.LdaModel(corpus=corpus_gensim, id2word=dic, num_topics=60)

In [None]:
lda.show_topics(num_words=100)
#lda.print_topics()

In [40]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, corpus_gensim, dic)
vis

  default_term_info = default_term_info.sort_values(


In [9]:
import re

stop = 'de van het een en in of is op voor te aan die met niet bij zijn als dat tot door dan deze kan wordt worden bedoeld er dit kan om  artikel lid'.split()

corpus_gensim = [] # list of documents, each is a list of (idterm, frequency)
dic = gensim.corpora.Dictionary()

for doc_text in tqdm.tqdm( per_doc_text_selection ):
    if len(doc_text) > 500000: # spacy whines for memory reasons (Errors.E088), we can choose not to care right now
        continue

    for paragraph in re.split(r'\n{2,}', doc_text):
        #print( paragraph )
        #print('-------------------------------------------------')
        if 1: # faster but...
            phrase_list = gensim.utils.simple_preprocess(paragraph) # split, lowercase, remove short strings
        else: # ...we're doing an experiment feeding it cleaned phrases
            phrase_list = wetsuite.helpers.spacy.nl_noun_chunks( paragraph )

        phrase_list = list( phrase   for phrase in phrase_list   if phrase.lower() not in stop )

        #corpus_tokens.append( pp )
        doc_id_counts = dic.doc2bow(phrase_list, allow_update=True)  # we're currently creating, not looking up
        corpus_gensim.append( doc_id_counts )

        #if random.randint(0,200) == 1:
        #    break

  0%|          | 0/4000 [00:00<?, ?it/s]

100%|██████████| 4000/4000 [00:03<00:00, 1238.24it/s]


In [12]:
# the analysis's data churning:
lda = gensim.models.LdaModel(corpus=corpus_gensim, id2word=dic, num_topics=40)

In [17]:
import pyLDAvis.utils
import IPython
from importlib import reload
reload(IPython)
reload(pyLDAvis.utils)
pyLDAvis.utils.write_ipynb_local_js()

ImportError: cannot import name 'get_ipython_dir' from 'IPython.utils.path' (/home/scarfboy/.local/lib/python3.8/site-packages/IPython/utils/path.py)

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, corpus_gensim, dic)
vis



### Top2Vec

In [36]:
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'   # avoids some specific versioning breakage, though also makes things slower

from top2vec import Top2Vec

In [9]:
# quick and dirty "if we've run this before and save that model, load it.  If not, generate and save it so the next run can just load it"
model_filename = "model-doc-%d"%len(per_doc_text)
if os.path.exists( model_filename ):
    print( "Loading %s"%model_filename )
    doc_model = Top2Vec.load( model_filename )
else:
    print( "Generating and saving %s"%model_filename )
    doc_model = Top2Vec( per_doc_text, embedding_model='universal-sentence-encoder' )
    doc_model.save( model_filename )

Loading model-doc-2500


In [10]:
# same for the per-artikel test model
model_filename = "model-artikel-%d"%len(per_artikel_text)
if os.path.exists( model_filename ):
    artikel_model = Top2Vec.load( model_filename )
else:
    artikel_model = Top2Vec( per_artikel_text )
    artikel_model.save( model_filename )

2023-04-19 18:55:41,806 - top2vec - INFO - Pre-processing documents for training
2023-04-19 18:56:16,239 - top2vec - INFO - Creating joint document/word embedding
2023-04-19 19:37:21,925 - top2vec - INFO - Creating lower dimension embedding of documents
2023-04-19 19:39:16,097 - top2vec - INFO - Finding dense areas of documents
2023-04-19 19:39:23,952 - top2vec - INFO - Finding topics


## Okay, what does model represent?  What's in there?

In [11]:
help( artikel_model )

Help on Top2Vec in module top2vec.Top2Vec object:

class Top2Vec(builtins.object)
 |  Top2Vec(documents, min_count=50, topic_merge_delta=0.1, ngram_vocab=False, ngram_vocab_args=None, embedding_model='doc2vec', embedding_model_path=None, embedding_batch_size=32, split_documents=False, document_chunker='sequential', chunk_length=100, max_num_chunks=None, chunk_overlap_ratio=0.5, chunk_len_coverage_ratio=1.0, sentencizer=None, speed='learn', use_corpus_file=False, document_ids=None, keep_documents=True, workers=None, tokenizer=None, use_embedding_model_tokenizer=False, umap_args=None, hdbscan_args=None, verbose=True)
 |  
 |  Top2Vec
 |  
 |  Creates jointly embedded topic, document and word vectors.
 |  
 |  
 |  Parameters
 |  ----------
 |  documents: List of str
 |      Input corpus, should be a list of strings.
 |  
 |  min_count: int (Optional, default 50)
 |      Ignores all words with total frequency lower than this. For smaller
 |      corpora a smaller min_count will be necessa

In [21]:
topic_words, word_scores, topic_scores, topic_nums = artikel_model.search_topics(keywords=["educatie"], num_topics=10)

topic_words
#for topic in topic_nums:
#    artikel_model.generate_topic_wordcloud( topic, )

[array(['peuters', 'peuter', 'peuteropvang', 've', 'vve', 'ouderbijdrage',
        'voorschoolse', 'uurtarief', 'educatie', 'verzorgers', 'dagdelen',
        'belastingdienst', 'subsidiebedrag', 'lrk', 'kindercentrum',
        'aanbod', 'kinderopvang', 'reguliere', 'opgevangen',
        'geindiceerde', 'indicatie', 'hbo', 'subsidie', 'utrechtse',
        'heuvelrug', 'fiscaal', 'gesubsidieerde', 'gesubsidieerd',
        'ouders', 'jaarlijks', 'landelijk', 'basisschool', 'subsidiering',
        'peuterspeelzaal', 'kwaliteitseisen', 'inzet', 'aantallen',
        'subsidieren', 'kwaliteit', 'jarigen', 'rijk', 'aanvullend',
        'bekostigt', 'basisscholen', 'verdeeld', 'gehanteerde', 'start',
        'toetsen', 'aanbieder', 'afhankelijk'], dtype='<U15'),
 array(['leegstand', 'medegebruik', 'tekort', 'school', 'overgaan',
        'scholen', 'klokuren', 'gevorderd', 'iii', 'gezag', 'educatie',
        'vordering', 'gebouw', 'basisonderwijs', 'schoolbesturen',
        'onderwijs', 'bevoegd

In [23]:
topic_words, word_scores, topic_scores, topic_nums = doc_model.search_topics(keywords=["educatie"], num_topics=2)

topic_words
#for topic in topic_nums:
#    artikel_model.generate_topic_wordcloud( topic, )

[array(['on', 'an', 'aansprakelijk', 'er', 'discussie', 'overheidsdienst',
        'bedrijfsafval', 'not', 'has', 'wao', 'of', 'wedstrijd', 'persoon',
        'in', 'grootte', 'verschil', 'je', 'toekennen', 'is', 'pas',
        'vaker', 'betrekt', 'integraal', 'ne', 'zitgelegenheid', 'no',
        'leefbaarheid', 'alsof', 'stap', 'wbb', 'naleven', 'en',
        'veroorzaken', 'dienst', 'the', 'kampeermiddelen', 'na', 'bewonen',
        'aanverwante', 'voordat', 'uitgangspunt', 'asv', 'aspecten', 'by',
        'relatief', 'automatisch', 'eijsden', 'dit', 'oisterwijk',
        'lossen'], dtype='<U15'),
 array(['geen', 'voor', 'heeft', 'eens', 'nederlandse', 'naar',
        'nederland', 'gewoon', 'zijn', 'welke', 'een', 'waar', 'waarom',
        'bij', 'gaat', 'natuurlijk', 'niet', 'hij', 'nieuwe', 'komt',
        'tegen', 'maar', 'het', 'deze', 'groningen', 'zoals', 'nooit',
        'worden', 'wordt', 'daar', 'veel', 'iemand', 'goed', 'zeker',
        'nijmegen', 'toch', 'mensen', 'doet'


The size of the context relates to how wide you cast the net for related words. 

For example, if you train on a few paragraphs at a time, when you might get things often mentioned in the same list, e.g. looking for 'educatie' might get you 'cultuur, kunst, initiatieven, stimulatie, subsidue' and perhaps some local assocations and perhaps some specific pronouns from typical wording

Whereas when you train on documents, you might get, well, a similar idea - some local associations, but less typical wording, and each topic looks more complete (peuters, pedagogisch, voorschoolse, vroegschoolse, ourderbijdrage)
 




In [None]:
#model = Top2Vec.load("model-1000")

### SVD, LSA and LSI

There are many methods that ''somehow'' mathematically model the intuition that things that occur together are probably relatedmean similar things.

Latent Semantic Analysis (LSA) is one of them. It's a matrix method focused on using SVD,
which roughly speaking generalizes by squeazing a dense term-count matrix via a simplified expression

that tries to summarize seeing what amount of expression remains when you squeeze an dense correlation count through

...on counts that are usually fairly bag-of-words.

One of its limitations seems to be that it easily gets distracted, so it needs either very clean data or just a lot of it.




### LDA