<a href="https://colab.research.google.com/github/larajakl/Machine-Learning/blob/main/04_LM_LDA_Topic_modeling_2024_Lara_JAKL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering Documents

# Setup

We start by importing [pandas](https://pandas.pydata.org/) - an essential tool for data scientists!

We load a .CSV (Comma Seperated Values) file of German news articles from https://github.com/tblock/10kGNAD


In [1]:
from IPython.display import YouTubeVideo

In [2]:
import pandas as pd

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

Pandas is a useful package to load CSV files and to parse them. It can also parse TSV - separated by tabs, or as in our case, separated by a `;`.  
Pandas is often used as the first-step for data scientists to load and analyze data.

In [3]:
'''
Since my mother tongue is German and I wanted to try out another language,
I decided to use Dutch which is another language that I speak.
I used the following dataset of news articles in Dutch:
https://www.kaggle.com/datasets/maxscheijen/dutch-news-articles?resource=download
'''

'\nSince my mother tongue is German and I wanted to try out another language,\nI decided to use Dutch which is another language that I speak.\nI used the following dataset of news articles in Dutch:\nhttps://www.kaggle.com/datasets/maxscheijen/dutch-news-articles?resource=download\n/Users/Lara/Desktop/MLT/Machine Learning/Assignments/dutch-news-articles.csv\n'

In [None]:
# in the following lines I download the dataset

In [4]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("maxscheijen/dutch-news-articles")

print("Path to dataset files:", path)


Downloading from https://www.kaggle.com/api/v1/datasets/download/maxscheijen/dutch-news-articles?dataset_version_number=157...


100%|██████████| 161M/161M [00:09<00:00, 18.3MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/maxscheijen/dutch-news-articles/versions/157


In [5]:
import os

directory_path = '/root/.cache/kagglehub/datasets/maxscheijen/dutch-news-articles/versions/157'
files = os.listdir(directory_path)
print(files)

['dutch-news-articles.csv']


In [6]:
csv_filename = 'dutch-news-articles.csv'
csv_file_path = os.path.join(directory_path, csv_filename)

In [7]:
df_articles = pd.read_csv(csv_file_path,
                 sep=',',       # this file is separated by ","
                 on_bad_lines='skip',
                 header=0,   # first line is header line for this CSV
                 # .. so we define the column names here:
                 usecols=['datetime', 'title', 'content', 'category', 'url'],  # Reads only the specified columns
                 # And by specifiying the column as a Categorical type,
                 # we can save computer memory! Yay!
                 dtype={'category': 'category'})

print(df_articles.head())

              datetime                                            title  \
0  2010-01-01 00:49:00                Enige Litouwse kerncentrale dicht   
1  2010-01-01 02:08:00  Spanje eerste EU-voorzitter onder nieuw verdrag   
2  2010-01-01 02:09:00                 Fout justitie in Blackwater-zaak   
3  2010-01-01 05:14:00        Museumplein vol, minder druk in Rotterdam   
4  2010-01-01 05:30:00              Obama krijgt rapporten over aanslag   

                                             content    category  \
0  De enige kerncentrale van Litouwen is oudjaars...  Buitenland   
1  Spanje is met ingang van vandaag voorzitter va...  Buitenland   
2  Vijf werknemers van het omstreden Amerikaanse ...  Buitenland   
3  Het Oud en Nieuwfeest op het Museumplein in Am...  Binnenland   
4  President Obama heeft de eerste rapporten gekr...  Buitenland   

                                                 url  
0  https://nos.nl/artikel/126231-enige-litouwse-k...  
1  https://nos.nl/artikel/1262

**Note:** Specifying a column with repeated strings as a category is a good Pandas' trick to be aware of. Often the dataset can't fit into the memory, and by specifiying columns as a categorical column when loading the data (`pd.read_csv`), we get to spare memory and allow the dataset to fit the working memory better.

In [8]:
df_articles

Unnamed: 0,datetime,title,content,category,url
0,2010-01-01 00:49:00,Enige Litouwse kerncentrale dicht,De enige kerncentrale van Litouwen is oudjaars...,Buitenland,https://nos.nl/artikel/126231-enige-litouwse-k...
1,2010-01-01 02:08:00,Spanje eerste EU-voorzitter onder nieuw verdrag,Spanje is met ingang van vandaag voorzitter va...,Buitenland,https://nos.nl/artikel/126230-spanje-eerste-eu...
2,2010-01-01 02:09:00,Fout justitie in Blackwater-zaak,Vijf werknemers van het omstreden Amerikaanse ...,Buitenland,https://nos.nl/artikel/126233-fout-justitie-in...
3,2010-01-01 05:14:00,"Museumplein vol, minder druk in Rotterdam",Het Oud en Nieuwfeest op het Museumplein in Am...,Binnenland,https://nos.nl/artikel/126232-museumplein-vol-...
4,2010-01-01 05:30:00,Obama krijgt rapporten over aanslag,President Obama heeft de eerste rapporten gekr...,Buitenland,https://nos.nl/artikel/126236-obama-krijgt-rap...
...,...,...,...,...,...
255519,2023-08-09 09:51:42,"Amazone-landen willen ontbossing tegengaan, ma...",Acht Zuid-Amerikaanse landen zijn het op de Am...,Buitenland,https://nos.nl//artikel/2485984-amazone-landen...
255520,2023-08-09 10:06:31,Topman moederbedrijf Albert Heijn: 'Prijsverla...,"De topman van Ahold Delhaize, het moederbedrij...",Economie,https://nos.nl//artikel/2485988-topman-moederb...
255521,2023-08-09 10:09:40,Bijzondere mijlpaal voor Duncan Laurence: een ...,Duncan Laurence heeft met zijn nummer Arcade e...,Binnenland,https://nos.nl//artikel/2485989-bijzondere-mij...
255522,2023-08-09 10:17:16,Brand in Frans vakantiehuis voor gehandicapten...,In een Frans vakantiehuis voor gehandicapten i...,Buitenland,https://nos.nl//artikel/2485990-brand-in-frans...


In [9]:
df_articles['category'].cat.categories

Index(['Binnenland', 'Buitenland', 'Cultuur & Media', 'Economie',
       'Koningshuis', 'Opmerkelijk', 'Politiek', 'Regionaal nieuws', 'Tech',
       '1 Jaar Oorlog', '4 En 5 Mei ', 'Aardbevingen', 'Crisis Asielbeleid',
       'Cultuur-En-Media', 'Einde Rutte Iv', 'Grensoverschrijdend ',
       'Gronings Gas', 'Jaarwisseling', 'Keti Koti', 'Klimaat',
       'Kroning Charles', 'L1Mburg', 'Midterm-Verkiezingen', 'Nh Nieuws',
       'Oekraïens Offensief', 'Omroep Brabant', 'Omroep Flevoland',
       'Omroep Gelderland', 'Omroep West', 'Omroep Zeeland', 'Omrop Fryslân',
       'Op Weg Naar Tk2023', 'Opstand Wagner', 'Pelé Overleden',
       'Pentagon-Lek', 'Proces-Taghi', 'Regio', 'Regionaal Nieuws', 'Rijnmond',
       'Rtv Drenthe', 'Rtv Noord', 'Rtv Oost', 'Rtv Utrecht',
       'Schipholonderzoek', 'Slavernijverleden', 'Songfestival ',
       'Stikstofcrisis', 'Strijd In Sudan', 'Treinongeluk', 'Turkije Kiest',
       'Verkiezingen', 'Wangedrag Supporters', 'Watersnoodramp', 'Wk Voetbal'

# Clustering with Latent Dirichlet Allocation (LDA)

In [10]:
%pip install -U gensim --quiet

In [11]:
from pprint import pprint # for printing objects nicely

from gensim import corpora, models
from gensim.utils import simple_preprocess

## Instead of the gensim English stopwords...
# from gensim.parsing.preprocessing import STOPWORDS
## ...I will use nltk's Dutch stopwords:
from nltk.corpus import stopwords

from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import *

import numpy as np

from random import choice
import random # I import this to get random number of articles from the file that I will work with

np.random.seed(1234)

In [12]:
# Initialize the Stemmers
stemmer = SnowballStemmer('dutch')
dutch_stop_words = set(stopwords.words('dutch'))


def lemmatize_stemming(text):
  """lemmatize and stem a word"""
  return stemmer.stem(text)


def preprocess(text):
  """lemmatize and remove stopwords"""
  result = [lemmatize_stemming(token)
            for token in simple_preprocess(text)
            if token not in dutch_stop_words and len(token) > 3]
  return result


In our DataFrame, we have a table contains the articles and their topics.

We only need the articles for this tasks - we will create our own topics. So, let's start by converting the articles column into a ist of all the articles:

In [13]:
all_articles = df_articles['content'].to_list()
all_articles[:5]

['De enige kerncentrale van Litouwen is oudjaarsavond om 23.00 uur buiten gebruik gesteld. Dat verliep zonder problemen, aldus de directeur. Litouwen beloofde al in 2004 om de centrale te sluiten in ruil voor toetreding tot de Europese Unie. De EU wilde sluiting omdat de kerncentrale bij de stad Visiginas mogelijk niet veilig was. Nucleaire ramp De centrale is een grotere versie van die bij Tsjernobyl. Die ontplofte in 1986 en veroorzaakte een nucleaire wolk die over een groot deel van Europa trok. Dat was de grootste nucleaire ramp in de geschiedenis. Voor Litouwen betekent de sluiting dat het land een goedkope bron van energie kwijt is. Het wordt nu veel afhankelijker van bijvoorbeeld gas uit Rusland. De kerncentrale leverde bijna driekwart van de Litouwse energiebehoefte.',
 'Spanje is met ingang van vandaag voorzitter van de EU. De Zweedse premier Fredrik Reinfeldt heeft het stokje, formeel om middernacht, overgedragen aan zijn Spaanse collega José Luis Rodriguez Zapatero. Spanje 

In [14]:
'''
I will work with randomly selected 200 articles from the dataset:
'''
random.seed(42)

random_articles = random.sample(all_articles, 200)

## Preprocessing

Let's see an example, what happens when we pre-process a document.

Look at the output of this cell, and compare the tokenized original document, to the lemmatized document:

My observations: The lemmatized (preprocessed) document has fewer unique tokens/words because of the lemmatization, the removal of words of length 3 or shorter, and the removal of stopwords. Every token is in lowercase now.

In [15]:
print('original document: ')
article = choice(random_articles)
print(article, "\n")

# This time, we don't care about punctuations as tokens (Can you think why?):
print('original document, broken into words: ')
words = [word for word in article.split(' ')]
print(words, "\n")
print("Vocabulary size of the original article:", len(set(words)))

# now let's see what happens when we pass the article into our preprocessing
# method:
print('\n\n tokenized and lemmatized document: ')
preprocessed_article = preprocess(article)
print(preprocessed_article, '\n')
print("Vocabulary size after preprocessing:", len(set(preprocessed_article)))


original document: 
Het fregat Hr. Ms.Tromp heeft gisteren twee Somalische piraten doodgeschoten. Zestien Somaliërs zijn opgepakt. De Somaliërs schoten vanaf een gekaapt Iraans vissersschip, dat door de Nederlanders werd bevrijd. Aan Nederlandse kant is volgens het ministerie van Defensie niemand gewond geraakt. De Somaliërs werden gedood toen vanaf de Tromp werd teruggeschoten. De piraten sloegen op de vlucht maar ze staakten de vluchtpoging toen er waarschuwingsschoten werden gelost. Later werd de Tromp nog genaderd door een ander gekaapt schip, maar dat maakte rechtsomkeert na schoten voor de boeg. 

original document, broken into words: 
['Het', 'fregat', 'Hr.', 'Ms.Tromp', 'heeft', 'gisteren', 'twee', 'Somalische', 'piraten', 'doodgeschoten.', 'Zestien', 'Somaliërs', 'zijn', 'opgepakt.', 'De', 'Somaliërs', 'schoten', 'vanaf', 'een', 'gekaapt', 'Iraans', 'vissersschip,', 'dat', 'door', 'de', 'Nederlanders', 'werd', 'bevrijd.', 'Aan', 'Nederlandse', 'kant', 'is', 'volgens', 'he

Now let's pre-process all the documents.  
This is a heavy procedure, and may take a bit ;)

In [16]:
processed_docs = list(map(preprocess, random_articles))
processed_docs[:10]

[['kliniek',
  'bangkok',
  'operaties',
  'maand',
  'uitvoert',
  'waarbij',
  'peniss',
  'gebleekt',
  'opsprak',
  'gekom',
  'social',
  'media',
  'medewerker',
  'deeld',
  'foto',
  'facebok',
  'geslachtsdel',
  'witter',
  'liet',
  'mak',
  'post',
  'ker',
  'gedeeld',
  'riep',
  'vrag',
  'grenz',
  'thais',
  'maand',
  'geled',
  'begon',
  'lelux',
  'ziekenhuis',
  'laser',
  'peniss',
  'behandel',
  'melanin',
  'afgebrok',
  'waardor',
  'penis',
  'witter',
  'kliniek',
  'stat',
  'bekend',
  'operaties',
  'lichaamsdel',
  'witter',
  'mak',
  'vrag',
  'soort',
  'operaties',
  'vertelt',
  'bunthita',
  'wattanasiri',
  'ziekenhuis',
  'werkt',
  'persbureau',
  'krijg',
  'klant',
  'maand',
  'drie',
  'vier',
  'operatie',
  'kost',
  'ongever',
  'euro',
  'vijf',
  'sessies',
  'doelgroep',
  'operaties',
  'mann',
  'leeftijd',
  'vertelt',
  'wattanasiri',
  'daarvan',
  'mak',
  'del',
  'thais',
  'lgbt',
  'gemeenschap',
  'gemeenschap',
  'thailand

## Setting Up The Dictionary

Our preprocessing is complete.

We now need to calculate the occurance frequencies of each of our stemmed words. But first, we will create a vocabulary dictionary where every word appears once. Every article would be represented as a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model), an unordered set of words that the article contain.

---

Q: Why is it called bag-of-words?

Hint: Think about your probability lessons - where you had randomly picked out white or black balls out of a bag...

My guess: Well, the order is lost in the bag-of-words approach, so we can image it like a bag per article in which we throw all the words of that article, with their respective frequency (meaning that if the word "dairy" appears 3 times in the article, it is also 3 times in our bag). Then we could imagine to randomly pick a word out of our article-bag and the probability of picking that word would be its frequency divided by all the words (again, including duplicates).

In [18]:
dictionary = corpora.Dictionary(processed_docs)


Let's take a look:

In [19]:
for idx, (k, v) in enumerate(dictionary.iteritems()):
    print(k, v)
    if idx >= 10:
        break


### BTW: `enumerate` is a great python function!
### It automatically creates an index, an auto-incremented counter variable,
### that represents the position of every object in the collection.

### Read more about it here: https://realpython.com/python-enumerate/

0 achterkamertjes
1 actric
2 afgebrok
3 afrikan
4 allen
5 alom
6 armoed
7 azie
8 bangkok
9 bedrijf
10 begon


Second, we filter the tokens that may appear to often.

We have full control on the process.

### Model Hyperparameter tuning

### Your Turn:
#### Exercise 1 - Hyperparameter effect on the model output:
**Q:** How would changing these parameters influence the result?  
After running this example, please return here to change them and try them out.

ANSWER: I will answer this question at the bottom of this Colab file!


In [20]:
## Model hyper parameters:

## These are the dictionary preparation parameters:
filter_tokens_if_container_documents_are_less_than = 5
filter_tokens_if_appeared_percentage_more_than = 0.8
keep_the_first_n_tokens=100000

## and the LDA Parameters:
num_of_topics = 20

In [21]:
dictionary.filter_extremes(
    no_below=filter_tokens_if_container_documents_are_less_than,
    no_above=filter_tokens_if_appeared_percentage_more_than,
    keep_n=keep_the_first_n_tokens)


We now create a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) (BOW) dictionary for each document, using [gensim's dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) tool.

It will be in the format of:

```{ 'word_id': count }```


In [22]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
print(len(bow_corpus))

200


Let's take a look at the result.

Our corpus contains now only word_ids, not the words themselves, so we have to peek into the dictionary to know which word that id represents:

In [23]:
# randomly choose an article from the corpus:
sample_bow_doc = choice(bow_corpus)

print('The processed bag-of-word document is just pairs of (word_id, # of occurnces) and looks like this:')
print(sample_bow_doc, '\n\n')

print ('We peek in the dictionary: for each word_id, we get its assigned word:')
for word_id, word_freq in sample_bow_doc:
  real_word = dictionary[word_id]
  print(f'Word #{word_id} ("{real_word}") appears {word_freq} time.')


The processed bag-of-word document is just pairs of (word_id, # of occurnces) and looks like this:
[(12, 1), (50, 1), (84, 1), (88, 1), (199, 1), (349, 1), (385, 1), (408, 1), (605, 1), (642, 1), (648, 1), (654, 1), (749, 1), (813, 1)] 


We peek in the dictionary: for each word_id, we get its assigned word:
Word #12 ("del") appears 1 time.
Word #50 ("mens") appears 1 time.
Word #84 ("vooral") appears 1 time.
Word #88 ("waarbij") appears 1 time.
Word #199 ("twitter") appears 1 time.
Word #349 ("raakt") appears 1 time.
Word #385 ("gewond") appears 1 time.
Word #408 ("tientall") appears 1 time.
Word #605 ("dod") appears 1 time.
Word #642 ("aanslag") appears 1 time.
Word #648 ("gisteravond") appears 1 time.
Word #654 ("duit") appears 1 time.
Word #749 ("viel") appears 1 time.
Word #813 ("twaalf") appears 1 time.


## LDA model using Bag-of-words

Let's start by applying the LDA model using the bag-of-words (Warning: this could take a while):

In [24]:
lda_model = models.LdaMulticore(bow_corpus,
                                num_topics=num_of_topics,
                                id2word=dictionary,
                                passes=5,
                                workers=2)



It is done!

Now we can observe which topics the model had extracted from the documents.

- *Topics* are made of sets of words and their distribution for that topic, representing their weight in that topic.
- Every document may be composed of multiple topics, with different weights representing the relation to each topics.

We will loop over the extracted topics and examine the words that construct them.

In [25]:
for idx, topic in lda_model.print_topics(num_of_topics):
    print(f'Topic: {idx} \t Words: {topic}')


Topic: 0 	 Words: 0.023*"jar" + 0.019*"volgen" + 0.017*"land" + 0.014*"twee" + 0.010*"zegt" + 0.009*"werd" + 0.009*"gister" + 0.009*"grot" + 0.008*"nederland" + 0.008*"vandag"
Topic: 1 	 Words: 0.031*"gemeent" + 0.021*"koning" + 0.019*"besluit" + 0.017*"auto" + 0.013*"jar" + 0.013*"regel" + 0.012*"nederland" + 0.011*"rechtbank" + 0.010*"volgen" + 0.010*"drie"
Topic: 2 	 Words: 0.018*"twee" + 0.017*"jar" + 0.017*"stat" + 0.014*"vrouw" + 0.012*"kwam" + 0.012*"geled" + 0.012*"mens" + 0.012*"agent" + 0.012*"procent" + 0.011*"jarig"
Topic: 3 	 Words: 0.032*"kamer" + 0.030*"pakistan" + 0.019*"twed" + 0.019*"geheim" + 0.017*"veroordeeld" + 0.013*"minister" + 0.013*"belgisch" + 0.013*"voorzitter" + 0.013*"tijden" + 0.013*"krant"
Topic: 4 	 Words: 0.022*"mens" + 0.019*"ler" + 0.017*"video" + 0.016*"volgen" + 0.014*"zegt" + 0.013*"eerst" + 0.013*"had" + 0.012*"media" + 0.010*"rond" + 0.010*"bekend"
Topic: 5 	 Words: 0.031*"ziekenhuis" + 0.020*"wegen" + 0.020*"komt" + 0.018*"provincie" + 0.014*"w

## TF / IDF

Let's take it one step further. We will cluster our document by running the LDA using [TF/IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

We start with TF/IDF calculation on our bag-of-words.
TF/IDF accepts a dictionary of word frequencies as an input, and it calculates the term frequency and the inversed document frequency accordingly.

Its output is a re-weighted dictionary of the documents term frequencies:

In [26]:
# initialize a tfidf from our corpus
tfidf = models.TfidfModel(bow_corpus)

# apply it on our corpus
tfidf_corpus = tfidf[bow_corpus]

pprint(tfidf_corpus[0][:10])

[(0, 0.03420235371515244),
 (1, 0.058511070037302916),
 (2, 0.07920899336444842),
 (3, 0.4166371078978617),
 (4, 0.07050637163642252),
 (5, 0.04789424423823499),
 (6, 0.0997194565565183),
 (7, 0.11702214007460583),
 (8, 0.08332742157957235),
 (9, 0.055683782751587606)]


In [27]:
# the new tfidf corpus is just our corpus - but transformed. It has the same size of documents:
assert len(bow_corpus) == len(tfidf_corpus)

Now let's apply LDA on the tfidf corpus, with the same amount of topics.

You can play with the # of passes, if the model doesn't converge properly

In [28]:
lda_model_tfidf = models.LdaMulticore(tfidf_corpus,
                                      num_topics=num_of_topics,
                                      id2word=dictionary,
                                      passes=5,
                                      workers=4)



In [29]:
for idx, topic in lda_model_tfidf.print_topics(num_of_topics):
    print(f'Topic: {idx} \t Word: {topic}')

Topic: 0 	 Word: 0.010*"bank" + 0.010*"waard" + 0.009*"central" + 0.009*"provincie" + 0.007*"consument" + 0.006*"voorzitter" + 0.006*"familie" + 0.006*"verkop" + 0.006*"del" + 0.006*"vind"
Topic: 1 	 Word: 0.012*"vrouw" + 0.010*"werk" + 0.009*"social" + 0.008*"noord" + 0.008*"moord" + 0.008*"brand" + 0.008*"agent" + 0.007*"terrein" + 0.007*"media" + 0.006*"problem"
Topic: 2 	 Word: 0.013*"israe" + 0.012*"groenlink" + 0.011*"utrecht" + 0.011*"wegen" + 0.011*"liep" + 0.010*"verdwijn" + 0.010*"ziekenhuis" + 0.009*"rond" + 0.008*"augustus" + 0.007*"automobilist"
Topic: 3 	 Word: 0.010*"lop" + 0.009*"journalist" + 0.008*"plaats" + 0.008*"pol" + 0.008*"brandwer" + 0.008*"kamer" + 0.008*"gemeent" + 0.007*"verdacht" + 0.006*"vrijgelat" + 0.006*"geplaatst"
Topic: 4 	 Word: 0.011*"stichting" + 0.008*"dorp" + 0.007*"slachtoffer" + 0.007*"besmet" + 0.007*"mens" + 0.007*"vak" + 0.006*"reger" + 0.006*"radio" + 0.006*"geslot" + 0.006*"voormal"
Topic: 5 	 Word: 0.013*"gemeent" + 0.011*"prijs" + 0.010*

## Inference

Now that we have a topic-modeler, let's use it on one of the articles.

In [30]:
# randomly pick an article:
test_doc = choice(range(len(processed_docs)))
processed_docs[test_doc][:50]

['christelijk',
 'begraafplat',
 'fran',
 'dorpj',
 'tracy',
 'tientall',
 'grav',
 'beschadigd',
 'eerder',
 'wek',
 'soortgelijk',
 'incident',
 'jod',
 'begraafplat',
 'oost',
 'frankrijk',
 'fran',
 'minister',
 'binnenland',
 'zak',
 'bernard',
 'cazeneuv',
 'bekend',
 'dader',
 'politie',
 'spor',
 'premier',
 'vall',
 'twitterd',
 'walgt',
 'gebeurteniss',
 'tracy',
 'noordwestkust',
 'frankrijk',
 'ligt']

Using the original BOW model:

In [31]:
for index, score in sorted(lda_model[bow_corpus[test_doc]], key=lambda tup: -1*tup[1]):
    print(f"Topic match score: {score} \nTopic: {lda_model.print_topic(index, num_of_topics)}")


Topic match score: 0.9472202062606812 
Topic: 0.020*"partij" + 0.019*"gat" + 0.018*"volgen" + 0.016*"land" + 0.016*"zegt" + 0.014*"nieuw" + 0.014*"euro" + 0.014*"reger" + 0.014*"fran" + 0.014*"minister" + 0.013*"maand" + 0.013*"frankrijk" + 0.011*"oekrai" + 0.010*"kort" + 0.009*"bekend" + 0.009*"jar" + 0.009*"miljoen" + 0.009*"nederland" + 0.009*"dinsdag" + 0.009*"krijg"


And with the TF/IDF model:

In [32]:
for index, score in sorted(lda_model_tfidf[bow_corpus[test_doc]], key=lambda tup: -1*tup[1]):
    print("Topic match score: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, num_of_topics)))

Topic match score: 0.9472188353538513	 
Topic: 0.009*"mens" + 0.009*"nieuw" + 0.008*"zegt" + 0.007*"dod" + 0.007*"land" + 0.007*"minister" + 0.007*"jarig" + 0.007*"aantal" + 0.007*"war" + 0.006*"vorig" + 0.006*"wek" + 0.006*"volgen" + 0.006*"zak" + 0.006*"had" + 0.006*"bekend" + 0.006*"tijd" + 0.006*"europes" + 0.006*"gat" + 0.006*"vandag" + 0.006*"eerder"


Calculating the [perplexity score](https://towardsdatascience.com/perplexity-in-language-models-87a196019a94) (lower is better):

In [33]:
print('Perplexity: ', lda_model.log_perplexity(bow_corpus))
print('Perplexity TFIDF: ', lda_model_tfidf.log_perplexity(bow_corpus))

Perplexity:  -7.020839772642903
Perplexity TFIDF:  -7.93722257790332


### Exercise - inference

Now please try it on a new document!

Go to a news website, such as [orf.at](https://orf.at/) and copy an article of your choice here:

In [34]:
unseen_document = """FC Twente heeft woensdag in de KNVB-beker ternauwernood gewonnen van VV Katwijk. De tukkers hadden grote moeite met de amateurclub, maar wonnen met 2-3.
FC Twente had in het winderige Katwijk nog een vliegende start en vond in de derde minuut al het net. De bal belandde vanuit een hoekschop voor de voeten van Alec Van Hoorenbeeck, die met een gelukje kon binnenwerken.

Het leek een makkelijke avond te worden voor de tukkers, maar de thuisploeg besloot anders. Na een kwartier stond het plots 1-1 door een fraaie treffer van Robin Schulte. Katwijk, de nummer vier van de Tweede Divisie, kreeg vleugels en was voor rust dicht bij een voorsprong. Twente ontsnapte, omdat het bij Katwijk ontbrak aan nauwkeurigheid in de afronding.

Twente-trainer Joseph Oosting moet in de rust een geweldige speech hebben gegeven, want de bezoekers kwamen na rust binnen vijf minuten met 3-1 voor. De net ingevallen Ricky van Wolfswinkel zorgde op aangeven van Sem Steijn voor de 2-1. Een paar minuten later maakte Bas Kuipers de 3-1.

Katwijk, dat met een vuurwerkshow voor de aftrap enorm had uitgepakt voor de bekeravond, liet het hoofd niet hangen en maakte het Twente lastig. Mohammed Tahiri liet de thuisploeg geloven in een stunt toen hij in de 85e minuut de aansluitingstreffer binnenschoot. Het slotoffensief was niet genoeg om een verlenging af te dwingen.

De volle bekeravond is nog lang niet voorbij. Volg alle duels in ons liveblog."""

bow_vector = dictionary.doc2bow(preprocess(unseen_document))

print("Simply printing the lda_model output would look like this:")
pprint(lda_model[bow_vector])

print("\n\nSo let's make it nicer, by printing the topic contents:")
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))


Simply printing the lda_model output would look like this:
[(0, 0.30323246),
 (2, 0.20538136),
 (11, 0.25353816),
 (13, 0.11433935),
 (14, 0.10427464)]


So let's make it nicer, by printing the topic contents:
Score: 0.3029906749725342	 Topic: 0.023*"jar" + 0.019*"volgen" + 0.017*"land" + 0.014*"twee" + 0.010*"zegt"
Score: 0.25352591276168823	 Topic: 0.016*"israe" + 0.015*"geld" + 0.015*"volgen" + 0.015*"kwam" + 0.015*"euro"
Score: 0.205551415681839	 Topic: 0.018*"twee" + 0.017*"jar" + 0.017*"stat" + 0.014*"vrouw" + 0.012*"kwam"
Score: 0.11434706300497055	 Topic: 0.017*"zegt" + 0.017*"elkar" + 0.016*"aanslag" + 0.015*"mens" + 0.011*"politie"
Score: 0.10435087978839874	 Topic: 0.019*"twee" + 0.015*"mens" + 0.014*"had" + 0.012*"tijd" + 0.012*"wer"


## Visualization

Finally, there are packages that can visulaize the results, such as [pyLDAvis](https://pypi.org/project/pyLDAvis/) and [tmplot](https://pypi.org/project/tmplot/).

Let's take a look at pyLDAvis visualization result.

**Please note:** this is an old and unmaintained package. It is easier to run it in Google-Colab than on your laptop. But, if you still try running it locally, please try **lowering your python version** (3.6 / 3.6 / 3.8) when you create the poetry environment for this exercise.

In [35]:
%pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


In [36]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

bow_lda_data = gensimvis.prepare(lda_model, bow_corpus, dictionary)

pyLDAvis.display(bow_lda_data)

In [39]:
# and here I wanted to try the visualisation for the TD/IDF model:

bow_lda_data = gensimvis.prepare(lda_model_tfidf, bow_corpus, dictionary)

pyLDAvis.display(bow_lda_data)

  and should_run_async(code)


Try changing the parameters to get a *satisfying level of clustering*.  
Which parameters worked best for the language you chose?

In [37]:
'''
My observations about the hyperparameters:

I had to change the hyperparameters because a lot of tokens were filtered out with the initial ones
because I only used 50 articles when I tried this exercise out for the first time - and filtering out tokens if
they appear in less than 15 of those 50 articles reduces the number of tokens drastically...
I was left with only 2!
So I lowered this hyperparameter and also I went back and chose 200 articles randomly instead,
and I continued working with those 200 articles.

The hyperparameters affect the model in the following way:
- no_below=filter_tokens_if_container_documents_are_less_than: excludes (filters) tokens that occur in less
than x documents. If x is 5, then a word must appear in 5 or more articles. This is helpful to remove
rare words that might not be informative and might create noise in the model.

- no_above=filter_tokens_if_appeared_percentage_more_than: filters tokens that occur in x percent of
documents or more often. Hence, this will remove words that occur in most documents that might not be
decisive for topic modeling. If x is 0.8, a word must occur in 80% or less of the articles. This is helpful to
remove very common words which might be less useful for differentiating between topics.

- keep_n=keep_the_first_n_token: keeps the most frequent x tokens (after filtering).

My implementation:
lower no_below: Since the corpus is still rather small (200 articles which are not super long), I set the no_below
hyperparameter a little lower because a high value might exclude rare but important terms. In 200 short articles of
(hopefully) differing topics, how big are the chances that a word appears in a high number of them? They are low!

higher no_above: The Dutch language contains a lot of commonly used compound words. I read online that
since that is the case, it is wise to set the no_above value a little higher so that you still include common compounds
without letting overly frequent terms (that might not be informative) influence the LDA too much.

I left the keep_n hyperparameter as is because I did not get extremely many tokens anyways due to the smaller
dataset and rather short articles.

About the visualisation: I read online that clusters that do not overlap extensively might demonstrate successful
clustering.
This explains why in my first trial with only 50 articles and not-so-ideal hyperparameters, the clusters were
almost all at the right top corner and overlapping a lot. Now with 200 articles and better tuned hyperparameters,
the clusters look more independent. Then I changed the number of topics from 10 to 20 (I know that there are maaany
more topics in the dataset - I know, I am biased...), and now the clusters are even better!
As a last step, I was curious how the clusters look for the IF/IDF based LDA model. And they surprised me! They
vary in size a lot more and the smaller clusters are very close to each other and overlapping. I did some research
and learned the following:
- Size of clusters: While BOW only weighs words considering their frequency in an article, TD/IDF weighs words
considering their frequency in an article AND in the whole corpus (inversed). For BOW, this might result in
a more similar distribution of word weighs across topics. For TD/IDF, it means that words that are unique to specific
articles or topics have higher weights and as a result, topics with more unique words will be bigger in size.
- Distance between clusters/overlapping: The small and partly overlapping clusters in the TF/IDF based LDA might
show topics that are only prevalent in a small proportion of the corpus (hence their small size), and that
share unique words with one another (hence their proximity). In easy words: these are rare/niche topics that do
not occur as often in the corpus but that are quite similar in terms of their words.

As a next step, one could definitely increase the number of articles, and play around a little more with
the number of topics, and maybe also refine the preprocessing pipeline for the Dutch language.
'''

  and should_run_async(code)


'\nMy observations about the hyperparameters:\n\nI had to change the hyperparameters because a lot of tokens were filtered our with the initial ones\nbecause I only used 50 articles when I tried this LDA out for the first time - and filtering out tokens if\nthey appear in less than 15 of thosr 50 articles reduces the number of tokens drastically...\nI was left with only 2!).\nSo I lowered this hyperparameter and\nalso I went back and chose 200 articles randomly instead, and I will continue working with those 200 articles.\n\nThe hyperparameters affect the model in the following way: \n- no_below=filter_tokens_if_container_documents_are_less_than: excludes (filters) tokens that occur in less\nthan x documents. If x is 5, then a word must appear in 5 or more articles. This is helpful to remove\nrare words that might not be informative and might create noise in the model.\n\n- no_above=filter_tokens_if_appeared_percentage_more_than: filters tokens that occur in x percent of\ndocuments or 