<a href="https://colab.research.google.com/github/larajakl/Machine-Learning/blob/main/04_LM_LDA_Topic_modeling_2024_Lara_JAKL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering Documents

# Setup

We start by importing [pandas](https://pandas.pydata.org/) - an essential tool for data scientists!

We load a .CSV (Comma Seperated Values) file of German news articles from https://github.com/tblock/10kGNAD


In [1]:
from IPython.display import YouTubeVideo

In [2]:
import pandas as pd

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

Pandas is a useful package to load CSV files and to parse them. It can also parse TSV - separated by tabs, or as in our case, separated by a `;`.  
Pandas is often used as the first-step for data scientists to load and analyze data.

In [7]:
'''
Since my mother tongue is German and I wanted to try out another language,
I decided to use Dutch which is another language that I speak.
I used the following dataset of news articles in Dutch:
https://www.kaggle.com/datasets/maxscheijen/dutch-news-articles?resource=download
/Users/Lara/Desktop/MLT/Machine Learning/Assignments/dutch-news-articles.csv
'''

'\nSince my mother tongue is German and I wanted to try out another language,\nI decided to use Dutch which is another language that I speak.\nI used the following dataset of news articles in Dutch:\nhttps://www.kaggle.com/datasets/maxscheijen/dutch-news-articles?resource=download\n/Users/Lara/Desktop/MLT/Machine Learning/Assignments/dutch-news-articles.csv\n'

In [4]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("maxscheijen/dutch-news-articles")

print("Path to dataset files:", path)


Downloading from https://www.kaggle.com/api/v1/datasets/download/maxscheijen/dutch-news-articles?dataset_version_number=157...


100%|██████████| 161M/161M [00:01<00:00, 89.0MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/maxscheijen/dutch-news-articles/versions/157


In [5]:
import os

directory_path = '/root/.cache/kagglehub/datasets/maxscheijen/dutch-news-articles/versions/157'
files = os.listdir(directory_path)
print(files)

['dutch-news-articles.csv']


In [6]:
csv_filename = 'dutch-news-articles.csv'
csv_file_path = os.path.join(directory_path, csv_filename)

In [7]:
df_articles = pd.read_csv(csv_file_path,
                 sep=',',       # this file is actually separated by ","
                 on_bad_lines='skip',
                 header=0,   # first line is header line for this CSV
                 # .. so we define the column names here:
                 usecols=['datetime', 'title', 'content', 'category', 'url'],  # Reads only the specified columns
                 # And by specifiying the column as a Categorical type,
                 # we can save computer memory! Yay!
                 dtype={'category': 'category'})

print(df_articles.head())

              datetime                                            title  \
0  2010-01-01 00:49:00                Enige Litouwse kerncentrale dicht   
1  2010-01-01 02:08:00  Spanje eerste EU-voorzitter onder nieuw verdrag   
2  2010-01-01 02:09:00                 Fout justitie in Blackwater-zaak   
3  2010-01-01 05:14:00        Museumplein vol, minder druk in Rotterdam   
4  2010-01-01 05:30:00              Obama krijgt rapporten over aanslag   

                                             content    category  \
0  De enige kerncentrale van Litouwen is oudjaars...  Buitenland   
1  Spanje is met ingang van vandaag voorzitter va...  Buitenland   
2  Vijf werknemers van het omstreden Amerikaanse ...  Buitenland   
3  Het Oud en Nieuwfeest op het Museumplein in Am...  Binnenland   
4  President Obama heeft de eerste rapporten gekr...  Buitenland   

                                                 url  
0  https://nos.nl/artikel/126231-enige-litouwse-k...  
1  https://nos.nl/artikel/1262

**Note:** Specifying a column with repeated strings as a category is a good Pandas' trick to be aware of. Often the dataset can't fit into the memory, and by specifiying columns as a categorical column when loading the data (`pd.read_csv`), we get to spare memory and allow the dataset to fit the working memory better.

In [8]:
df_articles

Unnamed: 0,datetime,title,content,category,url
0,2010-01-01 00:49:00,Enige Litouwse kerncentrale dicht,De enige kerncentrale van Litouwen is oudjaars...,Buitenland,https://nos.nl/artikel/126231-enige-litouwse-k...
1,2010-01-01 02:08:00,Spanje eerste EU-voorzitter onder nieuw verdrag,Spanje is met ingang van vandaag voorzitter va...,Buitenland,https://nos.nl/artikel/126230-spanje-eerste-eu...
2,2010-01-01 02:09:00,Fout justitie in Blackwater-zaak,Vijf werknemers van het omstreden Amerikaanse ...,Buitenland,https://nos.nl/artikel/126233-fout-justitie-in...
3,2010-01-01 05:14:00,"Museumplein vol, minder druk in Rotterdam",Het Oud en Nieuwfeest op het Museumplein in Am...,Binnenland,https://nos.nl/artikel/126232-museumplein-vol-...
4,2010-01-01 05:30:00,Obama krijgt rapporten over aanslag,President Obama heeft de eerste rapporten gekr...,Buitenland,https://nos.nl/artikel/126236-obama-krijgt-rap...
...,...,...,...,...,...
255519,2023-08-09 09:51:42,"Amazone-landen willen ontbossing tegengaan, ma...",Acht Zuid-Amerikaanse landen zijn het op de Am...,Buitenland,https://nos.nl//artikel/2485984-amazone-landen...
255520,2023-08-09 10:06:31,Topman moederbedrijf Albert Heijn: 'Prijsverla...,"De topman van Ahold Delhaize, het moederbedrij...",Economie,https://nos.nl//artikel/2485988-topman-moederb...
255521,2023-08-09 10:09:40,Bijzondere mijlpaal voor Duncan Laurence: een ...,Duncan Laurence heeft met zijn nummer Arcade e...,Binnenland,https://nos.nl//artikel/2485989-bijzondere-mij...
255522,2023-08-09 10:17:16,Brand in Frans vakantiehuis voor gehandicapten...,In een Frans vakantiehuis voor gehandicapten i...,Buitenland,https://nos.nl//artikel/2485990-brand-in-frans...


In [10]:
df_articles['category'].cat.categories

Index(['Binnenland', 'Buitenland', 'Cultuur & Media', 'Economie',
       'Koningshuis', 'Opmerkelijk', 'Politiek', 'Regionaal nieuws', 'Tech',
       '1 Jaar Oorlog', '4 En 5 Mei ', 'Aardbevingen', 'Crisis Asielbeleid',
       'Cultuur-En-Media', 'Einde Rutte Iv', 'Grensoverschrijdend ',
       'Gronings Gas', 'Jaarwisseling', 'Keti Koti', 'Klimaat',
       'Kroning Charles', 'L1Mburg', 'Midterm-Verkiezingen', 'Nh Nieuws',
       'Oekraïens Offensief', 'Omroep Brabant', 'Omroep Flevoland',
       'Omroep Gelderland', 'Omroep West', 'Omroep Zeeland', 'Omrop Fryslân',
       'Op Weg Naar Tk2023', 'Opstand Wagner', 'Pelé Overleden',
       'Pentagon-Lek', 'Proces-Taghi', 'Regio', 'Regionaal Nieuws', 'Rijnmond',
       'Rtv Drenthe', 'Rtv Noord', 'Rtv Oost', 'Rtv Utrecht',
       'Schipholonderzoek', 'Slavernijverleden', 'Songfestival ',
       'Stikstofcrisis', 'Strijd In Sudan', 'Treinongeluk', 'Turkije Kiest',
       'Verkiezingen', 'Wangedrag Supporters', 'Watersnoodramp', 'Wk Voetbal'

# Clustering with Latent Dirichlet Allocation (LDA)

In previous exercises, you got to know NLTK.

### Stemming
Here we will also use NLTK's methods of **stemming** the words. By returning to the root of the word, its stem, we reduce the dimensionality: the number of words in the vocabulary decreases. For example, instead of having different words for the singular and plural form - 'word' <--> 'words' or 'Kanzler', 'Kanzlers', 'Kanzlei', etc., we trim those words into 'Kanzl'. Hence we can reduce the size of the vocabulary by at least half.

### Stop Words
We will also remove `stopwords` from our text. In English, words such as: `a`, `an`, and `the` will be removed, as they don't add much to the meaning of the sentence. For each language, there is a different curated list of such words, and NLTK is a great source for those.

### GenSim
In this exercise, you'll be introduced to another package, specialized in topic modeling, called `gensim`:
https://radimrehurek.com/gensim/



In [11]:
%pip install -U gensim --quiet

In [17]:
from pprint import pprint # for printing objects nicely

from gensim import corpora, models
from gensim.utils import simple_preprocess

## Instead of the gensim English stopwords...
# from gensim.parsing.preprocessing import STOPWORDS
## ...I will use nltk's Dutch stopwords:
from nltk.corpus import stopwords

from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import *

import numpy as np

from random import choice
import random # I import this to get random 50 articles from the file that I will work with

np.random.seed(1234)

In [13]:
# Initialize the Stemmers
stemmer = SnowballStemmer('dutch')
dutch_stop_words = set(stopwords.words('dutch'))


def lemmatize_stemming(text):
  """lemmatize and stem a word"""
  return stemmer.stem(text)


def preprocess(text):
  """lemmatize and remove stopwords"""
  result = [lemmatize_stemming(token)
            for token in simple_preprocess(text)
            if token not in dutch_stop_words and len(token) > 3]
  return result


In our DataFrame, we have a table contains the articles and their topics.

We only need the articles for this tasks - we will create our own topics. So, let's start by converting the articles column into a ist of all the articles:

In [14]:
all_articles = df_articles['content'].to_list()
all_articles[:5]

['De enige kerncentrale van Litouwen is oudjaarsavond om 23.00 uur buiten gebruik gesteld. Dat verliep zonder problemen, aldus de directeur. Litouwen beloofde al in 2004 om de centrale te sluiten in ruil voor toetreding tot de Europese Unie. De EU wilde sluiting omdat de kerncentrale bij de stad Visiginas mogelijk niet veilig was. Nucleaire ramp De centrale is een grotere versie van die bij Tsjernobyl. Die ontplofte in 1986 en veroorzaakte een nucleaire wolk die over een groot deel van Europa trok. Dat was de grootste nucleaire ramp in de geschiedenis. Voor Litouwen betekent de sluiting dat het land een goedkope bron van energie kwijt is. Het wordt nu veel afhankelijker van bijvoorbeeld gas uit Rusland. De kerncentrale leverde bijna driekwart van de Litouwse energiebehoefte.',
 'Spanje is met ingang van vandaag voorzitter van de EU. De Zweedse premier Fredrik Reinfeldt heeft het stokje, formeel om middernacht, overgedragen aan zijn Spaanse collega José Luis Rodriguez Zapatero. Spanje 

In [73]:
'''
I will work with randomly selected 200 articles from the dataset:
'''
random.seed(42)

random_articles = random.sample(all_articles, 200)

## Preprocessing

Let's see an example, what happens when we pre-process a document.

Look at the output of this cell, and compare the tokenized original document, to the lemmatized document:

My observations: The lemmatized (preprocessed) document has fewer unique tokens/words because of the lemmatization, the removal of words of length 3 or shorter, and the removal of stopwords. Every token is in lowercase now.

In [74]:
print('original document: ')
article = choice(random_articles)
print(article, "\n")

# This time, we don't care about punctuations as tokens (Can you think why?):
print('original document, broken into words: ')
words = [word for word in article.split(' ')]
print(words, "\n")
print("Vocabulary size of the original article:", len(set(words)))

# now let's see what happens when we pass the article into our preprocessing
# method:
print('\n\n tokenized and lemmatized document: ')
preprocessed_article = preprocess(article)
print(preprocessed_article, '\n')
print("Vocabulary size after preprocessing:", len(set(preprocessed_article)))


original document: 
Het fregat Hr. Ms.Tromp heeft gisteren twee Somalische piraten doodgeschoten. Zestien Somaliërs zijn opgepakt. De Somaliërs schoten vanaf een gekaapt Iraans vissersschip, dat door de Nederlanders werd bevrijd. Aan Nederlandse kant is volgens het ministerie van Defensie niemand gewond geraakt. De Somaliërs werden gedood toen vanaf de Tromp werd teruggeschoten. De piraten sloegen op de vlucht maar ze staakten de vluchtpoging toen er waarschuwingsschoten werden gelost. Later werd de Tromp nog genaderd door een ander gekaapt schip, maar dat maakte rechtsomkeert na schoten voor de boeg. 

original document, broken into words: 
['Het', 'fregat', 'Hr.', 'Ms.Tromp', 'heeft', 'gisteren', 'twee', 'Somalische', 'piraten', 'doodgeschoten.', 'Zestien', 'Somaliërs', 'zijn', 'opgepakt.', 'De', 'Somaliërs', 'schoten', 'vanaf', 'een', 'gekaapt', 'Iraans', 'vissersschip,', 'dat', 'door', 'de', 'Nederlanders', 'werd', 'bevrijd.', 'Aan', 'Nederlandse', 'kant', 'is', 'volgens', 'he

Now let's pre-process all the documents.  
This is a heavy procedure, and may take a bit ;)

In [75]:
processed_docs = list(map(preprocess, random_articles))
processed_docs[:10]

[['kliniek',
  'bangkok',
  'operaties',
  'maand',
  'uitvoert',
  'waarbij',
  'peniss',
  'gebleekt',
  'opsprak',
  'gekom',
  'social',
  'media',
  'medewerker',
  'deeld',
  'foto',
  'facebok',
  'geslachtsdel',
  'witter',
  'liet',
  'mak',
  'post',
  'ker',
  'gedeeld',
  'riep',
  'vrag',
  'grenz',
  'thais',
  'maand',
  'geled',
  'begon',
  'lelux',
  'ziekenhuis',
  'laser',
  'peniss',
  'behandel',
  'melanin',
  'afgebrok',
  'waardor',
  'penis',
  'witter',
  'kliniek',
  'stat',
  'bekend',
  'operaties',
  'lichaamsdel',
  'witter',
  'mak',
  'vrag',
  'soort',
  'operaties',
  'vertelt',
  'bunthita',
  'wattanasiri',
  'ziekenhuis',
  'werkt',
  'persbureau',
  'krijg',
  'klant',
  'maand',
  'drie',
  'vier',
  'operatie',
  'kost',
  'ongever',
  'euro',
  'vijf',
  'sessies',
  'doelgroep',
  'operaties',
  'mann',
  'leeftijd',
  'vertelt',
  'wattanasiri',
  'daarvan',
  'mak',
  'del',
  'thais',
  'lgbt',
  'gemeenschap',
  'gemeenschap',
  'thailand

In [76]:
for i, doc in enumerate(processed_docs[:5]):
    print(f'Processed Document {i+1}: {doc}')

Processed Document 1: ['kliniek', 'bangkok', 'operaties', 'maand', 'uitvoert', 'waarbij', 'peniss', 'gebleekt', 'opsprak', 'gekom', 'social', 'media', 'medewerker', 'deeld', 'foto', 'facebok', 'geslachtsdel', 'witter', 'liet', 'mak', 'post', 'ker', 'gedeeld', 'riep', 'vrag', 'grenz', 'thais', 'maand', 'geled', 'begon', 'lelux', 'ziekenhuis', 'laser', 'peniss', 'behandel', 'melanin', 'afgebrok', 'waardor', 'penis', 'witter', 'kliniek', 'stat', 'bekend', 'operaties', 'lichaamsdel', 'witter', 'mak', 'vrag', 'soort', 'operaties', 'vertelt', 'bunthita', 'wattanasiri', 'ziekenhuis', 'werkt', 'persbureau', 'krijg', 'klant', 'maand', 'drie', 'vier', 'operatie', 'kost', 'ongever', 'euro', 'vijf', 'sessies', 'doelgroep', 'operaties', 'mann', 'leeftijd', 'vertelt', 'wattanasiri', 'daarvan', 'mak', 'del', 'thais', 'lgbt', 'gemeenschap', 'gemeenschap', 'thailand', 'alom', 'behandel', 'populair', 'homoseksuel', 'mann', 'travestiet', 'zorg', 'edel', 'del', 'zegt', 'medewerker', 'ziekenhuis', 'will', 

## Setting Up The Dictionary

Our preprocessing is complete.

We now need to calculate the occurance frequencies of each of our stemmed words. But first, we will create a vocabulary dictionary where every word appears once. Every article would be represented as a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model), an unordered set of words that the article contain.

---

Q: Why is it called bag-of-words?

Hint: Think about your probability lessons - where you had randomly picked out white or black balls out of a bag...

My guess: Well, the order is lost in the bag-of-words approach, so we can image it like a bag per article in which we throw all the words of that article, with their respective frequency (meaning that if the word "dairy" appears 3 times in the article, it is also 3 times in our bag). Then we could imagine to randomly pick a word out of our article-bag and the probability of picking that word would be its frequency divided by all the words (again, including duplicates).

In [77]:
dictionary = corpora.Dictionary(processed_docs)


Let's take a look:

In [78]:
for idx, (k, v) in enumerate(dictionary.iteritems()):
    print(k, v)
    if idx >= 10:
        break


### BTW: `enumerate` is a great python function!
### It automatically creates an index, an auto-incremented counter variable,
### that represents the position of every object in the collection.

### Read more about it here: https://realpython.com/python-enumerate/

0 achterkamertjes
1 actric
2 afgebrok
3 afrikan
4 allen
5 alom
6 armoed
7 azie
8 bangkok
9 bedrijf
10 begon


Second, we filter the tokens that may appear to often.

We have full control on the process.

### Model Hyperparameter tuning

### Your Turn:
#### Exercise 1 - Hyperparameter effect on the model output:
**Q:** How would changing these parameters influence the result?  
After running this example, please return here to change them and try them out.

ANSWER: I will answer this question at the bottom of this Colab file!


In [100]:
## Model hyper parameters:

## These are the dictionary preparation parameters:
filter_tokens_if_container_documents_are_less_than = 5
filter_tokens_if_appeared_percentage_more_than = 0.8
keep_the_first_n_tokens=100000

## and the LDA Parameters:
num_of_topics = 10

  and should_run_async(code)


In [101]:
dictionary.filter_extremes(
    no_below=filter_tokens_if_container_documents_are_less_than,
    no_above=filter_tokens_if_appeared_percentage_more_than,
    keep_n=keep_the_first_n_tokens)


  and should_run_async(code)


We now create a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) (BOW) dictionary for each document, using [gensim's dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) tool.

It will be in the format of:

```{ 'word_id': count }```


In [102]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
print(len(bow_corpus))

200


  and should_run_async(code)


Let's take a look at the result.

Our corpus contains now only word_ids, not the words themselves, so we have to peek into the dictionary to know which word that id represents:

In [103]:
# randomly choose an article from the corpus:
sample_bow_doc = choice(bow_corpus)

print('The processed bag-of-word document is just pairs of (word_id, # of occurnces) and looks like this:')
print(sample_bow_doc, '\n\n')

print ('We peek in the dictionary: for each word_id, we get its assigned word:')
for word_id, word_freq in sample_bow_doc:
  real_word = dictionary[word_id]
  print(f'Word #{word_id} ("{real_word}") appears {word_freq} time.')


The processed bag-of-word document is just pairs of (word_id, # of occurnces) and looks like this:
[(22, 1), (26, 1), (29, 1), (36, 2), (45, 1), (50, 3), (51, 2), (54, 2), (58, 1), (64, 1), (71, 1), (75, 2), (79, 1), (82, 1), (84, 1), (93, 1), (97, 4), (114, 1), (132, 1), (140, 2), (141, 1), (146, 1), (153, 1), (156, 1), (161, 1), (169, 1), (174, 2), (175, 1), (176, 1), (194, 2), (201, 1), (208, 1), (218, 1), (220, 1), (222, 1), (223, 1), (224, 4), (247, 1), (271, 1), (276, 4), (277, 1), (278, 1), (283, 1), (289, 1), (306, 1), (307, 3), (308, 1), (309, 1), (310, 1), (313, 1), (319, 3), (337, 1), (353, 1), (365, 1), (371, 1), (374, 1), (396, 1), (431, 1), (445, 6), (447, 1), (457, 1), (475, 1), (486, 1), (501, 1), (505, 1), (516, 2), (530, 1), (532, 3), (575, 1), (576, 1), (577, 1), (578, 1), (579, 1), (580, 1), (581, 1), (582, 1), (583, 1), (584, 1), (585, 1), (586, 1), (587, 1), (588, 1), (589, 1), (590, 2), (591, 2), (592, 1), (593, 1), (594, 1), (595, 1), (596, 1), (597, 1), (598, 1

  and should_run_async(code)


## LDA model using Bag-of-words

Let's start by applying the LDA model using the bag-of-words (Warning: this could take a while):

In [104]:
lda_model = models.LdaMulticore(bow_corpus,
                                num_topics=num_of_topics,
                                id2word=dictionary,
                                passes=5,
                                workers=2)

  and should_run_async(code)


It is done!

Now we can observe which topics the model had extracted from the documents.

- *Topics* are made of sets of words and their distribution for that topic, representing their weight in that topic.
- Every document may be composed of multiple topics, with different weights representing the relation to each topics.

We will loop over the extracted topics and examine the words that construct them.

In [105]:
for idx, topic in lda_model.print_topics(num_of_topics):
    print(f'Topic: {idx} \t Words: {topic}')


Topic: 0 	 Words: 0.023*"gemeent" + 0.020*"jar" + 0.016*"mens" + 0.010*"euro" + 0.009*"zegt" + 0.008*"premier" + 0.007*"huis" + 0.007*"lang" + 0.007*"amsterdam" + 0.007*"nederland"
Topic: 1 	 Words: 0.028*"jar" + 0.018*"procent" + 0.013*"mens" + 0.012*"zegt" + 0.010*"war" + 0.010*"supermarkt" + 0.009*"stat" + 0.009*"dorp" + 0.008*"nederland" + 0.008*"komt"
Topic: 2 	 Words: 0.019*"jar" + 0.018*"procent" + 0.015*"zegt" + 0.015*"nederland" + 0.012*"grot" + 0.012*"auto" + 0.011*"boer" + 0.009*"winkel" + 0.009*"nieuw" + 0.008*"voer"
Topic: 3 	 Words: 0.016*"mens" + 0.015*"aanslag" + 0.011*"aangift" + 0.010*"bank" + 0.010*"onderzoek" + 0.010*"partij" + 0.009*"zegt" + 0.009*"twee" + 0.008*"komt" + 0.008*"europes"
Topic: 4 	 Words: 0.026*"nederland" + 0.017*"europes" + 0.017*"koning" + 0.017*"land" + 0.017*"moet" + 0.015*"gat" + 0.012*"euro" + 0.010*"minister" + 0.009*"maatregel" + 0.009*"hel"
Topic: 5 	 Words: 0.019*"euro" + 0.017*"gat" + 0.014*"miljoen" + 0.010*"maatregel" + 0.010*"zegt" + 

  and should_run_async(code)


## TF / IDF

Let's take it one step further. We will cluster our document by running the LDA using [TF/IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

We start with TF/IDF calculation on our bag-of-words.
TF/IDF accepts a dictionary of word frequencies as an input, and it calculates the term frequency and the inversed document frequency accordingly.

Its output is a re-weighted dictionary of the documents term frequencies:

In [106]:
# initialize a tfidf from our corpus
tfidf = models.TfidfModel(bow_corpus)

# apply it on our corpus
tfidf_corpus = tfidf[bow_corpus]

pprint(tfidf_corpus[0][:10])

[(0, 0.03420235371515244),
 (1, 0.058511070037302916),
 (2, 0.07920899336444842),
 (3, 0.4166371078978617),
 (4, 0.07050637163642252),
 (5, 0.04789424423823499),
 (6, 0.09971945655651832),
 (7, 0.11702214007460583),
 (8, 0.08332742157957235),
 (9, 0.055683782751587606)]


  and should_run_async(code)


In [107]:
# the new tfidf corpus is just our corpus - but transformed. It has the same size of documents:
assert len(bow_corpus) == len(tfidf_corpus)

  and should_run_async(code)


Now let's apply LDA on the tfidf corpus, with the same amount of topics.

You can play with the # of passes, if the model doesn't converge properly

In [108]:
lda_model_tfidf = models.LdaMulticore(tfidf_corpus,
                                      num_topics=num_of_topics,
                                      id2word=dictionary,
                                      passes=5,
                                      workers=4)

  and should_run_async(code)


In [109]:
for idx, topic in lda_model_tfidf.print_topics(num_of_topics):
    print(f'Topic: {idx} \t Word: {topic}')

Topic: 0 	 Word: 0.010*"agent" + 0.009*"politie" + 0.007*"besmet" + 0.005*"huis" + 0.005*"verdacht" + 0.005*"ongeluk" + 0.005*"israe" + 0.005*"voer" + 0.004*"kabinet" + 0.004*"normal"
Topic: 1 	 Word: 0.008*"supermarkt" + 0.006*"brandwer" + 0.006*"prijs" + 0.005*"fran" + 0.005*"anti" + 0.005*"vijf" + 0.005*"unie" + 0.005*"brand" + 0.005*"twitter" + 0.005*"energie"
Topic: 2 	 Word: 0.008*"gemeent" + 0.006*"gat" + 0.006*"miljoen" + 0.005*"land" + 0.005*"europes" + 0.005*"elkar" + 0.005*"auto" + 0.005*"zegt" + 0.005*"mens" + 0.005*"zegg"
Topic: 3 	 Word: 0.006*"dod" + 0.006*"kan" + 0.006*"kreg" + 0.005*"werd" + 0.005*"gesprek" + 0.005*"rechter" + 0.005*"noord" + 0.005*"schot" + 0.005*"islamitisch" + 0.005*"gewond"
Topic: 4 	 Word: 0.010*"koning" + 0.008*"prijs" + 0.006*"europa" + 0.005*"ler" + 0.005*"wer" + 0.005*"vrijdag" + 0.005*"websit" + 0.004*"advocat" + 0.004*"april" + 0.004*"noord"
Topic: 5 	 Word: 0.008*"agent" + 0.008*"hag" + 0.007*"verdacht" + 0.006*"jarig" + 0.005*"groep" + 0.0

  and should_run_async(code)


## Inference

Now that we have a topic-modeler, let's use it on one of the articles.

In [110]:
# randomly pick an article:
test_doc = choice(range(len(processed_docs)))
processed_docs[test_doc][:50]

  and should_run_async(code)


['rechtbanktolk',
 'voer',
 'actie',
 'sind',
 'vandag',
 'nem',
 'nieuw',
 'opdracht',
 'daardor',
 'zull',
 'sommig',
 'zitting',
 'politieverhor',
 'doorgan',
 'voorspelt',
 'ord',
 'registertolk',
 'vertaler',
 'belangengroep',
 'led',
 'voert',
 'langer',
 'actie',
 'beter',
 'belon',
 'vorig',
 'wek',
 'werd',
 'demonstraties',
 'leeuward',
 'arnhem',
 'amsterdam',
 'hag',
 'georganiseerd',
 'klein',
 'beroepsgroep',
 'vuist',
 'mak',
 'daarom',
 'gan',
 'nieuw',
 'acties',
 'aldus',
 'fed',
 'dijkstra',
 'tolk',
 'vertaler',
 'teven',
 'voorzitter',
 'ord']

Using the original BOW model:

In [111]:
for index, score in sorted(lda_model[bow_corpus[test_doc]], key=lambda tup: -1*tup[1]):
    print(f"Topic match score: {score} \nTopic: {lda_model.print_topic(index, num_of_topics)}")


Topic match score: 0.9921028017997742 
Topic: 0.024*"politie" + 0.013*"verdacht" + 0.012*"twee" + 0.009*"vorig" + 0.009*"agent" + 0.009*"wek" + 0.009*"jar" + 0.009*"onderzoek" + 0.009*"jarig" + 0.008*"zak"


  and should_run_async(code)


And with the TF/IDF model:

In [112]:
for index, score in sorted(lda_model_tfidf[bow_corpus[test_doc]], key=lambda tup: -1*tup[1]):
    print("Topic match score: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, num_of_topics)))

Topic match score: 0.868702232837677	 
Topic: 0.006*"ziekenhuis" + 0.006*"euro" + 0.005*"nederland" + 0.005*"kamer" + 0.005*"werk" + 0.005*"mens" + 0.005*"minister" + 0.005*"mak" + 0.005*"over" + 0.004*"waard"
Topic match score: 0.12427598237991333	 
Topic: 0.008*"gemeent" + 0.006*"gat" + 0.006*"miljoen" + 0.005*"land" + 0.005*"europes" + 0.005*"elkar" + 0.005*"auto" + 0.005*"zegt" + 0.005*"mens" + 0.005*"zegg"


  and should_run_async(code)


Calculating the [perplexity score](https://towardsdatascience.com/perplexity-in-language-models-87a196019a94) (lower is better):

In [113]:
print('Perplexity: ', lda_model.log_perplexity(bow_corpus))
print('Perplexity TFIDF: ', lda_model_tfidf.log_perplexity(bow_corpus))

  and should_run_async(code)


Perplexity:  -6.815109635912172
Perplexity TFIDF:  -7.807528069409568


### Exercise - inference

Now please try it on a new document!

Go to a news website, such as [orf.at](https://orf.at/) and copy an article of your choice here:

In [97]:
unseen_document = """FC Twente heeft woensdag in de KNVB-beker ternauwernood gewonnen van VV Katwijk. De tukkers hadden grote moeite met de amateurclub, maar wonnen met 2-3.
FC Twente had in het winderige Katwijk nog een vliegende start en vond in de derde minuut al het net. De bal belandde vanuit een hoekschop voor de voeten van Alec Van Hoorenbeeck, die met een gelukje kon binnenwerken.

Het leek een makkelijke avond te worden voor de tukkers, maar de thuisploeg besloot anders. Na een kwartier stond het plots 1-1 door een fraaie treffer van Robin Schulte. Katwijk, de nummer vier van de Tweede Divisie, kreeg vleugels en was voor rust dicht bij een voorsprong. Twente ontsnapte, omdat het bij Katwijk ontbrak aan nauwkeurigheid in de afronding.

Twente-trainer Joseph Oosting moet in de rust een geweldige speech hebben gegeven, want de bezoekers kwamen na rust binnen vijf minuten met 3-1 voor. De net ingevallen Ricky van Wolfswinkel zorgde op aangeven van Sem Steijn voor de 2-1. Een paar minuten later maakte Bas Kuipers de 3-1.

Katwijk, dat met een vuurwerkshow voor de aftrap enorm had uitgepakt voor de bekeravond, liet het hoofd niet hangen en maakte het Twente lastig. Mohammed Tahiri liet de thuisploeg geloven in een stunt toen hij in de 85e minuut de aansluitingstreffer binnenschoot. Het slotoffensief was niet genoeg om een verlenging af te dwingen.

De volle bekeravond is nog lang niet voorbij. Volg alle duels in ons liveblog."""

bow_vector = dictionary.doc2bow(preprocess(unseen_document))

print("Simply printing the lda_model output would look like this:")
pprint(lda_model[bow_vector])

print("\n\nSo let's make it nicer, by printing the topic contents:")
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))


Simply printing the lda_model output would look like this:
[(0, 0.1868625),
 (1, 0.11276727),
 (2, 0.19892505),
 (3, 0.3630255),
 (7, 0.12559237)]


So let's make it nicer, by printing the topic contents:
Score: 0.36362117528915405	 Topic: 0.021*"miljoen" + 0.016*"twee" + 0.014*"politie" + 0.014*"kwam" + 0.014*"euro"
Score: 0.1987922042608261	 Topic: 0.015*"mens" + 0.013*"nederland" + 0.011*"veilig" + 0.011*"zegt" + 0.011*"ziekenhuis"
Score: 0.17830641567707062	 Topic: 0.011*"mens" + 0.010*"nederland" + 0.010*"grot" + 0.009*"aantal" + 0.009*"land"
Score: 0.1289258599281311	 Topic: 0.019*"mens" + 0.019*"jar" + 0.015*"land" + 0.014*"gemeent" + 0.011*"nieuw"
Score: 0.11748918890953064	 Topic: 0.020*"politie" + 0.014*"gat" + 0.012*"maatregel" + 0.011*"agent" + 0.011*"nieuw"


  and should_run_async(code)


## Visualization

Finally, there are packages that can visulaize the results, such as [pyLDAvis](https://pypi.org/project/pyLDAvis/) and [tmplot](https://pypi.org/project/tmplot/).

Let's take a look at pyLDAvis visualization result.

**Please note:** this is an old and unmaintained package. It is easier to run it in Google-Colab than on your laptop. But, if you still try running it locally, please try **lowering your python version** (3.6 / 3.6 / 3.8) when you create the poetry environment for this exercise.

In [114]:
%pip install pyLDAvis

  and should_run_async(code)




In [115]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

bow_lda_data = gensimvis.prepare(lda_model, bow_corpus, dictionary)

pyLDAvis.display(bow_lda_data)

  and should_run_async(code)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  token_table['Freq'] = token_table['Freq'].round()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  token_table['Term'] = vocab[token_table.index.values].values
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  token_table['Freq'] = token_table.Freq / term_frequency[token_tab

Try changing the parameters to get a *satisfying level of clustering*.  
Which parameters worked best for the language you chose?

In [None]:
'''
My observations about the hyperparameters:

I had to change the hyperparameters because a lot of tokens were filtered our with the initial ones
because I only used 50 articles when I tried this LDA out for the first time - and filtering out tokens if
they appear in less than 15 of thosr 50 articles reduces the number of tokens drastically...
I was left with only 2!).
So I lowered this hyperparameter and
also I went back and chose 200 articles randomly instead, and I will continue working with those 200 articles.

The hyperparameters affect the model in the following way:
- no_below=filter_tokens_if_container_documents_are_less_than: excludes (filters) tokens that occur in less
than x documents. If x is 5, then a word must appear in 5 or more articles. This is helpful to remove
rare words that might not be informative and might create noise in the model.

- no_above=filter_tokens_if_appeared_percentage_more_than: filters tokens that occur in x percent of
documents or more often. Hence, this will remove words that occur in most documents that might not be
decisive for topic modeling. If x is 0.8, a word must occur in 80% or less of the articles. This is helpful to
remove very common words which might be less useful for differentiating between topics.

- keep_n=keep_the_first_n_token: keeps the most frequent x tokens (after filtering).

My implementation:
lower no_below: Since the corpora is still rather small (200 articles which are not super long), I set the no_below
hyperparameter a little lower because a high value might exclude rare but important terms. In 200 short articles of
(hopefully) differing topics, what are the chances that a word appears in a high number of them? They are low!

higher no_above: The Dutch language contains a lot of commonly used/frequent compound words. I read online that
since that is the case, it is wise to set the no_above a little higher so that you still include common compounds
without letting overly frequent terms (that might not be informative) influence the LDA too much.

I left the keep_n hyperparameter as is because I did not get as many tokens anyways due to the smaller dataset and
rather short articles.

About the visualisation: I read online that clusters that do not overlap extensively demonstrate successful clustering.
This explains why in my first trial with only 50 articles and not-so-ideal hyperparameters, the clusters were
almost all at the right top corner overlapping a lot. Now with 200 articles and better tuned hyperparameters,
the clusters look more independent.

As a next step, one could also play around with the number of topics, and refine the preprocessing pipeline.
'''