<div class="alert alert-success" style="font-size: 16px;">
    <div style="font-size: 22px; font-weight:bold; margin: 10px 0px;">Topic Modeling</div>
After preprocessing we have to vectorize our preprocessed text and train our topic model.
</div>

## Imports

In [1]:
#!py -m spacy download en_core_web_sm

from tqdm.notebook import tqdm
import spacy

from conferences import get_conference_data

try:
    nlp
except:
    nlp = spacy.load("nl_core_news_sm")

## Get Data

In [2]:
conferences_data = get_conference_data()

In [3]:
text_by_Rutte = [conference['text'] for conference in conferences_data[0]]
text_by_de_jonge = [conference['text'] for conference in conferences_data[1]]

merged_texts = [text_by_Rutte, text_by_de_jonge]
merged_texts = [item for sublist in merged_texts for item in sublist]

## Tokenize Texts

In [4]:
%%time 
processed_texts = [text for text in tqdm(nlp.pipe(merged_texts, n_process=1, disable=["ner", "parser"]), total=len(merged_texts))] 

  0%|          | 0/58 [00:00<?, ?it/s]

Wall time: 1min 24s


In [1]:
# processed_texts[:10]

In [2]:
# word.pos_ == 'VERB' and not 
tokenized_texts_lem = [[word.lemma_ for word in processed_text if not word.is_stop and not word.is_punct] for processed_text in processed_texts]
tokenized_texts_lem[:10]

NameError: name 'processed_texts' is not defined

## Vectorization

In [11]:
from gensim.corpora import Dictionary

MIN_DF = 1 # minium document frequency
MAX_DF = 0.6 # maximum document frequency

dictionary = Dictionary(tokenized_texts_lem) # get the vocabulary
dictionary.filter_extremes(no_below=MIN_DF, 
                           no_above=MAX_DF)
corpus = [dictionary.doc2bow(text) for text in tokenized_texts_lem]

## Train Model

In [12]:
from gensim.models.wrappers import LdaMallet

PATH_TO_MALLET = r'C:/mallet/bin/mallet.bat'

N_TOPICS = 2
N_ITERATIONS = 1000 # usually 1000 will do

lda = LdaMallet(PATH_TO_MALLET,
                corpus=corpus,
                id2word=dictionary,
                num_topics=N_TOPICS,
                optimize_interval=10,
                iterations=N_ITERATIONS)

## Show Topics

In [15]:
for topic in range(N_TOPICS):
    words = lda.show_topic(topic, 20)
    topic_n_words = ' '.join([word[0] for word in words])
    print('Topic {}: {}'.format(str(topic), topic_n_words))

Topic 0: prikken daadwerkelijk maat gezondheidsraad toegangstest levering regio vaccinatiegraad bescherming besmettingsgraad 60 kortom astrazeneca verantwoorden naam fase huisarts risiconiveau leveren organiseren
Topic 1: uur hugo jonge avondklok pakket lockdown school winkel   maximaal corona helaas kind variant ervoor afstand burgemeester maart feit lastig


## Visualisation with pyLDAvis

In [14]:
import gensim

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()

lda_conv = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda)

gensimvis.prepare(lda_conv, corpus, dictionary)

  default_term_info = default_term_info.sort_values(


-
-
-
-
-
-
-
-
-


In [None]:
# TEST

texts = ['Image this a not a list of two texts', 'But like millions!']
dutch_texts = ['Dus dat weet u eigenlijk nog niet.','En u weet eigenlijk ook niet hoeveel beschermingsmiddelen u dan nodig heeft?']
processed_texts2 = [text for text in nlp2.pipe(texts, 
                                              n_process=1, disable=["ner",
                                                       "parser"])]

processed_texts2
for text in processed_texts2:
    for token in text:
        print(token.text, token.pos_, token.dep_)