# Topic-Modelling mit LDA

1. Für Topic Modelling mit LDA benutzen wir das Paket: `gensim.models.LdaModel`
2. Die benutzten Funktionen für die Vorbereitung von Daten wurden in dem Ordner: `src\prepare_data.py`
    - Rohdaten wurden von dem sklearn aufgerufen `load_tokenize_texts()`und vorverarbeitet: `preprocess_text()`
    - Das Vocabular wurden von dem Trainset erstellt
    - Die Bags-Of-Words Repräsentationen wurden in `split_and_create_voca_from_trainset` erstellt
3. Die benutzen Funktionen für Evaluation befinden sich in `src/evaluierung.py`:
    - Topic Coherence
    - Topic Diversity

In [1]:
from IPython.display import clear_output
import pandas as pd
from gensim.parsing.preprocessing import preprocess_string, strip_punctuation, strip_numeric
from pathlib import Path
from tqdm import tqdm
import gensim
from gensim.models import LdaModel

In [2]:
from src.prepare_dataset import TextDataLoader
from src.evaluierung import topicCoherence2, topicDiversity

In [3]:
def get_data(min_df):
    stopwords_filter = True
    textsloader = TextDataLoader(source="20newsgroups", 
                             train_size=None, test_size=None)
    textsloader.load_tokenize_texts("20newsgroups")
    textsloader.preprocess_texts(length_one_remove=True, 
                                 punctuation_lower = True, 
                                 stopwords_filter = stopwords_filter)
    textsloader.split_and_create_voca_from_trainset(max_df=0.7, min_df=min_df, 
                                                    stopwords_remove_from_voca=stopwords_filter)

    for_lda_model = True
    word2id, id2word, train_set, test_set, val_set = textsloader.create_bow_and_savebow_for_each_set(for_lda_model=for_lda_model, 
                                                                                                     normalize = True)
    docs_tr, _, _ = textsloader.get_docs_in_words_for_each_set()
    textsloader.write_info_vocab_to_text()
    del textsloader
    return docs_tr, train_set, id2word

def get_lda_topics(train_set, id2word, num_topics):
    myid2word = id2word
    ldamodel = LdaModel(train_set, num_topics= num_topics, id2word = id2word, random_state = 42)
    lda_topics = ldamodel.show_topics(num_topics= num_topics, num_words=25)
    topics = []
    filters = [lambda x: x.lower(), strip_punctuation, strip_numeric]
    for topic in lda_topics:
        topics.append(preprocess_string(topic[1], filters))
    del ldamodel
    del lda_topics
    return topics

In [4]:
num_topics = 10
min_dfs = [2,5,10,30,100]

df = pd.DataFrame()
df['min_df'] = min_dfs
vocab_sizes = []
coherences = []
diversities = []

for min_df in min_dfs:
    print(100*"-")
    docs_tr, train_set, id2word = get_data(min_df)
    vocab_sizes.append(len(id2word.keys()))
    topics = get_lda_topics(train_set, id2word, num_topics)
    for i, topic in enumerate(topics):
        ws = " ".join(topic)
        print(f'topic: {i} {ws}')
    tc = topicCoherence2(topics,len(topics),docs_tr,len(docs_tr))
    td = topicDiversity(topics)
    coherences.append(tc)
    diversities.append(td)
    print(100*"-")
    
df['vocab_size'] = vocab_sizes
df['coherence'] = coherences
df['diversity'] = diversities

loading texts: ...
train-size after loading: 11314
test-size after loading: 7532
finished load!
start: preprocessing: ...
finised: preprocessing!
vocab-size in df: 57326
start creating vocabulary ...
length of the vocabulary: 54318
length word2id list: 54318
length id2word list: 54318
finished: creating vocabulary
train-size-after-all: 11214
test-size-after-all: 7532
validation-size-after-all: 100
test-size-after-all: 11214
test-indices-length: 11214
test-size-after-all: 100
test-indices-length: 100
test-size-after-all: 7532
test-indices-length: 7532
length train-documents-indices : 1420825
length of the vocabulary: 54318


start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

compact representation for LDA
to

train-size-after-all: 11214
test-size-after-all: 7532
validation-size-after-all: 100
test-size-after-all: 11214
test-indices-length: 11214
test-size-after-all: 100
test-indices-length: 100
test-size-after-all: 7532
test-indices-length: 7532
length train-documents-indices : 1150368
length of the vocabulary: 8496


start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

compact representation for LDA
topic: 0 max university posting nntp writes host article pl colorado washington wm uiuc ca distribution sl cso hp edm db de time se alaska cars apr
topic: 1 mr writes people article president time kuwait university org key government cs back state posting made nntp host law good question fbi years today day
topic: 2 w

In [5]:
df

Unnamed: 0,min_df,vocab_size,coherence,diversity
0,2,54318,0.140694,0.608
1,5,29874,0.143413,0.64
2,10,18677,0.162169,0.664
3,30,8496,0.125755,0.632
4,100,3102,0.13094,0.636
