# Topic-Modelling mit LDA

1. Für Topic Modelling mit LDA benutzen wir das Paket: `gensim.models.LdaModel`
2. Die benutzten Funktionen für die Vorbereitung von Daten wurden in dem Ordner: `src\prepare_data.py`
    - Rohdaten wurden von dem sklearn aufgerufen `load_tokenize_texts()`und vorverarbeitet: `preprocess_text()`
    - Das Vocabular wurden von dem Trainset erstellt
    - Die Bags-Of-Words Repräsentationen wurden in `split_and_create_voca_from_trainset` erstellt
3. Die benutzen Funktionen für Evaluation befinden sich in `src/evaluierung.py`:
    - Topic Coherence
    - Topic Diversity

In [1]:
from IPython.display import clear_output
import pandas as pd
from gensim.parsing.preprocessing import preprocess_string, strip_punctuation, strip_numeric
from pathlib import Path
from tqdm import tqdm
import gensim
from gensim.models import LdaModel

In [2]:
from src.prepare_dataset import TextDataLoader
from src.evaluierung import topicCoherence2, topicDiversity

In [3]:
def get_data(min_df):
    stopwords_filter = True
    use_bert_embedding = False
    textsloader = TextDataLoader(source="20newsgroups", 
                             train_size=None, test_size=None)
    textsloader.load_tokenize_texts("20newsgroups")
    textsloader.preprocess_texts(length_one_remove=True, 
                                 punctuation_lower = True, 
                                 stopwords_filter = stopwords_filter,
                                 use_bert_embedding = False
                                )
    textsloader.split_and_create_voca_from_trainset(max_df=0.7, min_df=min_df, 
                                                    stopwords_remove_from_voca=stopwords_filter)

    for_lda_model = True
    word2id, id2word, train_set, test_set, val_set, test_h1_set, test_h2_set = textsloader.create_bow_and_savebow_for_each_set(for_lda_model=for_lda_model, 
                                                                                                     normalize = True)
    docs_tr, _, _ = textsloader.get_docs_in_words_for_each_set()
    textsloader.write_info_vocab_to_text()
    del textsloader
    return docs_tr, train_set, id2word

def get_lda_topics(train_set, id2word, num_topics):
    myid2word = id2word
    ldamodel = LdaModel(train_set, num_topics= num_topics, id2word = id2word, random_state = 42)
    lda_topics = ldamodel.show_topics(num_topics= num_topics, num_words=25)
    topics = []
    filters = [lambda x: x.lower(), strip_punctuation, strip_numeric]
    for topic in lda_topics:
        topics.append(preprocess_string(topic[1], filters))
    del ldamodel
    del lda_topics
    return topics

In [None]:
num_topics = 10
min_dfs = [2,5,10,30,100]

In [4]:
df = pd.DataFrame()
df['min_df'] = min_dfs
vocab_sizes = []
coherences = []
diversities = []

for min_df in min_dfs:
    print(100*"-")
    docs_tr, train_set, id2word = get_data(min_df)
    vocab_sizes.append(len(id2word.keys()))
    topics = get_lda_topics(train_set, id2word, num_topics)
    for i, topic in enumerate(topics):
        ws = " ".join(topic)
        print(f'topic: {i} {ws}')
    tc = topicCoherence2(topics,len(topics),docs_tr,len(docs_tr))
    td = topicDiversity(topics)
    coherences.append(tc)
    diversities.append(td)
    print(100*"-")
    
df['vocab_size'] = vocab_sizes
df['coherence'] = coherences
df['diversity'] = diversities

----------------------------------------------------------------------------------------------------
loading texts: ...
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





train-size after loading: 11314
test-size after loading: 7532
finished load!
start: preprocessing: ...
preprocessing step: remove stopwords
finised: preproces

train-size after loading: 11314
test-size after loading: 7532
finished load!
start: preprocessing: ...
preprocessing step: remove stopwords
finised: preprocessing!
vocab-size in df: 18677
preprocessing remove stopwords from vocabulary
start creating vocabulary ...
length of the vocabulary: 18677
length word2id list: 18677
length id2word list: 18677
finished: creating vocabulary
save docs in txt...
save docs finished
train-size-after-all: 11214
test-size-after-all: 7532
validation-size-after-all: 100
test-size-after-all: 11214
test-indices-length: 11214
test-size-after-all: 100
test-indices-length: 100
test-size-after-all: 7532
test-indices-length: 7532
length train-documents-indices : 1299485
length of the vocabulary: 18677


start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow inpu

finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

compact representation for LDA
save docs in txt...
save docs finished
topic: 0 god max people jesus mr bible christ christian church time christians writes life good things article man children made day point love religion koresh christianity
topic: 1 writes article posting nntp host ca space nasa university gov news cs hp distribution berkeley cc chi game la good nj gatech technology time org
topic: 2 article writes university posting nntp host world distribution henry colorado sun cleveland reply usa toronto time cwru cs turkish ed bike armenians umd freenet washington
topic: 3 writes article game team cs year university posting ca games host nntp good season win hockey time play league players gm david nhl player sgi
topic: 4 writes university posting host 

In [5]:
df

Unnamed: 0,min_df,vocab_size,coherence,diversity
0,2,54318,0.115729,0.628
1,5,29874,0.142487,0.584
2,10,18677,0.134361,0.604
3,30,8496,0.110175,0.676
4,100,3102,0.129724,0.644
