# Topic Modeling Visualisierung der Amtspresse mit Python

Das folgende Notebook ist entstanden in Anlehnung an:

Blogpost:

https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

Notebooks:

https://github.com/kapadias/mediumposts/blob/master/nlp/published_notebooks/Introduction%20to%20Topic%20Modeling.ipynb

https://github.com/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb


### Topic Modeling with LDA Implementation

1. Loading data
2. Data cleaning
3. Convert to Document Term Matrix
4. Train LDA Model
5. Create Visualization

## Loading data

In [1]:
# Importing modules
import pandas as pd
import re

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import warnings
warnings.simplefilter("ignore", DeprecationWarning)


In [3]:
# Read data 
path_to_csv_file = ### path to csv file
df = pd.read_csv(path_to_csv_file)
df.head()

Unnamed: 0,filename,full_date,year,year_month,article_count,article_counts_per_issue,article_length_chars,article_length_words,newspaper,headline,article_text
0,1863-07-01_01.txt,1863-07-01,1863,1863-07,1,9,7101,955,PC,Ist das Abgeordnetenhaus eine Obrigkeit?,Ist das Abgeordnetenhaus eine Obrigkeit?\nEin ...
1,1863-07-01_02.txt,1863-07-01,1863,1863-07,2,9,1083,129,PC,Wochenschau.,Wochenschau.\nDie Nachrichten aus Karlsbad übe...
2,1863-07-01_03.txt,1863-07-01,1863,1863-07,3,9,546,66,PC,Wochenschau.,Ihre Majestät die Königin Augusta verweilt noc...
3,1863-07-01_04.txt,1863-07-01,1863,1863-07,4,9,2145,310,PC,Wochenschau.,Se. Königl. Hoheit der Kronprinz hat von Litth...
4,1863-07-01_05.txt,1863-07-01,1863,1863-07,5,9,1904,260,PC,Wochenschau.,Es muß der Regierung zu großer Befriedigung ge...


## Data Cleaning


In [4]:
# Remove the columns
columns_to_drop = ['filename', 'article_count', 'article_counts_per_issue', 'article_length_chars', 'article_length_words']
df = df.drop(columns=columns_to_drop, axis=1)

df.head()

Unnamed: 0,full_date,year,year_month,newspaper,headline,article_text
0,1863-07-01,1863,1863-07,PC,Ist das Abgeordnetenhaus eine Obrigkeit?,Ist das Abgeordnetenhaus eine Obrigkeit?\nEin ...
1,1863-07-01,1863,1863-07,PC,Wochenschau.,Wochenschau.\nDie Nachrichten aus Karlsbad übe...
2,1863-07-01,1863,1863-07,PC,Wochenschau.,Ihre Majestät die Königin Augusta verweilt noc...
3,1863-07-01,1863,1863-07,PC,Wochenschau.,Se. Königl. Hoheit der Kronprinz hat von Litth...
4,1863-07-01,1863,1863-07,PC,Wochenschau.,Es muß der Regierung zu großer Befriedigung ge...


## Remove punctuation/lower casing

In [5]:
with open('german_stopwords.txt', 'r', encoding='utf-8') as f:
    stopwords = f.read()


In [None]:
nlp = spacy.load('de_core_news_sm')

def lemmatizer(text, nlp):
    '''Lemmatize words.
    Input: Text
    Return: lemmatized text'''
        
    doc = nlp(str(text))
    return ' '.join([token.lemma_ for token in doc])

In [None]:
%%time
# Lemmatization
df.loc[:,'article_text_processed'] = df.loc[:,'article_text'].apply(lambda x: lemmatizer(x, nlp))

# Print out the first rows of papers
df.loc[:,'article_text_processed'].head(15)                                                                     

In [None]:
df.to_pickle('article_text_lemmatized.p')

In [6]:
# Remove punctuation
df.loc[:,'article_text_processed'] = df.loc[:,'article_text'].map(lambda x: re.sub('[,\.!?:;-]', '', str(x)))

# Replace \n with empty space
df.loc[:,'article_text_processed'] = df.loc[:,'article_text_processed'].map(lambda x: x.replace('\n', ' '))

# Convert the titles to lowercase
df.loc[:,'article_text_processed'] = df.loc[:,'article_text_processed'].map(lambda x: x.lower())

# Remove Stopwords
df.loc[:,'article_text_processed'] = df.loc[:,'article_text_processed'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopwords]))

# w for w in word_tokens if not w in stop_words
# apply(lambda x: [item for item in x if item not in stopwords])

# Print out the first rows of papers
df.loc[:,'article_text_processed'].head()

  df.loc[:,'article_text_processed'] = df.loc[:,'article_text'].map(lambda x: re.sub('[,\.!?:;-]', '', str(x)))


0    abgeordnetenhaus obrigkeit weithin bekannter g...
1    wochenschau nachrichten karlsbad befinden unse...
2    majestät königin augusta verweilt england besu...
3    königl hoheit kronprinz litthauen reise fortge...
4    regierung befriedigung gereichen preßverordnun...
Name: article_text_processed, dtype: object

## Convert to Document Term Matrix

In [7]:
docs_raw = df.loc[:,'article_text_processed']

print(len(docs_raw))
print(type(docs_raw))

29142
<class 'pandas.core.series.Series'>


In [8]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print(dtm_tf.shape)

(29142, 34762)


## Train LDA-Model

In [9]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tf.fit(dtm_tf)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=20, n_jobs=None,
                          perp_tol=0.1, random_state=0, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [10]:
# Help function for printing Topics
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

print("Topics found via LDA:")
print_topics(lda_tf, tf_vectorizer, 10)

Topics found via LDA:

Topic #0:
industrie lage mark wirthschaftlichen ausfuhr jahres preise handel einfuhr letzten

Topic #1:
herr freisinnigen herren partei richter sagen herrn lassen politik weise

Topic #2:
deutschen deutschland deutsche staaten landwirthschaft frankreich zoll waaren getreide england

Topic #3:
kaiser majestat berlin bismarck kaisers konig deutschen kronprinzen schreiben gesellschaft

Topic #4:
arbeiter zahl personen kinder fallen namentlich einzelnen bestimmungen kreise arbeiten

Topic #5:
arbeiter socialdemokratie gesellschaft arbeit sozialdemokratie arbeitern socialdemokratischen socialdemokraten partei mark

Topic #6:
mark zahl landwirthschaft landwirthschaftlichen bevolkerung personen betrug theil landlichen namentlich

Topic #7:
kaiser majestat berlin konig prinzen kaiserin wilhelm prinz kaisers koniglichen

Topic #8:
regierung kirche partei katholischen parteien nationalliberalen wahlen liberalen staat fortschrittspartei

Topic #9:
berlin hoheren personen ve

## Create Visualiztion with pyLDAvis

In [12]:
vis = pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [13]:
pyLDAvis.save_html(vis, 'pylda_amtspresse.html')

## Convert to Document Term Matrix with Tf-Idf

In [None]:
df = pd.read_pickle('article_text_processed.p')

In [None]:
docs_raw = df.loc[:,'article_text_processed']

print(len(docs_raw))
print(type(docs_raw))

In [None]:
tfidf_vectorizer = TfidfVectorizer(strip_accents = 'unicode',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)

In [None]:
with open('dtm_tfidf_vectorizer.p', 'wb') as f:
    pickle.dump(dtm_tfidf, f)

In [None]:
%%time
lda_tfidf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tfidf.fit(dtm_tfidf)

In [None]:
with open('lda_tfidf_model.p', 'wb') as f:
    pickle.dump(lda_tfidf, f)

In [None]:
print("Topics found via LDA:")
print_topics(lda_tfidf, tfidf_vectorizer, 10)