# Topics – Easy Topic Modeling in Python

The text mining technique **Topic Modeling** has become a popular statistical method for clustering documents. This [Jupyter notebook](http://jupyter.org/) introduces a step-by-step workflow, basically containing data preprocessing, the actual topic modeling using **latent Dirichlet allocation** (LDA), which learns the relationships between words, topics, and documents, as well as some interactive visualizations to explore the model.

LDA, introduced in the context of text analysis in [2003](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf), is an instance of a more general class of models called **mixed-membership models**. Involving a number of distributions and parameters, the topic model is typically performed using [Gibbs sampling](https://en.wikipedia.org/wiki/Gibbs_sampling) with conjugate priors and is purely based on word frequencies.

There have been written numerous introductions to topic modeling for humanists (e.g. [this one](http://scottbot.net/topic-modeling-for-humanists-a-guided-tour/)), which provide another level of detail regarding its technical and epistemic properties.

For this workflow, you will need a corpus (a set of texts) as plain text (`.txt`) or [TEI XML](http://www.tei-c.org/index.xml) (`.xml`). Using the `dariah_topics` package, you also have the ability to process the output of [DARIAH-DKPro-Wrapper](https://github.com/DARIAH-DE/DARIAH-DKPro-Wrapper), a command-line tool for *natural language processing*.

Topic modeling works best with very large corpora. The [TextGrid Repository](https://textgridrep.org/) is a great place to start searching for text data. Anyway, to demonstrate the technique, we provide one small text collection in the folder `grenzboten_sample` containing 15 diary excerpts, as well as 15 war diary excerpts, which appeared in *Die Grenzboten*, a German newspaper of the late 19th and early 20th century.

**Of course, you can work with your own corpus in this notebook.**

We're relying on the LDA implementation by [Allen B. Riddell](https://www.ariddell.org/), called [lda](http://pythonhosted.org/lda/index.html), which is very lightweight. Aside from that, we provide two more Jupyter notebooks:

* [IntroducingMallet](IntroducingMallet.ipynb), using LDA by [MALLET](http://mallet.cs.umass.edu/topics.php), which is known to be very robust. 
* [IntroducingGensim](IntroducingGensim.ipynb), using LDA by called [Gensim](https://radimrehurek.com/project/gensim/), which is attractive because of its multi-core support.

For more information in general, have a look at the [documentation](http://dev.digital-humanities.de/ci/job/DARIAH-Topics/doclinks/1/).

## First step: Installing dependencies

To work within this Jupyter notebook, you will have to import the `dariah_topics` library. As you do, `dariah_topics` also imports a couple of external libraries, which have to be installed first. `pip` is the preferred installer program in Python. Starting with Python 3.4, it is included by default with the Python binary installers. If you are interested in `pip`, have a look at [this website](https://docs.python.org/3/installing/index.html).

You have the ability to install dependencies via `pip` from within this notebook. To get a feeling for working with Jupyter, copy and paste (or best: type) the following code snippet in the empty cell below and press the **Run**-button.

```
import pip

pip.main(['install', '-r', 'requirements.txt'])
```

If you get any errors or are not able to install *all* dependencies properly, try [Stack Overflow](https://stackoverflow.com/questions/tagged/pip) for troubleshooting or create a new issue on our [GitHub page](https://github.com/DARIAH-DE/Topics).

### What have I done?

With `import pip` you have imported the package `pip`, which is in the Python standard library included. In the next line, you called `pip`'s function `main` and commited a [list](https://en.wikipedia.org/wiki/List_(abstract_data_type) with three elements:

1. `install` is the command to install packages.
2. `-r` (or `--requirements`) installs from the given requirements file.
3. `requirements.txt` is a simple text file containing a list of all required libraries.

### Some final words
As you already know, code has to be written in the grey cells. You execute a cell by clicking the **Run**-button. If you want to run all cells of the notebook at once, click **Cell > Run All** or **Kernel > Restart & Run All** respectively, if you want to restart the Python kernel first. On the left side of an (unexecuted) cell stands `In [ ]:`. The empty bracket means, that the cell hasn't been executed yet. By clicking **Run**, a star appears in the brackets (`In [*]:`), which means the process is running. In most cases, you won't see that star, because your computer is faster than your eyes. You can execute only one cell at once, all following executions will be in the waiting line. If the process of a cell is done, a number appears in the brackets (`In [1]:`).

## Starting with topic modeling!

Execute the following cell to import modules from the `dariah_topics` library.

In [1]:
from dariah_topics import preprocessing
from dariah_topics import doclist
from dariah_topics import meta
from dariah_topics import mallet
from dariah_topics import visualization

Furthermore, we will need some additional functions from external libraries.

In [2]:
import os
from bokeh.io import show
import lda

Let's not pay heed to any warnings right now and execute the following cell.

In [3]:
import warnings
warnings.filterwarnings('ignore')

## 1. Preprocessing

### 1.2. Reading a corpus of documents

#### Defining the path to the corpus folder

In the present example code, we are using a folder of 'txt' documents provided with the package. For using your own corpus, change the path accordingly.

In [4]:
path = "grenzboten_sample"

#### List all documents in the folder
We begin by creating a list of all the documents in the folder specified above. That list will tell function `pre.read_from_txt()` (see below) which text documents to read.

In [5]:
pathdoclist = doclist.PathDocList(path)
document_list = pathdoclist.full_paths(as_str=True)

The current list of documents looks like this:

In [6]:
document_list

['grenzboten_sample/Grenzboten_1844_Tagebuch_56.txt',
 'grenzboten_sample/Grenzboten_1846_Tagebuch_82.txt',
 'grenzboten_sample/Grenzboten_1916_Kriegstagebuch_69.txt',
 'grenzboten_sample/Grenzboten_1915_Kriegstagebuch_73.txt',
 'grenzboten_sample/Grenzboten_1914_Kriegstagebuch_95.txt',
 'grenzboten_sample/Grenzboten_1915_Kriegstagebuch_33.txt',
 'grenzboten_sample/Grenzboten_1914_Kriegstagebuch_68.txt',
 'grenzboten_sample/Grenzboten_1846_Tagebuch_51.txt',
 'grenzboten_sample/Grenzboten_1845_Tagebuch_81.txt',
 'grenzboten_sample/Grenzboten_1844_Tagebuch_82.txt',
 'grenzboten_sample/Grenzboten_1916_Kriegstagebuch_48.txt',
 'grenzboten_sample/Grenzboten_1915_Kriegstagebuch_94.txt',
 'grenzboten_sample/Grenzboten_1915_Kriegstagebuch_39.txt',
 'grenzboten_sample/Grenzboten_1845_Tagebuch_85.txt',
 'grenzboten_sample/Grenzboten_1846_Tagebuch_96.txt',
 'grenzboten_sample/Grenzboten_1845_Tagebuch_93.txt',
 'grenzboten_sample/Grenzboten_1916_Kriegstagebuch_81.txt',
 'grenzboten_sample/Grenzbot

**Alternatively**, if we want to use other documents, or just a selction of those in the specified folder, we can define our own `doclist` by creating a list of strings containing paths to text files. For example, to use only the texts from 1916, we would define the list as

`
    doclist = ['grenzboten_sample/grenzboten_1916_Kriegstagebuch_41.txt',
           'grenzboten_sample/grenzboten_1916_Kriegstagebuch_48.txt',
           'grenzboten_sample/grenzboten_1916_Kriegstagebuch_49.txt',
           'grenzboten_sample/grenzboten_1916_Kriegstagebuch_69.txt',
           'grenzboten_sample/grenzboten_1916_Kriegstagebuch_81.txt']
`

#### Generate document labels

In [7]:
document_labels = pathdoclist.labels()
document_labels

['Grenzboten_1844_Tagebuch_56',
 'Grenzboten_1846_Tagebuch_82',
 'Grenzboten_1916_Kriegstagebuch_69',
 'Grenzboten_1915_Kriegstagebuch_73',
 'Grenzboten_1914_Kriegstagebuch_95',
 'Grenzboten_1915_Kriegstagebuch_33',
 'Grenzboten_1914_Kriegstagebuch_68',
 'Grenzboten_1846_Tagebuch_51',
 'Grenzboten_1845_Tagebuch_81',
 'Grenzboten_1844_Tagebuch_82',
 'Grenzboten_1916_Kriegstagebuch_48',
 'Grenzboten_1915_Kriegstagebuch_94',
 'Grenzboten_1915_Kriegstagebuch_39',
 'Grenzboten_1845_Tagebuch_85',
 'Grenzboten_1846_Tagebuch_96',
 'Grenzboten_1845_Tagebuch_93',
 'Grenzboten_1916_Kriegstagebuch_81',
 'Grenzboten_1845_Tagebuch_62',
 'Grenzboten_1844_Tagebuch_77',
 'Grenzboten_1914_Kriegstagebuch_97',
 'Grenzboten_1916_Kriegstagebuch_41',
 'Grenzboten_1916_Kriegstagebuch_49',
 'Grenzboten_1844_Tagebuch_70',
 'Grenzboten_1914_Kriegstagebuch_37',
 'Grenzboten_1844_Tagebuch_88',
 'Grenzboten_1845_Tagebuch_52',
 'Grenzboten_1915_Kriegstagebuch_99',
 'Grenzboten_1914_Kriegstagebuch_94',
 'Grenzboten_1

#### Optional: Accessing metadata

In case you want a more structured overview of your corpus, execute the following cell:

In [8]:
metadata = meta.fn2metadata(os.path.join(path, '*.txt'))
metadata

Unnamed: 0,author,basename,filename,title
0,Grenzboten,Grenzboten_1844_Tagebuch_56,grenzboten_sample/Grenzboten_1844_Tagebuch_56.txt,1844_Tagebuch_56
1,Grenzboten,Grenzboten_1846_Tagebuch_82,grenzboten_sample/Grenzboten_1846_Tagebuch_82.txt,1846_Tagebuch_82
2,Grenzboten,Grenzboten_1916_Kriegstagebuch_69,grenzboten_sample/Grenzboten_1916_Kriegstagebu...,1916_Kriegstagebuch_69
3,Grenzboten,Grenzboten_1915_Kriegstagebuch_73,grenzboten_sample/Grenzboten_1915_Kriegstagebu...,1915_Kriegstagebuch_73
4,Grenzboten,Grenzboten_1914_Kriegstagebuch_95,grenzboten_sample/Grenzboten_1914_Kriegstagebu...,1914_Kriegstagebuch_95
5,Grenzboten,Grenzboten_1915_Kriegstagebuch_33,grenzboten_sample/Grenzboten_1915_Kriegstagebu...,1915_Kriegstagebuch_33
6,Grenzboten,Grenzboten_1914_Kriegstagebuch_68,grenzboten_sample/Grenzboten_1914_Kriegstagebu...,1914_Kriegstagebuch_68
7,Grenzboten,Grenzboten_1846_Tagebuch_51,grenzboten_sample/Grenzboten_1846_Tagebuch_51.txt,1846_Tagebuch_51
8,Grenzboten,Grenzboten_1845_Tagebuch_81,grenzboten_sample/Grenzboten_1845_Tagebuch_81.txt,1845_Tagebuch_81
9,Grenzboten,Grenzboten_1844_Tagebuch_82,grenzboten_sample/Grenzboten_1844_Tagebuch_82.txt,1844_Tagebuch_82


#### Read listed documents from folder

In [9]:
corpus = preprocessing.read_from_txt(document_list)

At this point, the corpus is generator object.

### 1.3. Tokenize corpus
Your text files will be tokenized. Tokenization is the task of cutting a stream of characters into linguistic units, simply words or, more precisely, tokens. The tokenize function the library provides is a simple unicode tokenizer. Depending on the corpus it might be useful to use an external tokenizer function, or even develop your own, since its efficiency varies with language, epoch and text type.

In [10]:
tokens = [list(preprocessing.tokenize(document)) for document in list(corpus)]

At this point, each text is represented by a list of separate token strings. If we want to look e.g. into the first text (which has the index `0` as Python starts counting at 0) and show its first 10 words/tokens (that have the indeces `0:9` accordingly) by typing:

In [11]:
tokens[0][0:9]

['es',
 'berlin',
 'und',
 'paris',
 'sprcchscligkeir',
 'credit',
 'und',
 'religion',
 'priester']

### 1.4. Create a document-term matrix

The LDA topic model is based on a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) of the corpus. To improve performance in large corpora, the matrix describes the frequency of terms that occur in the collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

In [12]:
doc_terms = preprocessing.create_doc_term_matrix(tokens, document_labels)
doc_terms

Unnamed: 0,die,der,und,in,den,von,zu,das,des,nicht,...,mördern,mühevolle,münch-bellinghausen,mühling,mühsame,mühsamen,müht,mül,müllers,a!s
Grenzboten_1844_Tagebuch_56,90.0,92.0,88.0,70.0,30.0,25.0,25.0,16.0,25.0,23.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1846_Tagebuch_82,319.0,346.0,275.0,164.0,106.0,87.0,110.0,94.0,75.0,96.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
Grenzboten_1916_Kriegstagebuch_69,39.0,64.0,51.0,24.0,14.0,28.0,1.0,7.0,10.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Grenzboten_1915_Kriegstagebuch_73,41.0,51.0,43.0,31.0,27.0,7.0,1.0,7.0,9.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1914_Kriegstagebuch_95,80.0,85.0,62.0,65.0,42.0,35.0,11.0,13.0,14.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1915_Kriegstagebuch_33,93.0,95.0,87.0,78.0,50.0,48.0,1.0,8.0,21.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1914_Kriegstagebuch_68,32.0,31.0,26.0,37.0,15.0,13.0,2.0,10.0,16.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1846_Tagebuch_51,226.0,177.0,188.0,111.0,73.0,62.0,93.0,60.0,35.0,78.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1845_Tagebuch_81,344.0,351.0,311.0,178.0,107.0,118.0,156.0,116.0,91.0,112.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1844_Tagebuch_82,213.0,207.0,169.0,128.0,85.0,86.0,79.0,80.0,66.0,67.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 1.5. Feature removal

In topic modeling, it is often usefull (if not vital) to remove some types before modeling. In this example, the 100 most frequent words and the *hapax legomena* in the corpus will be removed. This step is very easy to handle using the benefits of indexing.

#### List the 100 most frequent words

In [13]:
mfw100 = preprocessing.find_stopwords(doc_terms, 100)

These are the five most frequent words:

In [14]:
mfw100[:5]

['die', 'der', 'und', 'in', 'den']

#### List hapax legomena

In [15]:
hapax_list = preprocessing.find_hapax(doc_terms)

#### Optional: Use external stopwordlist

In [16]:
path_to_stopwordlist = "tutorial_supplementals/stopwords/de.txt"

extern_stopwords = [line.strip() for line in open(path_to_stopwordlist, 'r')]

#### Combine lists and remove content from `doc_term_matrix`

In [17]:
features = set(mfw100 + hapax_list + extern_stopwords)
doc_terms = preprocessing.remove_features_from_df(doc_terms, features)

Finally, this is how your clean corpus looks like now.

In [18]:
doc_terms

Unnamed: 0,franzosen,genommen,abgewiesen,südlich,berlin,lassen,geschütze,englische,januar,deutschland,...,tilemans,tausendmal,taten,geldinstitute,tatkraft,gemeinem,tausenden,teilangriffe,tendenzstück,gemeingefährlichkeitsmaßstab
Grenzboten_1844_Tagebuch_56,0.0,1.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1846_Tagebuch_82,4.0,2.0,0.0,0.0,1.0,7.0,0.0,1.0,1.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1916_Kriegstagebuch_69,12.0,6.0,9.0,10.0,2.0,0.0,5.0,9.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1915_Kriegstagebuch_73,8.0,11.0,6.0,7.0,2.0,0.0,6.0,3.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1914_Kriegstagebuch_95,2.0,9.0,12.0,6.0,2.0,1.0,8.0,10.0,1.0,1.0,...,0.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1915_Kriegstagebuch_33,17.0,13.0,24.0,8.0,4.0,1.0,16.0,11.0,86.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1914_Kriegstagebuch_68,8.0,3.0,5.0,3.0,2.0,0.0,3.0,3.0,1.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1846_Tagebuch_51,1.0,1.0,1.0,0.0,1.0,7.0,0.0,1.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grenzboten_1845_Tagebuch_81,1.0,3.0,0.0,0.0,7.0,10.0,0.0,0.0,0.0,10.0,...,0.0,0.0,0.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0
Grenzboten_1844_Tagebuch_82,0.0,0.0,0.0,0.0,5.0,10.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2. Model creation

The actual topic modeling is done with external state-of-the-art LDA implementations. In this example, we are relying on the open-source toolkit **lda**.

### 2.1. Translate document-term matrix into array

In this step, all values of your document-term matrix will be translated into an [array](https://en.wikipedia.org/wiki/Array_data_structure).

In [19]:
doc_term_matrix = doc_terms.as_matrix().astype(int)
doc_term_matrix

array([[ 0,  1,  0, ...,  0,  0,  0],
       [ 4,  2,  0, ...,  0,  0,  0],
       [12,  6,  9, ...,  0,  0,  0],
       ..., 
       [ 4,  3,  2, ...,  0,  0,  0],
       [ 1,  2,  0, ...,  0,  0,  0],
       [ 4,  0,  0, ...,  0,  0,  0]])

### 2.2. Creating list of vocabulary

To translate numbers back into words after the model creation, you have to set up a list of all unique tokens in the corpus.

In [20]:
vocab = doc_terms.columns
vocab

Index(['franzosen', 'genommen', 'abgewiesen', 'südlich', 'berlin', 'lassen',
       'geschütze', 'englische', 'januar', 'deutschland',
       ...
       'tilemans', 'tausendmal', 'taten', 'geldinstitute', 'tatkraft',
       'gemeinem', 'tausenden', 'teilangriffe', 'tendenzstück',
       'gemeingefährlichkeitsmaßstab'],
      dtype='object', length=4258)

### 2.3. Generate LDA model

We can define the number of topics we want to calculate as an argument (`n_topics`) in the function. Furthermore, the number of iterations can be defined. A higher number of passes will probably yield a better model, but also increases processing time.

**Warning: this step can take quite a while!** Meaning something between some seconds and some hours depending on corpus size and the number of iterations. Our example short stories corpus should be done within a minute or two at `n_iter=5000`.

In [21]:
%%time

model = lda.LDA(n_topics=10, n_iter=5000)
model.fit(doc_term_matrix)

CPU times: user 47.4 s, sys: 1.96 ms, total: 47.4 s
Wall time: 47.6 s


### 2.4. Create document-topic matrix

The generated model object can now be translated into a human-readable document-topic matrix (that is a actually a pandas DataFrame) that constitutes our principle exchange format for topic modeling results.

In [22]:
topics = preprocessing.lda2dataframe(model, vocab)
topics

Unnamed: 0,Key 1,Key 2,Key 3,Key 4,Key 5,Key 6,Key 7,Key 8,Key 9,Key 10
Topic 1,lassen,weise,finden,welt,art,nämlich,sagen,berlin,leben,bringen
Topic 2,kaiser,berlin,friedrich,tausend,wien,verfasser,akademie,gedichte,leben,karl
Topic 3,hiesigen,scheint,steht,gewiß,gesellschaft,stadt,alten,seite,zeitung,indeß
Topic 4,presse,volk,eigentlich,frage,regierung,namen,junge,briefe,männer,stande
Topic 5,geschütze,abgewiesen,genommen,franzosen,januar,abgeschlagen,östlich,kriegstagebuch,angriff,verlusten
Topic 6,september,dezember,engländer,österreicher,truppen,englische,türken,august,geschlagen,england
Topic 7,frankreich,französischen,lamennais,sprache,anfangs,einst,seele,krakau,belgien,buch
Topic 8,juli,märz,april,südlich,stellungen,heftige,gestürmt,italiener,englische,maas
Topic 9,deutschland,oesterreich,glauben,wissen,wiener,deutscher,berliner,politischen,publicum,letzten
Topic 10,leipzig,fremden,ward,sieht,stände,theater,leipziger,zeitung,deutsch,stadt


## 3. Model visualization and evaluation

The following matrix contains the probability per topic for each document, which we need for the visualization.

In [23]:
doc_topics = preprocessing.lda_doc_topic(model, topics, document_labels)
doc_topics

Unnamed: 0,Grenzboten_1844_Tagebuch_56,Grenzboten_1844_Tagebuch_70,Grenzboten_1844_Tagebuch_77,Grenzboten_1844_Tagebuch_82,Grenzboten_1844_Tagebuch_88,Grenzboten_1845_Tagebuch_52,Grenzboten_1845_Tagebuch_62,Grenzboten_1845_Tagebuch_81,Grenzboten_1845_Tagebuch_85,Grenzboten_1845_Tagebuch_93,...,Grenzboten_1915_Kriegstagebuch_33,Grenzboten_1915_Kriegstagebuch_39,Grenzboten_1915_Kriegstagebuch_73,Grenzboten_1915_Kriegstagebuch_94,Grenzboten_1915_Kriegstagebuch_99,Grenzboten_1916_Kriegstagebuch_41,Grenzboten_1916_Kriegstagebuch_48,Grenzboten_1916_Kriegstagebuch_49,Grenzboten_1916_Kriegstagebuch_69,Grenzboten_1916_Kriegstagebuch_81
lassen weise finden,0.258479,0.218852,0.234322,0.27,0.137258,0.20084,0.275677,0.189388,0.26571,0.588489,...,0.009867,0.000148,0.014761,0.00033,0.000176,0.004952,0.025039,0.000122,0.002804,0.000134
kaiser berlin friedrich,0.196672,0.266772,0.258043,0.070654,0.081019,0.05115,0.040945,0.0196,0.097651,0.050868,...,8.9e-05,0.000148,0.000208,0.00033,0.023023,0.00016,0.001711,0.000122,0.001469,0.000134
hiesigen scheint steht,0.147544,0.151387,0.122387,0.169346,0.070475,0.167251,0.135475,0.459051,0.342432,0.067588,...,8.9e-05,0.003116,0.000208,0.00033,0.000176,0.00016,0.000156,0.001345,0.000134,0.001478
presse volk eigentlich,0.036609,0.053657,0.074203,0.126863,0.068717,0.115407,0.056877,0.116146,0.06738,0.052154,...,0.004533,0.007567,0.002287,0.00033,0.000176,0.00016,0.000156,0.000122,0.000134,0.004167
geschütze abgewiesen genommen,0.000158,6.3e-05,7.4e-05,6.5e-05,0.000176,3.7e-05,5.3e-05,0.004203,0.000574,0.001994,...,0.740533,0.862166,0.698753,0.772607,0.638137,0.273323,0.364075,0.372983,0.181709,0.234005
september dezember engländer,0.017591,0.010782,0.003781,0.007255,0.000176,0.006243,5.3e-05,0.004203,0.009447,0.012926,...,0.178756,0.069881,0.068815,0.079538,0.084534,0.054473,0.032815,0.03313,0.086916,0.05793
frankreich französischen lamennais,0.014422,0.009521,0.030467,0.020327,0.000176,0.027054,0.030855,0.026259,0.027192,0.033505,...,8.9e-05,0.000148,0.004366,0.00033,0.000176,0.00016,0.000156,0.001345,0.000134,0.000134
juli märz april,0.000158,6.3e-05,7.4e-05,6.5e-05,0.000176,3.7e-05,5.3e-05,4.2e-05,0.001618,0.001994,...,0.0632,0.053561,0.208108,0.138944,0.253251,0.655112,0.575583,0.590587,0.726435,0.697715
deutschland oesterreich glauben,0.21252,0.249117,0.157969,0.118366,0.567838,0.235159,0.101487,0.138202,0.130532,0.1409,...,0.002756,0.000148,0.000208,0.00033,0.000176,0.00016,0.000156,0.000122,0.000134,0.002823
leipzig fremden ward,0.115848,0.039786,0.118681,0.217059,0.073989,0.196824,0.358524,0.042905,0.057463,0.049582,...,8.9e-05,0.003116,0.002287,0.006931,0.000176,0.011342,0.000156,0.000122,0.000134,0.001478


### 3.1. Distribution of topics

#### Distribution of topics over all documents

The distribution of topics over all documents can now be visualized in a heat map.

In [24]:
plot = visualization.doc_topic_heatmap_interactive(doc_topics, title="Grenzbote")
show(plot, notebook_handle=True)

#### Distribution of topics in a single documents

To take closer look on the topics in a single text, we can use the follwing function that shows all the topics in a text and their respective proportions. To select the document, we have to give its index to the function.

In [25]:
visualization.plot_doc_topics(doc_topics, 0)

<module 'matplotlib.pyplot' from '/usr/local/lib/python3.6/dist-packages/matplotlib/pyplot.py'>