# Topics – Easy Topic Modeling in Python


This notebook was adapted from <a href="https://github.com/DARIAH-DE/Topics"><i>Topics</i> (DARIAH-DE)</a> notebook IntroducingMallet.ipynb for the project DiSpecs. <br>
The original markdown text is kept and highlighted in italic. The section Visualization is separated into a new notebook (6_Topic-visualization).  

<i>The text mining technique **Topic Modeling** has become a popular statistical method for clustering documents. This [Jupyter notebook](http://jupyter.org/) introduces a step-by-step workflow, basically containing data preprocessing, the actual topic modeling using **latent Dirichlet allocation** (LDA), which learns the relationships between words, topics, and documents, as well as some interactive visualizations to explore the model.

LDA, introduced in the context of text analysis in [2003](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf), is an instance of a more general class of models called **mixed-membership models**. Involving a number of distributions and parameters, the topic model is typically performed using [Gibbs sampling](https://en.wikipedia.org/wiki/Gibbs_sampling) with conjugate priors and is purely based on word frequencies.

There have been written numerous introductions to topic modeling for humanists (e.g. [this one](http://scottbot.net/topic-modeling-for-humanists-a-guided-tour/)), which provide another level of detail regarding its technical and epistemic properties.

For this workflow, you will need a corpus (a set of texts) as plain text (`.txt`) or [TEI XML](http://www.tei-c.org/index.xml) (`.xml`). Using the `dariah_topics` package, you also have the ability to process the output of [DARIAH-DKPro-Wrapper](https://github.com/DARIAH-DE/DARIAH-DKPro-Wrapper), a command-line tool for *natural language processing*.

Topic modeling works best with very large corpora. The [TextGrid Repository](https://textgridrep.org/) is a great place to start searching for text data. Anyway, to demonstrate the technique, we provide one small text collection in the folder `grenzboten_sample` containing 15 diary excerpts, as well as 15 war diary excerpts, which appeared in *Die Grenzboten*, a German newspaper of the late 19th and early 20th century.

**Of course, you can work with your own corpus in this notebook.**

We're relying on the LDA implementation by [Andrew McCallum](https://people.cs.umass.edu/~mccallum/), called [MALLET](http://mallet.cs.umass.edu/topics.php), which is known to be very robust. Aside from that, we provide two more Jupyter notebooks:

* [IntroducingGensim](IntroducingGensim.ipynb), using LDA by [Gensim](https://radimrehurek.com/project/gensim/), which is attractive because of its multi-core support.
* [IntroducingLda](IntroducingLda.ipynb), using LDA by [lda](http://pythonhosted.org/lda/index.html), which is very lightweight.

For more information in general, have a look at the [documentation](http://dev.digital-humanities.de/ci/job/DARIAH-Topics/doclinks/1/).</i>

## First step: Installing dependencies

<i>To work within this Jupyter notebook, you will have to import the `dariah_topics` library. As you do, `dariah_topics` also imports a couple of external libraries, which have to be installed first. `pip` is the preferred installer program in Python. Starting with Python 3.4, it is included by default with the Python binary installers. If you are interested in `pip`, have a look at [this website](https://docs.python.org/3/installing/index.html).

To install the `dariah_topics` library with all dependencies, open your commandline, go with `cd` to the folder `Topics` and run:

```
pip install -r requirements.txt
```

Alternatively, you can do:

```
python setup.py install
```

If you get any errors or are not able to install *all* dependencies properly, try [Stack Overflow](https://stackoverflow.com/questions/tagged/pip) for troubleshooting or create a new issue on our [GitHub page](https://github.com/DARIAH-DE/Topics).

**Important**: If you are on macOS or Linux, you will have to use `pip3` and `python3`.</i>

### Some final words
<i>As you probably already know, code has to be written in the grey cells. You execute a cell by clicking the **Run**-button (or **Ctrl + Enter**). If you want to run all cells of the notebook at once, click **Cell > Run All** or **Kernel > Restart & Run All** respectively, if you want to restart the Python kernel first. On the left side of an (unexecuted) cell stands `In [ ]:`. The empty bracket means, that the cell hasn't been executed yet. By clicking **Run**, a star appears in the brackets (`In [*]:`), which means the process is running. In most cases, you won't see that star, because your computer is faster than your eyes. You can execute only one cell at once, all following executions will be in the waiting line. If the process of a cell is done, a number appears in the brackets (`In [1]:`).</i>

## Starting with topic modeling!

<i>Execute the following cell to import modules from the `dariah_topics` library.</i>

In [1]:
from cophi_toolbox import preprocessing
from dariah_topics import utils
from dariah_topics import postprocessing
from dariah_topics import visualization

<i>Furthermore, we will need some additional functions from external libraries.</i>

In [2]:
import metadata_toolbox.utils as metadata
import pandas as pd
from pathlib import Path

<i>Let's not pay heed to any warnings right now and execute the following cell.</i>

In [3]:
import warnings
warnings.filterwarnings('ignore')

We will import a few more packages for DiSpecs:

In [4]:
import os
import numpy as np
from collections import Counter
import pickle
from datetime import datetime

In [26]:
# a datetime object containing current date and time
now = datetime.now()

dt_string = now.strftime("%Y%m%d-%H%M")
print("date and time =", dt_string)

date and time = 20210414-1213


## 1. Preprocessing

### 1.1. Reading a corpus of documents

#### Defining the path to the corpus folder

<i>In the present example code, we are using the 30 diary excerpts from the folder `grenzboten`. To use your own corpus, change the path accordingly.</i>

In [5]:
#data = 'Y:/data/projekte/dispecs/TopicModeling' 
#path_to_corpus = Path(data, 'dispecs_es')

# Path variables
data = 'Y:/data/projekte/dispecs/TopicModeling' 
language = 'it'
path_to_corpus = Path(data, 'dispecs_'+language+'_paragr')

#### Specifying the pattern of filenames for metadata extraction

<i>You have the ability to extract metadata from the filenames. For instance, if your textfiles look like:

```
goethe_1816_stella.txt
```

the pattern would look like this:

```
{author}_{year}_{title}
```

So, let's try this for the example corpus.</i>


Change the pattern according to your file names.

In [6]:
pattern = '{year}_{periodical}_{author}_{volume}_{issue}_{id}_{chunk}' #_{chunk}

#### Accessing file paths and metadata
<i>We begin by creating a list of all the documents in the folder specified above. That list will tell the function `preprocessing.read_files` (see below) which text documents to read. Furthermore, based on filenames we can create some metadata, e.g. author and title.<i>

In [7]:
meta = pd.concat([metadata.fname2metadata(str(path), pattern=pattern) for path in path_to_corpus.glob('*.txt')])
meta[:5] # by adding '[:5]' to the variable, only the first 5 elements will be printed

Unnamed: 0,year,periodical,author,volume,issue,id,chunk
Y:\data\projekte\dispecs\TopicModeling\dispecs_it_paragr\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-1_Nr-000_09A-399_0000.txt,1727,Il-Filosofo-alla-Moda,Cesare-Frasponi,Vol-1,Nr-000,09A-399,0
Y:\data\projekte\dispecs\TopicModeling\dispecs_it_paragr\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-0651_09A-398_0000.txt,1727,Il-Filosofo-alla-Moda,Cesare-Frasponi,Vol-2,Nr-0651,09A-398,0
Y:\data\projekte\dispecs\TopicModeling\dispecs_it_paragr\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282_0000.txt,1727,Il-Filosofo-alla-Moda,Cesare-Frasponi,Vol-2,Nr-101,096-282,0
Y:\data\projekte\dispecs\TopicModeling\dispecs_it_paragr\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282_0001.txt,1727,Il-Filosofo-alla-Moda,Cesare-Frasponi,Vol-2,Nr-101,096-282,1
Y:\data\projekte\dispecs\TopicModeling\dispecs_it_paragr\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282_0002.txt,1727,Il-Filosofo-alla-Moda,Cesare-Frasponi,Vol-2,Nr-101,096-282,2


Depending on which language you are analyzing, you might notice different language variants of the word "anonymous" in the column "author". We can correct this in the data frame by replacing the respective string with "Anonymous".  

In [8]:
str_to_replace = [r'\bAnonymus\b', r'\bAnonym\b', r'\bAnonyme\b', r'\bAnónimo\b']
for string in str_to_replace:
    meta['author'] = meta['author'].str.replace(string, "Anonymous", regex=True)

You can now take a look at what authors and periodicals occur in your corpus and with how many issues they are represented. 

In [9]:
meta.groupby('author')['id'].nunique()

author
Anonymous-(Eliza-Haywood)          7
Antonio-Piazza                   245
Cesare-Frasponi                  393
Francesco-Anselmi                 30
Francesco-Grassi                  34
Gasparo-Gozzi                    248
Gioseffa-Cornoldi-Caminer         36
Giovanni-Ferri-di-S.-Costante    261
Giuseppe-Baretti                  33
Luca-Magnanima                    39
Pietro-und-Alessandro-Verri       18
Name: id, dtype: int64

In [10]:
meta.groupby('periodical')['id'].nunique()

periodical
Gazzetta-urbana-veneta                         245
Gazzetta-veneta                                103
Gli-Osservatori-veneti                          41
Il-Caffè                                        18
Il-Filosofo-alla-Moda                          393
Il-Socrate-veneto                               30
La-Frusta-Letteraria-di-Aristarco-Scannabue     33
La-Spettatrice                                   7
La-donna-galante-ed-erudita                     36
Lo-Spettatore-italiano                         261
Lo-Spettatore-italiano-piemontese               34
L’Osservatore-veneto                           104
Osservatore-toscano                             39
Name: id, dtype: int64

#### Read listed documents from folder

In [11]:
corpus = list(preprocessing.read_files(meta.index))
corpus[1][:4000] # printing the first x number of characters of the first document

'Eccovi Leggitore benigno un fasciare d’ altro Fogli volare legare anch’ esso come il primo in formare di Libro se volere divertirvi figuratevi di entrare in una selva composto da una varietà singolare di Alberi e ritrovare che formare tra di loro una confusione gradito . Si essere piantato o inserito di giorno in giorno a misurare che capitare alla mano . Non vi sorprendere per tangere se alcun di quello che produrre la medesimo specie di frutto vi essere posto frammischiare con altro in qualche distanza . Vi accaderà mi lusingare con qualche piacere il passare da una Pianta di Peri ad una di Noci da questo ad un Giregio indo ad Giugiolo ad una Pigna ad un Pomo ad un Nespolo ad un Cotogno e poscia d’ incontrarvi qualche volto in un’ altro pianto di Peri di Po mi o di Ceregj però sempre con differire innestare con distinto figurare con diverso colorire e con variare saporire . Passegiate con attenzione osservare a vostro bell’ aggio e divertitevi . Vi dare la permissione di raccogliere

<i>Your `corpus` contains as much elements (`documents`) as texts in your corpus are. Each element of `corpus` is a list containing exactly one element, the text itself as one single string including all whitespaces and punctuations:

```
[['This is the content of your first document.'],
 ['This is the content of your second document.'],
 ...
 ['This is the content of your last document.']]
```<i>

Check the length of your corpus:

In [12]:
len(corpus)

5574

### 1.3. Tokenize corpus
<i>Now, your `documents` in `corpus` will be tokenized. Tokenization is the task of cutting a stream of characters into linguistic units, simply words or, more precisely, tokens. The tokenize function `dariah_topics` provides is a simple Unicode tokenizer. Depending on the corpus, it might be useful to use an external tokenizer function, or even develop your own, since its efficiency varies with language, epoch and text type.<i>

In [13]:
tokenized_corpus = [list(preprocessing.tokenize(document)) for document in corpus]

<i>At this point, each `document` is represented by a list of separate token strings. As above, have a look at the first document (which has the index `0` as Python starts counting at 0) and show its first 14 words/tokens (that have the indices `0:13` accordingly).<i>

In [14]:
tokenized_corpus[0]

['non',
 'vi',
 'essere',
 'forse',
 'mai',
 'stare',
 'verun',
 'opra',
 'nè',
 'antico',
 'nè',
 'moderno',
 'che',
 'avere',
 'fare',
 'tangere',
 'strepitare',
 'nel',
 'paese',
 'dov',
 'essere',
 'nato',
 'di',
 'cui',
 'siansi',
 'esitare',
 'tanto',
 'esemplare',
 'come',
 'questo',
 'del',
 'spettatore',
 'da',
 'cui',
 'si',
 'essere',
 'ricavare',
 'il',
 'filosofo',
 'alla',
 'moda',
 'tutti',
 'li',
 'discorso',
 'che',
 'la',
 'comporre',
 'orare',
 'intitolare',
 'lezioni',
 'comparire',
 'da',
 'principiare',
 'ad',
 'un',
 'ad',
 'un',
 'in',
 'qualità',
 'di',
 'fogli',
 'volare',
 'in',
 'figurare',
 'di',
 'gazette',
 'se',
 'ne',
 'essere',
 'venduto',
 'fino',
 'ventimila',
 'al',
 'giorno',
 'se',
 'ne',
 'essere',
 'di',
 'molto',
 'fare',
 'in',
 'poco',
 'anno',
 'quattro',
 'edizioni',
 'francesi',
 'tali',
 'discorso',
 'non',
 'essere',
 'fatturare',
 'un',
 'solere',
 'come',
 'con',
 'evidenza',
 'vi',
 'fare',
 'palese',
 'la',
 'varietà',
 'della',
 'fr

### 1.4 Create a document-term matrix

<i>The LDA topic model is based on a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) of the corpus. To improve performance in large corpora, the matrix describes the frequency of terms that occur in the collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.</i>

__Use only one of the following two versions for creating the matrix.__<br>
Change the meta argument according to the pieces of information you want to keep in your matrix. Important: Each filename still needs to stay distinctive, so you should include a distinctive feature, like for example the ID. 

#### 1.4.1 Large corpus matrix

<i>If you have a very large corpus, create a document-term matrix designed for large corpora.<i>



In [15]:
document_term_matrix, document_ids, type_ids = preprocessing.create_document_term_matrix(tokenized_corpus,
                                                                                         meta['author']+'_'+meta['periodical']+'_' + meta['volume']+'_' + meta['issue']+'_'+meta['id']+'_'+meta['chunk'],
                                                                                         large_corpus=True) #+'_'+meta['chunk']
(document_term_matrix, document_ids, type_ids)[:5]

(                      0
 document_id type_id    
 1           44059     1
             91427    16
             90710     1
             3788      1
             11021     1
             11434    17
             76567    16
             90295     1
             13799    16
             16025     1
             40788     1
             72233     1
             31425     3
             45975     2
             35564     1
             35714     6
             80682     4
             39605     1
             93101     2
             855       1
             76097     1
             21706     1
             49804    11
             39541     1
             41064     3
             35521     1
             58979     1
             67292     2
             58547     5
             74840     5
 ...                  ..
 5574        3007      1
             34164     1
             60518     2
             5052      1
             8776      1
             46046     1
             39900     1


#### 1.4.2 Small corpus matrix

<i>Otherwise, use the document-term matrix desigend for small corpora.<i>

In [18]:
#document_term_matrix = preprocessing.create_document_term_matrix(tokenized_corpus, meta['author']+'_'+meta['periodical']+'_'+meta['title']+'_'+meta['id'])
#document_term_matrix[:5]

### 1.5. Feature removal

<i>*Stopwords* (also known as *most frequent tokens*) and *hapax legomena* are harmful for LDA and have to be removed from the corpus or the document-term matrix respectively. In this example, the 50 most frequent tokens will be categorized as stopwords.

**Hint**: Be careful with removing most frequent tokens, you might remove tokens quite important for LDA. Anyway, to gain better results, it is highly recommended to use an external stopwords list.

In this notebook, we combine the 50 most frequent tokens, hapax legomena and an external stopwordslist.</i>

#### List the 100 most frequent words

<i>If you have chosen the large corpus model, you will have to add `type_ids` to the function `preprocessing.list_mfw()`.</i><br>
So, if you created a matrix for a large corpus, write this as a third argument: `type_ids=type_ids`.

In [16]:
stopwords = preprocessing.list_mfw(document_term_matrix, most_frequent_tokens=100, type_ids=type_ids) #, type_ids=type_ids

<i>These are the five most frequent words:</i>

In [17]:
stopwords[:10]

['di', 'il', 'che', 'la', 'essere', 'non', 'in', 'un', 'per', 'avere']

#### List hapax legomena

Again, if you created a large matrix, you have to add/change a few arguments.<br>
For `preprocessing.find_hapax_legomena` just add `type_ids`. To find out the total number of types use `len(type_ids)` instead of `document_term_matrix.shape[1]`.<br>
    
__For a large corpus:__    


In [18]:
hapax_legomena = preprocessing.find_hapax_legomena(document_term_matrix, type_ids) 
print("Total number of types in corpus:", len(type_ids))
print("Total number of hapax legomena:", len(hapax_legomena))

Total number of types in corpus: 95346
Total number of hapax legomena: 49891


__For a small corpus:__

In [22]:
#hapax_legomena = preprocessing.find_hapax_legomena(document_term_matrix) 
#print("Total number of types in corpus:", document_term_matrix.shape[1]) 
#print("Total number of hapax legomena:", len(hapax_legomena))

#### Optional: Use external stopwordlist

In [19]:
path_to_stopwordlist = Path(data, 'stopwords', language+'.txt')
external_stopwords = [line.strip() for line in path_to_stopwordlist.open('r', encoding='utf-8')]
external_stopwords[:20]

['a',
 'abbastanza',
 'abbia',
 'abbiamo',
 'abbiano',
 'abbiate',
 'accidenti',
 'ad',
 'adesso',
 'affinche',
 'agl',
 'agli',
 'ahime',
 'ahimã¨',
 'ahimè',
 'ai',
 'al',
 'alcuna',
 'alcuni',
 'alcuno']

#### Combine lists and remove content from `tokenized_corpus` 
__Add `type_ids=type_ids` if you have a large corpus.__

In [20]:
features = stopwords + hapax_legomena + external_stopwords
clean_tokenized_corpus = list(preprocessing.remove_features(features, tokenized_corpus=tokenized_corpus, type_ids=type_ids)) #, type_ids=type_ids

Save the features list to remove them as stop words for the word clouds in the visualization workflow.  

In [21]:
total_stopwords = ' '.join(features)
with open(data+'/stopwords/'+ language+'_features.txt', 'w+', encoding='utf-8') as f:
    f.write(total_stopwords)

## 2. Model creation

#### Path to MALLET folder 

<i>Now we must tell the library where to find the local instance of MALLET. If you managed to install MALLET, it is sufficient set `path_to_mallet = 'mallet'`, if you store MALLET in a local folder, you have to specify the path to the binary explictly (e.g. `path_to_mallet = 'C:/mallet-2.0.8/bin/mallet'`).

**Whitespaces are not allowed in the path!**</i>

MALLET has to be installed directly under C:. __If you are using Linux__, then you don't need to use the '.bat' extension, __otherwhise you have to specify that__.

In [22]:
"""
path_to_mallet = 'C:/mallet-2.0.8/bin/mallet
"""
path_to_mallet = 'C:/mallet-2.0.8/bin/mallet.bat'

### 2.1. Create `Mallet` object

<i>Finally, we can instance the `Mallet` object.<i>

In [23]:
Mallet = utils.Mallet(path_to_mallet)

In [28]:
help(Mallet)

Help on Mallet in module dariah_topics.utils object:

class Mallet(builtins.object)
 |  Mallet(executable='mallet', corpus_output=None, logfile=False)
 |  
 |  Python wrapper for MALLET.
 |  
 |  With this class you can call the command-line tool `MALLET <http://mallet.cs.umass.edu/topics.php>`_     from within Python.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, executable='mallet', corpus_output=None, logfile=False)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  call_mallet(self, command, **kwargs)
 |      Calls the command-line tool MALLET.
 |      
 |      With this function you can call `MALLET <http://mallet.cs.umass.edu/topics.php>`_         using a specific ``command`` (e.g. ``train-topics``) and its parameters.
 |      **Whitespaces (especially for Windows users) are not allowed in paths.**
 |      
 |      Args:
 |          command (str): A MALLET command, this could be ``import-dir`` (load
 |              the contents of a directory

<i>The object `Mallet` has a method `import_tokenized_corpus()` to create a specific corpus file for MALLET.</i>

Adjust the argument meta with the metadata of your choice. Be carefull not to overwrite with non distinctive properties.

In [24]:
mallet_corpus = Mallet.import_tokenized_corpus(clean_tokenized_corpus, meta['year']+'_'+meta['periodical']+'_'+meta['author']+'_'+meta['volume']+'_'+meta['issue']+'_'+meta['id']+'_'+meta['chunk']) 

In [30]:
help(Mallet.import_tokenized_corpus)

Help on method import_tokenized_corpus in module dariah_topics.utils:

import_tokenized_corpus(tokenized_corpus, document_labels, **kwargs) method of dariah_topics.utils.Mallet instance
    Creates MALLET corpus model.
    
    With this function you can import a ``tokenized_corpus`` to create the         MALLET corpus model. The MALLET command for this step is ``import-dir``         with ``--keep-sequence`` (which is already defined in the function, so         you don't have to), but you have the ability to specify all available         parameters. The output will be saved in ``output_corpus``.
    
    Args:
        tokenized_corpus (list): Tokenized corpus containing one or more
            iterables containing tokens.
        document_labels (list): Name of each `tokenized_document` in `tokenized_corpus`.
        encoding (str): Character encoding for input file. Defaults to UTF-8.
        token_regex (str): Divides documents into tokens using a regular
            expression (supp

<i>Furthermore, `Mallet` has the method `train_topics()` to create and train the LDA model. To create a LDA model, there have to be specified a couple of parameters.

But first, if you are curious about any library, module, class or function, try `help()`. This can be very useful, because (at least in a well documented library) explanations of use and parameters will be printed. We're interested in the function `Mallet.train_topics()` in the module `dariah_topics.mallet`, so let's try:

```
help(mallet.Mallet)
```

This will print something like this (in fact even more):

```
Help on method train_topics in module dariah_topics.mallet:

train_topics(mallet_binary, **kwargs) method of dariah_topics.mallet.Mallet instance
    Args:
        input_model (str): Absolute path to the binary topic model created by `output_model`.
        output_model (str): Write a serialized MALLET topic trainer object.
            This type of output is appropriate for pausing and restarting training,
            but does not produce data that can easily be analyzed.
        output_topic_keys (str): Write the top words for each topic and any
            Dirichlet parameters to file.
        topic_word_weights_file (str): Write unnormalized weights for every
            topic and word type.
        word_topic_counts_file (str): Write a sparse representation of topic-word
            assignments. By default this is null, indicating that no file will
            be written.
        output_doc_topics (str): Write the topic proportions per document, at
            the end of the iterations.
        num_topics (int): Number of topics. Defaults to 10.
        num_top_words (int): Number of keywords for each topic. Defaults to 10.
        num_interations (int): Number of iterations. Defaults to 1000.
        num_threads (int): Number of threads for parallel training.  Defaults to 1.
        num_icm_iterations (int): Number of iterations of iterated conditional
            modes (topic maximization).  Defaults to 0.
        no_inference (bool): Load a saved model and create a report. Equivalent
            to `num_iterations = 0`. Defaults to False.
        random_seed (int): Random seed for the Gibbs sampler. Defaults to 0.
        optimize_interval (int): Number of iterations between reestimating
            dirichlet hyperparameters. Defaults to 0.
        optimize_burn_in (int): Number of iterations to run before first
            estimating dirichlet hyperparameters. Defaults to 200.
        use_symmetric_alpha (bool): Only optimize the concentration parameter of
            the prior over document-topic distributions. This may reduce the
            number of very small, poorly estimated topics, but may disperse common
            words over several topics. Defaults to False.
        alpha (float): Sum over topics of smoothing over doc-topic distributions.
            alpha_k = [this value] / [num topics]. Defaults to 5.0.
        beta (float): Smoothing parameter for each topic-word. Defaults to 0.01.
```

So, now you know how to define the number of topics and the number of sampling iterations as well. A higher number of iterations will probably yield a better model, but also increases processing time. `alpha` and `beta` are so-called *hyperparameters*. They influence the model's performance, so feel free to play around with them. In the present example, we will leave the default values. Furthermore, there exist various methods for hyperparameter optimization, e.g. gridsearch or Gaussian optimization.

**Warning: This step can take quite a while!** Meaning something between some seconds and some hours depending on corpus size and the number of iterations. Our example corpus should be done within a minute or two at `num_iterations=1000`.</i>

<i>First, create an output folder</i>. Set the variables `num_topics`, `num_iterations` and `optimize_interval` you will use for modelling to also generate the file and directory names.  

In [27]:
num_topics=22
num_iterations=2000
optimize_interval=20

output = data + '/output/Dariah_IntroducingMallet/'+language+'/'+dt_string.split('-')[0]+'_n'+str(num_topics)+'_i'+str(num_iterations)+'_opt'+str(optimize_interval)+'_paragr'
if not os.path.exists(output):
    os.makedirs(output)
output

'Y:/data/projekte/dispecs/TopicModeling/output/Dariah_IntroducingMallet/it/20210414_n22_i2000_opt20_paragr'

Now we use the variable `dt_string` to include the information in our file and directory names.

If you use the version we suggest, you might want to leave out the `%%time` command, since it causes some problems in jupyter notebook.

In [28]:
"""
%%time

Mallet.train_topics(mallet_corpus,
                    output_topic_keys=str(Path(output, 'topic_keys.txt')),
                    output_doc_topics=str(Path(output, 'doc_topics.txt')),
                    num_topics=10,
                    num_iterations=1000)
"""
"""
topic_word_weights_file (str): Write unnormalized weights for every
            topic and word type.
        word_topic_counts_file (str): Write a sparse representation of topic-word
            assignments. By default this is null, indicating that no file will
            be written.
        
"""
#%%time  
#num_topics=25
#num_iterations=2000
#optimize_interval=20
topic_keys_output = dt_string + '_' + 'topic_keys'+ '_n'+ str(num_topics) + '_i' + str(num_iterations) + '_opt'+ str(optimize_interval) + '.txt'
doc_topics_output = dt_string + '_' + 'doc_topics'+ '_n'+ str(num_topics) + '_i' + str(num_iterations) + '_opt'+ str(optimize_interval) + '.txt'
topic_word_weights_file = dt_string + '_' + 'topic_word_weights'+ '_n'+ str(num_topics) + '_i' + str(num_iterations) + '_opt'+ str(optimize_interval) + '.txt'
#word_topic_counts_file = dt_string + '_' + 'word_topic_counts'+ '_n'+ str(num_topics) + '_i' + str(num_iterations) + '_opt'+ str(optimize_interval) + '.txt'

Mallet.train_topics(mallet_corpus,
                    output_topic_keys=str(Path(output, topic_keys_output)),
                    output_doc_topics=str(Path(output, doc_topics_output)),
                    num_topics=num_topics,
                    num_iterations=num_iterations,
                    optimize_interval=optimize_interval,
                    topic_word_weights_file=str(Path(output, topic_word_weights_file)),
                    ) 


In [29]:
str(Path(output, topic_word_weights_file))

'Y:\\data\\projekte\\dispecs\\TopicModeling\\output\\Dariah_IntroducingMallet\\it\\20210414_n22_i2000_opt20_paragr\\20210414-1213_topic_word_weights_n22_i2000_opt20.txt'

<i>If you are curious about MALLET's logging, have a look at the file `mallet.log`, which should have been created in the same directory as your notebook is.</i>

__Save__ the final (clean) document term list for future analysis and visualization.

In [30]:
matrix_path =  str(output).replace('\\', '/') + '/' + dt_string + '_dtl.pkl'

final_frequencies = []
for li in clean_tokenized_corpus:  
    counts = Counter(li)
    final_frequencies.append(counts)  

res = final_frequencies[0]
for c in final_frequencies[1:]: 
    res += c
res

f = open(matrix_path,"wb")
pickle.dump(dict(res),f)
f.close()

print('Document term list saved as pickle in ' + matrix_path)


Document term list saved as pickle in Y:/data/projekte/dispecs/TopicModeling/output/Dariah_IntroducingMallet/it/20210414_n22_i2000_opt20_paragr/20210414-1213_dtl.pkl


__Save__ a reduced version of the topic_word_weights_file.

In [31]:
a_file = open(str(output+"/" + topic_word_weights_file), "r", encoding="utf8")

lines = a_file.readlines()
a_file.close()

new_file = open(str(output+"/"
                    + dt_string + '_' + 'topic_word_weights'+ '_n'+ str(num_topics) 
                    + '_i' + str(num_iterations) + '_opt'+ str(optimize_interval) 
                    + '_reduced.txt'), "w",encoding="utf8")
for line in lines:
    if '\t0.0' not in line:
        new_file.write(line)

new_file.close()

### 2.4. Create document-topic matrix

<i>The generated model object can now be translated into a human-readable document-topic matrix (that is a actually a pandas data frame) that constitutes our principle exchange format for topic modeling results. For generating the matrix from a Gensim model, we can use the following function:</i>

In [32]:
topics = postprocessing.show_topics(topic_keys_file=str(Path(output, topic_keys_output)))
topics

Unnamed: 0,Key 0,Key 1,Key 2,Key 3,Key 4,Key 5,Key 6,Key 7,Key 8,Key 9,Key 10,Key 11,Key 12,Key 13,Key 14,Key 15,Key 16,Key 17,Key 18,Key 19
Topic 0,casa,mano,città,notte,passare,cavallo,fossa,povero,correre,onda,mettere,fuoco,servire,collare,morire,mattina,persona,abitare,uscire,portare
Topic 1,duc,casa,calle,pago,parlare,venezia,librajo,affittare,colombani,bottega,caffè,chiave,ponte,applicare,paolo,contrada,vol,signor,genova,milano
Topic 2,patron,banco,bar,ducati,pieligo,balle,oglio,cai,fag,ducato,soldi,manco,bottega,capitan,pelle,oro,zara,nominato,cassa,scudi
Topic 3,moda,bianco,donna,colore,roso,nero,nastro,grosso,colorire,portare,vestire,tavola,cappello,verde,abitare,guarnire,collare,piccolo,testare,formare
Topic 4,scrivere,signore,lingua,lettera,autore,italia,librare,bue,parlare,chiamare,quell,don,frusta,libro,leggere,aristarco,leggitori,parola,secolo,poetare
Topic 5,mondare,mano,occhio,credere,uscire,parola,fossa,volto,entrare,cuore,animare,pensiero,puntare,donna,ancorare,lasciare,capere,parlare,cominciare,scrivere
Topic 6,teatro,signor,pubblico,sera,opera,città,foglio,cantare,meritare,concorrere,musico,musica,celebre,lettera,ballo,scrivere,applauso,spettacolo,teatri,signora
Topic 7,acqua,vino,medico,aria,animale,usare,malattia,caffè,corpo,terra,medicina,volto,piccolo,parto,odore,osservare,febbre,sangue,mangiare,produrre
Topic 8,credere,piacere,persona,spirito,volto,ragione,virtù,meritare,animare,rendere,cuore,pensare,parlare,solere,ancorare,amico,lasciare,amore,menare,provare
Topic 9,natura,antico,nazione,bello,terra,scienza,arte,pensare,secolo,utile,solere,numerare,leggere,necessario,società,nascere,popolare,storia,ragione,esempio


In [33]:
document_topics = postprocessing.show_document_topics(topics=topics,
                                                      doc_topics_file=str(Path(output, doc_topics_output)))
document_topics[:30]

Unnamed: 0,1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-1_Nr-000_09A-399_0000,1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-0651_09A-398_0000,1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282_0000,1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282_0001,1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282_0002,1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-102_096-283_0000,1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-102_096-283_0001,1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-102_096-283_0002,1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-103_096-284_0000,1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-103_096-284_0001,...,1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-70_117-1148_0001,1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-71_117-1149_0000,1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-72_117-1150_0000,1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-72_117-1150_0001,1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-73_117-1151_0000,1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-74_117-1152_0000,1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-75_117-1153_0000,1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-75_117-1153_0001,1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-76_117-1154_0000,1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-77_117-1155_0000
casa mano città,0.0352,0.003,0.0396,0.1535,0.0018,0.0011,0.0061,0.0013,0.0315,0.0485,...,0.0115,0.1598,0.081,0.1207,0.2446,0.1226,0.0643,0.1279,0.0449,0.0765
duc casa calle,0.0005,0.0006,0.0002,0.0002,0.0003,0.0002,0.0002,0.0003,0.0002,0.0002,...,0.0004,0.0002,0.0002,0.0004,0.0002,0.0003,0.0002,0.0003,0.0273,0.0002
patron banco bar,0.0001,0.0001,0.0001,0.0,0.0001,0.0001,0.0001,0.0001,0.0,0.0001,...,0.0001,0.0,0.0,0.0001,0.0001,0.0001,0.0001,0.0001,0.0,0.0
moda bianco donna,0.0004,0.0676,0.0484,0.0917,0.0003,0.0002,0.0002,0.0002,0.0002,0.0002,...,0.0003,0.0001,0.0001,0.0003,0.0002,0.0003,0.0002,0.0003,0.0137,0.0001
scrivere signore lingua,0.0007,0.0277,0.0003,0.0874,0.1301,0.0003,0.0003,0.0004,0.0003,0.0004,...,0.0006,0.0003,0.0185,0.0097,0.0003,0.0005,0.0295,0.0005,0.024,0.0002
mondare mano occhio,0.0021,0.0026,0.0636,0.0009,0.0583,0.0009,0.001,0.0012,0.0009,0.0012,...,0.199,0.0741,0.0663,0.084,0.0243,0.0441,0.181,0.247,0.0448,0.0007
teatro signor pubblico,0.0015,0.0152,0.0151,0.079,0.0335,0.0152,0.0007,0.0008,0.0006,0.0008,...,0.0013,0.0087,0.0223,0.0012,0.0006,0.001,0.0007,0.0009,0.0005,0.0005
acqua vino medico,0.0008,0.1621,0.0004,0.0483,0.0006,0.0004,0.0004,0.0005,0.0003,0.0005,...,0.0007,0.0003,0.0003,0.0007,0.0097,0.0005,0.0004,0.0005,0.0003,0.0147
credere piacere persona,0.1144,0.0062,0.0504,0.002,0.0766,0.2977,0.1769,0.3835,0.3201,0.3921,...,0.0138,0.1404,0.1037,0.0134,0.1052,0.0745,0.1142,0.0033,0.0491,0.0629
natura antico nazione,0.2541,0.0302,0.0012,0.0011,0.002,0.0012,0.0062,0.0015,0.0011,0.0015,...,0.0211,0.001,0.0847,0.1027,0.0105,0.0587,0.0061,0.0017,0.0484,0.0045
