# Topic modeling for EHRI testimonies

Michal Frankl

This notebook performs basic topic modeling on testimonies published in the [EHRI Edition of Early Holocaust Testimony](https://early-testimony.ehri-project.eu/). It is meant as only a first, indicative step, or a proof-of-concept for a potential wider study of Holocaust testimonies.

Most of the code below was adapted from the Jupyter notebook by Mike Bryant and Maria Dermentzi [Exploratory Topic Modelling Using Python](https://github.com/EHRI/ehri-data-analysis-tools/blob/master/topic-modelling-python/USHMM_Oral_Testimonies_Topic_Modelling.ipynb), without much further fine-tuning and exploration. Please see there for further explanations and considerations. At its current stage, this script only uses the English versions of the testimonies, many of them translations. 

## Install and necessary Python modules

In [32]:
!pip install pandas
!pip install tei-reader
!pip install gensim
!pip install pyLDAvis
!pip install requests
!pip install spacy
!pip install nltk
!pip install pyarrow





In [33]:
import os
from tei_reader import TeiReader
import pandas as pd
import re
import requests
import numpy as np
import spacy
!python3 -m spacy download en_core_web_sm
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from warnings import filterwarnings
filterwarnings('ignore')
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


[nltk_data] Downloading package stopwords to /home/michal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Lemmatisation and corpus cleaning

Copied from Dermentzi/Bryant; for some languages, spaCy would have to be replaced by another solution (likely nltk).

In [35]:
# Loading the English pipeline and model 
nlp = spacy.load("en_core_web_sm", exclude=["ner", "parser"])
nlp.max_length=3000000

"""
Defining the lemmatisation function:
Each document fed into the function, we parse it using the spacy pipeline; 
tokenise it; check whether each of its tokens is a noun and is not included in the stopword list or 
is not a punctuation and consists of only alphabetic characters. If a token passes
these checks, then we take its lemma and append it to the list of lemmas
that the function will output.
"""

# This function is inspired by Mattingly's (2021, 2022) tutorials on topic modelling

def lemmatise(text, allowed_postags=["NOUN"]):
    text_out = ""
    doc = nlp(text, disable=['ner','parser'])
    new_text = []
    for token in doc:
        if token.pos_ in allowed_postags and token.lower_ not in stopwords and not token.is_punct and token.is_alpha:
            new_text.append(token.lemma_)
    text_out = new_text
    return (text_out)

In [13]:
stopwords = stopwords.words("english")

In [28]:
testimony_dir = "early-testimony_ENG"

directory = os.fsencode(testimony_dir)

testimonies = []

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith(".xml"): 
        file = testimony_dir + '/' + filename
#        print(file)
        reader = TeiReader()
        corpora = reader.read_file(file)
        text = corpora.text.replace("\n", " ")
#        print(corpora.text)
        ltext = lemmatise(text)
        testimonies.append({
            "filename": filename,
            "edition": filename.split('-')[1],
            "ltext": ltext
        })

_For now, we use the tei_reader module to extract text from the TEI source files. Potentially, a more robust solution, or perhaps preprocesseing of the TEI files might be applied._

In [38]:
# testimonies

## Load into a Pandas dataframe

In [16]:
testimoniesdf = pd.DataFrame.from_dict(testimonies)

In [18]:
testimoniesdf.head()

Unnamed: 0,filename,edition,ltext
0,EHRI-ET-DEGOB0701_EN.xml,ET,"[day, country, pack, husband, service, comrade..."
1,EHRI-ET-ZIH3010025_EN.xml,ET,"[née, voivodeship, education, year, school, pa..."
2,EHRI-ET-DEGOB0748_EN.xml,ET,"[father, brother, part, day, parent, week, bor..."
3,EHRI-ET-WL05320171_EN.xml,ET,"[city, city, delegate, detachment, day, decree..."
4,EHRI-ET-ZIH3010317_EN.xml,ET,"[daughter, mother, née, education, school, fam..."


## Build the vector for topic modeling

In [19]:
dictionary = Dictionary(documents=testimoniesdf['ltext'].to_list(), prune_at=None)
dictionary.filter_extremes(no_above=0.6, keep_n=None)  # Filter out words that appear too often
dictionary.compactify() # Assign new ids to words

In [20]:
temp = dictionary[0]  # This is only to "load" the dictionary
dictionary.id2token

{0: 'People',
 1: 'aid',
 2: 'arrival',
 3: 'beginning',
 4: 'block',
 5: 'body',
 6: 'building',
 7: 'call',
 8: 'car',
 9: 'chamber',
 10: 'chimney',
 11: 'clothe',
 12: 'cold',
 13: 'commando',
 14: 'comrade',
 15: 'country',
 16: 'death',
 17: 'doctor',
 18: 'document',
 19: 'effect',
 20: 'end',
 21: 'enemy',
 22: 'evening',
 23: 'fire',
 24: 'freight',
 25: 'gas',
 26: 'head',
 27: 'hell',
 28: 'home',
 29: 'hospital',
 30: 'husband',
 31: 'information',
 32: 'kitchen',
 33: 'leg',
 34: 'life',
 35: 'line',
 36: 'luggage',
 37: 'mattress',
 38: 'memory',
 39: 'mill',
 40: 'month',
 41: 'neighbourhood',
 42: 'noon',
 43: 'open',
 44: 'other',
 45: 'part',
 46: 'person',
 47: 'photo',
 48: 'pit',
 49: 'plant',
 50: 'pocket',
 51: 'policeman',
 52: 'prisoner',
 53: 'provision',
 54: 'punishment',
 55: 'quarantine',
 56: 'question',
 57: 'rail',
 58: 'railway',
 59: 'rain',
 60: 'reason',
 61: 'report',
 62: 'road',
 63: 'roll',
 64: 'room',
 65: 'selection',
 66: 'service',
 67: 'sh

In [21]:
corpus = [dictionary.doc2bow(doc) for doc in testimoniesdf['ltext'].to_list()]  # convert list of tokens to bag of word representation

In [22]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 1036
Number of documents: 125


## Train the model

For now, I have opted for six topics and otherwise haven't thinkered with the paramenters of the model. To be done later on.

In [23]:
# Set training parameters. (Řehůřek, 2022b)
num_topics = 6 # The number of topics
passes =20 # The number of times the algorithm will go through the entire corpus
iterations = 400
chunksize = 50
eval_every = None
random_state = 0 # This is used to make this process reproducible

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model_6_topics = LdaModel(
    corpus=corpus,
    id2word=id2word,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    chunksize=chunksize,
    eval_every = eval_every,
    random_state=random_state
)

In [24]:
model_6_topics.print_topics(num_words=30)

[(0,
  '0.078*"city" + 0.059*"shtetl" + 0.032*"street" + 0.021*"store" + 0.018*"population" + 0.018*"house" + 0.017*"fire" + 0.016*"community" + 0.016*"order" + 0.015*"synagogue" + 0.015*"refugee" + 0.014*"victim" + 0.012*"bomb" + 0.011*"foot" + 0.011*"morning" + 0.011*"route" + 0.011*"basement" + 0.010*"crowd" + 0.010*"train" + 0.010*"money" + 0.009*"apartment" + 0.009*"home" + 0.009*"commander" + 0.008*"individual" + 0.008*"decree" + 0.008*"building" + 0.007*"merchandise" + 0.007*"group" + 0.007*"resident" + 0.007*"part"'),
 (1,
  '0.032*"car" + 0.025*"border" + 0.023*"bread" + 0.022*"factory" + 0.021*"money" + 0.020*"train" + 0.019*"water" + 0.016*"thing" + 0.015*"worker" + 0.014*"brother" + 0.014*"month" + 0.014*"wagon" + 0.013*"clothe" + 0.012*"hour" + 0.012*"journey" + 0.011*"station" + 0.011*"boy" + 0.010*"piece" + 0.010*"lot" + 0.010*"potato" + 0.010*"evening" + 0.010*"pit" + 0.010*"sister" + 0.010*"soup" + 0.010*"morning" + 0.010*"foot" + 0.009*"guard" + 0.009*"village" + 0.00

In [25]:
top_topics = model_6_topics.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -1.5243.
[([(0.016075734, 'transport'),
   (0.015271248, 'prisoner'),
   (0.010251147, 'order'),
   (0.009739055, 'home'),
   (0.008732267, 'number'),
   (0.00795466, 'office'),
   (0.007712003, 'ghetto'),
   (0.0072530783, 'house'),
   (0.0070971674, 'case'),
   (0.0070124683, 'war'),
   (0.0068565123, 'hour'),
   (0.006783877, 'death'),
   (0.0066802246, 'barrack'),
   (0.0065473905, 'life'),
   (0.0065277517, 'month'),
   (0.006146308, 'group'),
   (0.0061378665, 'mother'),
   (0.0059158076, 'thing'),
   (0.0058753435, 'course'),
   (0.005796325, 'block')],
  -0.72891670623474),
 ([(0.031666033, 'ghetto'),
   (0.02281146, 'daughter'),
   (0.020376636, 'husband'),
   (0.016730422, 'son'),
   (0.015529929, 'room'),
   (0.014975363, 'apartment'),
   (0.012832854, 'door'),
   (0.012555443, 'house'),
   (0.011523316, 'father'),
   (0.011039054, 'mother'),
   (0.010834783, 'parent'),
   (0.010650832, 'hand'),
   (0.010192903, 'brother'),
   (0.009804148, 'wife'),


## Visualise the six topics using pyLDAvis

In [42]:
# pyLDAvis.enable_notebook()
vis = gensimvis.prepare(model_6_topics, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(vis, 'model_6_topics.html')
pyLDAvis.display(vis)

## Conclusion

This script is only a first step towards a broader exploration of (early and other) Holocaust testimonies using topic modeling. The results, without any fine-tuning, seem encouraging and the extracted topics relatively coherent and significant. Further analysis is needed.