# ETM example - Reddit (*r/depression*) dataset

This example aims to showcase how to run the ETM package on a given dataset. Preprocessing steps required for better usability of the dataset are abstracted away from this document. 

Tokenization, lemmatization, removal of stopwords and least frequent terms were done on the dataset, thus producing the preprocessed version.

Dataset: https://www.kaggle.com/datasets/luizfmatos/reddit-english-depression-related-submissions - created with submissions of [*r/depression*](https://www.reddit.com/r/depression/) in english language

Embeddings: https://wikipedia2vec.github.io/wikipedia2vec/ - 300 dimensions enwiki, skipgram, further preprocessed to remove entity tokens

## Imports and original/preprocessed corpus loading

In [4]:
import numpy, torch
numpy.set_printoptions(threshold=10000)
from embedded_topic_model.utils import preprocessing
from embedded_topic_model.models.etm import ETM
import json
import time

# Loading a dataset in JSON format. Documents must be composed by string sentences
# This is the original dataset. This will be used to evaluate assigned topics with original documents 
# for the test set
original_corpus_file = 'data/_reddit-posts-gatherer-en.submissions_subset.json'
original_documents_raw = json.load(open(original_corpus_file, 'r'))
original_documents = [document['body'] for document in original_documents_raw]

# Defines the test set of original documents as the last 100 reddit posts
test_original = original_documents[len(original_documents)-100:]

# Loads the preprocessed dataset
corpus_file = 'data/_reddit-posts-gatherer-en.submissions_subset.preprocessed.json'
documents_raw = json.load(open(corpus_file, 'r'))
documents = [" ".join(document['body']) for document in documents_raw]
print(f'documents len: {len(documents)}')
train = documents[:len(documents)-100]
print(f'train len: {len(train)}')
test = documents[len(documents)-100:]
print(f'test len: {len(test)}')

documents len: 32113
train len: 32013
test len: 100


## Creating BOW and training

In [5]:
# Creates BOW representations of the train and test datasets. Note that the train vocabulary is used to produce 
# the BOW of the test dataset
vocabulary, train_dataset, _, = preprocessing.create_etm_datasets(train, min_df=0.005)
print(f'vocabulary len: {len(vocabulary)}')
preprocessed_test = preprocessing.create_bow_dataset(test, vocabulary)

vocabulary len: 1516


In [6]:
# Produces an array of topics including word probabilities per topic
def get_topics_with_word_probabilities(idx_to_word, topic_word_dist):
    topics = []
    
    for i in range(len(topic_word_dist)):
        words_distribution = topic_word_dist[i].cpu().numpy()
        top_words_indexes = words_distribution.argsort()[-20:]
        descending_top_words_indexes = top_words_indexes[::-1]
        topic_words = [(words_distribution[idx], idx_to_word[idx]) for idx in descending_top_words_indexes]
        topics.append(topic_words)

    return topics

# Identifies the top-3 topics assigned to a document and unify both info for further evaluation
def get_document_with_assigned_topic(docs, topic_doc_dist):
    docs_with_topics = []
    for idx, doc in enumerate(docs):
        assigned_topics = [int(i) for i in torch.topk(topic_doc_dist[idx], 3).indices]
        docs_with_topics.append((assigned_topics, torch.topk(topic_doc_dist[idx], 3).values, doc))
    return docs_with_topics

# Defines k (number of topics) as 8. This number was defined after a series of experiments, just like the preprocessing steps previously performed
k = 8
print(f'Starting training for k={k}...')

# Creates and fit an ETM instance to the training dataset. The W2V embeddings are a preprocessed version of the Wikipedi2Vec skipgram with 300 dimensions.
# You can find the original embeddings here: https://wikipedia2vec.github.io/wikipedia2vec/pretrained/
# The preprocessing steps made for the W2V embeddings consist of the removal of entity tokens, which were not 
# necessary for this experiment
etm_instance = ETM(
    vocabulary,
    embeddings="./data/enwiki_20180420_300d_optimized_v1.w2v",
    num_topics=k,
    epochs=100,
    debug_mode=True,
)

etm_instance.fit(train_dataset)

Starting training for k=8...
Reading embeddings from word2vec file...
Topics before training: [['www', 'rd', 'crappy', 'load', 'med', 'feed', 'cold', 'cool', 'average', 'minimum'], ['medium', 'graduation', 'professional', 'information', 'grandparent', 'tl', 'land', 'talent', 'meaningful', 'switch'], ['pity', 'www', 'exam', 'vent', 'reddit', 'test', 'cheat', 'excuse', 'pass', 'impact'], ['bottom', 'top', 'deep', 'texte', 'wreck', 'scar', 'math', 'th', 'sweet', 'super'], ['roof', 'male', 'worker', 'texte', 'total', 'mix', 'female', 'wage', 'teacher', 'capable'], ['married', 'gf', 'term', 'refer', 'street', 'http', 'year', 'grow', 'road', 'mark'], ['reddit', 'sophomore', 'math', 'freshman', 'student', 'store', 'crack', 'weigh', 'graduation', 'unemployed'], ['pick', 'tonight', 'update', 'public', 'weekend', 'spring', 'campus', 'beg', 'overall', 'tl']]
Epoch 1 - Learning Rate: 0.005 - KL theta: 0.05 - Rec loss: 513.12 - NELBO: 513.17
Epoch 2 - Learning Rate: 0.005 - KL theta: 0.63 - Rec los

Epoch 51 - Learning Rate: 0.005 - KL theta: 3.78 - Rec loss: 453.93 - NELBO: 457.71
Epoch 52 - Learning Rate: 0.005 - KL theta: 3.81 - Rec loss: 453.89 - NELBO: 457.7
Epoch 53 - Learning Rate: 0.005 - KL theta: 3.8 - Rec loss: 453.85 - NELBO: 457.65
Epoch 54 - Learning Rate: 0.005 - KL theta: 3.77 - Rec loss: 453.85 - NELBO: 457.62
Epoch 55 - Learning Rate: 0.005 - KL theta: 3.78 - Rec loss: 453.84 - NELBO: 457.62
Epoch 56 - Learning Rate: 0.005 - KL theta: 3.79 - Rec loss: 453.77 - NELBO: 457.56
Epoch 57 - Learning Rate: 0.005 - KL theta: 3.79 - Rec loss: 453.75 - NELBO: 457.54
Epoch 58 - Learning Rate: 0.005 - KL theta: 3.8 - Rec loss: 453.72 - NELBO: 457.52
Epoch 59 - Learning Rate: 0.005 - KL theta: 3.8 - Rec loss: 453.72 - NELBO: 457.52
Epoch 60 - Learning Rate: 0.005 - KL theta: 3.79 - Rec loss: 453.71 - NELBO: 457.5
Topics: [['depression', 'help', 'anxiety', 'pain', 'medication', 'able', 'depress', 'suicidal', 'depressed', 'mental'], ['people', 'think', 'thing', 'know', 'see', '

<embedded_topic_model.models.etm.ETM at 0x7f068c177790>

## Training results and transformation/prediction

In [7]:
# Obtains the top-20 words for each topic
topic_words = etm_instance.get_topics(20)
print(f'topic_words: {topic_words}')

# Obtains the normalized topic word distribution
t_w_dist = etm_instance.get_topic_word_dist()
print(f't_w_dist: {t_w_dist}')

# Obtains the topics with assigned word probabilities
topics_with_word_probs = get_topics_with_word_probabilities(vocabulary, t_w_dist)
for idx, topic in enumerate(topics_with_word_probs):
    print(f'topic {idx}: {topic}\n')

# Obtains the normalized document topic distribution
d_t_dist = etm_instance.get_document_topic_dist()
print(f'd_t_dist: {d_t_dist}')

# train_d_t_dist = etm_instance.transform(train_dataset)
# print(f'new_d_t_dist: {new_d_t_dist}')

# Transforms the test dataset with the learned parameters. The output is the normalized document topic distribution 
# of the test dataset
test_d_t_dist = etm_instance.transform(preprocessed_test)
# print(f'new_d_t_dist 2: {new_d_t_dist}')

topic_words: [['depression', 'help', 'anxiety', 'need', 'problem', 'depress', 'depressed', 'pain', 'medication', 'mental', 'suicidal', 'cause', 'doctor', 'therapy', 'able', 'self', 'therapist', 'issue', 'well', 'experience'], ['people', 'thing', 'think', 'know', 'make', 'say', 'find', 'talk', 'seem', 'way', 'see', 'person', 'use', 'try', 'read', 'many', 'thought', 'look', 'lot', 'one'], ['day', 'go', 'get', 'time', 'take', 'leave', 'sleep', 'week', 'come', 'night', 'start', 'month', 'hour', 'bed', 'call', 'thing', 'eat', 'keep', 'stop', 'house'], ['life', 'good', 'time', 'well', 'think', 'work', 'problem', 'social', 'lot', 'hard', 'part', 'great', 'deal', 'lose', 'little', 'normal', 'happy', 'self', 'world', 'bad'], ['friend', 'tell', 'know', 'talk', 'want', 'parent', 'family', 'love', 'say', 'care', 'feel', 'live', 'girl', 'relationship', 'help', 'mom', 'life', 'make', 'one', 'see'], ['year', 'get', 'go', 'school', 'time', 'work', 'job', 'start', 'college', 'last', 'month', 'high', 'f

## Test evaluation

In [10]:
# Exhibits topic assignments, including probabilites, for up to 15 test documents
assignements = get_document_with_assigned_topic(test_original, test_d_t_dist)
for assignement in assignements[:15]:
    print(f'topics: {assignement[0]}, probs: {assignement[1]}, doc:\n{assignement[2]}\n')

topics: [3, 1, 5], probs: tensor([0.1864, 0.1828, 0.1529], device='cuda:0'), doc:
I am taking a biology course online as an elective credit at a California CC. This past week was the mid term, and to my surprise I flunked. All because I forgot to cite my sources and the teacher thought I plagiarized from a message board/article.

I have never been so aggravated, stressed, and utterly in awe of the hurdles and problems. I'm not a rocket scientist I am pretty smart though and am stuck between wanting to drop and just amazed it can be like this.

She sends me freaking copy and pasted website information not in APA format it's a freaking joke clearly plagiarized. What does the prof. say when I voice my opinion? "Well, the classes are structured around partners. Continue to focus on the material and work around your schedule so this doesn't happen. You need to spend more than 10-15 hours a week for material."


Huh!! I have a husband, 3 kids, and work full time and am supposed to wait until