## Hierarchical Topic Models

We have two types of hierarchical topic models: **HTM-WS** and **HTM-DS**. These models involve the user in deciding which topics need to be split further. Instead of automatically creating subtopics for every topic, the user can inspect the initial model and choose which topics to expand. This approach makes it easier to incorporate expert knowledge into the model.

Both methods start with a first-level (L1) topic model and then build the second level by focusing on one of the L1 topics. Here's how they work:

- **HTM-WS**: Creates new documents by keeping only the words related to the chosen topic.
- **HTM-DS**: Keeps only the documents where the chosen topic is a major part.

The main difference is:

- **HTM-WS** assigns each word to just one subtopic, giving a clear, detailed breakdown of topics.
- **HTM-DS** allows different full documents to be included in different submodels, which can be useful for understanding how entire documents fit into subtopics.

In short, **HTM-WS** provides a detailed and precise breakdown of topics, while **HTM-DS** offers a way to explore how entire documents relate to subtopics, even if it's less precise.

In [1]:
import pathlib
import gzip
from termcolor import colored
import pandas as pd
import sys
import os
import time

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../../..')))
print(os.path.abspath(os.path.join(os.getcwd(), '../../..')))

from src.topic_modeling.hierarchical.hierarchical_tm import HierarchicalTM
from src.topic_modeling.polylingual_tm import PolylingualTM

mallet_path = pathlib.Path(os.path.abspath(os.path.join(os.getcwd(), '../../..'))).joinpath("src/topic_modeling/Mallet-202108/bin/mallet").as_posix()
path_stops = pathlib.Path(os.path.abspath(os.path.join(os.getcwd(), '../../..'))).joinpath("src/topic_modeling/stops").as_posix()

/export/usuarios_ml4ds/lbartolome/Repos/umd/LinQAForge


In [2]:
lang_colors = {
    "EN": "red",
    "ES": "blue",
    # Add more languages and their colors as needed
}

## Father model 

A first-level multilingual topic model should be trained initially. While it's possible to create a script that integrates everything at once, the second level is better suited for an exploratory step based on the first-level topic model.

In [3]:
father_model = pathlib.Path("/export/usuarios_ml4ds/lbartolome/Repos/umd/LinQAForge/data/models/POLI_FILTERED_AL/rosie_1_20")

### Father model topics

We first present the topics in the parent model. These topics are common to both languages, with corresponding keys for each language. The topics are aligned, although the words in each topic are not literal translations between the languages.

In [4]:
# father model
father_model = pathlib.Path(father_model)
langs = ["EN", "ES"]

hmg = HierarchicalTM()

# Load topic-keys in each lang
all_keys = {}
for lang in langs:
    # Default to white if color is not defined
    color = lang_colors.get(lang, "white")
    print(colored("#" * 50, color))
    print(colored(f"-- -- Topic keys in {lang.upper()}: ", color))
    print(colored("-" * 50, color))
    keys = []
    with (father_model / f"mallet_output/keys_{lang}.txt").open('r', encoding='utf8') as fin:
        keys = [el.strip() for el in fin.readlines()]
    all_keys[lang] = keys
    for id, tpc in enumerate(keys):
        print(colored(f"- Topic {id}: {tpc}", color))
    print("\n")

[31m##################################################[0m
[31m-- -- Topic keys in EN: [0m
[31m--------------------------------------------------[0m
[31m- Topic 0: blood cell body level protein acid gene technology normal produce function immune hormone result insulin sugar glucose lead urine[0m
[31m- Topic 1: heart study blood health pressure disease sleep participate healthy obesity cholesterol overweight attack stroke disorder failure carry organization lung[0m
[31m- Topic 2: clinic care hospital patient mayo center health team service medical offer pediatric treatment program family children specialist unit include[0m
[31m- Topic 3: provider care health doctor medical child treatment healthcare question visit professional talk diagnose check appointment condition follow symptom procedure[0m
[31m- Topic 4: eye injury bone muscle activity technology pain nerve exercise surgery joint physical spinal vision leg foot brain head arm[0m
[31m- Topic 5: surgery blood heart t

### Get input from the user on the construction of the 2nd level topic model

We need the following information from the user:

- The topic to be "expanded" to construct the 2nd level topic model
- The algorithm to be used for constructing the 2nd level topic model
- If using HTM-DS, the threshold value
- The number of training topics for the 2nd level model

In [5]:
# ask input from user: he needs to select a topic id
topic_id = input(f"Please select the topic you want to expand: ")
try:
    topic_id = int(topic_id)
    if topic_id < 0 or topic_id >= len(keys):
        raise ValueError("Topic id out of range.")
    print(f"Selected Topic {topic_id}: {all_keys[langs[0]][topic_id]}")
    for lang in langs:
        color = lang_colors.get(lang, "white")
        print(
            colored(f"Keys in {lang}: {all_keys[lang][topic_id]}", color))
except ValueError as e:
    print(f"Invalid input: {e}")
    sys.exit()

# htm version
htm_version = input(f"Please select the method you want to use (htm_ws/htm_ds): ")
if htm_version not in ["htm_ds", "htm_ws"]:
    raise ValueError("Invalid method")

# thr if ds
thr = 0.0
if htm_version == "htm_ds":
    thr = input("Please insert the threshold: ")
    try:
        thr = float(thr)
    except:
        print(f"Invalid input: {e}")
        sys.exit()
# ask input from user: he needs to select a topic id
tr_tpcs = input(f"Please select the number of training topics for the submodel: ")
try:
    tr_tpcs = int(tr_tpcs)
except ValueError as e:
    print(f"Invalid input: {e}")
    sys.exit()

Please select the topic you want to expand:  1


Selected Topic 1: heart study blood health pressure disease sleep participate healthy obesity cholesterol overweight attack stroke disorder failure carry organization lung
[31mKeys in EN: heart study blood health pressure disease sleep participate healthy obesity cholesterol overweight attack stroke disorder failure carry organization lung[0m
[34mKeys in ES: cardíaco corazón presión arterial enfermedad participar año sueño sangre obesidad cabo ataque vida saludable colesterol alto sobrepeso insuficiencia riesgo[0m


Please select the method you want to use (htm_ws/htm_ds):  htm_ws
Please select the number of training topics for the submodel:  10


### Train the model

In [None]:
submodel_path = hmg.create_submodel_tr_corpus(
    father_model_path=father_model,
    langs=langs,
    exp_tpc=topic_id,
    tr_topics=tr_tpcs,
    htm_version=htm_version,
    thr=thr)

# train model
start_time = time.time()
model = PolylingualTM(
    mallet_path=mallet_path,
    lang1=langs[0],
    lang2=langs[1],
    model_folder= submodel_path,
    num_topics=tr_tpcs,
    is_second_level=True,
    add_stops_path=path_stops
)
model.train()

end_time = time.time()
print(f"-- Model trained in {end_time - start_time} seconds")

### Topics at the 2nd-level for the expanded topic (HTM-DS)

Below are the topics obtained from training the HTM-DS model.

In [38]:
# Display topics of the submodel 
# Load topic-keys in each lang
submodel_ds_path = submodel_path

all_keys = {}
for lang in langs:
    # Default to white if color is not defined
    color = lang_colors.get(lang, "white")
    print(colored("#" * 50, color))
    print(colored(f"-- -- Topic keys in {lang.upper()}: ", color))
    print(colored("-" * 50, color))
    keys = []
    with (submodel_ds_path / f"mallet_output/keys_{lang}.txt").open('r', encoding='utf8') as fin:
        keys = [el.strip() for el in fin.readlines()]
    all_keys[lang] = keys
    for id, tpc in enumerate(keys):
        print(colored(f"- Topic {id}: {tpc}", color))
    print("\n")

[31m##################################################[0m
[31m-- -- Topic keys in EN: [0m
[31m--------------------------------------------------[0m
[31m- Topic 0: food technology body eat weight activity increase level diet exercise diabetes avoid drink water reduce healthy day prevent sugar[0m
[31m- Topic 1: surgery technology skin eye pain brain bone tissue injury remove procedure muscle body infection occur damage tube joint image[0m
[31m- Topic 2: cancer cell disease technology risk people liver breast woman health affect factor syndrome chronic increase download diagnose hormone symptom[0m
[31m- Topic 3: study doctor participate health sleep blood organization carry surname disorder trial conduct tomography search people clinical magnetic transplant resonance[0m
[31m- Topic 4: care clinic health mayo medical service center doctor hospital professional patient team offer insurance appointment receive treat company book[0m
[31m- Topic 5: symptom medicine doctor trea

### Topics at the 2nd-level for the expanded topic (HTM-WS)

For comparison, we show the topics obtained by training an HTM-WS model for the same expansion topic.

In [7]:
# Display topics of the submodel 
# Load topic-keys in each lang
submodel_ws_path = submodel_path
all_keys = {}
for lang in langs:
    # Default to white if color is not defined
    color = lang_colors.get(lang, "white")
    print(colored("#" * 50, color))
    print(colored(f"-- -- Topic keys in {lang.upper()}: ", color))
    print(colored("-" * 50, color))
    keys = []
    with (submodel_ws_path / f"mallet_output/keys_{lang}.txt").open('r', encoding='utf8') as fin:
        keys = [el.strip() for el in fin.readlines()]
    all_keys[lang] = keys
    for id, tpc in enumerate(keys):
        print(colored(f"- Topic {id}: {tpc}", color))
    print("\n")

[31m##################################################[0m
[31m-- -- Topic keys in EN: [0m
[31m--------------------------------------------------[0m
[31m- Topic 0: sleep study overweight obesity disorder phase apnea york obese rhythm participate health circadian conduct examine organization complication sufficient adult[0m
[31m- Topic 1: study participate carry maryland african american activity bethesda disease physical risk lung woman improve age aim develop black health[0m
[31m- Topic 2: study health atrial fibrillation participate blood week exercise trial cardiomyopathy cardiac participant national include evaluate carry tachycardia cardiovascular hypertrophic[0m
[31m- Topic 3: study blood sleep heart disease participate health organization lung disorder eligible loved condition sponsor healthy vessel nhlbi hour direct[0m
[31m- Topic 4: heart stroke attack disease cardiac coronary failure health rhythm artery risk arrhythmia healthy blood angina sudden ischemic beat 