## Hierarchical Topic Models

We have two types of hierarchical topic models: **HTM-WS** and **HTM-DS**. These models involve the user in deciding which topics need to be split further. Instead of automatically creating subtopics for every topic, the user can inspect the initial model and choose which topics to expand. This approach makes it easier to incorporate expert knowledge into the model.

Both methods start with a first-level (L1) topic model and then build the second level by focusing on one of the L1 topics. Here's how they work:

- **HTM-WS**: Creates new documents by keeping only the words related to the chosen topic.
- **HTM-DS**: Keeps only the documents where the chosen topic is a major part.

The main difference is:

- **HTM-WS** assigns each word to just one subtopic, giving a clear, detailed breakdown of topics.
- **HTM-DS** allows different full documents to be included in different submodels, which can be useful for understanding how entire documents fit into subtopics.

In short, **HTM-WS** provides a detailed and precise breakdown of topics, while **HTM-DS** offers a way to explore how entire documents relate to subtopics, even if it's less precise.

In [39]:
import pathlib
import gzip
from termcolor import colored
import pandas as pd
import sys
import os
import time

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../../..')))
print(os.path.abspath(os.path.join(os.getcwd(), '../../..')))

from src.topic_modeling.hierarchical.hierarchical_tm import HierarchicalTM
from src.topic_modeling.polylingual_tm import PolylingualTM

mallet_path = pathlib.Path(os.path.abspath(os.path.join(os.getcwd(), '../../..'))).joinpath("src/topic_modeling/Mallet-202108/bin/mallet").as_posix()

/export/usuarios_ml4ds/lbartolome/Repos/umd/LinQAForge


In [30]:
lang_colors = {
    "EN": "red",
    "ES": "blue",
    # Add more languages and their colors as needed
}

## Father model 

A first-level multilingual topic model should be trained initially. While it's possible to create a script that integrates everything at once, the second level is better suited for an exploratory step based on the first-level topic model.

In [26]:
father_model = pathlib.Path("/export/usuarios_ml4ds/lbartolome/Repos/umd/LinQAForge/data/models/POLI/rosie_1_20")

### Father model topics

We first present the topics in the parent model. These topics are common to both languages, with corresponding keys for each language. The topics are aligned, although the words in each topic are not literal translations between the languages.

In [31]:
# father model
father_model = pathlib.Path(father_model)

hmg = HierarchicalTM()

# Load topic-keys in each lang
all_keys = {}
for lang in langs:
    # Default to white if color is not defined
    color = lang_colors.get(lang, "white")
    print(colored("#" * 50, color))
    print(colored(f"-- -- Topic keys in {lang.upper()}: ", color))
    print(colored("-" * 50, color))
    keys = []
    with (father_model / f"mallet_output/keys_{lang}.txt").open('r', encoding='utf8') as fin:
        keys = [el.strip() for el in fin.readlines()]
    all_keys[lang] = keys
    for id, tpc in enumerate(keys):
        print(colored(f"- Topic {id}: {tpc}", color))
    print("\n")

[31m##################################################[0m
[31m-- -- Topic keys in EN: [0m
[31m--------------------------------------------------[0m
[31m- Topic 0: vaccine child vaccination age influenza organization dose health month virus receive person risk recommend people adult hepatitis disease report[0m
[31m- Topic 1: treatment medicine medication doctor technology drug treat dose day injection prescribe prescription time therapy prevent follow reduce symptom talk[0m
[31m- Topic 2: child technology family time parent feel people organization life disorder health talk learn stress depression behavior mental experience school[0m
[31m- Topic 3: technology surgery skin eye bone remove procedure tissue injury body tooth surgeon tube joint hand pain arm muscle foot[0m
[31m- Topic 4: baby woman pregnancy birth risk technology health pregnant organization infant sexual sex week bear mother increase defect newborn period[0m
[31m- Topic 5: care health provider patient medi

### Get input from the user on the construction of the 2nd level topic model

We need the following information from the user:

- The topic to be "expanded" to construct the 2nd level topic model
- The algorithm to be used for constructing the 2nd level topic model
- If using HTM-DS, the threshold value
- The number of training topics for the 2nd level model

In [33]:
# ask input from user: he needs to select a topic id
topic_id = input(f"Please select the topic you want to expand: ")
try:
    topic_id = int(topic_id)
    if topic_id < 0 or topic_id >= len(keys):
        raise ValueError("Topic id out of range.")
    print(f"Selected Topic {topic_id}: {all_keys[langs[0]][topic_id]}")
    for lang in langs:
        color = lang_colors.get(lang, "white")
        print(
            colored(f"Keys in {lang}: {all_keys[lang][topic_id]}", color))
except ValueError as e:
    print(f"Invalid input: {e}")
    sys.exit()

# htm version
htm_version = input(f"Please select the method you want to use (htm_ws/htm_ds): ")
if htm_version not in ["htm_ds", "htm_ws"]:
    raise ValueError("Invalid method")

# thr if ds
thr = 0.0
if htm_version == "htm_ds":
    thr = input("Please insert the threshold: ")
    try:
        thr = float(thr)
    except:
        print(f"Invalid input: {e}")
        sys.exit()
# ask input from user: he needs to select a topic id
tr_tpcs = input(f"Please select the number of training topics for the submodel: ")
try:
    tr_tpcs = int(tr_tpcs)
except ValueError as e:
    print(f"Invalid input: {e}")
    sys.exit()

Please select the topic you want to expand:  4


Selected Topic 4: baby woman pregnancy birth risk technology health pregnant organization infant sexual sex week bear mother increase defect newborn period
[31mKeys in EN: baby woman pregnancy birth risk technology health pregnant organization infant sexual sex week bear mother increase defect newborn period[0m
[34mKeys in ES: bebé mujer embarazo riesgo sexual embarazado nacimiento parto problema nacido semana nacer madre materno prueba defecto aborto pareja mes[0m


Please select the method you want to use (htm_ws/htm_ds):  htm_ds
Please insert the threshold:  0.5
Please select the number of training topics for the submodel:  10


### Train the model

In [41]:
submodel_path = hmg.create_submodel_tr_corpus(
    father_model_path=father_model,
    langs=langs,
    exp_tpc=topic_id,
    tr_topics=tr_tpcs,
    htm_version=htm_version,
    thr=thr)

# train model
start_time = time.time()
model = PolylingualTM(
    mallet_path=mallet_path,
    lang1=langs[0],
    lang2=langs[1],
    model_folder= submodel_path,
    num_topics=tr_tpcs,
    is_second_level=True
)
model.train()

end_time = time.time()
print(f"-- Model trained in {end_time - start_time} seconds")

INFO:src.topic_modeling.hierarchical.hierarchical_tm:-- -- Creating training corpus according to HTM-DS.
2024-06-05 04:02:09,440 - PolylingualTM - INFO - -- -- Importing data to Mallet...
2024-06-05 04:02:09,440 - PolylingualTM - INFO - -- -- Importing data to Mallet...
INFO:PolylingualTM:-- -- Importing data to Mallet...
2024-06-05 04:02:09,445 - PolylingualTM - INFO - -- -- Running command /export/usuarios_ml4ds/lbartolome/Repos/umd/LinQAForge/src/topic_modeling/Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\p{L}+" --print-output --input /export/usuarios_ml4ds/lbartolome/Repos/umd/LinQAForge/data/models/POLI/rosie_1_20/submodels/htm_ds_from_tpc_4_train_with_10/train_data/corpus_EN.txt --output /export/usuarios_ml4ds/lbartolome/Repos/umd/LinQAForge/data/models/POLI/rosie_1_20/submodels/htm_ds_from_tpc_4_train_with_10/mallet_input/corpus_EN.mallet --extra-stopwords src/topic_modeling/stops/en.txt
2024-06-05 04:02:09,445 - Polylin

-- Model trained in 225.32194828987122 seconds


### Topics at the 2nd-level for the expanded topic (HTM-DS)

Below are the topics obtained from training the HTM-DS model.

In [44]:
# Display topics of the submodel 
# Load topic-keys in each lang
all_keys = {}
for lang in langs:
    # Default to white if color is not defined
    color = lang_colors.get(lang, "white")
    print(colored("#" * 50, color))
    print(colored(f"-- -- Topic keys in {lang.upper()}: ", color))
    print(colored("-" * 50, color))
    keys = []
    with (submodel_path / f"mallet_output/keys_{lang}.txt").open('r', encoding='utf8') as fin:
        keys = [el.strip() for el in fin.readlines()]
    all_keys[lang] = keys
    for id, tpc in enumerate(keys):
        print(colored(f"- Topic {id}: {tpc}", color))
    print("\n")

[31m##################################################[0m
[31m-- -- Topic keys in EN: [0m
[31m--------------------------------------------------[0m
[31m- Topic 0: health organization service include public form technology safety insurance plan program cost receive require law document question request worker[0m
[31m- Topic 1: child health family program support learn community technology start school development care parent practice head activity organization resource improve[0m
[31m- Topic 2: cancer surgery technology cell treatment tumor brain bone tissue body therapy procedure skin image eye breast muscle remove radiation[0m
[31m- Topic 3: food technology water gene protein eat acid body skin product air exposure hand include lead day fat clean reduce[0m
[31m- Topic 4: technology symptom child baby pain medicine doctor time medication feel day treatment people talk week change disorder provider health[0m
[31m- Topic 5: age datum health report study rate increase ris

### Topics at the 2nd-level for the expanded topic (HTM-WS)

For comparison, we show the topics obtained by training an HTM-WS model for the same expansion topic.

In [45]:
# Display topics of the submodel 
# Load topic-keys in each lang
submodel_ws_path = father_model / "submodels/htm_ws_from_tpc_4_train_with_10"
all_keys = {}
for lang in langs:
    # Default to white if color is not defined
    color = lang_colors.get(lang, "white")
    print(colored("#" * 50, color))
    print(colored(f"-- -- Topic keys in {lang.upper()}: ", color))
    print(colored("-" * 50, color))
    keys = []
    with (submodel_path / f"mallet_output/keys_{lang}.txt").open('r', encoding='utf8') as fin:
        keys = [el.strip() for el in fin.readlines()]
    all_keys[lang] = keys
    for id, tpc in enumerate(keys):
        print(colored(f"- Topic {id}: {tpc}", color))
    print("\n")

[31m##################################################[0m
[31m-- -- Topic keys in EN: [0m
[31m--------------------------------------------------[0m
[31m- Topic 0: health organization service include public form technology safety insurance plan program cost receive require law document question request worker[0m
[31m- Topic 1: child health family program support learn community technology start school development care parent practice head activity organization resource improve[0m
[31m- Topic 2: cancer surgery technology cell treatment tumor brain bone tissue body therapy procedure skin image eye breast muscle remove radiation[0m
[31m- Topic 3: food technology water gene protein eat acid body skin product air exposure hand include lead day fat clean reduce[0m
[31m- Topic 4: technology symptom child baby pain medicine doctor time medication feel day treatment people talk week change disorder provider health[0m
[31m- Topic 5: age datum health report study rate increase ris