# In this notebook, we are qualitatively building categories

We look at the file: \
`/data/06-clustered-df/FINAL_alpha0.3_k10_res0.002_iterations400.pkl`

And the corresponding excel file: \
`output/cluster-qualifications_2025/cluster-explorer/FinalSelect/FINAL_SingleSolExplorer_alpha0.3_k10_res0.002_iterations400.xlsx`


## Tree structure:

**Pharmacology**

- **Pharmacodynamics**
  - Mechanism of action
  - Animal Models of Disorders
  - Other Pharmacodynamics
- **Pharmacokinetics**
- **Pharmacogenetics**
- **Analytical Methods**
- **Other Pharmacology**

**Indications**

- **Disorders**
  - Depression
  - Anxiety
  - OCD
  - Weight and Eating Disorders
  - Substance Abuse/Addictions
  - Other Disorders
- **Special Populations**
  - Pregnant
  - Pediatrics
  - Geriatrics
  - Other Populations
- **Other Indications**

**Safety**

- Prenatal Exposure
- Pregnancy
- Suicide
- Sexual Dysfunction
- Toxicity
- Cardiovascular
- Other Safety

**Other**

- Ecotoxicology
- Alternative Treatments
- Drug Utilization Patterns
- Cost-Effectiveness
- Other

### Papers will be fruther be categorized into the categories above and can receive the tags:

1. Clinical Evidence

2. Non-Clinical Evidence

3. Garbage

4. Other


# load and prep data


In [10]:
import pandas as pd
import json

from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Access environment variables
python_path = os.getenv("PYTHONPATH")
data_dir = os.getenv("DATA_DIR")
src_dir = os.getenv("SRC_DIR")
output_dir = os.getenv("OUTPUT_DIR")

SAVE_PATH = output_dir + "/cluster-qualifications_2025/cluster-label-tree/"

In [11]:
p = data_dir + f"/06-clustered-df/FINAL_alpha0.3_k10_res0.002_iterations400.pkl"

df = pd.read_pickle(p)

# Read all sheets into a dictionary
df.head(3)

Unnamed: 0,eid,title,date,first_author,abstract,doi,year,auth_year,unique_auth_year,pubmed_id,...,reference_eids,nr_references,filtered_reference_eids,nr_filtered_references,title_abstract,clean_title,clean_abstract,specter2_embeddings,cluster_alpha0.3_k10_res0.002,centrality_alpha0.3_k10_res0.002
0,2-s2.0-0020425640,Kinetics of citalopram in man; plasma levels i...,1982-01-01,Overø K.,Abstract1.Citalopram is rapidly absorbed and s...,10.1016/S0278-5846(82)80181-4,1982,Overo_1982,Overo_1982_2,6959195.0,...,"[2-s2.0-0019179180, 2-s2.0-0020047901, 2-s2.0-...",18,"[2-s2.0-0020431499, 2-s2.0-0020431887, 2-s2.0-...",5,Kinetics of citalopram in man; plasma levels i...,Kinetics of citalopram in man; plasma levels i...,1.Citalopram is rapidly absorbed and slowly el...,"[0.01807364, 0.45536643, -0.6515877, 0.2952828...",19,0.151958
1,2-s2.0-0019951467,Nonpurinergic nature and efficacy of nonadrene...,1982-01-01,Irvin C.G.,The nonadrenergic inhibition of airway smooth ...,,1982,Irvin_1982,Irvin_1982,6121785.0,...,,0,[],0,Nonpurinergic nature and efficacy of nonadrene...,Nonpurinergic nature and efficacy of nonadrene...,The nonadrenergic inhibition of airway smooth ...,"[0.2421042, 0.35010815, -0.4101752, 0.56234515...",68,0.021471
2,2-s2.0-0020059465,ANDROGEN‐INDUCED SEXUAL DIMORPHISM IN HIGH AFF...,1982-01-01,JALILIAN‐TEHRANI M.H.,High affinity binding of [3H]‐dopamine and [3H...,10.1111/j.1476-5381.1982.tb08755.x,1982,JalilianTehrani_1982,JalilianTehrani_1982,7074286.0,...,"[2-s2.0-0017191716, 2-s2.0-0018835903, 2-s2.0-...",54,[],0,ANDROGENINDUCED SEXUAL DIMORPHISM IN HIGH AFFI...,ANDROGENINDUCED SEXUAL DIMORPHISM IN HIGH AFFI...,High affinity binding of dopamine and 5hydroxy...,"[0.20964071, 1.1354221, -0.08355975, -0.338157...",10,0.012279


In [12]:
# Load the Excel file
file_path = (
    output_dir
    + "/cluster-qualifications_2025/cluster-explorer/FinalSelect/FINAL_SingleSolExplorer_alpha0.3_k10_res0.002_iterations400.xlsx"
)
# Read all sheets into a dictionary
summary_sheets = pd.read_excel(file_path, sheet_name=None)
print(summary_sheets.keys())
summary_sheets["summary"].head(3)

dict_keys(['summary', 'cluster_0', 'cluster_1', 'cluster_2', 'cluster_3', 'cluster_4', 'cluster_5', 'cluster_6', 'cluster_7', 'cluster_8', 'cluster_9', 'cluster_10', 'cluster_11', 'cluster_12', 'cluster_13', 'cluster_14', 'cluster_15', 'cluster_16', 'cluster_17', 'cluster_18', 'cluster_19', 'cluster_20', 'cluster_21', 'cluster_22', 'cluster_23', 'cluster_24', 'cluster_25', 'cluster_26', 'cluster_27', 'cluster_28', 'cluster_29', 'cluster_30', 'cluster_31', 'cluster_32', 'cluster_33', 'cluster_34', 'cluster_35', 'cluster_36', 'cluster_37', 'cluster_38', 'cluster_39', 'cluster_40', 'cluster_41', 'cluster_42', 'cluster_43', 'cluster_44', 'cluster_45', 'cluster_46', 'cluster_47', 'cluster_48', 'cluster_49', 'cluster_50', 'cluster_51', 'cluster_52', 'cluster_53', 'cluster_54', 'cluster_55', 'cluster_56', 'cluster_57', 'cluster_58', 'cluster_59', 'cluster_60', 'cluster_61', 'cluster_62', 'cluster_63', 'cluster_64', 'cluster_65', 'cluster_66', 'cluster_67', 'cluster_68', 'cluster_69', 'cluster

Unnamed: 0,Cluster,Nr of Pubs,25th Percentile Year,Median Year,75th Percentile Year,Given Label,Word_0,Word_1,Word_2,Word_3,...,Word_10,Word_11,Word_12,Word_13,Word_14,Word_15,Word_16,Word_17,Word_18,Word_19
0,0,1219,2014.0,2019.0,2022.0,,fluoxetine,pharmaceuticals,effects,fish,...,zebrafish,treatment,aquatic,serotonin,sertraline,study,antidepressants,drugs,exposed,removal
1,1,1133,1995.0,2000.0,2008.0,,ht,fluoxetine,receptor,serotonin,...,citalopram,effect,effects,rats,extracellular,antidepressant,neurons,administration,brain,1a
2,2,1067,2006.0,2012.0,2019.0,,method,fluoxetine,plasma,determination,...,citalopram,human,samples,paroxetine,drugs,analysis,ngml,drug,range,chromatography


# Clusters labels are arrived at by assessing the file:

`output/cluster-qualifications_2025/cluster-explorer/FinalSelect/FINAL_SingleSolExplorer_alpha0.3_k10_res0.002_iterations400.xlsx`

The file contains the 20 most central papers in each cluster as well as 20 random papers from each cluster. 0. What is the defining characteristic of the cluster? (to use as label)

1. Does it focus on pharmacology?
   - Which area of pharmacology?
2. Does it focus on in vivo, vitro, or human?
3. Does it focus on a specific disorder?
4. Does it focus on a special population?
5. Does it focus on a specific drug?
6. Does it focus on safety?

then it gets one of the following tags.

- 1. Clinical Evidence
- 2. Non-Clinical Evidence
- 9. Garbage
- 0. Other

The file is manually assessed and the clusters are labeled as follows:

A cleaner version in JSON format is saved as `output/cluster-qualifications/ClusterLabels.json`


# Cluster Labels


In [13]:
# 1 clinical
# 2 non clinical
# 9 garbage
# 0 Other.

cluster_labels = {
    0: ["Aquatic Ecotoxicology", 0],
    1: ["Serotonin Receptor Modulation in SSRI Treatment", 2],
    2: ["Analytical Methods for SSRI Detection and Quantification", 2],
    3: ["SSRIs for OCD", 1],
    4: ["Risk of Prenatal Exposure", 1],
    5: ["SSRIs for Pediatric Depression", 1],
    6: ["SSRIs and the Cytochrome P450 System", 2],
    7: ["Impact of SSRIs on Neurogenesis", 2],
    8: ["The Chronic Unpredictable Mild Stress Model of Depression", 1],
    9: ["SSRIs Effect on Neural Processing of Emotional Cues", 1],
    10: ["Risk of Prenatal Exposure (Rodents)", 1],
    11: ["SSRIs for PTSD", 2],
    12: ["SSRIs in Forced Swimming Test", 2],
    13: ["SSRI Utilization Patterns", 1],
    14: ["Fluvoxamine for Depression", 1],
    15: ["Inflammatory Biomarkers", 1],
    16: ["Sexual Dysfunction", 1],
    17: ["SSRIs Effect on Fear", 2],
    18: ["Post-Stroke SSRI Use", 1],
    19: ["Escitalopram for Depression", 1],
    20: [
        "Serotonin Transporter Gene and Antidepressant Response",
        1,
    ],  # is this clinical? it deals with humans but is pharmacogenetics. do still give it a 1 (for clinical) or a 0?
    21: ["Treatment Strategies and Outcomes in Depression", 1],
    22: ["SSRIs for Pain", 1],
    23: ["SSRIs for Depression in Patients with Cardiovascular Disease", 1],
    24: ["SSRIs in Dementias", 1],
    25: ["Serotonin Syndrome", 1],
    26: ["SSRIs for Weight Management", 1],
    27: ["Paroxetine Binding to Serotonin Transporter", 2],
    28: ["Bleeding Risk", 1],
    29: ["Antidepressant-like Effects of Herbal Extracts in Animal Models", 2],
    30: [
        "Oncopreventive and Oncolytic Properties of SSRIs",
        1,
    ],
    31: ["Suicice Risk", 1],
    32: ["Fracture Risk", 1],
    33: ["SSRIs for Generalized Anxiety Disorder", 1],
    34: ["SSRIs for Panic Disorder", 1],
    35: ["Pharmacological Reviews", 2],
    36: ["SSRIs for Bipolar Depression", 1],
    37: ["Hepatotoxic Risk", 2],
    38: ["Fluvoxamine for Covid 19", 1],
    39: ["Pharmacoeconomics of SSRIs for Depression", 0],
    40: ["SSRIs Effect on Sleep", 1],
    41: ["Venlafaxine vs SSRIs for Depression", 1],
    42: ["SSRI Effects on Lactation", 1],
    43: ["SSRIs for Depression in Diabetes Patients", 1],
    44: ["SSRIs for Menopausal Vasomotor Symptoms (Hot flashes)", 1],
    45: ["Pharmacovigilance of SSRIs", 1],
    46: ["Premature Ejaculation", 1],
    47: ["SSRIs for Depression with Cognitive Impairment", 1],
    48: ["QTC Prolongation", 1],
    49: ["Extrapyramidal Reactions Risk", 1],  # movement disorders
    50: ["SSRIs for Irritable Bowel Syndrome", 1],
    51: ["SSRIs for Alcohol Dependence", 1],
    52: ["SSRIs for Premenstrual Dysphoric Disorder (PMDD)", 1],
    53: ["Neuroimaging of Serotonin Transporters", 2],
    54: ["SSRIs for (Psycho)dermatological Diseases", 1],
    55: ["Synthetic Chemistry of Paroxetine", 2],
    56: ["SSRIs for Autism Spectrum Disorders", 1],
    57: ["SSRIs for Eating Disorders", 1],
    58: ["SSRIs for Depression in Parkinsons Patients", 1],
    59: ["SSRIs for Depression in in Epileptic Patients", 1],
    60: ["SSRIs Effects on Ion Channels", 2],
    61: ["SSRIs and Augmentation for Treatment Resistant Depression", 1],
    62: ["Environmental Degradation of Antidepressants", 0],
    63: ["Methylenedioxymethamphetamine (MDMA) Induced Neurotoxicity", 2],
    64: ["Laboratory Techniques for drug analysis", 2],
    65: ["SSRIs for Social Anxiety Disorder", 1],
    66: ["Forensic Toxicology of Fatal Overdoses", 0],
    67: ["SSRIs (Fluvoxamine) for Schizophrenia", 1],
    68: ["SSRIs Effect on Pulmonary Hypertension", 2],
    69: ["SSRIs for Body Dysmorphic Disorder (BDD)", 1],
    70: ["Antidepressant Drug Design", 2],
    71: ["Serotonergic Modulation of Cocaine Effects", 2],
    72: ["Pharmacogenetics in SSRI Treatment", 2],
    73: ["Non-SERT Transporters in Antidepressant Action", 2],
    74: ["SSRIs for Depression, Comparative Effectiveness", 1],
    75: ["SSRIs for Geriatric Depression", 1],
    76: ["SERT Binding Mechanisms", 2],
    77: ["Astrocyte Receptors as a Target for SSRIs", 2],
    78: [
        "SSRIs for Borderline Personality Disorder",
        1,
    ],
    79: ["Agomelatin for Depression", 1],
    80: ["Depression Treatment Guidelines", 1],
    81: ["Sertraline for Depression", 1],
    82: ["Withdrawal Symptoms", 1],
    83: ["Hyponatremia Risk", 1],
    84: ["SSRIs Effect on Dopamine Receptors", 2],
    85: [
        "SSRIs Effects on Antimicrobials and Gut Microbiome",
        2,
    ],
    86: ["SSRIs for Depression after Traumatic Brain Injury", 1],
    87: ["Antifungal Activity of Sertraline", 2],
    88: ["Neurometabolite Changes with Antidepressants", 1],
    89: ["Galanin System in Antidepressant Action", 2],
    90: ["SSRIs for Fibromyalgia", 1],
    91: ["SSRI Crystal Structure and Complexation", 2],
    92: [
        "SSRIs Effect on Brain-derived Neurotrophic Factor (BDNF) Levels in Depressive Patients",
        1,
    ],
    93: ["SSRIs in Model Organisms ( C. Elegans and Drosophilia)", 2],
    94: ["SSRIs for Depression in Cancer Patients", 1],
    95: ["SSRIs for Postpartum Depression", 1],
    96: ["SSRIs Long-term Efficacy and Recurrent Depression", 1],
    97: ["Canine Behavioral Disorders", 0],  # agression in dogs mostly
    98: ["Gatways to Clinical Trials", 9],
    99: ["SRRIs and Ocular Safety", 1],  # mostly glaucoma
    ######################################
    # now below 100 papers per cluster #
    ######################################
    100: ["Folate for Depression", 1],
    101: ["SSRIs for Depression in Patients with Kidney Disease", 1],
    102: ["Psilocybin for Depression", 1],
    103: ["SSRIs Effect on Reinforcement Learning in Rodents", 2],
    104: [
        "SSRIs for Behavioral Addictions",
        1,
    ],
    105: ["St. John's Wort for Depression", 1],
    106: ["SSRIs for Migraine", 1],
    107: ["Vilazodone for Depression", 1],
    108: ["SSRIs for Dysthymia", 1],
    109: [
        "SSRIs for Enuresis",
        1,
    ],
    110: [
        "Paroxetines as G Protein-coupled Receptor Kinase 2 (GRK2) Inhibitor",
        2,
    ],
    111: ["SSRIs for Trichotillomania", 1],
    112: ["Transcranial Direct Current Stimulation (tDCS) for Depression", 1],
    113: ["SSRIs for Burning Mouth Syndrome", 1],
    114: [
        "SSRIs for Depression in Patients with Chronic Obstructive Pulmonary Disease (COPD)",
        1,
    ],
    115: ["SSRIs for the Mangement of Multiple Sclerosis", 1],
    116: ["SSRIs for Interferon-Induced Depression", 1],
    117: ["SSRIs for Depression in HIV-positive Patients", 1],
    118: ["Acupuncture for Depression", 1],
    119: ["SSRIs for Syncope", 1],
    120: ["SSRIs for Smoking Cessation", 1],
    121: [
        "Loudness Dependence of Auditory Evoked Potentials (LDAEP) as Biomarker for Depression",
        2,
    ],
    122: ["SSRIs for Tinnitus ", 1],
    123: ["Exercise for Depression", 1],
    124: ["SSRI-Methylphenidate Interactions", 2],
    #####################################
    ### from here on under 50 papers ###
    #####################################
    125: ["SSRIs for Narcolepsy", 1],
    126: ["SSRIs for OCD-like behaviors", 2],
    127: ["SSRIs in the Treatment of Huntington Disease", 1],
    128: ["SSRIs for Congenital Myasthenic Syndrome (SCCMS)", 1],
    129: ["Tryptophan Depletion in Depression", 1],
    130: ["Risk of Microscopic Colitis", 1],
    131: ["Herbal Remedies for SSRI-Induced Sexual Dysfunction", 2],
    132: ["Adulterants of SSRIs in Weight Loss Supplements", 2],
    133: ["SSRIs for Mutism", 1],
    134: ["SSRIs for Tourette's Syndrome", 1],
    135: ["Childhood Trauma and Antidepressant Response", 1],
    136: ["SSRIs Risk of Apathy syndrome", 1],
    137: ["Serotonin Transporter in Primate Models", 2],
    138: ["Blood-Brain Barrier Drug Transport", 2],
    139: ["Polypharmacy Effects in Aging Models", 2],
    140: ["Gateways to Clinical Trials", 9],
    141: ["SSRIs for Infertile Women", 1],
    142: ["Fewer than 10", 9],
    143: ["Fewer than 10", 9],
}

path = SAVE_PATH + "/cluster_labels_tags_full_raw.json"

with open(path, "w") as f:
    json.dump(cluster_labels, f)

cluster_labels_full_raw = {k: v[0] for k, v in cluster_labels.items()}
path = path = SAVE_PATH + "/cluster_labels_full_raw.json"

with open(path, "w") as f:
    json.dump(cluster_labels_full_raw, f)

In [14]:
tags_dict = {k: v[1:] for k, v in cluster_labels.items()}
tags_list = [v for v in tags_dict.values()]
tags_list = [item for sublist in tags_list for item in sublist]

# - 1. Clinical Evidence
# - 2. Non-Clinical Evidence
# - 9. Irrelevant
# - 0. Other
pd.Series(tags_list).value_counts()

1    96
2    39
0     5
9     4
Name: count, dtype: int64

# Cluster Tree


In [15]:
cluster_tree = {
    "Pharmacology": {
        "Pharmacodynamics": {
            "Mechanism of action": [
                1,
                7,
                9,
                53,
                60,
                63,
                68,
                73,
                76,
                77,
                84,
                85,
                88,
                89,
                92,
                103,
                110,
                137,
            ],
            "Animal Models of Disorders": [8, 12, 17, 71, 121, 126],
            "Other": [10, 93, 124],
        },
        "Pharmacokinetics": [6, 138],
        "Pharmacogenetics": [20, 72],
        "Analytical Methods": [2, 27, 55, 64, 91],
        "Other Pharmacology": [15, 35, 70],
    },
    "Indications": {
        "Depressive Disorders": [
            5,
            9,
            12,
            14,
            19,
            21,
            23,
            36,
            41,
            43,
            47,
            58,
            59,
            61,
            74,
            75,
            79,
            80,
            81,
            86,
            92,
            94,
            95,
            96,
            100,
            101,
            102,
            105,
            107,
            108,
            112,
            114,
            116,
            117,
            123,
            129,
        ],
        "Anxiety": [17, 33, 34, 65],
        "OCD": [3, 69, 126],
        "Weight and Eating Disorders": [26, 57],
        "Substance Abuse/Addictions": [51, 71, 104, 120],
        "Other": [
            11,
            22,
            38,
            40,
            44,
            46,
            50,
            52,
            54,
            56,
            67,
            68,
            78,
            87,
            90,
            106,
            109,
            111,
            113,
            115,
            119,
            122,
            125,
            127,
            128,
            133,
            134,
            135,
        ],
    },
    "Safety": {
        "Special Populations": {
            "Pediatrics": [5, 31],
            "Geriatrics": [18, 23, 24, 47, 58, 75],
            "Other": [43, 59, 86, 94, 101, 114, 117, 141],
        },
        "Perinatal Exposure": [4, 10, 42, 95],
        "Other": [
            16,
            25,
            28,
            30,
            31,
            32,
            37,
            45,
            48,
            49,
            66,
            82,
            83,
            99,
            130,
            131,
            136,
            139,
        ],
    },
    "Other": {
        "Ecotoxicology": [0, 62],
        "Alternative Treatments": [29, 105, 118],
        "Other": [13, 39, 97, 98, 132],
    },
}


path = path = SAVE_PATH + "/cluster_tree_full_raw.json"

with open(path, "w") as f:
    json.dump(cluster_tree, f)

In [16]:
def extract_all_numbers(cluster_tree):
    all_numbers = []

    def traverse_dict(d):
        for key, value in d.items():
            if isinstance(value, dict):
                traverse_dict(value)
            elif isinstance(value, list):
                # Check if the list contains nested lists or dictionaries
                if all(isinstance(item, int) for item in value):
                    all_numbers.extend(value)
                else:
                    # If list contains other dictionaries or lists
                    for item in value:
                        if isinstance(item, dict):
                            traverse_dict(item)
                        elif isinstance(item, list):
                            all_numbers.extend([x for x in item if isinstance(x, int)])
                        elif isinstance(item, int):
                            all_numbers.append(item)

    traverse_dict(cluster_tree)
    return sorted(all_numbers)


# Extract all numbers from the cluster_tree
all_numbers = extract_all_numbers(cluster_tree)

# value counts of all numbers
print("Clusters assigned to single and multiple categories:")
print(
    pd.Series(pd.Series(all_numbers).value_counts().to_dict().values()).value_counts()
)

Clusters assigned to single and multiple categories:
1    118
2     23
Name: count, dtype: int64


In [17]:
print("Clusters with no tree entry:")

for key in list(cluster_labels_full_raw.keys()):
    if key not in all_numbers:
        print(key, ":", cluster_labels_full_raw[key])

Clusters with no tree entry:
140 : Gateways to Clinical Trials
142 : Fewer than 10
143 : Fewer than 10
