# In this notebook, we are qualitatively building categories

We look at the file: \
`../data/06-clustered-df/alpha0.3_k10_res0.002.pkl`

And the corresponding excel file: \
`..output/cluster-qualifications/cluster-explorer/FinalSelect/SingleSolExplorer_alpha0.3_k10_res0.002.xlsx`


## Tree structure:

**Pharmacology**

- **Pharmacodynamics**
  - Mechanism of action
  - Animal Models of Disorders
  - Other Pharmacodynamics
- **Pharmacokinetics**
- **Pharmacogenetics**
- **Analytical Methods**
- **Biomarker Search**
- **Other Pharmacology**

**Indications**

- **Disorders**
  - Depression
  - Anxiety
  - OCD
  - Weight and Eating Disorders
  - Substance Abuse/Addictions
  - Other Disorders
- **Special Populations**
  - Pregnant
  - Pediatrics
  - Geriatrics
  - Other Populations
- **Other Indications**

**Safety**

- Prenatal Exposure
- Pregnancy
- Suicide
- Sexual Dysfunction
- Toxicity
- Cardiovascular
- Other Safety

**Other**

- **Drugs**
  - Fluvoxamine
  - Paroxetine
  - Fluoxetine
  - Escitalopram
  - Citalopram
  - Sertraline
  - Agomelatine
  - Vilazodone
  - Psilocybin
  - Generic SSRIs
  - SSRIs
- Ecotoxicology
- Alternative Treatments
- Drug Utilization Patterns
- Cost-Effectiveness
- Other

### Papers will be categorized into the categories above and can reveive the tags:

1. Clinical Evidence
2. Non-Clinical Evidence
3. Mix
4. Review
5. Irrelevant
6. Other


# load and prep data


In [13]:
import pandas as pd
import json

In [14]:
params = "alpha0.3_k10_res0.002"
p = f"../data/06-clustered-df/{params}.pkl"
df = pd.read_pickle(p)
df.head(3)

Unnamed: 0,eid,title,date,first_author,abstract,doi,year,auth_year,unique_auth_year,pubmed_id,...,reference_eids,nr_references,filtered_reference_eids,nr_filtered_references,title_abstract,clean_title,clean_abstract,specter2_embeddings,cluster_alpha0.3_k10_res0.002,centrality_alpha0.3_k10_res0.002
0,2-s2.0-0020316326,Serotonergic mechanism in the control of β-end...,1982-04-12,Bruni J.F.,The role of the serotonergic mechanism in the ...,10.1016/0024-3205(82)90686-5,1982,Bruni_1982,Bruni_1982,6283286.0,...,"[2-s2.0-0016795422, 2-s2.0-0000011578, 2-s2.0-...",46,[],0,Serotonergic mechanism in the control of β-end...,Serotonergic mechanism in the control of β-end...,The role of the serotonergic mechanism in the ...,"[-0.38758993, 0.8743463, -0.52714413, 0.029653...",70,0.045627
1,2-s2.0-0019936013,EFFECTS OF PAROXETINE ON SYNAPTOSOMAL NEUROTRA...,1982-01-01,Magnussen I.,,10.1111/j.1600-0404.1982.tb03382.x,1982,Magnussen_1982,Magnussen_1982_4,,...,[],0,[],0,EFFECTS OF PAROXETINE ON SYNAPTOSOMAL NEUROTRA...,EFFECTS OF PAROXETINE ON SYNAPTOSOMAL NEUROTRA...,,"[0.329068, 0.23448052, -0.6597941, 0.13635367,...",65,0.010706
2,2-s2.0-0020058010,Treatment of intention myoclonus with paroxeti...,1982-01-01,Magnussen I.,,,1982,Magnussen_1982,Magnussen_1982_3,,...,[],0,[],0,Treatment of intention myoclonus with paroxeti...,Treatment of intention myoclonus with paroxeti...,,"[0.406605, 1.0992043, -0.60125256, 0.73224956,...",41,0.039863


# Clusters labels are arrived at by assessing the file:

`Study-1-Bibliometrics/output/cluster-qualifications/cluster-explorer/FinalSelect/SingleSolExplorer_alpha0.3_k10_res0.002.xlsx`

The file contains the 20 most central papers in each cluster as well as 20 random papers from each cluster. 0. What is the defining characteristic of the cluster? (to use as label)

1. Does it focus on pharmacology?
   - Which area of pharmacology?
2. Does it focus on in vivo, vitro, or human?
3. Does it focus on a specific disorder?
4. Does it focus on a special population?
5. Does it focus on a specific drug?
6. Does it focus on safety?

then it gets one of the following tags.

- 1. Clinical Evidence
- 2. Non-Clinical Evidence
- 3. Mix
- 4. Review
- 5. Irrelevant
- 0. Other

The file is manually assessed and the clusters are labeled as follows:

A cleaner version in JSON format is saved as `output/cluster-qualifications/ClusterLabels.json`


In [15]:
# 1 clinical
# 2 in vivo/vitro
# 3 Mixed
# 4 review
# 5 garbage
# 0 None.

cluster_labels = {
    0: ["Serotonin Receptor Studies", 2],
    1: ["Aquatic Ecotoxicology", 0],
    2: ["Risks of Prenatal Exposure", 1],
    3: ["Quantification of SSRIs in Biological Samples", 2],
    4: ["SSRIs for Obsessive-Compulsive Disorder (OCD)", 1],
    5: ["SSRIs and the Cytochrome P450 System", 2],
    6: ["SSRI Neuroscience", 2],
    7: ["Pediatric Depression", 1],
    8: ["The Chronic Unpredictable Mild Stress Model of Depression", 1],
    9: ["Fluvoxamine for Depression", 1, 4],
    10: ["Paroxetine Bindeing", 2],
    11: ["SSRIs Effect on Neural Processing of Emotional Cues", 1],
    12: ["Risks of Prenatal Exposure (Rodents)", 1],
    13: ["SSRIs for PTSD", 2],
    14: ["SSRIs in Forced Swimming Test", 2],
    15: ["Serotonin Syndrome", 1],
    16: ["Sexual Dysfunction", 1],
    17: ["Sequenced Depression Treatment", 1],
    18: ["Post-Stroke SSRI Use", 1],
    19: ["SSRIs Effect on Fear", 2],
    20: ["Bleeding Risk", 1],
    21: [
        "Serotonin Transporter Gene and Antidepressant Response",
        1,
    ],  # is this clinical? it deals with humans but is pharmacogenetics. do still give it a 1 (for clinical) or a 0?
    22: [
        "SSRIs and Inflammation",
        3,
    ],  # this seems like ssri have anti-inflammatory effects but also inflammation can cause depression
    23: ["Escitalopram for Depression", 1],
    24: ["SSRIs for Pain", 1],
    25: ["SSRI Utilization Patterns", 1],
    26: ["SSRIs in Dementias", 1],
    27: ["SSRIs for Weight Management", 1],
    28: ["Suicice Risk", 1],
    29: [
        "SSRIs and Cardiovascular Pathology",
        1,
    ],  # quick check pls! its rather depression in this population
    30: ["Serotonin Binding and Receptor Studies", 2],  # compare to 0
    31: ["SSRIs for Panic Disorder", 1],
    32: ["Paroxetine", 4, 5],
    33: ["Cost-Effectiveness", 0],
    34: ["SSRIs for Dermatological Diseases", 1],  # psychodermatology?
    35: ["Herbal and Natural Compounds for Depression", 2],
    36: ["SSRIs for Generalized Anxiety Disorder (GAD)", 1],
    37: ["Fracture Risk", 1],
    38: ["SSRIs for Bipolar Depression", 1],
    39: ["Fluvoxamine for Covid 19", 1],
    40: ["QTC Prolongation", 1],
    41: ["Fluoxetine", 4, 5],
    42: ["Extrapyramidal Reactions Risk", 1],  # movement disorders
    43: ["Premature Ejaculation", 1],
    44: ["SSRIs for Depression in Parkinsons Patients", 1],
    45: [
        "Oncopreventive and Oncolytic Properties of SSRIs",
        1,
    ],  # mostly risk estimation
    46: ["SSRIs Effect on Sleep in Depressed Patients", 1],
    47: ["SSRIs for Depression in Diabetes Patients", 1],
    48: ["SSRIs for Irritable Bowel Syndrome", 1],
    49: ["SSRIs for Premenstrual Dysphoric Disorder (PMDD)", 1],
    50: ["SSRIs for Depression with Cognitive Impairment", 1],
    51: ["SSRIs for Eating Disorders", 1],
    52: ["SSRIs for Alcohol Dependence", 1],
    53: ["Hot Flushes (Menopause)", 1],
    54: ["(Sertraline) Drug Delivery", 2],
    55: ["SSRIs for Depression in in Epileptic Patients", 1],
    56: ["SSRIs for Autism Spectrum Disorders", 1],
    57: ["SSRIs for Substance Abuse (Cocaine)", 2],
    58: ["Hepatotoxic Risk", 2],  # also check 85
    59: ["SSRIs (Fluvoxamine) for Schizophrenia", 1],
    60: ["SSRI Synthesis", 2],
    61: ["Sertraline for Depression", 1],
    62: ["SSRIs for Social Anxiety Disorder", 1, 4],
    63: ["Drug Drug Interactions", 1],
    64: ["SSRIs Effects on Ion Channels", 2],
    65: ["Methylenedioxymethamphetamine (MDMA) Effects on Serotonin", 2],
    66: ["Genotype Mediated Response to SSRIs", 2],
    67: ["Pharmacological Perspectives on Antidepressants", 2, 4],
    68: ["Hyponatremia Risk", 1],
    69: ["SSRIs for Body Dysmorphic Disorder (BDD)", 1],
    70: ["SSRIs Effects on Neuroendocrine System", 3],
    71: ["SSRIs for Pediatric Anxiety", 1],
    72: ["SSRI Safety Reviews", 1, 4],
    73: [
        "SSRIs Effects on Antimicrobials and Gut Microbiome",
        2,
    ],  # no idea where to put this (mechanism of action, safety, treatment of gut things) now at other diseases
    74: ["Depression Treatment Review", 1, 4],
    75: ["Agomelatin for Depression", 4],
    76: ["Geriatric Depression", 1, 4],
    77: ["Citalopram for Depression", 1, 4, 5],
    78: [
        "SSRIs for Personality Disorders and Related Traits",
        1,
        4,
    ],  # mostly borderline
    79: ["Forensic Drug Concentration", 1],
    80: ["Pulmonary Hypertension", 2],
    81: ["Astrocyte Receptors", 1],
    82: ["Withdrawal Symptoms", 1],
    83: ["Wastewater Management", 0],
    84: ["SSRI Use", 5, 1],
    85: ["Depression in Hepatitis Patients", 1],
    86: ["SSRIs and Augmentation for Treatment Resistant Depression", 1, 4],
    87: ["SSRIs for Depression after Traumatic Brain Injury", 1],
    88: ["Repeated SSRIs Exposures Effects on Dopamine Receptors", 2],
    89: [
        "Hyperprolactinemia-related Symptoms",
        2,
    ],  # galactorrhoea; amenorrhea; gynecomastia
    90: ["Neurochemical and Electrophysiological Correlates of SSRIs", 3],
    91: ["SSRIs for Fibromyalgia", 2],  # also chronic fatigue syndrome
    92: ["Sex differences in SSRI Effects in Depression", 3],
    93: ["SRRIs and Ocular Safety", 1],  # mostly glaucoma
    94: ["SSRIs for OCD-like behaviors", 2],
    95: ["Transcranial Direct Current Stimulation (tDCS) for Depression", 1],
    96: ["SSRIs for Depression in Cancer", 1],
    97: ["SSRI-Induced Mania", 1, 5],
    98: ["Gatways to Clinical Trials", 5, 4],
    99: ["Sertraline", 5],
    100: ["Canine Behavioral Disorders", 0],  # agression in dogs mostly
    #####################################
    ### from here on under 100 papers ###
    #####################################
    101: [
        "Risk for Orofacial Pain Disorders",
        2,
    ],
    102: [
        "SSRIs Effect on Brain-derived Neurotrophic Factor (BDNF) Levels in Depressive Patients",
        1,
    ],  # may this be a biomarker?
    103: [
        "SSRIs for Enuresis",
        1,
    ],  # or enuresis or Stress Urinary Incontinence Or Urinary Function? Whats the best title?
    104: ["SSRIs for Migraine", 1],
    105: ["SSRIs for Postpartum Depression", 1],  # is this special pop?
    106: ["St. John's Wort for Depression", 1],
    107: ["SSRIs for Depression in Kidney Disease Patients", 1],  #
    108: [
        "Behavioral Addictions",
        1,
    ],  #  pathological gambling mostly, and problematic internet use
    109: ["SSRIs Effect on Response Inhibition", 2],
    110: ["SSRIs for Dysthymia", 1],
    111: ["Generic SSRIs", 1],  # which category? now under other/drug
    112: ["SSRIs in Model Organisms ( C. Elegans and Drosophilia)", 2],
    113: ["Vilazodone for Depression", 1, 4],
    114: ["SSRIs for Trichotillomania", 1],
    115: ["Selective serotonin reuptake inhibitors", 5, 0],
    116: ["SSRIs for Depression in and Treatment of Multiple Sclerosis", 1],
    117: [
        "SSRIs for Depression in Chronic Obstructive Pulmonary Disease (COPD) Patients",
        1,
    ],
    118: ["Folate for Depression", 1],
    119: [
        "Paroxetines Therapeutic Potential as G Protein-coupled Receptor Kinase 2 (GRK2) Inhibitor for Cardiovascular Diseases",
        3,
    ],
    120: ["Tryptophan Depletion in Depression", 1],
    121: ["Serotonin and Dopamine Release and Uptake Dynamics", 2],
    122: ["SSRIs for Congenital Myasthenic Syndrome (SCCMS)", 1],
    123: ["Olfactory Bulbectomized (OB) Rat Model for Depression", 1],
    124: ["SSRIs for Depression in HIV-positive Patients", 1],
    125: ["SSRIs for Smoking Cessation", 1],
    126: [
        "Loudness Dependence of Auditory Evoked Potentials (LDAEP) as Biomarker for Depression",
        3,
    ],
    127: ["SSRIs for Tinnitus ", 1],
    128: ["SSRIs for Syncope", 1],
    129: ["Novel Compounds Synthesis and Antidepressant Potential", 2],
    130: ["Risk of Hallucinations", 1],
    131: ["Exercise for Depression", 1],
    #####################################
    ### from here on under 50 papers ###
    #####################################
    132: ["SSRIs for Narcolepsy", 1],
    133: ["SSRIs and Galanin in Depression", 2],
    134: ["SSRIs for Tourette's Syndrome", 1],
    135: ["Adulterants in Weight Loss Supplements", 0],
    136: ["Risk of Microscopic Colitis", 1],
    137: ["SSRIs for Pediatric OCD", 1],
    138: ["SSRIs Cellular Distribution", 2],  # dont understand the topic so well
    139: ["Psychotic Depression (STOP-PD Study)", 0, 1],
    #####################################
    ### from here on under 25 papers ###
    #####################################
    140: ["Treatment of Paroxetine Induced Sexual Dysfunction", 2],
    141: ["Childhood Abuse and Depression Treatment Response", 1],
    142: ["Risk of Priapism", 1],  # painful erection
    143: ["Psilocybin for Depression", 1],
    144: ["SSRIs for Depersonalization Disorder", 1],
    145: ["Risk of Reversible Cerebral Vasoconstriction Syndrome (RCVS)", 1],
    146: ["Gateways to Clinical Trials", 5],
    147: ["Depression in Infertile Women", 1, 5],
    148: ["SSRIs for Olfactory reference syndrome (ORS)", 1, 5],
}
path = f"../output/cluster-qualifications/raw_cluster_labels.json"

with open(path, "w") as f:
    json.dump(cluster_labels, f)

### Format to a pretty json format and save


In [16]:
formatted_cluster_labels = {}

for k, v in cluster_labels.items():
    if isinstance(v, list) and len(v) >= 1:
        label = v[0]
        evidence = ", ".join(map(str, v[1:]))
        formatted_cluster_labels[f"cluster_{k}"] = {
            "children": [{"label": label}, {"evidence_type": evidence}],
        }
        if not evidence:
            print(f"Evidence Tpye missing: cluster_{k}: {label} ({evidence})")
    else:
        print(f"Warning: Unexpected value format for cluster_{k}: {v}")

json_string = json.dumps(formatted_cluster_labels, indent=2)

# save to file
path = f"../output/cluster-qualifications/formatted_cluster_labels.json"
with open(path, "w") as f:
    f.write(json_string)

# Create Excel Sheet with Cluster; Labels, Years, Evidence Type


In [17]:
def process_cluster_data(formatted_cluster_labels, df_summary):
    """
    Process the cluster data to create a DataFrame with cluster labels and evidence types
    """
    # Create DataFrame from cluster_labels dictionary
    df_clusters = pd.DataFrame(formatted_cluster_labels).T.reset_index()
    df_clusters.columns = ["Cluster", "Label"]
    df_clusters["Cluster"] = (
        df_clusters["Cluster"].str.replace("cluster_", "").astype(int)
    )
    df_clusters["ClusterLabel"] = df_clusters["Label"].apply(lambda x: x[0]["label"])
    df_clusters["EvidenceType"] = df_clusters["Label"].apply(
        lambda x: x[1]["evidence_type"]
    )

    # Add columns for each evidence type
    evidence_types = {
        "et_clinical": 1,
        "et_non_clinical": 2,
        "et_mixed": 3,
        "et_review": 4,
        "et_garbage": 5,
        "et_other": 0,
    }
    for col_name, evid_type in evidence_types.items():
        df_clusters[col_name] = df_clusters["EvidenceType"].apply(
            lambda x: 1 if str(evid_type) in x else 0
        )

    # Remove the evidencetype column
    df_clusters.drop(columns=["Label", "EvidenceType"], inplace=True)

    # Merge with df_summary on cluster and Cluster columns
    df_clusters = df_clusters.merge(
        df_summary[
            [
                "Cluster",
                "Nr of Pubs",
                "25th Percentile Year",
                "Median Year",
                "75th Percentile Year",
            ]
        ],
        left_on="Cluster",
        right_on="Cluster",
        how="left",
    )

    # Make all column names lowercase and replace spaces with underscores
    df_clusters.columns = [col.lower().replace(" ", "_") for col in df_clusters.columns]

    return df_clusters

In [18]:
# read summary sheet from output/cluster-qualifications/cluster-explorer/FinalSelect/SingleSolExplorer_alpha0.3_k10_res0.002.xlsx

df_summary = pd.read_excel(
    "../output/cluster-qualifications/cluster-explorer/FinalSelect/FinalRun_i200_SingleSolExplorer_alpha0.3_k10_res0.002.xlsx",
    sheet_name="summary",
)
df_summary["Cluster"] = df_summary["Cluster"].astype(int)
# Use the function
df_cluster_info = process_cluster_data(formatted_cluster_labels, df_summary)
df_cluster_info.head(5)

Unnamed: 0,cluster,clusterlabel,et_clinical,et_non_clinical,et_mixed,et_review,et_garbage,et_other,nr_of_pubs,25th_percentile_year,median_year,75th_percentile_year
0,0,Serotonin Receptor Studies,0,1,0,0,0,0,1218,1995.0,2000.0,2007.0
1,1,Aquatic Ecotoxicology,0,0,0,0,0,1,1093,2014.0,2018.0,2021.0
2,2,Risks of Prenatal Exposure,1,0,0,0,0,0,1073,2007.0,2013.0,2018.0
3,3,Quantification of SSRIs in Biological Samples,0,1,0,0,0,0,1070,2005.0,2012.0,2017.0
4,4,SSRIs for Obsessive-Compulsive Disorder (OCD),1,0,0,0,0,0,1006,1999.0,2007.0,2016.0


# Add Each Branch As Column to DataFrame


#### load cluster hierachy


In [19]:
p = "../output/cluster-qualifications/ClusterHierachy_noComments.json"
tree_hierarchy = json.load(open(p))

In [20]:
def flatten_hierarchy(hierarchy, path=None):
    """Recursively flattens a nested dictionary into a list of dictionaries."""
    path = path or []  # Use empty list if path is not provided

    items = []
    for key, value in hierarchy.items():
        new_path = path + [key]
        if isinstance(value, dict):
            items.extend(flatten_hierarchy(value, new_path))
        else:
            items.extend(
                {
                    **{level: 1 for level in new_path},  # One-hot encode path
                    "cluster": cluster,
                }
                for cluster in value  # Handle lists of clusters
            )
    return items


def create_hierarchy_dataframe(tree_hierarchy):
    """
    Creates a structured DataFrame from a hierarchical JSON dictionary.
    Handles missing hierarchy levels and provides summary statistics.
    """

    flattened_data = flatten_hierarchy(tree_hierarchy)
    df = pd.DataFrame(flattened_data)

    # Fill NaN with 0 (for missing levels) and convert to int for efficient grouping
    df = df.fillna(0).astype({"cluster": int})

    # Group by cluster and aggregate with max (keep most detailed level for each hierarchy)
    df_grouped = df.groupby("cluster").max().reset_index()

    # Convert column names to lowercase with underscores
    df_grouped.columns = df_grouped.columns.str.lower().str.replace(" ", "_")

    print("Number of unique clusters:", df_grouped["cluster"].nunique())
    print("Entries before grouping:", len(df))
    print("Entries after grouping:", len(df_grouped))
    print("Total Number of branches (Root to Leaf):", df_grouped.shape[1] - 1)

    return df_grouped


df_hierarchy = create_hierarchy_dataframe(tree_hierarchy)
df_hierarchy.head(5)

Number of unique clusters: 149
Entries before grouping: 207
Entries after grouping: 149
Total Number of branches (Root to Leaf): 47


Unnamed: 0,cluster,pharmacology,pharmacodynamics,mechanism_of_action,animal_models_of_disorders,otherpharmacodynamics,pharmacokinetics,pharmacogenetics,analytical_methods,biomarker_search,...,agomelatine,vilazodone,psilocybin,generic_ssris,ssris,ecotoxicology,alternative_treatments,drug_utilization_patterns,cost-effectineness,otherother
0,0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
# merge on cluster with df_cluster_info
df_cluster_info = df_cluster_info.merge(df_hierarchy, on="cluster", how="left")
print(df_cluster_info.shape)
df_cluster_info.head()

(149, 59)


Unnamed: 0,cluster,clusterlabel,et_clinical,et_non_clinical,et_mixed,et_review,et_garbage,et_other,nr_of_pubs,25th_percentile_year,...,agomelatine,vilazodone,psilocybin,generic_ssris,ssris,ecotoxicology,alternative_treatments,drug_utilization_patterns,cost-effectineness,otherother
0,0,Serotonin Receptor Studies,0,1,0,0,0,0,1218,1995.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Aquatic Ecotoxicology,0,0,0,0,0,1,1093,2014.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,2,Risks of Prenatal Exposure,1,0,0,0,0,0,1073,2007.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Quantification of SSRIs in Biological Samples,0,1,0,0,0,0,1070,2005.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,SSRIs for Obsessive-Compulsive Disorder (OCD),1,0,0,0,0,0,1006,1999.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
p = "/Users/jlq293/Projects/Study-1-Bibliometrics/output/cluster-qualifications/"
df_cluster_info.to_excel(p + "ClusterLabelsAndTopicHierarchy.xlsx", index=False)

In [23]:
df_cluster_info.columns

Index(['cluster', 'clusterlabel', 'et_clinical', 'et_non_clinical', 'et_mixed',
       'et_review', 'et_garbage', 'et_other', 'nr_of_pubs',
       '25th_percentile_year', 'median_year', '75th_percentile_year',
       'pharmacology', 'pharmacodynamics', 'mechanism_of_action',
       'animal_models_of_disorders', 'otherpharmacodynamics',
       'pharmacokinetics', 'pharmacogenetics', 'analytical_methods',
       'biomarker_search', 'otherpharmacology', 'indications', 'disorders',
       'depression', 'anxiety', 'ocd', 'weight_and_eating_disorders',
       'substance_abuse/addictions', 'otherdisorders', 'safety',
       'special_populations', 'pediatrics', 'geriatrics', 'otherpopulations',
       'prenatal_exposure', 'suicide', 'sexual_dysfunction', 'toxicity',
       'cardiovascular', 'othersafety', 'other', 'drugs', 'fluvoxamine',
       'paroxetine', 'fluoxetine', 'escitalopram', 'citalopram', 'sertraline',
       'agomelatine', 'vilazodone', 'psilocybin', 'generic_ssris', 'ssris',
   

# Build Full Tree Incl Labels


In [24]:
import json


# Define the function to replace numbers with labels
def replace_numbers_with_labels(node, mapping):
    if isinstance(node, list):
        return [mapping.get(item, item) for item in node]
    elif isinstance(node, dict):
        return {
            key: replace_numbers_with_labels(value, mapping)
            for key, value in node.items()
        }
    else:
        return node


# Transform the tree into the desired format
def transform_tree(node):
    """
    Recursively transforms the tree into a nested structure with "name" and "children" keys.

    Args:
      node (dict or list): The current subtree or list of items.

    Returns:
      dict: The transformed nested JSON tree.
    """
    if isinstance(node, dict):
        return [
            {"name": key, "children": transform_tree(value)}
            for key, value in node.items()
        ]
    elif isinstance(node, list):
        return [{"name": item} for item in node]
    else:
        return {"name": node}


cluster_label_dict = dict(
    zip(df_cluster_info["cluster"], df_cluster_info["clusterlabel"])
)
# Replace numbers with labels in the hierarchy
tree_with_labels = replace_numbers_with_labels(tree_hierarchy, cluster_label_dict)

p = "../output/cluster-qualifications/FullClusterLabelTree.json"

with open(p, "w") as f:
    json.dump(tree_with_labels, f)

updated_tree_with_labels = {}
for root in tree_with_labels.keys():
    root_dict = {
        # d3js does not work if root and name are the same
        root.lower(): {
            "name": root,
            "children": transform_tree(tree_with_labels[root]),
        }
    }
    updated_tree_with_labels.update(root_dict)

# Print the result
print(json.dumps(updated_tree_with_labels, indent=2))

p = "../output/cluster-qualifications/FullClusterLabelTreeFormated.json"

with open(p, "w") as f:
    json.dump(updated_tree_with_labels, f)

{
  "pharmacology": {
    "name": "Pharmacology",
    "children": [
      {
        "name": "Pharmacodynamics",
        "children": [
          {
            "name": "Mechanism of action",
            "children": [
              {
                "name": "Serotonin Receptor Studies"
              },
              {
                "name": "Paroxetine Bindeing"
              },
              {
                "name": "SSRIs Effect on Fear"
              },
              {
                "name": "SSRIs Effect on Neural Processing of Emotional Cues"
              },
              {
                "name": "Serotonin Binding and Receptor Studies"
              },
              {
                "name": "Astrocyte Receptors"
              },
              {
                "name": "Repeated SSRIs Exposures Effects on Dopamine Receptors"
              },
              {
                "name": "Neurochemical and Electrophysiological Correlates of SSRIs"
              },
              {
    