[leiden algo documentation](https://leidenalg.readthedocs.io/en/latest/install.html)

# General Approach

In this analysis, we explore the clustering of scientific publications using the Leiden algorithm, with a focus on adjusting the algorithm's parameters to improve the coherence and relevance of the clusters formed. The Leiden algorithm, which improves upon the limitations of modularity optimization seen in previous methods, is employed for its effectiveness in identifying communities within networks. We particularly experiment with various resolution parameters, iterating through a range of values to tune the granularity of the clustering results. The objective is to find an optimal balance where clusters are neither too broad, encompassing multiple unrelated topics, nor too narrow, breaking down coherent subjects into excessive subgroups.

The analysis starts with the application of the Leiden algorithm using default settings and then progresses to more sophisticated implementations, including adjusting the resolution parameter to influence the size and coherence of clusters. This parameter tuning is crucial for addressing the resolution limit problem inherent in modularity-based community detection methods. By carefully selecting resolution parameters, we aim to achieve a set of clusters that are meaningful and manageable for qualitative analysis.

This process is supported by the creation of summary sheets and detailed exploratory sheets for each parameter set, facilitating a thorough review of the clustering outcomes. Parameters leading to the most coherent and distinct clusters, based on predefined decision criteria, are identified and selected for further analysis. The final selection of parameters is based on qualitative assessments of cluster coherence, ensuring that the clusters formed are both scientifically relevant and insightful for further research explorations.

Notes about the leiden algorithm:

### The most simpe implementation is this:

Here we would just use all default parameters.
The metrices are good, the qualitative assessment poor.
The `ModularityVertexPartition` is the default partitioning algorithm

- it is based on the modularity measure (link density within communities vs. link density between communities)
- Higher scires = stronger division into communities
- It tries to maximize the modularity score

The problem here is that the partitions are really big (if run without max_comm_size, around 5,000 papers in the largest communit)
Examining the most central articles (based on the eigenvector similarity), they do not seem coherent.

**PROBLEM:** Modularity, though
robust for many practical applications, suffers from the resolution limit problem,
in which optimization may fail to identify clusters smaller than a certain scale
that is dependent on properties of the network. (https://arxiv.org/pdf/2308.09644.pdf)

### The second implementation is this:

We are trying to tune the resolution parameter. To access it, we use the CPMVertexPartition algorithm. (CPM = Constant Potts Model)
This means that modularity is not an important parameter anymore.

- The resolution parameter is the only parameter of this algorithm
- It compares the actual link density to the resolution parameter
- If the link density is higher than the resolution parameter, the nodes are put into the same community
- It has a quality attribute, which is similar to the modularity score
- We will not rely on this but use a qualitative analysis to assess the results

In previous papers (Waltman 2020a, Ahlgren 2020), the authors used different levels of granularity. They used 10 different values of the resolution parameter γ: 0.000001, 0.000002, 0.000005, 0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01

We decide to use the following resolution parameter to narrow down our search.
0.00001,0.00002,0.0001,0.0002,0.001,0.002,0.01,0.02,

We then evaluated the solutions


# Packages


In [2]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import networkx as nx
import leidenalg as la
import igraph as ig
import sys
import glob
from tqdm import tqdm

sys.path.append("/Users/jlq293/Projects/Study-1-Bibliometrics/src/network/")
from PartitionCreator import PartitionCreator
from NetworkAnalyzer import TextAnalyzer, CommunityExplorer, ClusterPlotter

In [14]:
def load_graph_files():
    """
    Load graph files and return a dictionary of graphs and a DataFrame.

    Returns:
        params_graph_dict (dict): A dictionary containing graph objects, where the keys are the graph names.
        df (DataFrame): A DataFrame containing embeddings.

    """
    graph_files = glob.glob("../data/05-graphs/weighted-knn-citation-graph/*.graphml")
    graph_files.sort()

    params_graph_dict = {
        f.split("/")[-1]
        .replace("weighted_", "")
        .replace("_knn_citation.graphml", ""): ig.Graph.Read_GraphML(f)
        for f in graph_files
    }

    df = pd.read_pickle("../data/04-embeddings/df_with_specter2_embeddings.pkl")

    return params_graph_dict, df


params_graph_dict, df = load_graph_files()
print(params_graph_dict.keys())

dict_keys(['alpha0.3_k10', 'alpha0.3_k15', 'alpha0.3_k20', 'alpha0.3_k5', 'alpha0.5_k10', 'alpha0.5_k15', 'alpha0.5_k20', 'alpha0.5_k5'])


# ModularityVertexPartition

not favorable


In [15]:
modularity = []
nr_clusters = []
params = []
for params, G in params_graph_dict.items():
    pc = PartitionCreator(G, df)
    pc.create_partition_from_modularityvertexpartition(
        n_iterations=20,
        # max_comm_size=1500
    )
    modularity.append(pc.modularity)
    nr_clusters.append(len(pc.cluster_sizes))

# create df
df_params = pd.DataFrame(
    {
        "params": list(params_graph_dict.keys()),
        "modularity": modularity,
        "nr_clusters": nr_clusters,
    }
).sort_values(by="modularity", ascending=False)
df_params

Unnamed: 0,params,modularity,nr_clusters
5,alpha0.5_k15,0.683545,29
6,alpha0.5_k20,0.682633,29
4,alpha0.5_k10,0.679827,29
1,alpha0.3_k15,0.679479,34
2,alpha0.3_k20,0.679333,32
0,alpha0.3_k10,0.676278,32
7,alpha0.5_k5,0.668973,30
3,alpha0.3_k5,0.665366,36


# CPM RESOLUTION DETERMINATION


# waltman 2020a:

Different levels of granularity were considered. For each relatedness measure, we obtained 10 clustering solutions, each of them for a different value of the resolution parameter γ. The following values of γ were used: 0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, and 0.01.

# ahlgren 2020:

Using different values of the resolution parameter γ (0.000001, 0.000002, 0.000005, 0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002), we obtain 11 clustering solutions for each relatedness measure. Compared to our earlier study (Ahlgren et al., 2019), we exclude the clustering solutions for the two largest resolution values used in that study (0.005 and 0.01). These clustering solutions have around 300,000 and 500,000 clusters, respectively, and most of the clusters consist of fewer than 10 publications.


In [16]:
def process_params_graph_dict(params_graph_dict):
    """
    Process the params_graph_dict to create partitions and explore communities.
    Args:
        params_graph_dict (dict): A dictionary containing graph objects, where the keys are the graph names.
    Returns:
        partitions_explorer_dict (dict): A dictionary containing graph summary dataframes.
        nr_clusters (list): A list of the number of clusters for each graph.
        graph_params (list): A list of the graph parameters.
        resolution_params (list): A list of the resolution parameters.

    """
    # Define the resolution values to iterate over
    resolution_values = [
        0.000001,
        0.000002,
        0.000005,
        0.00001,
        0.00002,
        0.00005,
        0.0001,
        0.0002,
        0.0005,
        0.001,
        0.002,
        0.006,
        0.005,
        0.01,
        0.02,
        0.05,
    ]
    iterations = 20
    nr_clusters = []
    graph_params = []
    resolution_params = []

    partitions_explorer_dict = {}

    # Iterate over each graph and resolution value
    for params, G in params_graph_dict.items():
        for resolution in resolution_values:
            # Create partition using CPMVertexPartition algorithm
            pc = PartitionCreator(G, df)
            pc.create_partition_from_cmpvertexpartition(
                n_iterations=iterations,
                resolution_parameter=resolution,
                verbose=False,
                cluster_colunm_name=f"cluster_{params}_res{resolution}",
                centrality_column_name=None,
            )

            # Explore communities using CommunityExplorer
            ce = CommunityExplorer(
                pc.df, cluster_column=f"cluster_{params}_res{resolution}"
            )
            ce.create_cluster_sheets(sort=False, n=15)
            ce.create_summary_sheet()
            df_summary, _ = ce.full_return()
            partitions_explorer_dict[f"{params}_res{resolution}"] = df_summary
            nr_clusters.append(len(pc.partition.sizes()))
            graph_params.append(params)
            resolution_params.append(resolution)

    # Create a DataFrame with the graph parameters and the number of clusters
    df_graph_cluster_params = pd.DataFrame(
        {
            "params": graph_params,
            "resolution": resolution_params,
            "nr_clusters": nr_clusters,
        }
    ).sort_values(by="nr_clusters", ascending=False)

    return partitions_explorer_dict, df_graph_cluster_params


partitions_explorer_dict, df_graph_cluster_params = process_params_graph_dict(
    params_graph_dict
)

In [17]:
# only cluster nr betwe 50 and 500
df_graph_cluster_params_lim = df_graph_cluster_params[
    (df_graph_cluster_params["nr_clusters"] > 50)
    & (df_graph_cluster_params["nr_clusters"] < 500)
]
df_graph_cluster_params_lim

Unnamed: 0,params,resolution,nr_clusters
93,alpha0.5_k15,0.01,477
75,alpha0.5_k10,0.006,442
122,alpha0.5_k5,0.002,433
29,alpha0.3_k15,0.01,379
109,alpha0.5_k20,0.01,375
76,alpha0.5_k10,0.005,373
11,alpha0.3_k10,0.006,362
12,alpha0.3_k10,0.005,320
45,alpha0.3_k20,0.01,299
91,alpha0.5_k15,0.006,298


In [18]:
def save_graph_summary_to_excel(partitions_explorer_dict, filename):
    """
    Save partitions_explorer_dict to an Excel file, with each value as a separate sheet.

    Args:
        partitions_explorer_dict (dict): A dictionary containing graph summary dataframes.

    Returns:
        None
    """
    with pd.ExcelWriter(f"../output/tables/cluster-explorer/{filename}") as writer:
        for k, v in list(partitions_explorer_dict.items()):
            # if "_k5_" in k:
            #    continue
            v.to_excel(writer, sheet_name=k)

In [19]:
save_graph_summary_to_excel(
    df_graph_cluster_params_lim, filename="FullParameterFinder_ClusterExplorer.xlsx"
)

# Limit Parameter Space


### Decision Criteria:

1. premature ejaculation and sexual dysfunction be distinct clusters
2. Mixing of anxiety and panic disorder?
3. Diabetic Mice VS Body Weight and Diabetes VS depression treatment in diabetics?
4. Use alpha = 0.3 (look at alpha0.5_k10_res0.002); topics: 50, 133 -> very mixed words, little coherence.Not the case for alpha0.3_k10_res0.002 (here, sexual dys and premature ejac single topic though)
5. Zimeldine (1 or two topics)
6. Pregnancy related topics.


In [20]:
keepornot = {
    # all k=5 have many many clusters of single papers
    "alpha0.5_k5_res1e-05": 0,
    "alpha0.5_k5_res2e-05": 0,
    "alpha0.5_k5_res0.0001": 0,
    "alpha0.5_k5_res0.0002": 0,
    "alpha0.5_k5_res0.001": 0,
    "alpha0.5_k5_res0.002": 0,
    "alpha0.5_k5_res0.006": 0,
    "alpha0.5_k5_res0.01": 0,
    "alpha0.5_k5_res0.02": 0,
    # K10
    "alpha0.5_k10_res1e-05": 0,
    "alpha0.5_k10_res2e-05": 0,
    "alpha0.5_k10_res0.0001": 0,
    "alpha0.5_k10_res0.0002": 0,
    "alpha0.5_k10_res0.001": 0,  # premature ejaculation and sexual dysfunction and one topic
    "alpha0.5_k10_res0.002": 0,  # weight and sex are one topic, same problem as above
    "alpha0.5_k10_res0.006": 0,
    "alpha0.5_k10_res0.01": 0,
    "alpha0.5_k10_res0.02": 0,
    # K15
    "alpha0.5_k15_res1e-05": 0,
    "alpha0.5_k15_res2e-05": 0,
    "alpha0.5_k15_res0.0001": 0,
    "alpha0.5_k15_res0.0002": 0,
    "alpha0.5_k15_res0.001": 0,  # no bad, very broad and inclusive topics
    "alpha0.5_k15_res0.002": 0,  # gut sex dysf and premature ejaculation are one topic
    "alpha0.5_k15_res0.006": 1,  # pretty good. bit too many. too many pregnancy clusters, for example
    "alpha0.5_k15_res0.01": 0,
    "alpha0.5_k15_res0.02": 0,
    # K20
    "alpha0.5_k20_res1e-05": 0,
    "alpha0.5_k20_res2e-05": 0,
    "alpha0.5_k20_res0.0001": 0,
    "alpha0.5_k20_res0.0002": 0,
    "alpha0.5_k20_res0.001": 0,
    "alpha0.5_k20_res0.002": 0,
    "alpha0.5_k20_res0.006": 1,
    "alpha0.5_k20_res0.01": 1,  # MANY MANT HT TOPCIS (something betwe 002 and 01 needed.)
    "alpha0.5_k20_res0.02": 0,
    # NEW ALPHA 0.3
    # K5
    "alpha0.3_k5_res1e-05": 0,
    "alpha0.3_k5_res2e-05": 0,
    "alpha0.3_k5_res0.0001": 0,
    "alpha0.3_k5_res0.0002": 0,
    "alpha0.3_k5_res0.001": 0,
    "alpha0.3_k5_res0.002": 0,
    "alpha0.3_k5_res0.006": 0,
    "alpha0.3_k5_res0.01": 0,
    "alpha0.3_k5_res0.02": 0,
    # K10
    "alpha0.3_k10_res1e-05": 0,
    "alpha0.3_k10_res2e-05": 0,
    "alpha0.3_k10_res0.0001": 0,
    "alpha0.3_k10_res0.0002": 0,
    "alpha0.3_k10_res0.001": 0,  # premature ejaculation and sexual dysfunction and one topic; anxiety + panic too
    "alpha0.3_k10_res0.002": 1,  # gut
    "alpha0.3_k10_res0.006": 1,
    "alpha0.3_k10_res0.01": 0,  # good illustration for granularity using 'suicide' (500 topics)
    "alpha0.3_k10_res0.02": 0,
    # K15
    "alpha0.3_k15_res1e-05": 0,
    "alpha0.3_k15_res2e-05": 0,
    "alpha0.3_k15_res0.0001": 0,
    "alpha0.3_k15_res0.0002": 0,
    "alpha0.3_k15_res0.001": 0,  # no bad, very broad and inclusive topics
    "alpha0.3_k15_res0.002": 0,  # gut sex dysf and premature ejaculation are one topic
    "alpha0.3_k15_res0.006": 1,  # pretty good. bit too many. too many pregnancy clusters, for example
    "alpha0.3_k15_res0.01": 1,  # good illustration for granularity using 'suicide' (500 topics)
    "alpha0.3_k15_res0.02": 0,
    # K 20
    "alpha0.3_k20_res1e-05": 0,
    "alpha0.3_k20_res2e-05": 0,
    "alpha0.3_k20_res0.0001": 0,
    "alpha0.3_k20_res0.0002": 0,
    "alpha0.3_k20_res0.001": 0,
    "alpha0.3_k20_res0.002": 0,  # too few?
    "alpha0.3_k20_res0.006": 1,  # best ?
    "alpha0.3_k20_res0.01": 1,  # too many? aber gut.
    "alpha0.3_k20_res0.02": 1,
}

### save limit param space to excel


In [21]:
limited_parameter_data_dict = {
    k: v for k, v in partitions_explorer_dict.items() if keepornot.get(k, 0) == 1
}

save_graph_summary_to_excel(
    limited_parameter_data_dict, filename="LimitedParameterFinder_ClusterExplorer.xlsx"
)

# show summary
df_lim_param = pd.DataFrame(
    {"cl_nr": {k: len(v) for k, v in limited_parameter_data_dict.items()}}
)

# Reset the index to turn the keys into a column and then sort by 'cl_nr'
df_lim_param = (
    df_lim_param.reset_index()
    .rename(columns={"index": "params"})
    .sort_values(by="cl_nr")
)
df_lim_param

Unnamed: 0,params,cl_nr
0,alpha0.3_k10_res0.002,155
4,alpha0.3_k20_res0.006,190
8,alpha0.5_k20_res0.006,231
2,alpha0.3_k15_res0.006,241
7,alpha0.5_k15_res0.006,298
5,alpha0.3_k20_res0.01,299
1,alpha0.3_k10_res0.006,362
9,alpha0.5_k20_res0.01,375
3,alpha0.3_k15_res0.01,379
6,alpha0.3_k20_res0.02,563


In [22]:
def save_to_excel_with_hyperlinks(filepath, df_summary, comb_sheets):
    """
    Saves summary DataFrame and combined sheets to an Excel file,
    with hyperlinks from the combined sheets back to the summary sheet,
    and from the summary sheet to each combined sheet.

    Parameters:
    - filepath: The path to the Excel file to save.
    - df_summary: The summary DataFrame to save to the 'summary' sheet.
    - comb_sheets: A list of DataFrames to save to individual sheets, with hyperlinks back to the summary.
    """
    with pd.ExcelWriter(filepath, engine="openpyxl") as writer:
        # Write the summary sheet
        df_summary.to_excel(writer, sheet_name="summary", index=False)

        # Access the workbook through the writer.book attribute after writing
        workbook = writer.book

        # Iterate through comb_sheets and write each to a new sheet
        for i, s in enumerate(comb_sheets):
            sheet_name = f"cluster_{i}"
            s.to_excel(writer, sheet_name=sheet_name, index=False)

            # Access the current sheet
            current_sheet = workbook[sheet_name]

            # find max row and add 5
            lastrow = current_sheet.max_row

            # Add a hyperlink in A20 to go back to the summary sheet
            current_sheet.cell(row=lastrow + 7, column=1).hyperlink = "#summary!A1"
            current_sheet.cell(row=lastrow + 7, column=1).value = "Back to Summary"
            current_sheet.cell(row=lastrow + 7, column=1).style = "Hyperlink"
            # Add nr pubs info
            current_sheet.cell(
                row=lastrow + 5, column=1
            ).value = f"NrPubs {df_summary.iloc[i, 1]}"
            # add time info
            current_sheet.cell(
                row=lastrow + 3, column=1
            ).value = f"Year(x,sd) m={df_summary.iloc[i, 2]} sd={df_summary.iloc[i, 3]}"

        # Update the summary sheet with hyperlinks to each cluster sheet
        summary_sheet = workbook["summary"]
        for row in range(2, summary_sheet.max_row + 1):
            cluster_num = summary_sheet.cell(row=row, column=1).value
            summary_sheet.cell(
                row=row, column=1
            ).hyperlink = f"#'cluster_{cluster_num}'!A1"
            summary_sheet.cell(row=row, column=1).style = "Hyperlink"

In [23]:
iterations = 20
quality_values = []
nr_clusters = []
partitions = []
graph_summary_df_dict = {}

for params in limited_parameter_data_dict.keys():
    G = params_graph_dict[params.split("_res")[0]]
    resolution = pd.to_numeric(params.split("res")[1])
    column_name = f"cluster_{params}"
    pc = PartitionCreator(G, df)
    pc.create_partition_from_cmpvertexpartition(
        n_iterations=iterations,
        resolution_parameter=resolution,
        verbose=False,
        cluster_colunm_name=column_name,
        centrality_column_name=f"centrality_{params}",
    )
    #############################
    #############################
    ce = CommunityExplorer(pc.df, f"cluster_{params}")
    ce.create_cluster_sheets(sort=True, n=10, sort_column=f"centrality_{params}")
    sorted_sheets = [s.reset_index(drop=True) for s in ce.sheets]
    # suffix
    sorted_sheets = [s.add_suffix(f"_sorted") for s in sorted_sheets]
    ce.create_cluster_sheets(sort=False, n=10, sort_column=f"centrality_{params}")
    unsorted_sheets = [s.reset_index(drop=True) for s in ce.sheets]
    # suffix
    unsorted_sheets = [s.add_suffix(f"_random") for s in unsorted_sheets]
    comb_sheets = [
        pd.concat([sorted_sheets[i], unsorted_sheets[i]], axis=1)
        for i in range(len(sorted_sheets))
    ]
    ce.create_summary_sheet()
    df_summary = ce.df_summary
    graph_summary_df_dict[f"{params}"] = [df_summary, comb_sheets]
    # save to excel
    filepath = f"../output/tables/cluster-explorer/SingleSolExplorer_{params}.xlsx"
    save_to_excel_with_hyperlinks(filepath, df_summary, comb_sheets)
    print(f"Params: {params}; Nr of clusters: {len(pc.partition.sizes())}")
    print("#" * 50)

Params: alpha0.3_k10_res0.002; Nr of clusters: 155
##################################################
Params: alpha0.3_k10_res0.006; Nr of clusters: 362
##################################################
Params: alpha0.3_k15_res0.006; Nr of clusters: 241
##################################################
Params: alpha0.3_k15_res0.01; Nr of clusters: 379
##################################################
Params: alpha0.3_k20_res0.006; Nr of clusters: 190
##################################################
Params: alpha0.3_k20_res0.01; Nr of clusters: 299
##################################################
Params: alpha0.3_k20_res0.02; Nr of clusters: 563
##################################################
Params: alpha0.5_k15_res0.006; Nr of clusters: 298
##################################################
Params: alpha0.5_k20_res0.006; Nr of clusters: 231
##################################################
Params: alpha0.5_k20_res0.01; Nr of clusters: 375
###################################

# Explore based on Decision Criteria


In [24]:
def cluster_coherence_analyzer(
    analysis_data_dict,
    print_n_random_titles,
    word1,
    word2,
    print_word_string=True,
    save_to_file=True,
    file_name="output/tables/cluster-explorer/ParamsTest/cluster_analysis_output.txt",
    include_all_params_status=True,  # New argument to control the inclusion of all params status
):
    """
    Analyzes clusters for coherence based on the presence of specific words, prints details about matching clusters,
    highlights if multiple hits are found within the same parameters, and optionally saves the output to a text file.

    Args:
    - analysis_data_dict (dict): Dictionary containing analysis data with keys as parameters (alpha, k, res) and values
                                 as a list containing the summary DataFrame and a list of DataFrames for each cluster.
    - print_n_random_titles (int): Number of random titles to print from matching clusters.
    - word1 (str), word2 (str): Words to search for within the cluster's topic words.
    - print_word_string (bool): Whether to print the concatenated string of topic words.
    - save_to_file (bool): Whether to save the printed output to a text file.
    - file_name (str): Name of the file to save the output to, if save_to_file is True.
    """
    no_hit_params = []
    output = []  # List to capture the output
    output.append(f"Words: {word1.upper()} {word2.upper()}\n")

    for params, (summary, titles_df_list) in analysis_data_dict.items():
        hits_per_param = 0  # Counter for hits within the same parameters
        nr_of_clusters = len(titles_df_list)

        for i, row in summary.iterrows():
            Nr_of_Pubs = row["Nr of Pubs"]
            row_words_str = " ".join([str(value) for value in row[4:]])
            if word1 in row_words_str and word2 in row_words_str:
                if hits_per_param == 0:
                    output.append("=" * 130)
                hits_per_param += 1  # Increment the hit counter
                output.append(
                    f"Found in {params};    Total Clusters: {nr_of_clusters};     Cluster: {row[0]};      Nr of Pubs in Cluster: {Nr_of_Pubs}"
                )
                if print_word_string:
                    output.append(row_words_str)
                if print_n_random_titles != 0:
                    output.append("RANDOM TITLES")
                    titles = titles_df_list[i]["title_random"].head(
                        print_n_random_titles
                    )
                    output.extend([tit for tit in titles])
                if hits_per_param > 0:
                    output.append("- " * 65)

        if hits_per_param > 1:  # Check if there were multiple hits for the parameters
            output.append(f"** Multiple hits found in parameters: {params} **\n")
        elif hits_per_param == 0 and include_all_params_status:
            no_hit_params.append(params)

    output.append(f"No hits found for Parameters: \n{set(no_hit_params)}\n")

    # Convert list to string
    output_str = "\n".join(output)

    # Print the output to console
    print(output_str)

    # Optionally save to file
    if save_to_file:
        with open(file_name, "w") as file:
            file.write(output_str)

### Decision Criteria:

1. premature ejaculation and sexual dysfunction be distinct clusters
2. Mixing of anxiety and panic disorder?
3. Diabetic Mice VS Body Weight and Diabetes VS depression treatment in diabetics?
4. Use alpha = 0.3 (look at alpha0.5_k10_res0.002); topics: 50, 133 -> very mixed words, little coherence.Not the case for alpha0.3_k10_res0.002 (here, sexual dys and premature ejac single topic though)
5. Zimeldine (1 or two topics)
6. Pregnancy related topics.


In [25]:
analysis_data_dict = graph_summary_df_dict
print_n_random_titles = 5
word1 = "ejac"
word2 = "dysf"
print_word_string = True
save_to_file = True
file_name = "../output/tables/cluster-explorer/ParamsTest/cluster_analysis_output.txt"

cluster_coherence_analyzer(
    analysis_data_dict=analysis_data_dict,
    print_n_random_titles=print_n_random_titles,
    word1=word1,
    word2=word2,
    print_word_string=print_word_string,
    save_to_file=save_to_file,
    file_name=file_name,
)

Words: EJAC DYSF

Found in alpha0.3_k15_res0.01;    Total Clusters: 379;     Cluster: 328;      Nr of Pubs in Cluster: 24
 sexual paroxetine male rats male rats irisin dysfunction sexual dysfunction induced ethyl acetate fraction ethyl acetate acetate fraction acetate ethyl fraction levels frequency latency ejaculatory paroxetine induced
RANDOM TITLES
Management of drug induced sexual dysfunction in male rats by ethyl acetate fraction of onion
Irisin ameliorates male sexual dysfunction in paroxetine-treated male rats
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
No hits found for Parameters: 
{'alpha0.3_k15_res0.006', 'alpha0.5_k15_res0.006', 'alpha0.3_k10_res0.002', 'alpha0.5_k20_res0.01', 'alpha0.3_k20_res0.006', 'alpha0.3_k10_res0.006', 'alpha0.3_k20_res0.01', 'alpha0.5_k20_res0.006', 'alpha0.3_k20_res0.02'}



In [26]:
analysis_data_dict = graph_summary_df_dict
print_n_random_titles = 10
word1 = "covid"
word2 = ""
print_word_string = True
save_to_file = True
file_name = "../output/tables/cluster-explorer/ParamsTest/cluster_analysis_output.txt"

cluster_coherence_analyzer(
    analysis_data_dict=analysis_data_dict,
    print_n_random_titles=print_n_random_titles,
    word1=word1,
    word2=word2,
    print_word_string=print_word_string,
    save_to_file=save_to_file,
    file_name=file_name,
)

Words: COVID 

Found in alpha0.3_k10_res0.002;    Total Clusters: 155;     Cluster: 121;      Nr of Pubs in Cluster: 65
 covid patients serotonin fluvoxamine serotonin reuptake inhibitors inhibitors reuptake inhibitors fam lpla2 serotonin reuptake reuptake treatment grapefruit juice fluvoxamine juice fluvoxamine interaction risky juice fluvoxamine fluvoxamine interaction risky juice fluvoxamine interaction grapefruit
RANDOM TITLES
Re: ‘efficacy and safety of selective serotonin reuptake inhibitors in COVID-19 management: a systematic review and meta-analysis’ by Deng et al.
Grapefruit juice-fluvoxamine interaction: Is it risky or not? [7]
Can selective serotonin reuptake inhibitors have a neuroprotective effect during COVID-19?
Neuropsychiatric Drugs Against COVID-19: What is the Clinical Evidence?
A randomized multicenter clinical trial to evaluate the efficacy of melatonin in the prophylaxis of SARS-CoV-2 infection in high-risk contacts (MeCOVID Trial): A structured summary of a stud

In [27]:
# SUICIDE
params_scoring_suicide = {
    "alpha0.3_k10_res0.002": "1",
    "alpha0.3_k10_res0.006": "0 (too many suicide clusters, very inconsistent)",
    "alpha0.3_k15_res0.006": "1 (very clear singel suicide cluster)",
    "alpha0.3_k15_res0.01": "0 (conflates overdoses and suicide; too many clusters on it; 1 clear)",
    "alpha0.3_k20_res0.006": "1 (one good suicide, one unrelated)",
    "alpha0.3_k20_res0.01": "1 (two clear suicide clusters; should be same)",
    "alpha0.3_k20_res0.02": "0 (akathisia cluster; boderline cluster; no clear bigger suicide cluster)",
    "alpha0.5_k15_res0.006": "0 no good suicide cluster",
    "alpha0.5_k20_res0.006": "1 (one clear suicide cluster, one overdose/drug)",
    "alpha0.5_k20_res0.01": "0 first biggest cluster is very bad",
}
# pregnancy
params_scoring_pregnancy = {
    "alpha0.3_k10_res0.002": "1 (single cluster, coherent)",
    "alpha0.3_k10_res0.006": "0 (pharma pregnancy, narcolepsy, lectation, neonatal, TOO MANY)",
    "alpha0.3_k15_res0.006": "0 (4 clusters, not one clear preggo cluster)",
    "alpha0.3_k15_res0.01": "0 (1 big trash topic; pretty bad)",
    "alpha0.3_k20_res0.006": "0 (one small preggo cluster)",
    "alpha0.3_k20_res0.01": "1 (bit too split up, but sensical, [pretty good])",
    "alpha0.3_k20_res0.02": "0 (creation seems wieird.)",
    "alpha0.5_k15_res0.006": "0 (biggest cluster sucks. creation went weird)",
    "alpha0.5_k20_res0.006": "0 (first biggest cluster is weird)",
    "alpha0.5_k20_res0.01": "0 (small cluster, seem sensical tho)",
}

In [28]:
iterations = 50
quality_values = []
nr_clusters = []
partitions = []
graph_summary_df_dict = {}

last_selection_params = [
    "alpha0.3_k10_res0.002",
    "alpha0.3_k20_res0.01",
    "alpha0.3_k20_res0.006",
]


for params in last_selection_params:
    G = params_graph_dict[params.split("_res")[0]]
    resolution = pd.to_numeric(params.split("res")[1])
    column_name = f"cluster_{params}"
    pc = PartitionCreator(G, df)
    pc.create_partition_from_cmpvertexpartition(
        n_iterations=iterations,
        resolution_parameter=resolution,
        verbose=False,
        cluster_colunm_name=column_name,
        centrality_column_name=f"centrality_{params}",
    )
    #############################
    #############################
    ce = CommunityExplorer(pc.df, f"cluster_{params}")
    ce.create_cluster_sheets(sort=True, n=25, sort_column=f"centrality_{params}")
    sorted_sheets = [s.reset_index(drop=True) for s in ce.sheets]
    # suffix
    sorted_sheets = [s.add_suffix(f"_sorted") for s in sorted_sheets]
    ce.create_cluster_sheets(sort=False, n=25, sort_column=f"centrality_{params}")
    unsorted_sheets = [s.reset_index(drop=True) for s in ce.sheets]
    # suffix
    unsorted_sheets = [s.add_suffix(f"_random") for s in unsorted_sheets]
    comb_sheets = [
        pd.concat([sorted_sheets[i], unsorted_sheets[i]], axis=1)
        for i in range(len(sorted_sheets))
    ]
    ce.create_summary_sheet()
    df_summary = ce.df_summary
    graph_summary_df_dict[f"{params}"] = [df_summary, comb_sheets]
    # save to excel
    filepath = (
        f"../output/tables/cluster-explorer/FinalSelect/SingleSolExplorer_{params}.xlsx"
    )
    save_to_excel_with_hyperlinks(filepath, df_summary, comb_sheets)
    print(f"Params: {params}; Nr of clusters: {len(pc.partition.sizes())}")
    print("#" * 50)

Params: alpha0.3_k10_res0.002; Nr of clusters: 154
##################################################
Params: alpha0.3_k20_res0.01; Nr of clusters: 299
##################################################
Params: alpha0.3_k20_res0.006; Nr of clusters: 190
##################################################


# FINAL SELECTION


## alpha0.3_k10_res0.002; Nr of clusters: 154

1. "alpha0.3_k10_res0.002": "1 - suicide",
2. "alpha0.3_k10_res0.002": "1 (single cluster, coherent) - pregnancy",

##### EVAL TOP 5 CLSUTERS

0. Very clear memory clsuter; pharma and not; 1199 pubs
1. Receptor binding; brain; guess its good
2. QTC prolongation; cardiovascular; good
3. astrocytes; pharma; good
4. Binding; platelets check difference to 2; (kinda good)
5. Pulmonary hypertension; bluthochdruck; (kinda good)

## alpha0.3_k20_res0.01; Nr of clusters: 299

1. "alpha0.3_k20_res0.01": "1 (two clear suicide clusters; should be same) - suicide",
2. "alpha0.3_k20_res0.01": "1 (bit too split up, but sensical, [pretty good]) - pregnancy",
3. Good elderly cluster
4. Good Fall cluster (in elderly)
5. Good single seasonal cluster
6. good withdrawal/discontinuation cluster

##### EVAL TOP 5 CLSUTERS

0. Learning; Memory;hippocampus; fear (avoidance learning); Ok Good
1. Prolactin; Pharmacology; ok Good
2. Cardiac,arrythmia, qt prolongation (very good)
3. pregnancy; placenta, a preggo pharma cluster
4. SSRI mechanisms; Pharmacology; Depression treatment
5. Pharmacology; 5-HT; receptor; SERT(serotonintransporter),

## alpha0.3_k20_res0.006; Nr of clusters: 190

1. "alpha0.3_k20_res0.01": "1 (two clear suicide clusters; should be same)- suicide",
2. "alpha0.3_k20_res0.01": "1 (bit too split up, but sensical, [pretty good])- pregnancy",

##### EVAL TOP 5 CLSUTERS

0. Memory; Cognition; Learning;
1. Receptor; Animal studies; IN vivo; autoreceptor;
2. cardiac; qt prolongation
3. astrocytes; pharmaco
4. pharmacological ssri properties; 5 ht; receptors
5. vascular; rat; unclear;


In [29]:
def process_given_labels(df_dict, params, words_to_remove):
    cols = [col for col in df_dict[params][0].columns if col.startswith("Word_")]
    df_dict[params][0]["Given Label"] = df_dict[params][0][cols].apply(
        lambda row: list(set(row.values)), axis=1
    )
    givenlabel = []
    for label in df_dict[params][0]["Given Label"]:
        label = [word for word in label[:10] if word not in words_to_remove]
        givenlabel.append(label)
        print(label)
    df_dict[params][0]["Given Label"] = givenlabel

    return df_dict

In [30]:
params = "alpha0.3_k10_res0.002"
words_to_remove = [
    "depression",
    "fluoxetine",
    "effects",
    "effect",
    "escitalopram",
    "paroxetine",
    "fluvoxamine",
    "citalopram",
]

df_dict = process_given_labels(graph_summary_df_dict, params, words_to_remove)

['mg', 'amitriptyline', 'performance', 'healthy', 'learning', 'patients', 'sleep', 'ht']
['opioid', 'antidepressant', '5ht', 'brain', 'ne', 'pindolol', 'serotonin', 'ht 1a', 'ht']
['qt prolongation', 'mg', 'prolongation', 'cardiac', 'interval', 'qtc interval', 'antidepressants']
['glial', 'glioma', 'protein', 'antidepressant', '5ht', 'brain', 'serotonin', 'ht', 'activity']
['patients', 'dat', 'brain', 'serotonin', 'controls', 'ht', 'transporter']
['tachycardia', 'pulmonary', 'induced', 'ph', 'pulmonary arterial', 'epididymis', 'sperm', 'rats', 'serotonin']
['amitriptyline', 'blind', 'depressive', 'patients', 'major depressive', 'major', 'study']
['release', 'fenfluramine', 'behavioral', 'amphetamine', 'raphe', 'rats', '5ht', 'brain', 'serotonin']
['depressed', 'patients', 'efficacy', 'reuptake inhibitors', 'norepinephrine', 'serotonin', 'ht', 'inhibitor', 'pharmacological']
['magnesium', 'eeg', 'model', 'patients', 'mgkg', 'antidepressant', 'rats', 'ht']
['prolactin', 'fenfluramine', '

In [39]:
"""
This code chunk performs community detection on a graph using different parameter configurations. It iterates over a list of parameter strings, extracts the necessary values from each string, and creates a partition based on the graph and the extracted parameters. It then drops unnecessary columns from the partition's dataframe, adds cluster labels to the graph, and saves the resulting graph as a GraphML file. Finally, it saves the partition's dataframe as a pickle file.
"""

final_params_list = [
    "alpha0.3_k10_res0.002",
    "alpha0.3_k20_res0.01",
    "alpha0.3_k20_res0.006",
]
iterations = 50

for params in final_params_list:
    G = params_graph_dict[params.split("_res")[0]]
    resolution = pd.to_numeric(params.split("res")[1])
    column_name = f"cluster_{params}"
    pc = PartitionCreator(G, df)
    pc.create_partition_from_cmpvertexpartition(
        n_iterations=iterations,
        resolution_parameter=resolution,
        verbose=False,
        cluster_colunm_name=column_name,
        centrality_column_name=f"centrality_{params}",
    )
    path = f"../data/06-clustered-df/{params}.pkl"
    # drop all columns starting with 'centrality' that dont contain params
    cols_to_drop = [
        col for col in pc.df.columns if "centrality" in col or "cluster" in col
    ]
    cols_to_drop = [col for col in cols_to_drop if params not in final_params_list]
    pc.df.drop(cols_to_drop, axis=1, inplace=True)
    # add cluster label to graph
    eid_cluster_dict = dict(zip(pc.df["eid"], pc.df[f"cluster_{params}"]))
    eid_centrality_dict = dict(zip(pc.df["eid"], pc.df[f"centrality_{params}"]))
    G.vs["cluster"] = [eid_cluster_dict[eid] for eid in G.vs["eid"]]
    G.vs["cluster_ev_centrality"] = [eid_centrality_dict[eid] for eid in G.vs["eid"]]
    # save graphml
    G.write_graphml(f"../data/07-clustered-graphs/{params}.graphml")

pc.df.to_pickle(path)

In [None]:
import networkx as nx

Gpajek = G.copy()

# Remove all attributes except 'eid' from each node
for node, data in Gpajek.nodes(data=True):
    keys_to_remove = [k for k in data if k != "eid"]  # Collect keys to remove
    for k in keys_to_remove:
        data.pop(k, None)  # Remove the key

# Now, convert all remaining node attributes to strings
for node, data in Gpajek.nodes(data=True):
    for k, v in data.items():
        data[k] = str(v)  # Convert attribute to string

# Gpajek is now modified with only 'eid' attributes retained and converted to string
# You can now save Gpajek to the Pajek format
nx.write_pajek(Gpajek, "f../data/07-clustered-graphs/alpha0.3_k20_res0.006.net")

# NOW WE FOLLOW WITH MAIN PATH
