**Computational Health Laboratory Project, A.Y. 2021/2022**

**Authors:** Niko Dalla Noce, Alessandro Ristori, Andrea Zuppolini

**Project:** Starting fron one or more genes, extract from interaction databases the genes they interact with. Using the expanded gene set, perform pathway analysis and obtain all disease pathways in which the genes appear. Merge the pathways to obtain a larger graph. Perform further network analysis to extract central biomarkers and communities beyond pathways. Compute a distance between the initial gene set and the various pathways (diseases).

# **CHL Project, Community Analysis**

## **Colab setup**
Takes care of the project setup on Colab.

In [1]:
if 'google.colab' in str(get_ipython()):
    import subprocess
    out_clone = subprocess.run(["git", "clone", "https://github.com/nikodallanoce/ComputationalHealthLaboratory"], text=True, capture_output=True)
    print("{0}{1}".format(out_clone.stdout, out_clone.stderr))
    %cd ComputationalHealthLaboratory

## **Community analysis**
Find the communities inside the graph and look how the one in which the starting gene fell is connected to the various diseases.

In [2]:
import pandas as pd
import networkx as nx

def intersection(lst1, lst2):
    inters = list()
    if not (len(lst1) == 0 or len(lst2) == 0):
        set1 = set(lst1)
        inters = [elem for elem in lst2 if elem in set1]
    return inters

We assume that you have already done pathway enrichment on notebook and network analysis on **0_Pathway_Enrichment** and **1_Network_Analysis** and, therefore, all the datasets needed here are available. If so, then load everything.

In [3]:
nodes = pd.read_csv("datasets/genes.csv", sep=",", index_col=0)["0"]
df_diseases = pd.read_csv("datasets/diseases_pathways.csv", sep=",", index_col=0)
interactions = pd.read_csv("datasets/interactions.csv", sep=",", index_col=0)
protein_graph = nx.read_gpickle("datasets/protein_graph.gpickle")
diseases = dict()
for i, disease in df_diseases.iterrows():
    disease_genes = disease['Genes'].split(";")
    term = disease['Term']
    diseases[i] = {"name": term, "genes": disease_genes}

Find all the communities with the Louvain method (beware that is not deterministic) and discard those with only one node inside.

In [4]:
louvain_communities = list(nx.algorithms.community.louvain_communities(protein_graph))
communities = [community for community in louvain_communities if len(community) > 1]  # Discard those communities with one node

In [5]:
print("Number of communities: {0}".format(len(communities)))

Number of communities: 9


Compute the mean number of nodes of those communities kept after the pruning on their size.

In [6]:
def mean_size_communities(communities: list) -> float:
    mean_size = 0
    for community in communities:
        mean_size += len(community)

    mean_size /= len(communities)
    return mean_size

In [7]:
print("Mean size of communities: {0}".format(str(mean_size_communities(communities))))

Mean size of communities: 681.1111111111111


Compute the number of diseases in a community.

In [8]:
def diseases_in_community(protein_graph: nx.Graph, community: set) -> set:
    diseases_community = set()
    for protein in list(community):
        diseases_protein = protein_graph.nodes[protein]["diseases"]
        diseases_community.update(diseases_protein)

    return diseases_community

In [9]:
diseases_comm = diseases_in_community(protein_graph, communities[5])
print("Number of diseases in community {0}: {1}".format(5, len(diseases_comm)))

Number of diseases in community 5: 586


Let’s also see how many disease, on average, where linked to those communities with only one node.

In [10]:
def mean_diseases_communities_size_n(communities: list, protein_graph: nx.Graph, n: int = 1) -> float:
    mean_diseases = 0
    n_size_commmunities = 0
    for community in communities:
        community = list(community)
        if len(community) == n:
            protein = community[0]
            protein_diseases = protein_graph.nodes[protein]['diseases']
            n_size_commmunities += 1
            mean_diseases += len(protein_diseases)

    if n_size_commmunities == 0:
        print("There are no communities with {0} nodes".format(n))
    else:
        mean_diseases /= n_size_commmunities

    return mean_diseases

In [11]:
mean_diseases_one_node = mean_diseases_communities_size_n(louvain_communities, protein_graph)
print("Mean diseases for those communities with one node: {0}".format(str(mean_diseases_one_node)))

Mean diseases for those communities with one node: 2.2324127906976745


At last compute the mean number of nodes for the diseases.

In [12]:
def mean_genes_diseases(diseases: dict) -> float:
    mean_size = 0
    for _, disease in diseases.items():
        mean_size += len(disease['genes'])

    mean_size /= len(diseases.keys())
    return mean_size

In [13]:
print("Mean number of genes for disease: {0}".format(mean_genes_diseases(diseases)))

Mean number of genes for disease: 246.55857385398983


Let's see if the communities do not share any nodes.

In [14]:
def are_communities_distinct(communities: list) -> bool:
    for i, first_community in enumerate(communities):
        for j in range(i+1, len(communities)):
            second_community = communities[j]
            if len(intersection(first_community, second_community))>0:
                return False

    return True

In [15]:
are_communities_distinct(communities)

True

Compute how many proteins are shared between each community and disease.

In [16]:
def communities_metrics(communities: list, diseases: dict) -> pd.DataFrame:
    df_ranks = list()
    for i, community in enumerate(communities):
        tot_genes = dict()
        shared_genes = dict()
        for k, disease in diseases.items():
            genes = disease['genes']
            shared_genes_community = intersection(genes, community)
            tot_genes[k] = len(genes)
            shared_genes[k] = len(shared_genes_community)

        for j in range(len(tot_genes)):
            n_genes, n_shared_genes = tot_genes[j], shared_genes[j]
            if n_shared_genes > 1:
                df_ranks.append({"Community": i, "Disease": diseases[j]['name'], "Shared genes": n_shared_genes,
                                 "Disease genes": n_genes, "Community size": len(community)})

    df_ranks = pd.DataFrame(df_ranks)

    # Ratio of the shared genes (between community and disease pathway) and the number of genes in the disease
    df_ranks["Ratio disease"] = df_ranks['Shared genes'] / df_ranks['Disease genes']

    # Ratio of the shared genes (between community and disease pathway) and the size of the community
    df_ranks['Ratio community'] = df_ranks["Shared genes"] / df_ranks["Community size"]

    # Relevance of the  based on the previous computed metrics
    df_ranks["Relevance"] = df_ranks["Ratio disease"] * df_ranks["Ratio community"]
    return df_ranks

In [17]:
communities_rank = communities_metrics(communities, diseases)
communities_rank[communities_rank["Disease"]=="Tooth size discrepancy"]

Unnamed: 0,Community,Disease,Shared genes,Disease genes,Community size,Ratio disease,Ratio community,Relevance
518,0,Tooth size discrepancy,2,47,495,0.042553,0.00404,0.000172
1079,1,Tooth size discrepancy,4,47,458,0.085106,0.008734,0.000743
1628,2,Tooth size discrepancy,3,47,900,0.06383,0.003333,0.000213
2182,3,Tooth size discrepancy,5,47,675,0.106383,0.007407,0.000788
2736,4,Tooth size discrepancy,9,47,411,0.191489,0.021898,0.004193
3312,5,Tooth size discrepancy,8,47,1077,0.170213,0.007428,0.001264
3868,6,Tooth size discrepancy,4,47,836,0.085106,0.004785,0.000407
4438,7,Tooth size discrepancy,5,47,1112,0.106383,0.004496,0.000478


In [18]:
def look_for_gene_community(protein: str, communities: list) -> int:
    for i, community in enumerate(communities):
        if protein in community:
            return i

    return -1

In [19]:
protein_community = "SON"
gene_community = look_for_gene_community(protein_community, communities)
if gene_community == -1:
    raise Exception("The gene {0} is not in one of the communities".format(protein_community))
else:
    print("The gene {0} is in community {1}".format(protein_community, gene_community))

The gene SON is in community 5


Knowing in which community our starting gene is, we can retrieve all the diseases inisde such community and rank them by their relevance.

In [20]:
# Retrieve the diseases inside the community in which the gene is in and keep those that share at least ten genes with the community
disease_rank = communities_rank[(communities_rank["Community"]==gene_community) &
                             (communities_rank["Shared genes"] > 10)].sort_values(by="Relevance", ascending=False).drop(["Community", "Shared genes",
                                                                                                                         "Disease genes", "Community size"], axis=1)

In [21]:
disease_rank.tail()

Unnamed: 0,Disease,Ratio disease,Ratio community,Relevance
3224,Renal fibrosis,0.08,0.011142,0.000891
2913,Disseminated Malignant Neoplasm,0.077922,0.011142,0.000868
3035,Prostatic Intraepithelial Neoplasias,0.081481,0.010214,0.000832
2889,Noninfiltrating Intraductal Carcinoma,0.063348,0.012999,0.000823
3096,oligodendroglioma,0.078571,0.010214,0.000802


In [22]:
disease_rank.to_csv("datasets/community_diseases_rank.csv")