**Computational Health Laboratory Project, A.Y. 2021/2022**

**Authors:** Niko Dalla Noce, Alessandro Ristori, Andrea Zuppolini

# **CHL Project, Pathway Analysis**
Starting fron one or more genes, extract from interaction databases the genes they interact with. Using the expanded gene set, perform pathway analysis and obtain all disease pathways in which the genes appear. Merge the pathways to obtain a larger graph. Perform further network analysis to extract central biomarkers and communities beyond pathways. Compute a distance between the initial gene set and the various pathways (diseases).

## **Colab setup**
Takes care of the project setup on Colab.

In [1]:
if 'google.colab' in str(get_ipython()):
    import subprocess
    from google.colab import drive
    out_clone = subprocess.run(["git", "clone", "https://github.com/nikodallanoce/ComputationalHealthLaboratory"], text=True, capture_output=True)
    print("{0}{1}".format(out_clone.stdout, out_clone.stderr))
    %pip install -U PyYAML
    %pip install gseapy
    drive.mount("/content/drive/")
    %cp "/content/drive/Shareddrives/CHL/config.yml" "/content/ComputationalHealthLaboratory"
    %cd ComputationalHealthLaboratory

## **Obtain all the genes that interacts with the starting one**
Starting from a gene obtain its neighbours and the interactions between them.


In [2]:
import requests
import pandas as pd
import numpy as np
from config import ACCESS_KEY, BASE_URL

In [3]:
gene_interactions = pd.read_csv("datasets/geneset.csv", sep=";")
gene_interactions["InteractorA"] = gene_interactions["InteractorA"].str.upper()
gene_interactions.drop_duplicates(inplace=True)
proteins_list = list(gene_interactions["InteractorA"])  # all the proteins that interact with our starting gene

In [4]:
gene_interactions.tail()

Unnamed: 0,InteractorA,InteractorB
149,NSP8,SON
150,NSP9,SON
151,ORF6,SON
152,ORF8,SON
153,CCNF,SON


## **Expand the interactions dataset**
Expand the dataset using the proteins obtained from the previous step.

In [5]:
request_url = BASE_URL + "/interactions"
data = {}

step = 5
for i in range(0, len(proteins_list), step):
    end = i+step
    if end >= len(proteins_list):
        end = len(proteins_list)
    
    # List of genes to search for
    gene_list = proteins_list[i:end]

    params = {
        "accesskey": ACCESS_KEY,
        "format": "json",  # Return results in TAB2 format
        "geneList": "|".join(gene_list),  # Must be | separated
        "searchNames": "true",  # Search against official names
        "includeInteractors": "true",  # Set to true to get any interaction involving EITHER gene, set to false to get interactions between genes
        "includeInteractorInteractions": "false",  # Set to true to get interactions between the geneList’s first order interactors
        "includeEvidence": "false",  # If false "evidenceList" is evidence to exclude, if true "evidenceList" is evidence to show
        "selfInteractionsExcluded": "true", # If true no self-interactions will be included
    }

    r = requests.get(request_url, params=params)
    interactions = r.json()
    
    # Check if the interactions are more than the allowed number
    if len(interactions)==10000:
      assert False

    # Create a hash of results by interaction identifier
    for interaction_id, interaction in interactions.items():
        data[interaction_id] = interaction

In [6]:
# Load the data into a pandas dataframe
dataset = pd.DataFrame.from_dict(data, orient="index")

# Re-order the columns and select only the columns we want to see
columns = ["OFFICIAL_SYMBOL_A", "OFFICIAL_SYMBOL_B"]
dataset = dataset[columns]

# Rename the columns and make all the values uppercase
dataset = dataset.rename(columns={"OFFICIAL_SYMBOL_A": "InteractorA", "OFFICIAL_SYMBOL_B": "InteractorB"})
dataset["InteractorA"] = dataset["InteractorA"].str.upper()
dataset["InteractorB"] = dataset["InteractorB"].str.upper()

# Print the dataframe
dataset.tail()

Unnamed: 0,InteractorA,InteractorB
3305885,CCNF,ZBTB1
3305886,CCNF,ZGPAT
3305887,CCNF,ZNF638
3305888,CCNF,ZNF687
3305889,CCNF,ZWINT


Drop duplicated interactions, they're not interesting from out point of view.

In [7]:
# Look for duplicated interactions
duplicated_interactions = pd.DataFrame(np.sort(dataset[["InteractorA", "InteractorB"]].values, 1)).duplicated()
print("Duplicated interactions:\n{0}".format(duplicated_interactions.value_counts()))

# Delete such interactions from the dataset
dataset = dataset[~duplicated_interactions.values]

Duplicated interactions:
False    79296
True     25281
dtype: int64


Drop self-loops since they're useless for our analysis.

In [8]:
# Look for interactions where both proteins are the same
same_proteins_interactions = pd.Series(dataset[["InteractorA", "InteractorB"]].nunique(axis=1) == 1)
print("Useless interactions:\n{0}".format(same_proteins_interactions.value_counts()))

# Delete such interactions from the dataset
dataset = dataset[~same_proteins_interactions.values]

Useless interactions:
False    79283
True        13
dtype: int64


Unify the interactions from the starting gene with the ones obtained by the requests to the BioGrid dataset.

In [9]:
dataset = pd.concat([dataset, gene_interactions])

In [10]:
nodes = pd.concat([dataset["InteractorA"], dataset["InteractorB"]]).unique()
print("Number of nodes: {0}".format(len(nodes)))

Number of nodes: 13010


At last, save the interactions and nodes into csv file for pathway enrichment.

In [11]:
# Save interactions and nodes dataset to csv
dataset.to_csv("datasets/interactions.csv")
pd.DataFrame(nodes).to_csv("datasets/genes.csv")

## **Pathway enrichment**
Find all the diseases linked to the nodes retrieved by the previous step.

In [12]:
import gseapy as gp

List all the datasets from wich we can retrieve pathways by using the gseapy package.

In [13]:
gp.get_library_name()

['ARCHS4_Cell-lines',
 'ARCHS4_IDG_Coexp',
 'ARCHS4_Kinases_Coexp',
 'ARCHS4_TFs_Coexp',
 'ARCHS4_Tissues',
 'Achilles_fitness_decrease',
 'Achilles_fitness_increase',
 'Aging_Perturbations_from_GEO_down',
 'Aging_Perturbations_from_GEO_up',
 'Allen_Brain_Atlas_10x_scRNA_2021',
 'Allen_Brain_Atlas_down',
 'Allen_Brain_Atlas_up',
 'Azimuth_Cell_Types_2021',
 'BioCarta_2013',
 'BioCarta_2015',
 'BioCarta_2016',
 'BioPlanet_2019',
 'BioPlex_2017',
 'CCLE_Proteomics_2020',
 'CORUM',
 'COVID-19_Related_Gene_Sets',
 'COVID-19_Related_Gene_Sets_2021',
 'Cancer_Cell_Line_Encyclopedia',
 'CellMarker_Augmented_2021',
 'ChEA_2013',
 'ChEA_2015',
 'ChEA_2016',
 'Chromosome_Location',
 'Chromosome_Location_hg19',
 'ClinVar_2019',
 'DSigDB',
 'Data_Acquisition_Method_Most_Popular_Genes',
 'DepMap_WG_CRISPR_Screens_Broad_CellLines_2019',
 'DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019',
 'Descartes_Cell_Types_and_Tissue_2021',
 'DisGeNET',
 'Disease_Perturbations_from_GEO_down',
 'Disease_Perturbati

Obtain all the pathways connected to our nodes, for our case we are going to use the DisGeNET dataset.

In [14]:
import os
if os.path.exists("datasets/diseases_pathways.csv"):
    df_diseases = pd.read_csv("datasets/diseases_pathways.csv", sep=",", index_col=0)
else:
    enr = gp.enrichr(gene_list=pd.DataFrame(nodes),
                      gene_sets=['DisGeNET'],  # Datasets from the gp.get_library_name() method
                      organism='Human',
                      description='DEGs_up_1d',
                      outdir='test'
                  )

    # Keep those pathways with an adjusted p-value < 0.1
    df_diseases = enr.results[enr.results["Adjusted P-value"] < 0.1][["Term", "Overlap", "P-value", "Adjusted P-value", "Genes"]]

In [15]:
df_diseases.tail()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Genes
584,Chronic otitis media,55/69,0.005896,0.09895,IGHM;CD81;WIPF1;FMR1;DOCK8;CHD7;JMJD1C;COMT;GT...
585,Inadequate arch length for tooth size,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
586,Tooth Crowding,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
587,Tooth mass arch size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
588,Tooth size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...


Save the pathways in a csv file just like for the interactions and nodes.

In [16]:
df_diseases.to_csv("datasets/diseases_pathways.csv")

Build a dict with all the diseases, this will be helpful when we will need to work on the graph.

In [17]:
diseases = dict()

for i, disease in df_diseases.iterrows():
    disease_genes = disease['Genes'].split(";")
    term = disease['Term']
    diseases[i] = {"name": term, "genes": disease_genes}

## **Protein-Protein network**
Build the protein-to-protein network and link each node to its diseases.


In [18]:
import networkx as nx

Build the graph and fill it with its nodes (the proteins coming from the dataset).

In [19]:
# Build the graph
protein_graph = nx.Graph(name='Protein Interactions Graph')

# Build the nodes
for node in nodes:
    protein_graph.add_node(node, diseases=[])  # Each node will have a list with the disease pathways it belongs to

Insert into the nodes their respective diseases.

In [20]:
for i, disease in diseases.items():
    disease_genes = disease['genes']
    for gene in disease_genes:
        protein_graph.nodes[gene]["diseases"].append(i)

There could be nodes without any diseases, they still need to be kept into the network.

In [21]:
nodes_no_disease = list()
for node in protein_graph.nodes:
    if len(protein_graph.nodes[node]["diseases"])==0:
        nodes_no_disease.append(str(node))

In [22]:
print("Nodes without diseases: {0}".format(len(nodes_no_disease)))

Nodes without diseases: 5101


Then, build the edges, it's straightforward as the nodes are known, the edges' weights will be the number of diseases in common between the nodes at each end.

In [23]:
def intersection(lst1, lst2):
    inters = list()
    if not (len(lst1) == 0 or len(lst2) == 0):
        set1 = set(lst1)
        inters = [elem for elem in lst2 if elem in set1]
    return inters

In [24]:
for _, interaction in dataset.iterrows():
    first_protein, second_protein = interaction[0], interaction[1]  # Proteins involved in the interaction

    # Retrieve the proteins' diseases
    prot1_dis = protein_graph.nodes()[interaction[0]]['diseases']
    prot2_dis = protein_graph.nodes()[interaction[1]]['diseases']
    
    # Build the edge
    protein_graph.add_edge(first_protein, second_protein, weight=len(intersection(prot1_dis, prot2_dis)))

There could be edges without any disease, as we did for the nodes, they still need to be kept.

In [25]:
edges_no_disease = list()
for edge in protein_graph.edges:
    if protein_graph.edges[edge]["weight"]==0:
        edges_no_disease.append(str(edge))

In [26]:
print("Edges without diseases: {0}".format(len(edges_no_disease)))

Edges without diseases: 45404


At last, save the graph.

In [27]:
nx.write_gpickle(protein_graph, 'datasets/protein_graph.gpickle')

## **Metrics**
Metrics needed to compare the various diseases and proteins.

Load the graph if already built previously.

In [28]:
import os
from tqdm.notebook import tqdm

if os.path.exists("datasets/protein_graph.gpickle"):
    protein_graph = nx.read_gpickle("datasets/protein_graph.gpickle")
elif not "protein_graph" in locals():
    raise ValueError("It was not possible to find the graph, build it from the previous steps")

**Size of largest pathway component:** Fraction of disease proteins that lie in the disease's largest pathway component (i.e., the relative size of the largest connected component (LCC) of the disease).

In [29]:
def largest_conn_comp(diseases_dict: dict) -> list:
    lcc_score = list()
    for _, disease_dict in tqdm(diseases_dict.items()):
        sub_graph = protein_graph.subgraph(disease_dict['genes'])  # Subgraph of the current disease
        largest_cc = max(nx.connected_components(sub_graph), key=len)
        lcc_score.append(len(largest_cc) / len(sub_graph.nodes()))
    
    return lcc_score

In [30]:
if "LCC Score" in df_diseases.columns:
    df_diseases["LCC Score"] = largest_conn_comp(diseases)
else:
    df_diseases.insert(len(df_diseases.columns), "LCC Score", largest_conn_comp(diseases), True)

  0%|          | 0/589 [00:00<?, ?it/s]

In [31]:
df_diseases.tail()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Genes,LCC Score
584,Chronic otitis media,55/69,0.005896,0.09895,IGHM;CD81;WIPF1;FMR1;DOCK8;CHD7;JMJD1C;COMT;GT...,0.018182
585,Inadequate arch length for tooth size,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043
586,Tooth Crowding,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043
587,Tooth mass arch size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043
588,Tooth size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043


**Distance of pathway components:** For each pair of pathway components, we calculate the average shortest path length between each set of proteins, and then, the average of this is taken over all pairs of the components.

In [32]:
from numpy.ma.core import mean

def distance_pathway_comps(diseases_dict: dict) -> list:
    dpc_score = list()
    for _, disease_dict in tqdm(diseases_dict.items()):
        sub_graph = protein_graph.subgraph(disease_dict['genes'])
        conn_comps = list(nx.connected_components(sub_graph))
        distances = list()
        for i, comp in enumerate(conn_comps):
            for j in range(i+1, len(conn_comps)):
                dist = 0
                for first_comp_protein in comp:
                    for second_comp_protein in conn_comps[j]:
                        dist += nx.shortest_path_length(protein_graph, source=first_comp_protein, target=second_comp_protein)
                
                distances.append(dist / (len(comp) * len(conn_comps[j])))

        dpc_score.append(mean(distances))
    
    return dpc_score

In [33]:
import os

if os.path.exists("datasets/mean_distances.csv"):
    df_mean_distances = pd.read_csv("datasets/mean_distances.csv", sep=",", index_col=0)
elif not "df_mean_distances" in locals():
    df_mean_distances = pd.DataFrame(distance_pathway_comps(diseases))
    df_mean_distances.to_csv('datasets/mean_distances.csv')

In [34]:
df_diseases["DPC Score"] = df_mean_distances
df_diseases.tail()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Genes,LCC Score,DPC Score
584,Chronic otitis media,55/69,0.005896,0.09895,IGHM;CD81;WIPF1;FMR1;DOCK8;CHD7;JMJD1C;COMT;GT...,0.018182,2.678114
585,Inadequate arch length for tooth size,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694
586,Tooth Crowding,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694
587,Tooth mass arch size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694
588,Tooth size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694


In [35]:
df_diseases.to_csv("datasets/diseases_scores.csv")

**Network modularity:** Fraction of edges that fall within/outside the pathway minus the expected fraction if edges were randomly distributed:
\begin{equation}
Q_d = 1/(2m) \sum_{ij} (I((i, j) ∈ E) − \frac{k_ik_j}{
2m})δ(p_i, p_j)
\end{equation}
where $k_i$ is the degree of $i$, and $δ(p_i, p_j)$ is 1 if $p_i$ and $p_j$ are equal and 0 otherwise.


In [36]:
def network_modularity(protein_graph: nx.Graph, diseases_dict: dict) -> list:
    m = protein_graph.number_of_edges()
    one_m = 1/(2*m)
    q = list()
    for _, disease_dict in tqdm(diseases_dict.items()):
        sub_graph = protein_graph.subgraph(disease_dict['genes'])
        disease_nodes = list(sub_graph.nodes())
        q_dis = 0
        for i, node_i in enumerate(disease_nodes):
            for j in range(i+1, len(disease_nodes)):
                node_j = disease_nodes[j]
                a = protein_graph.number_of_edges(node_i, node_j)
                k_i=protein_graph.degree[node_i]
                k_j=protein_graph.degree[node_j]
                q_dis += a - (k_i*k_j)/(2*m)
        
        q.append(one_m * q_dis)
    
    return q

In [37]:
import os

if os.path.exists("datasets/modularities.csv"):
    df_modularities = pd.read_csv("datasets/modularities.csv", sep=",", index_col=0)
elif not "df_modularities" in locals():
    df_modularities = pd.DataFrame(network_modularity(protein_graph, diseases))
    df_modularities.to_csv('datasets/modularities.csv')

In [38]:
df_diseases["Modularity"] = df_modularities
df_diseases.tail()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Genes,LCC Score,DPC Score,Modularity
584,Chronic otitis media,55/69,0.005896,0.09895,IGHM;CD81;WIPF1;FMR1;DOCK8;CHD7;JMJD1C;COMT;GT...,0.018182,2.678114,-8e-06
585,Inadequate arch length for tooth size,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694,4e-06
586,Tooth Crowding,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694,4e-06
587,Tooth mass arch size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694,4e-06
588,Tooth size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694,4e-06


In [39]:
df_diseases.to_csv("datasets/diseases_scores.csv")

## **Network biomarkers**
Compute the central nodes in the graph by taking into account their normalized degree.

In [40]:
import networkx.algorithms.centrality as nx_c

Compute the degree of each node and choose those that satisfy a treshold, the starting gene must be considered too.

In [41]:
nodes_degree = pd.DataFrame.from_dict(nx_c.degree_centrality(protein_graph), orient="index", columns=["centrality"])
nodes_degree = nodes_degree.sort_values(by='centrality', ascending=False)
biomarkers = pd.concat([nodes_degree.iloc[:29], nodes_degree[nodes_degree.index=="SON"]]) # Insert the starting gene
biomarkers

Unnamed: 0,centrality
KIAA1429,0.223922
ESR2,0.175801
ESR1,0.174879
FANCD2,0.162733
MYC,0.156815
KIF14,0.138596
HIST1H4A,0.12722
BRD4,0.117534
EED,0.112537
CIT,0.111308


## **Community analysis**
Find the communities inside the graph and look how the one in which the starting gene fell is connected to the various diseases.

Find all the communities with the Louvain method (beware that is not deterministic) and discard those with only one node inside.

In [42]:
louvain_communities = list(nx.algorithms.community.louvain_communities(protein_graph))
communities = [community for community in louvain_communities if len(community)>1]  # Discard those communities with one node

In [43]:
print("Number of communities: {0}".format(len(communities)))

Number of communities: 10


Compute the mean number of nodes of those community kept after the pruning on their size.

In [44]:
mean_size_communities = 0
for community in communities:
    mean_size_communities += len(community)

mean_size_communities /= len(communities)
print("Mean size of communities: {0}".format(str(mean_size_communities)))

Mean size of communities: 613.0


Let's also see how many disease, on average, where linked to those communities with only one node.

In [45]:
mean_diseases_one_node_communities = 0
n_one_node_commmunities = 0
for community in louvain_communities:
    community = list(community)
    if len(community) == 1:
        protein = community[0]
        protein_diseases = protein_graph.nodes[protein]['diseases']
        n_one_node_commmunities += 1
        mean_diseases_one_node_communities += len(protein_diseases)

if n_one_node_commmunities == 0:
    print("There are no communities with only one node")
else:
    mean_diseases_one_node_communities /= n_one_node_commmunities
    print("Mean diseases for those communities with one node: {0}".format(str(mean_diseases_one_node_communities)))

Mean diseases for those communities with one node: 2.2324127906976745


At last compute the mean number of nodes for the diseases.

In [46]:
mean_size_diseases = 0
for _, disease in diseases.items():
    mean_size_diseases += len(disease['genes'])

mean_size_diseases /= len(diseases.keys())
print("Mean number of genes for disease: {0}".format(mean_size_diseases))

Mean number of genes for disease: 246.55857385398983


Let's see if the communities do not share any nodes.

In [47]:
def are_communities_distinct(communities: list) -> bool:
    for i, first_community in enumerate(communities):
        for j in range(i+1, len(communities)):
            second_community = communities[j]
            if len(intersection(first_community, second_community))>0:
                return False

    return True

In [48]:
are_communities_distinct(communities)

True

Compute how many proteins are shared betweem each community and disease.

In [49]:
def communities_ranking(communities: list, diseases: dict) -> list:
    communities_rank = list()  # pd.DataFrame(columns=["com", "disease", "rel_val", "common_genes"])
    for i, community in enumerate(communities):
        tot_genes = dict()
        shared_genes = dict()
        for k, disease in diseases.items():
            genes = disease['genes']
            shared_genes_community = intersection(genes, community)
            tot_genes[k]= len(genes)
            shared_genes[k] = len(shared_genes_community)

        for j in range(len(tot_genes)):
            n_genes, n_shared_genes = tot_genes[j], shared_genes[j]
            if n_shared_genes > 1:
                communities_rank.append({"Community": i, "Disease": diseases[j]['name'], "Shared genes": n_shared_genes, "# genes": n_genes})

    communities_rank = pd.DataFrame.from_dict(communities_rank)
    communities_rank["Ratio disease"] = communities_rank['Shared genes'] / communities_rank['# genes']
    return communities_rank

In [50]:
communities_rank = communities_ranking(communities, diseases)
communities_rank[communities_rank["Disease"]=="Tooth size discrepancy"]

Unnamed: 0,Community,Disease,Shared genes,# genes,Ratio disease
564,0,Tooth size discrepancy,3,47,0.06383
1129,1,Tooth size discrepancy,9,47,0.191489
1680,2,Tooth size discrepancy,3,47,0.06383
2215,3,Tooth size discrepancy,2,47,0.042553
2714,4,Tooth size discrepancy,3,47,0.06383
3285,5,Tooth size discrepancy,5,47,0.106383
3823,6,Tooth size discrepancy,3,47,0.06383
4363,7,Tooth size discrepancy,3,47,0.06383
5194,9,Tooth size discrepancy,9,47,0.191489


Compute the ratio of the number of genes shared by each community and disease and the size of such community.

In [51]:
ratios_genes_community = list()
for community in range(len(communities)):
    ratios_genes_community += (list(communities_rank[communities_rank['Community']==community]['Shared genes'] / len(communities[community])))

communities_rank['Ratio community'] = pd.Series(ratios_genes_community)

In [52]:
communities_rank[communities_rank["Disease"]=="Tooth size discrepancy"]

Unnamed: 0,Community,Disease,Shared genes,# genes,Ratio disease,Ratio community
564,0,Tooth size discrepancy,3,47,0.06383,0.006522
1129,1,Tooth size discrepancy,9,47,0.191489,0.010936
1680,2,Tooth size discrepancy,3,47,0.06383,0.003367
2215,3,Tooth size discrepancy,2,47,0.042553,0.003344
2714,4,Tooth size discrepancy,3,47,0.06383,0.007042
3285,5,Tooth size discrepancy,5,47,0.106383,0.004634
3823,6,Tooth size discrepancy,3,47,0.06383,0.004323
4363,7,Tooth size discrepancy,3,47,0.06383,0.004823
5194,9,Tooth size discrepancy,9,47,0.191489,0.020316


We need to know in wich community our starting gene is.

In [53]:
def look_for_gene_community(protein: str, communities: list) -> int:
    for i, community in enumerate(communities):
        if protein in community:
            return i

    return -1

In [54]:
protein_community = "SON"
gene_community = look_for_gene_community(protein_community, communities)
if gene_community == -1:
    raise Exception("The gene {0} is not in one of the communities".format(protein_community))
else:
    print("The gene {0} is in community {1}".format(protein_community, gene_community))

The gene SON is in community 1


Knowing in which community our starting gene is, we can retrieve all the diseases inisde such community and rank them by the product of the two previous computed metrics.

In [55]:
# Compute the new metric
communities_rank['Relevance'] = communities_rank['Ratio community'] * communities_rank['Ratio disease']

# Retrieve the diseases inside the community in which the gene is in and keep those that share at least ten genes with the community
gene_rank = communities_rank[(communities_rank["Community"]==gene_community) &
                             (communities_rank["Shared genes"] > 10)].sort_values(by="Relevance", ascending=False).drop("Community", axis=1)

In [56]:
gene_rank.head()

Unnamed: 0,Disease,Shared genes,# genes,Ratio disease,Ratio community,Relevance
570,Intellectual Disability,430,1808,0.237832,0.522479,0.124262
565,Mental and motor retardation,246,802,0.306733,0.298906,0.091685
568,Global developmental delay,248,854,0.290398,0.301337,0.087508
567,Cognitive delay,233,758,0.307388,0.283111,0.087025
575,Poor school performance,226,741,0.304993,0.274605,0.083753


In [57]:
gene_rank.to_csv("datasets/community_diseases_rank.csv")

## **Draw the protein, disease and community graphs**
Plot the graphs to show the results of our work.

In [58]:
from pyvis.network import Network

Plot the protein graph, you can also plot the a disease's genes in the same graph. If a node is red, then it belongs to the disease.

In [59]:
def plot_protein_network(protein_graph: nx.Graph, disease_genes: list=None, biomarkers: list=None) -> None:
    if not biomarkers is None:
        plot_graph = protein_graph.subgraph(biomarkers)
    else:
        plot_graph = protein_graph.copy()

    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(plot_graph.nodes()):
        node_index[node] = i
        if not node in disease_genes:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in plot_graph.edges():
        if edge_from in disease_genes or edge_to in disease_genes:
            if not biomarkers is None:
                net.add_edge(node_index[edge_from], node_index[edge_to], color="red")
            else:
                net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("protein_graph.html")

In [60]:
plot_protein_network(protein_graph, diseases[588]["genes"], biomarkers.index)

Plot the disease graph, you can also plot a disease's genes on the same plot, the latters will be colored in red.

In [61]:
def plot_disease(protein_graph: nx.Graph, disease_genes: list, protein:str) -> None:
    sub_graph = protein_graph.subgraph(disease_genes)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if node != protein:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from==protein or edge_to==protein:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red")
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("disease_graph.html")

In [62]:
plot_disease(protein_graph, diseases[588]["genes"], "SON")

Plot a community graph and color the chosen protein, if any is passed, and the edges linked to it.

In [63]:
def plot_community_protein(protein_graph: nx.Graph, community: list, protein: str=None) -> None:
    if protein is not None:
        community = communities[look_for_gene_community(protein, communities)]
    else:
        community = np.random.randint(0, len(communities))

    sub_graph = protein_graph.subgraph(community)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if node != protein:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from==protein or edge_to==protein:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("community_protein_graph.html")

In [64]:
plot_community_protein(protein_graph, communities, "SON")

Plot a community and color the disease's genes passed to the method.

In [65]:
def plot_community_disease(protein_graph: nx.Graph, disease_genes: list, community: set) -> None:
    sub_graph = protein_graph.subgraph(community)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if not node in disease_genes:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from in disease_genes or edge_to in disease_genes:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("community_disease_graph.html")

In [66]:
plot_community_disease(protein_graph, diseases[588]["genes"], communities[3])

Compute the ratio of the number of genes shared by each community and disease and the size of such community.

In [67]:
ratios_genes_community = list()
for community in range(len(communities)):
    ratios_genes_community += (list(communities_rank[communities_rank['Community']==community]['Shared genes'] / len(communities[community])))

communities_rank['Ratio community'] = pd.Series(ratios_genes_community)

In [68]:
communities_rank[communities_rank["Disease"]=="Tooth size discrepancy"]

Unnamed: 0,Community,Disease,Shared genes,# genes,Ratio disease,Ratio community,Relevance
564,0,Tooth size discrepancy,3,47,0.06383,0.006522,0.000416
1129,1,Tooth size discrepancy,9,47,0.191489,0.010936,0.002094
1680,2,Tooth size discrepancy,3,47,0.06383,0.003367,0.000215
2215,3,Tooth size discrepancy,2,47,0.042553,0.003344,0.000142
2714,4,Tooth size discrepancy,3,47,0.06383,0.007042,0.00045
3285,5,Tooth size discrepancy,5,47,0.106383,0.004634,0.000493
3823,6,Tooth size discrepancy,3,47,0.06383,0.004323,0.000276
4363,7,Tooth size discrepancy,3,47,0.06383,0.004823,0.000308
5194,9,Tooth size discrepancy,9,47,0.191489,0.020316,0.00389


We need to know in wich community our starting gene is.

In [69]:
def look_for_gene_community(protein: str, communities: list) -> int:
    for i, community in enumerate(communities):
        if protein in community:
            return i

    return -1

In [70]:
protein_community = "SON"
gene_community = look_for_gene_community(protein_community, communities)
if gene_community == -1:
    raise Exception("The gene {0} is not in one of the communities".format(protein_community))
else:
    print("The gene SON is in community {0}".format(gene_community))

The gene SON is in community 1


Knowing in which community our starting gene is, we can retrieve all the diseases inisde such community and rank them by the product of the two previous computed metrics.

In [71]:
# Compute the new metric
communities_rank['Relevance'] = communities_rank['Ratio community'] * communities_rank['Ratio disease']

# Retrieve the diseases inside the community in which the gene is in and keep those that share at least ten genes with the community
gene_rank = communities_rank[(communities_rank["Community"]==gene_community) &
                             (communities_rank["Shared genes"] > 10)].sort_values(by="Relevance", ascending=False).drop("Community", axis=1)

In [72]:
gene_rank.tail()

Unnamed: 0,Disease,Shared genes,# genes,Ratio disease,Ratio community,Relevance
784,Ductal Carcinoma,11,177,0.062147,0.013366,0.000831
720,Adenomatous Polyposis Coli,14,295,0.047458,0.017011,0.000807
675,Hereditary Diffuse Gastric Cancer,12,233,0.051502,0.014581,0.000751
990,Mesothelioma,12,276,0.043478,0.014581,0.000634
968,Malignant mesothelioma,11,244,0.045082,0.013366,0.000603


In [73]:
gene_rank.to_csv("datasets/community_diseases_rank.csv")

## **Draw the protein, disease and community graphs**
Plot the graphs to show the results of our work.

In [74]:
from pyvis.network import Network

Plot the protein graph, you can also plot the a disease's genes in the same graph. If a node is red, then it belongs to the disease.

In [75]:
def plot_protein_network(protein_graph: nx.Graph, disease_genes: list=None, biomarkers: list=None) -> None:
    if not biomarkers is None:
        plot_graph = protein_graph.subgraph(biomarkers)
    else:
        plot_graph = protein_graph.copy()

    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(plot_graph.nodes()):
        node_index[node] = i
        if not node in disease_genes:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in plot_graph.edges():
        if edge_from in disease_genes or edge_to in disease_genes:
            if not biomarkers is None:
                net.add_edge(node_index[edge_from], node_index[edge_to], color="red")
            else:
                net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("protein_graph.html")

In [76]:
plot_protein_network(protein_graph, diseases[588]["genes"], biomarkers.index)

Plot the disease graph, you can also plot a disease's genes on the same plot, the latters will be colored in red.

In [77]:
def plot_disease(protein_graph: nx.Graph, disease_genes: list, protein:str) -> None:
    sub_graph = protein_graph.subgraph(disease_genes)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if node != protein:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from==protein or edge_to==protein:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red")
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("disease_graph.html")

In [78]:
plot_disease(protein_graph, diseases[588]["genes"], "SON")

Plot a community graph and color the chosen protein, if any is passed, and the edges linked to it.

In [79]:
def plot_community_protein(protein_graph: nx.Graph, community: list, protein: str=None) -> None:
    if protein is not None:
        community = communities[look_for_gene_community(protein, communities)]
    else:
        community = np.random.randint(0, len(communities))

    sub_graph = protein_graph.subgraph(community)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if node != protein:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from==protein or edge_to==protein:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("community_protein_graph.html")

In [80]:
plot_community_protein(protein_graph, communities, "SON")

Plot a community and color the disease's genes passed to the method.

In [81]:
def plot_community_disease(protein_graph: nx.Graph, disease_genes: list, community: set) -> None:
    sub_graph = protein_graph.subgraph(community)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if not node in disease_genes:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from in disease_genes or edge_to in disease_genes:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("community_disease_graph.html")

In [82]:
plot_community_disease(protein_graph, diseases[588]["genes"], communities[3])

Compute the ratio of the number of genes shared by each community and disease and the size of such community.

In [83]:
ratios_genes_community = list()
for community in range(len(communities)):
    ratios_genes_community += (list(communities_rank[communities_rank['Community']==community]['Shared genes'] / len(communities[community])))

communities_rank['Ratio community'] = pd.Series(ratios_genes_community)

In [84]:
communities_rank[communities_rank["Disease"]=="Tooth size discrepancy"]

Unnamed: 0,Community,Disease,Shared genes,# genes,Ratio disease,Ratio community,Relevance
564,0,Tooth size discrepancy,3,47,0.06383,0.006522,0.000416
1129,1,Tooth size discrepancy,9,47,0.191489,0.010936,0.002094
1680,2,Tooth size discrepancy,3,47,0.06383,0.003367,0.000215
2215,3,Tooth size discrepancy,2,47,0.042553,0.003344,0.000142
2714,4,Tooth size discrepancy,3,47,0.06383,0.007042,0.00045
3285,5,Tooth size discrepancy,5,47,0.106383,0.004634,0.000493
3823,6,Tooth size discrepancy,3,47,0.06383,0.004323,0.000276
4363,7,Tooth size discrepancy,3,47,0.06383,0.004823,0.000308
5194,9,Tooth size discrepancy,9,47,0.191489,0.020316,0.00389


We need to know in wich community our starting gene is.

In [85]:
def look_for_gene_community(protein: str, communities: list) -> int:
    for i, community in enumerate(communities):
        if protein in community:
            return i

    return -1

In [86]:
protein_community = "SON"
gene_community = look_for_gene_community(protein_community, communities)
if gene_community == -1:
    raise Exception("The gene {0} is not in one of the communities".format(protein_community))
else:
    print("The gene SON is in community {0}".format(gene_community))

The gene SON is in community 1


Knowing in which community our starting gene is, we can retrieve all the diseases inisde such community and rank them by the product of the two previous computed metrics.

In [87]:
# Compute the new metric
communities_rank['Relevance'] = communities_rank['Ratio community'] * communities_rank['Ratio disease']

# Retrieve the diseases inside the community in which the gene is in and keep those that share at least ten genes with the community
gene_rank = communities_rank[(communities_rank["Community"]==gene_community) &
                             (communities_rank["Shared genes"] > 10)].sort_values(by="Relevance", ascending=False).drop("Community", axis=1)

In [88]:
gene_rank.tail()

Unnamed: 0,Disease,Shared genes,# genes,Ratio disease,Ratio community,Relevance
784,Ductal Carcinoma,11,177,0.062147,0.013366,0.000831
720,Adenomatous Polyposis Coli,14,295,0.047458,0.017011,0.000807
675,Hereditary Diffuse Gastric Cancer,12,233,0.051502,0.014581,0.000751
990,Mesothelioma,12,276,0.043478,0.014581,0.000634
968,Malignant mesothelioma,11,244,0.045082,0.013366,0.000603


In [89]:
gene_rank.to_csv("datasets/community_diseases_rank.csv")

## **Draw the protein, disease and community graphs**
Plot the graphs to show the results of our work.

In [90]:
from pyvis.network import Network

Plot the protein graph, you can also plot the a disease's genes in the same graph. If a node is red, then it belongs to the disease.

In [91]:
def plot_protein_network(protein_graph: nx.Graph, disease_genes: list=None, biomarkers: list=None) -> None:
    if biomarkers is not None:
        plot_graph = protein_graph.subgraph(biomarkers)
    else:
        plot_graph = protein_graph.copy()

    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(plot_graph.nodes()):
        node_index[node] = i
        if not node in disease_genes:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in plot_graph.edges():
        if edge_from in disease_genes or edge_to in disease_genes:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("protein_graph.html")

In [92]:
plot_protein_network(protein_graph, diseases[588]["genes"], biomarkers.index)

Plot the disease graph, you can also plot a disease's genes on the same plot, the latters will be colored in red.

In [93]:
def plot_disease(protein_graph: nx.Graph, disease_genes: list, protein:str) -> None:
    sub_graph = protein_graph.subgraph(disease_genes)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if node != protein:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from==protein or edge_to==protein:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("disease_graph.html")

In [94]:
plot_disease(protein_graph, diseases[588]["genes"], "SON")

Plot a community graph and color the chosen protein, if any is passed, and the edges linked to it.

In [95]:
def plot_community_protein(protein_graph: nx.Graph, communities: list, protein: str=None) -> None:
    if protein is not None:
        community = communities[look_for_gene_community(protein, communities)]
    else:
        community = np.random.randint(0, len(communities))

    sub_graph = protein_graph.subgraph(community)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if node != protein:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from==protein or edge_to==protein:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("community_protein_graph.html")

In [96]:
plot_community_protein(protein_graph, communities, "SON")

Plot a community and color the disease's genes passed to the method.

In [97]:
def plot_community_disease(protein_graph: nx.Graph, disease_genes: list, community: set) -> None:
    sub_graph = protein_graph.subgraph(community)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if node not in disease_genes:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from in disease_genes or edge_to in disease_genes:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("community_disease_graph.html")

In [98]:
plot_community_disease(protein_graph, diseases[588]["genes"], communities[3])

In [99]:
import json
nodes_comm = []
for l_s in communities:
    ll_s = list(l_s)
    for n in ll_s:
        nodes_comm.append(n)

graph_community = protein_graph.subgraph(nodes_comm)
js = nx.cytoscape_data(graph_community)

with open("graph.cyjs", "w+") as fi:
    json.dump(js, fi)

In [100]:
colors = ["red", "blue", "green", "beige", "black", "purple", "orange", "pink", "yellow", "brown", "grey"]
net = Network(width=1080, height=720)
node_index = dict()
index = 0
for j in range(len(communities)):
    community=communities[j]
    color = colors[j]
    sub_graph = protein_graph.subgraph(community)
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = index
        net.add_node(index, label=node, size=16, color=color)
        index+=1

    for edge_from, edge_to in sub_graph.edges():
        net.add_edge(node_index[edge_from], node_index[edge_to], color=color)

net.show_buttons(['physics'])
net.force_atlas_2based(spring_strength=0.02)
net.show("prova.html")

Compute the ratio of the number of genes shared by each community and disease and the size of such community.

In [101]:
ratios_genes_community = list()
for community in range(len(communities)):
    ratios_genes_community += (list(communities_rank[communities_rank['Community']==community]['Shared genes'] / len(communities[community])))

communities_rank['Ratio community'] = pd.Series(ratios_genes_community)

In [102]:
communities_rank[communities_rank["Disease"]=="Tooth size discrepancy"]

Unnamed: 0,Community,Disease,Shared genes,# genes,Ratio disease,Ratio community,Relevance
564,0,Tooth size discrepancy,3,47,0.06383,0.006522,0.000416
1129,1,Tooth size discrepancy,9,47,0.191489,0.010936,0.002094
1680,2,Tooth size discrepancy,3,47,0.06383,0.003367,0.000215
2215,3,Tooth size discrepancy,2,47,0.042553,0.003344,0.000142
2714,4,Tooth size discrepancy,3,47,0.06383,0.007042,0.00045
3285,5,Tooth size discrepancy,5,47,0.106383,0.004634,0.000493
3823,6,Tooth size discrepancy,3,47,0.06383,0.004323,0.000276
4363,7,Tooth size discrepancy,3,47,0.06383,0.004823,0.000308
5194,9,Tooth size discrepancy,9,47,0.191489,0.020316,0.00389


We need to know in wich community our starting gene is.

In [103]:
def look_for_gene_community(protein: str, communities: list) -> int:
    for i, community in enumerate(communities):
        if protein in community:
            return i

    return -1

In [104]:
protein_community = "SON"
gene_community = look_for_gene_community(protein_community, communities)
if gene_community == -1:
    raise Exception("The gene {0} is not in one of the communities".format(protein_community))
else:
    print("The gene SON is in community {0}".format(gene_community))

The gene SON is in community 1


Knowing in which community our starting gene is, we can retrieve all the diseases inisde such community and rank them by the product of the two previous computed metrics.

In [105]:
# Compute the new metric
communities_rank['Relevance'] = communities_rank['Ratio community'] * communities_rank['Ratio disease']

# Retrieve the diseases inside the community in which the gene is in and keep those that share at least ten genes with the community
gene_rank = communities_rank[(communities_rank["Community"]==gene_community) &
                             (communities_rank["Shared genes"] > 10)].sort_values(by="Relevance", ascending=False).drop("Community", axis=1)

In [152]:
gene_rank.head()

Unnamed: 0,Disease,Shared genes,# genes,Ratio disease,Ratio community,Relevance
570,Intellectual Disability,430,1808,0.237832,0.522479,0.124262
565,Mental and motor retardation,246,802,0.306733,0.298906,0.091685
568,Global developmental delay,248,854,0.290398,0.301337,0.087508
567,Cognitive delay,233,758,0.307388,0.283111,0.087025
575,Poor school performance,226,741,0.304993,0.274605,0.083753


In [107]:
gene_rank.to_csv("datasets/community_diseases_rank.csv")

## **Draw the protein, disease and community graphs**
Plot the graphs to show the results of our work.

In [108]:
from pyvis.network import Network

Plot the protein graph, you can also plot the a disease's genes in the same graph. If a node is red, then it belongs to the disease.

In [109]:
def plot_protein_network(protein_graph: nx.Graph, disease_genes: list=None, biomarkers: list=None) -> None:
    if biomarkers is not None:
        plot_graph = protein_graph.subgraph(biomarkers)
    else:
        plot_graph = protein_graph.copy()

    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(plot_graph.nodes()):
        node_index[node] = i
        if not node in disease_genes:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in plot_graph.edges():
        if edge_from in disease_genes or edge_to in disease_genes:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("protein_graph.html")

In [110]:
plot_protein_network(protein_graph, diseases[588]["genes"], biomarkers.index)

Plot the disease graph, you can also plot a disease's genes on the same plot, the latters will be colored in red.

In [111]:
def plot_disease(protein_graph: nx.Graph, disease_genes: list, protein:str) -> None:
    sub_graph = protein_graph.subgraph(disease_genes)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if node != protein:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from==protein or edge_to==protein:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("disease_graph.html")

In [112]:
plot_disease(protein_graph, diseases[588]["genes"], "SON")

Plot a community graph and color the chosen protein, if any is passed, and the edges linked to it.

In [113]:
def plot_community_protein(protein_graph: nx.Graph, communities: list, protein: str=None) -> None:
    if protein is not None:
        community = communities[look_for_gene_community(protein, communities)]
    else:
        community = np.random.randint(0, len(communities))

    sub_graph = protein_graph.subgraph(community)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if node != protein:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from==protein or edge_to==protein:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("community_protein_graph.html")

In [114]:
plot_community_protein(protein_graph, communities, "SON")

Plot a community and color the disease's genes passed to the method.

In [115]:
def plot_community_disease(protein_graph: nx.Graph, disease_genes: list, community: set) -> None:
    sub_graph = protein_graph.subgraph(community)
    net = Network(width=1080, height=720)
    node_index = dict()
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = i
        if node not in disease_genes:
            net.add_node(i, label=node, size=8)
        else:
            net.add_node(i, label=node, size=16, color="red")

    for edge_from, edge_to in sub_graph.edges():
        if edge_from in disease_genes or edge_to in disease_genes:
            net.add_edge(node_index[edge_from], node_index[edge_to], color="red", value=1)
        else:
            net.add_edge(node_index[edge_from], node_index[edge_to])

    net.toggle_drag_nodes(False)
    net.show_buttons(['physics'])
    net.force_atlas_2based(spring_strength=0.02)
    net.show("community_disease_graph.html")

In [116]:
plot_community_disease(protein_graph, diseases[588]["genes"], communities[3])

In [136]:
graph_vis = protein_graph.copy()

for set_nodes in louvain_communities:
    nodes= list(set_nodes)
    if len(nodes)==1:
        graph_vis.remove_nodes_from(nodes)
    else:
        sub_graph=graph_vis.subgraph(nodes)
        sub_edges=sub_graph.edges()
        graph_vis.remove_edges_from(sub_edges)

In [150]:
palette = ['blue', 'red', 'yellow', 'pink', 'purple', 'black', 'brown', 'orange', 'skyblue', 'aqua']
net = Network(width=1080, height=720)
node_index = dict()
index = 0
son_comm = look_for_gene_community("SON", communities)
for j in range(len(communities)):
    community = communities[j]
    color = palette[j]
    if son_comm == j:
        color = 'green'
    sub_graph = protein_graph.subgraph(community)
    for i, node in enumerate(sub_graph.nodes()):
        node_index[node] = index
        net.add_node(index, label=node, size=8, color=color)
        index+=1

    for edge_from, edge_to in sub_graph.edges():
        net.add_edge(node_index[edge_from], node_index[edge_to], color=color)

com_edges = dict()
l_c = len(communities)
for edge_from, edge_to in tqdm(graph_vis.edges()):
    index = look_for_gene_community(edge_from, communities) * look_for_gene_community(edge_to, communities)
    if not (index in com_edges):
        com_edges[index]=1
        net.add_edge(node_index[edge_from], node_index[edge_to], color='black', value=1)

    if len(com_edges)== (l_c*(l_c+1))/2:
        break

net.show_buttons(['physics'])
net.force_atlas_2based(spring_strength=0.02)
net.show("communities.html")

  0%|          | 0/33622 [00:00<?, ?it/s]

In [180]:
def communities_distance(communities_rank:pd.DataFrame, first_index: int, second_index: int):
    # Retrieve all the diseases linked to the communities
    tmp_first = communities_rank[communities_rank["Community"]==first_index][["Disease", "Relevance"]]
    tmp_second = communities_rank[communities_rank["Community"]==second_index][["Disease", "Relevance"]]

    # Shared diseases between the two communities
    shared_diseases = tmp_first.merge(tmp_second, left_on="Disease", right_on="Disease")

    # Diseases that are not shared between the two communities
    tmp_first = tmp_first[~tmp_first['Disease'].isin(shared_diseases["Disease"])]
    tmp_second = tmp_second[~tmp_second['Disease'].isin(shared_diseases["Disease"])]

    metric = 0
    for i, disease in shared_diseases.iterrows():
        metric += np.power(disease["Relevance_x"]-disease["Relevance_y"], 2)

    for i, first_disease in tmp_first.iterrows():
        metric += np.power(first_disease["Relevance"], 2)

    for i, second_disease in tmp_second.iterrows():
        metric += np.power(second_disease["Relevance"], 2)

    return metric

son_comm = look_for_gene_community("SON", communities)
for i in range(len(communities)):
    print(communities_distance(communities_rank, son_comm, i))

0.10159605063258893
0.0
0.15884148212549778
0.12340290253934633
0.11729253986846629
0.15648397799270103
0.14105632143702454
0.18272083226840385
0.14193775385092258
0.0983267964503446


In [193]:
communities_rank[(communities_rank["Community"]==9) &
                             (communities_rank["Shared genes"] > 10)].sort_values(by="Relevance", ascending=False).drop("Community", axis=1)

Unnamed: 0,Disease,Shared genes,# genes,Ratio disease,Ratio community,Relevance
2452,Breast Carcinoma,196,3337,0.058735,0.460094,0.027024
2332,Malignant neoplasm of breast,198,3424,0.057827,0.464789,0.026877
2274,Neurodegenerative Disorders,70,552,0.126812,0.164319,0.020838
2221,Intellectual Disability,119,1808,0.065819,0.279343,0.018386
2381,Amyotrophic Lateral Sclerosis,57,480,0.118750,0.133803,0.015889
...,...,...,...,...,...,...
2597,Mesothelioma,11,276,0.039855,0.025822,0.001029
2690,Uterine Corpus Cancer,13,406,0.032020,0.030516,0.000977
2627,Malignant neoplasm of endometrium,13,407,0.031941,0.030516,0.000975
2301,Retinoblastoma,12,405,0.029630,0.028169,0.000835
