**Computational Health Laboratory Project, A.Y. 2021/2022**

**Authors:** Niko Dalla Noce, Alessandro Ristori, Andrea Zuppolini

**Project:** Starting fron one or more genes, extract from interaction databases the genes they interact with. Using the expanded gene set, perform pathway analysis and obtain all disease pathways in which the genes appear. Merge the pathways to obtain a larger graph. Perform further network analysis to extract central biomarkers and communities beyond pathways. Compute a distance between the initial gene set and the various pathways (diseases).

# **CHL Project, Pathway Enrichment**

## **Colab setup**
Takes care of the project setup on Colab.

In [1]:
if 'google.colab' in str(get_ipython()):
    import subprocess
    from google.colab import drive
    out_clone = subprocess.run(["git", "clone", "https://github.com/nikodallanoce/ComputationalHealthLaboratory"], text=True, capture_output=True)
    print("{0}{1}".format(out_clone.stdout, out_clone.stderr))
    %pip install -U PyYAML
    %pip install gseapy
    drive.mount("/content/drive/")
    %cp "/content/drive/Shareddrives/CHL/config.yml" "/content/ComputationalHealthLaboratory"
    %cd ComputationalHealthLaboratory

## **Obtain all the genes that interacts with the starting one**
Starting from a gene, retrieve its neighbours and the interactions between them.


In [2]:
import requests
import pandas as pd
import numpy as np
from config import ACCESS_KEY, BASE_URL

In [3]:
gene_interactions = pd.read_csv("datasets/geneset.csv", sep=";")
gene_interactions["InteractorA"] = gene_interactions["InteractorA"].str.upper()
gene_interactions.drop_duplicates(inplace=True)
proteins_list = list(gene_interactions["InteractorA"])  # All the proteins that interact with our starting gene

In [4]:
gene_interactions.tail()

Unnamed: 0,InteractorA,InteractorB
149,NSP8,SON
150,NSP9,SON
151,ORF6,SON
152,ORF8,SON
153,CCNF,SON


## **Expand the interactions dataset**
Expand the dataset using the proteins obtained from the previous step.

In [5]:
request_url = BASE_URL + "/interactions"
data = {}

step = 5
for i in range(0, len(proteins_list), step):
    end = i+step
    if end >= len(proteins_list):
        end = len(proteins_list)
    
    # List of genes to search for
    gene_list = proteins_list[i:end] # ["SRPK2"]

    params = {
        "accesskey": ACCESS_KEY,
        "format": "json",  # Return results in TAB2 format
        "geneList": "|".join(gene_list),  # Must be | separated
        "searchNames": "true",  # Search against official names
        "includeInteractors": "true",  # Set to true to get any interaction involving EITHER gene, set to false to get interactions between genes
        "includeInteractorInteractions": "false",  # Set to true to get interactions between the geneList’s first order interactors
        "includeEvidence": "false",  # If false "evidenceList" is evidence to exclude, if true "evidenceList" is evidence to show
        "selfInteractionsExcluded": "true", # If true no self-interactions will be included
    }

    r = requests.get(request_url, params=params)
    interactions = r.json()
    
    # Check if the interactions are more than the allowed number
    if len(interactions)==10000:
      assert False

    # Create a hash of results by interaction identifier
    for interaction_id, interaction in interactions.items():
        data[interaction_id] = interaction

In [6]:
# Load the data into a pandas dataframe
dataset = pd.DataFrame.from_dict(data, orient="index")

# Re-order the columns and select only the columns we want to see
columns = ["OFFICIAL_SYMBOL_A", "OFFICIAL_SYMBOL_B"]
dataset = dataset[columns]

# Rename the columns and make all the values uppercase
dataset = dataset.rename(columns={"OFFICIAL_SYMBOL_A": "InteractorA", "OFFICIAL_SYMBOL_B": "InteractorB"})
dataset["InteractorA"] = dataset["InteractorA"].str.upper()
dataset["InteractorB"] = dataset["InteractorB"].str.upper()

# Print the dataframe
dataset.tail()

Unnamed: 0,InteractorA,InteractorB
3305885,CCNF,ZBTB1
3305886,CCNF,ZGPAT
3305887,CCNF,ZNF638
3305888,CCNF,ZNF687
3305889,CCNF,ZWINT


Drop duplicated interactions, they're not interesting from out point of view.

In [7]:
# Look for duplicated interactions
duplicated_interactions = pd.DataFrame(np.sort(dataset[["InteractorA", "InteractorB"]].values, 1)).duplicated()
print("Duplicated interactions:\n{0}".format(duplicated_interactions.value_counts()))

# Delete such interactions from the dataset
dataset = dataset[~duplicated_interactions.values]

Duplicated interactions:
False    79296
True     25281
dtype: int64


Drop self-loops since they're useless for our analysis.

In [8]:
# Look for interactions where both proteins are the same
same_proteins_interactions = pd.DataFrame(dataset[["InteractorA", "InteractorB"]].nunique(axis=1) == 1)
print("Useless interactions:\n{0}".format(same_proteins_interactions.value_counts()))

# Delete such interactions from the dataset
dataset = dataset[~same_proteins_interactions.values]

Useless interactions:
False    79283
True        13
dtype: int64


Unify the interactions from the starting gene with the ones obtained by the requests to the BioGrid dataset.

In [9]:
dataset = pd.concat([dataset, gene_interactions])

In [10]:
nodes = pd.concat([dataset["InteractorA"], dataset["InteractorB"]]).unique()
print("Number of nodes: {0}".format(len(nodes)))

Number of nodes: 13010


At last, save the interactions and nodes into csv file for pathway enrichment.

In [11]:
# Save interactions and nodes dataset to csv
dataset.to_csv("datasets/interactions.csv")
pd.DataFrame(nodes).to_csv("datasets/genes.csv")

## **Pathway enrichment**
Find all the diseases linked to the nodes retrieved by the previous step.

In [12]:
import gseapy as gp

List all the datasets from wich we can retrieve pathways by using the gseapy package.

In [13]:
gp.get_library_name()

['ARCHS4_Cell-lines',
 'ARCHS4_IDG_Coexp',
 'ARCHS4_Kinases_Coexp',
 'ARCHS4_TFs_Coexp',
 'ARCHS4_Tissues',
 'Achilles_fitness_decrease',
 'Achilles_fitness_increase',
 'Aging_Perturbations_from_GEO_down',
 'Aging_Perturbations_from_GEO_up',
 'Allen_Brain_Atlas_10x_scRNA_2021',
 'Allen_Brain_Atlas_down',
 'Allen_Brain_Atlas_up',
 'Azimuth_Cell_Types_2021',
 'BioCarta_2013',
 'BioCarta_2015',
 'BioCarta_2016',
 'BioPlanet_2019',
 'BioPlex_2017',
 'CCLE_Proteomics_2020',
 'CORUM',
 'COVID-19_Related_Gene_Sets',
 'COVID-19_Related_Gene_Sets_2021',
 'Cancer_Cell_Line_Encyclopedia',
 'CellMarker_Augmented_2021',
 'ChEA_2013',
 'ChEA_2015',
 'ChEA_2016',
 'Chromosome_Location',
 'Chromosome_Location_hg19',
 'ClinVar_2019',
 'DSigDB',
 'Data_Acquisition_Method_Most_Popular_Genes',
 'DepMap_WG_CRISPR_Screens_Broad_CellLines_2019',
 'DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019',
 'Descartes_Cell_Types_and_Tissue_2021',
 'DisGeNET',
 'Disease_Perturbations_from_GEO_down',
 'Disease_Perturbati

Obtain all the pathways connected to our nodes, for our case we are going to use the DisGeNET dataset.

In [14]:
import os

if os.path.exists("datasets/diseases_pathways.csv"):
    df_diseases = pd.read_csv("datasets/diseases_pathways.csv", sep=",", index_col=0)
else:
    enr = gp.enrichr(gene_list=pd.DataFrame(nodes),
                      gene_sets=['DisGeNET'],  # Datasets from the gp.get_library_name() method
                      organism='Human',
                      description='DEGs_up_1d',
                      outdir='test'
                  )

    # Keep those pathways with an adjusted p-value < 0.1
    df_diseases = enr.results[enr.results["Adjusted P-value"] < 0.1][["Term", "Overlap", "P-value", "Adjusted P-value", "Genes"]]

Keep those pathways with an adjusted p-value < 0.1.

In [15]:
df_diseases.tail()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Genes
584,Chronic otitis media,55/69,0.005896,0.09895,IGHM;CD81;WIPF1;FMR1;DOCK8;CHD7;JMJD1C;COMT;GT...
585,Inadequate arch length for tooth size,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
586,Tooth Crowding,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
587,Tooth mass arch size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
588,Tooth size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...


Save the pathways in a csv file just like for the interactions and nodes.

In [16]:
df_diseases.to_csv("datasets/diseases_pathways.csv")