**Computational Health Laboratory Project, A.Y. 2021/2022**

**Authors:** Niko Dalla Noce, Alessandro Ristori, Andrea Zuppolini

**Project:** Starting fron one or more genes, extract from interaction databases the genes they interact with. Using the expanded gene set, perform pathway analysis and obtain all disease pathways in which the genes appear. Merge the pathways to obtain a larger graph. Perform further network analysis to extract central biomarkers and communities beyond pathways. Compute a distance between the initial gene set and the various pathways (diseases).

# **CHL Project, Pathway Enrichment**

## **Colab setup**
Takes care of the project setup on Colab.

In [1]:
if 'google.colab' in str(get_ipython()):
    import subprocess
    from google.colab import drive
    out_clone = subprocess.run(["git", "clone", "https://github.com/nikodallanoce/ComputationalHealthLaboratory"], text=True, capture_output=True)
    print("{0}{1}".format(out_clone.stdout, out_clone.stderr))
    %pip install -U PyYAML
    %pip install gseapy
    drive.mount("/content/drive/")
    %cp "/content/drive/Shareddrives/CHL/config.yml" "/content/ComputationalHealthLaboratory"
    %cd ComputationalHealthLaboratory

## **Obtain all the genes that interacts with the starting one**
Starting from a gene, retrieve its neighbours and the interactions between them.


In [2]:
import requests
import pandas as pd
import numpy as np
from config import ACCESS_KEY, BASE_URL

Some utilities methods needed to create and clean the interactions dataset.

In [3]:
def build_starting_gene_interactions(interactions_dataset: str) -> pd.DataFrame:
    # Build the interactions dataframe, choose and rename the needed columns
    gene_starting_interactions = pd.read_table(interactions_dataset)
    gene_starting_interactions.rename(columns={"Official Symbol Interactor A": "InteractorA",
                                      "Official Symbol Interactor B": "InteractorB"}, inplace=True)
    gene_starting_interactions = gene_starting_interactions[["InteractorA", "InteractorB"]]

    # Put uppercase all the genes inside each interaction
    gene_starting_interactions["InteractorA"] = gene_starting_interactions["InteractorA"].str.upper()
    gene_starting_interactions["InteractorB"] = gene_starting_interactions["InteractorB"].str.upper()
    return gene_starting_interactions


def remove_duplicated_interactions(interactions_dataset: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    """
    Remove all the duplicated interactions from the interactions dataframe
    :param interactions_dataset: dataframe of interactions retrieved from BioGRID
    :param verbose: print or not the amount of removed and kept interactions
    :return: cleaned interactions dataframe
    """
    # Look for duplicated interactions
    duplicated_interactions = pd.DataFrame(np.sort(
        interactions_dataset[["InteractorA", "InteractorB"]].values, 1)).duplicated()
    if verbose:
        print("Duplicated interactions:\n{0}".format(duplicated_interactions.value_counts()))

    # Delete such interactions from the dataset
    cleaned_interactions_dataset = interactions_dataset[~duplicated_interactions.values]
    return cleaned_interactions_dataset


def remove_self_loop_interactions(interactions_dataset: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    """
    Removes self-loop interactions from the interactions dataframe
    :param interactions_dataset: dataframe of interactions retrieved from BioGRID
    :param verbose: print or not the amount of removed and kept interactions
    :return: cleaned interactions dataframe
    """
    # Look for interactions where both proteins are the same
    same_proteins_interactions = pd.Series(interactions_dataset[["InteractorA", "InteractorB"]].nunique(axis=1) == 1)
    if verbose:
        print("Useless interactions:\n{0}".format(same_proteins_interactions.value_counts()))

    # Delete such interactions from the dataset
    cleaned_interactions_dataset = interactions_dataset[~same_proteins_interactions.values]
    return cleaned_interactions_dataset

We expect that you have retrieved the gene interactions from BioGRID in the tab3 format, if so, then build the pandas dataframe with the starting gene's interactions.

In [4]:
starting_interactions = build_starting_gene_interactions("datasets/BIOGRID.tab3.txt")
starting_interactions.tail()

Unnamed: 0,InteractorA,InteractorB
149,NSP8,SON
150,NSP9,SON
151,ORF6,SON
152,ORF8,SON
153,CCNF,SON


We need to remove duplicated and self-loops interactions from the dataframe.

In [5]:
starting_interactions = remove_duplicated_interactions(starting_interactions)

Duplicated interactions:
False    146
True       8
dtype: int64


In [6]:
starting_interactions = remove_self_loop_interactions(starting_interactions)

Useless interactions:
False    146
dtype: int64


Save the cleaned dataframe into a csv file.

In [21]:
starting_interactions.to_csv("datasets/geneset.csv")

Obtain all the proteins that interact with our starting gene.

In [8]:
proteins_list = set(starting_interactions["InteractorA"])
proteins_list.update(starting_interactions["InteractorB"])
proteins_list.remove("SON")  # Do not consider the starting gene
proteins_list = list(proteins_list)

## **Expand the interactions dataset**
Expand the dataset using the proteins obtained from the previous step.

In [9]:
request_url = BASE_URL + "/interactions"
data = {}

step = 5
for i in range(0, len(proteins_list), step):
    end = i+step
    if end >= len(proteins_list):
        end = len(proteins_list)
    
    # List of genes to search for
    gene_list = proteins_list[i:end] # ["SRPK2"]

    params = {
        "accesskey": ACCESS_KEY,
        "format": "json",  # Return results in TAB2 format
        "geneList": "|".join(gene_list),  # Must be | separated
        "searchNames": "true",  # Search against official names
        "includeInteractors": "true",  # Set to true to get any interaction involving EITHER gene, set to false to get interactions between genes
        "includeInteractorInteractions": "false",  # Set to true to get interactions between the geneList’s first order interactors
        "includeEvidence": "false",  # If false "evidenceList" is evidence to exclude, if true "evidenceList" is evidence to show
        "selfInteractionsExcluded": "true", # If true no self-interactions will be included
    }

    r = requests.get(request_url, params=params)
    interactions = r.json()
    
    # Check if the interactions are more than the allowed number
    if len(interactions)==10000:
      assert False

    # Create a hash of results by interaction identifier
    for interaction_id, interaction in interactions.items():
        data[interaction_id] = interaction

In [10]:
# Load the data into a pandas dataframe
dataset = pd.DataFrame.from_dict(data, orient="index")

# Re-order the columns and select only the columns we want to see
columns = ["OFFICIAL_SYMBOL_A", "OFFICIAL_SYMBOL_B"]
dataset = dataset[columns]

# Rename the columns and make all the values uppercase
dataset = dataset.rename(columns={"OFFICIAL_SYMBOL_A": "InteractorA", "OFFICIAL_SYMBOL_B": "InteractorB"})
dataset["InteractorA"] = dataset["InteractorA"].str.upper()
dataset["InteractorB"] = dataset["InteractorB"].str.upper()

# Print the dataframe
dataset.tail()

Unnamed: 0,InteractorA,InteractorB
3204002,YTHDF1,CLK2
3204091,RBM6,CLK2
3204238,AP2M1,CLK2
3204416,CLK2,UBE2I
3205088,MRPS21,CLK2


Drop duplicated interactions, they're not interesting from out point of view.

In [11]:
dataset = remove_duplicated_interactions(dataset)

Duplicated interactions:
False    79296
True     25281
dtype: int64


Drop self-loops since they're useless for our analysis.

In [12]:
dataset = remove_self_loop_interactions(dataset)

Useless interactions:
False    79283
True        13
dtype: int64


Unify the interactions from the starting gene with the ones obtained by the requests to the BioGrid dataset.

In [13]:
dataset = pd.concat([dataset, starting_interactions])

In [14]:
nodes = pd.concat([dataset["InteractorA"], dataset["InteractorB"]]).unique()
print("Number of nodes: {0}".format(len(nodes)))

Number of nodes: 13010


At last, save the interactions and nodes into csv file for pathway enrichment.

In [15]:
# Save interactions and nodes dataset to csv
dataset.to_csv("datasets/interactions.csv")
pd.DataFrame(nodes).to_csv("datasets/genes.csv")

## **Pathway enrichment**
Find all the diseases linked to the nodes retrieved by the previous step.

In [16]:
import gseapy as gp

Obtain all the pathways connected to our nodes, for our case we are going to use the DisGeNET dataset.

In [17]:
import os

if os.path.exists("datasets/diseases_pathways.csv"):
    df_diseases = pd.read_csv("datasets/diseases_pathways.csv", sep=",", index_col=0)
else:
    enr = gp.enrichr(gene_list=pd.DataFrame(nodes),
                      gene_sets=['DisGeNET'],  # Datasets from the gp.get_library_name() method
                      organism='Human',
                      description='DEGs_up_1d',
                      outdir='test'
                  )

    # Keep those pathways with an adjusted p-value < 0.1
    df_diseases = enr.results[enr.results["Adjusted P-value"] < 0.1][["Term", "Overlap", "P-value", "Adjusted P-value", "Genes"]]

Keep those pathways with an adjusted p-value < 0.1.

In [18]:
df_diseases.tail()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Genes
584,Chronic otitis media,55/69,0.005896,0.09895,IGHM;CD81;WIPF1;FMR1;DOCK8;CHD7;JMJD1C;COMT;GT...
585,Inadequate arch length for tooth size,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
586,Tooth Crowding,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
587,Tooth mass arch size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
588,Tooth size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...


Save the pathways in a csv file just like for the interactions and nodes.

In [19]:
df_diseases.to_csv("datasets/diseases_pathways.csv")