**Computational Health Laboratory Project, A.Y. 2021/2022**

**Authors:** Niko Dalla Noce, Alessandro Ristori, Andrea Zuppolini

# **CHL Project, Pathway Analysis**

## **Colab setup**
Takes care of the project setup on Colab.

In [1]:
if 'google.colab' in str(get_ipython()):
    import subprocess
    from google.colab import drive
    out_clone = subprocess.run(["git", "clone", "https://github.com/nikodallanoce/ComputationalHealthLaboratory"], text=True, capture_output=True)
    print("{0}{1}".format(out_clone.stdout, out_clone.stderr))
    %pip install -U PyYAML
    drive.mount("/content/drive/")
    %cp "/content/drive/Shareddrives/CHL/config.yml" "/content/ComputationalHealthLaboratory"
    %cd ComputationalHealthLaboratory

Cloning into 'ComputationalHealthLaboratory'...

Collecting PyYAML
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 5.2 MB/s 
[?25hInstalling collected packages: PyYAML
  Attempting uninstall: PyYAML
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed PyYAML-6.0
Mounted at /content/drive/
/content/ComputationalHealthLaboratory


## **Obtain all the genes that interacts with the starting one**
Starting from a gene obtain its neighbours and the interactions between them.


In [2]:
import requests
import json
import pandas as pd
import numpy as np
import re
from config import ACCESS_KEY, BASE_URL

In [3]:
gene_interactions = pd.read_csv("datasets/geneset.csv", sep=";")
gene_interactions["InteractorA"] = gene_interactions["InteractorA"].str.upper()
gene_interactions.drop_duplicates(inplace=True)
proteins_list = list(gene_interactions["InteractorA"])  # all the proteins that interact with our starting gene

In [4]:
gene_interactions

Unnamed: 0,InteractorA,InteractorB
0,YWHAG,SON
1,YWHAB,SON
3,SIRT7,SON
4,TCF3,SON
5,SF3B1,SON
...,...,...
149,NSP8,SON
150,NSP9,SON
151,ORF6,SON
152,ORF8,SON


## **Expand the network**
Build the protein-to-protein network using the interactions obtained from the previous step.

In [6]:
request_url = BASE_URL + "/interactions"
data = {}

step = 146
for i in range(0, len(proteins_list), step):
    end = i+step
    if end >= len(proteins_list):
        end = len(proteins_list)
    
    # List of genes to search for
    gene_list = proteins_list[i:end] # ["SRPK2"]

    params = {
        "accesskey": ACCESS_KEY,
        "format": "json",  # Return results in TAB2 format
        "geneList": "|".join(gene_list),  # Must be | separated
        "searchNames": "true",  # Search against official names
        "includeInteractors": "false",  # Set to true to get any interaction involving EITHER gene, set to false to get interactions between genes
        "includeInteractorInteractions": "false",  # Set to true to get interactions between the geneList’s first order interactors
        "includeEvidence": "false",  # If false "evidenceList" is evidence to exclude, if true "evidenceList" is evidence to show
        "selfInteractionsExcluded": "true", # If true no self-interactions will be included
    }

    r = requests.get(request_url, params=params)
    interactions = r.json()
    
    # Check if the interactions are more than the allowed number
    if len(interactions)==10000:
      assert False

    # Create a hash of results by interaction identifier
    for interaction_id, interaction in interactions.items():
        data[interaction_id] = interaction

In [7]:
# Load the data into a pandas dataframe
dataset = pd.DataFrame.from_dict(data, orient="index")

# Re-order the columns and select only the columns we want to see
columns = ["OFFICIAL_SYMBOL_A", "OFFICIAL_SYMBOL_B"]
dataset = dataset[columns]

# Rename the columns and make all the values uppercase
dataset = dataset.rename(columns={"OFFICIAL_SYMBOL_A": "InteractorA", "OFFICIAL_SYMBOL_B": "InteractorB"})
dataset["InteractorA"] = dataset["InteractorA"].str.upper()
dataset["InteractorB"] = dataset["InteractorB"].str.upper()

# Print the dataframe
dataset

Unnamed: 0,InteractorA,InteractorB
17282,SFPQ,NONO
22627,EZH2,EED
119679,SRPK2,U2AF2
120105,SRSF6,RNPS1
120300,U2AF2,PUF60
...,...,...
3324902,BRD4,HIST1H4A
3324964,BRD3,NFIA
3324983,NSP10,NSP16
3325359,SFPQ,NONO


Drop duplicated interactions, they're not interesting from out point of view.

In [8]:
# Look for duplicated interactions
duplicated_interactions = pd.DataFrame(np.sort(dataset[["InteractorA", "InteractorB"]].values, 1)).duplicated()
print("Duplicated interactions:\n{0}".format(duplicated_interactions.value_counts()))

# Delete such interactions from the dataset
dataset = dataset[~duplicated_interactions.values]

Duplicated interactions:
False    2636
True     1661
dtype: int64


Drop self-loops since they're useless for our analysis.

In [9]:
# Look for interactions where both proteins are the same
same_proteins_interactions = pd.DataFrame(dataset[["InteractorA", "InteractorB"]].nunique(axis=1) == 1)
print("Useless interactions:\n{0}".format(same_proteins_interactions.value_counts()))

# Delete such interactions from the dataset
dataset = dataset[~same_proteins_interactions.values]

Useless interactions:
False    2623
True       13
dtype: int64


Unify the interactions from the starting gene with the ones obtained by the requests to the BioGrid dataset.

In [10]:
dataset = dataset.append(gene_interactions)

In [11]:
nodes = dataset["InteractorA"].append(dataset["InteractorB"]).unique()
# Basterebbe fare l'append su genes nel caso considerassimo solamente i nodi iniziali
print("Number of nodes: {0}".format(len(nodes)))

Number of nodes: 147


In [12]:
# Save interactions and nodes dataset to csv
dataset.to_csv("interactions.csv")
pd.DataFrame(nodes).to_csv("genes.csv")

## **Draw the network**
Visualize the interactions between the proteins.

In [43]:
import networkx as nx
import matplotlib.pyplot as plt
from matplotlib import cm

In [44]:
protein_graph=nx.Graph(name='Protein Interaction Graph')
interactions = np.array(dataset)
for interaction in interactions:
    a = interaction[0] # protein a node
    b = interaction[1] # protein b node
    protein_graph.add_edges_from([(a,b)]) # add weighted edge to graph

In [45]:
# function to rescale list of values to range [newmin,newmax]
def rescale(l,newmin,newmax):
    arr = list(l)
    return [(x-min(arr))/(max(arr)-min(arr))*(newmax-newmin)+newmin for x in arr]

# use the matplotlib plasma colormap
graph_colormap = cm.get_cmap('coolwarm', 12)

# node color varies with Degree
c = rescale([protein_graph.degree(v) for v in protein_graph], 0.0, 0.9) 
c = [graph_colormap(i) for i in c]

# node size varies with betweeness centrality - map to range [10,100] 
bc = nx.betweenness_centrality(protein_graph) # betweeness centrality
s =  rescale([v for v in bc.values()], 400, 500)

In [None]:
pos = nx.spring_layout(protein_graph)
plt.figure(figsize=(22, 22), facecolor=[0.7, 0.7, 0.7, 0.4])
nx.draw_networkx(protein_graph, pos=pos, with_labels=True, node_color=c, edgelist=np.array(dataset), node_size=s, font_color='white',font_weight='bold', font_size='9')
plt.axis('off')
plt.show()

## **Pathway analysis**

In [13]:
!pip install gseapy

Collecting gseapy
  Downloading gseapy-0.10.8-py3-none-any.whl (526 kB)
[K     |████████████████████████████████| 526 kB 5.1 MB/s 
Collecting bioservices
  Downloading bioservices-1.8.4.tar.gz (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 52.1 MB/s 
Collecting grequests
  Downloading grequests-0.6.0-py3-none-any.whl (5.2 kB)
Collecting requests_cache
  Downloading requests_cache-0.9.3-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 5.0 MB/s 
[?25hCollecting easydev>=0.9.36
  Downloading easydev-0.12.0.tar.gz (47 kB)
[K     |████████████████████████████████| 47 kB 5.1 MB/s 
Collecting xmltodict
  Downloading xmltodict-0.12.0-py2.py3-none-any.whl (9.2 kB)
Collecting suds-community
  Downloading suds_community-1.1.0-py3-none-any.whl (144 kB)
[K     |████████████████████████████████| 144 kB 66.6 MB/s 
Collecting colorlog
  Downloading colorlog-6.6.0-py2.py3-none-any.whl (11 kB)
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl 

In [None]:
import gseapy as gp
from gseapy.plot import barplot

In [None]:
gp.get_library_name()

In [16]:
enr = gp.enrichr(gene_list=pd.DataFrame(nodes),
                  gene_sets=['Reactome_2016', 'KEGG_2021_Human'],
                  organism='Human',
                  description='DEGs_up_1d',
                  outdir='test',
                  cutoff=0.05
              )

In [19]:
enr.results

Unnamed: 0,Gene_set,Term,Overlap,P-value,Adjusted P-value,Old P-value,Old Adjusted P-value,Odds Ratio,Combined Score,Genes
0,DisGeNET,Primary malignant neoplasm,27/1032,7.714301e-09,0.000022,0,0,4.219701,78.824826,CCNF;U2AF1;PHB;TFPI;USP39;BMI1;TRIM28;MYC;UBR5...
1,DisGeNET,Childhood Medulloblastoma,8/93,4.227732e-07,0.000385,0,0,13.385019,196.444293,BRD3;MYC;KIF14;BMI1;ATOH1;ESR2;BRD4;EZH2
2,DisGeNET,Adult Medulloblastoma,7/64,4.393217e-07,0.000385,0,0,17.364912,254.188173,BRD3;MYC;KIF14;ATOH1;ESR2;BRD4;EZH2
3,DisGeNET,Malignant Neoplasms,29/1438,5.518867e-07,0.000385,0,0,3.217067,46.357689,SMARCB1;CCNF;U2AF1;PHB;TFPI;USP39;BMI1;TRIM28;...
4,DisGeNET,Mammary Neoplasms,39/2387,9.697201e-07,0.000523,0,0,2.692185,37.276687,VCP;SRSF1;KIF14;PSIP1;PHB;TFPI;PRPF19;BMI1;NR2...
...,...,...,...,...,...,...,...,...,...,...
2783,DisGeNET,Retinal Diseases,1/355,9.288102e-01,0.930145,0,0,0.377273,0.027862,VHL
2784,DisGeNET,Tuberculosis,2/602,9.384065e-01,0.939417,0,0,0.442598,0.028137,NXF1;BRD4
2785,DisGeNET,Drug-Induced Liver Disease,1/377,9.396582e-01,0.940333,0,0,0.354798,0.022082,S100A9
2786,DisGeNET,Lung diseases,1/393,9.465005e-01,0.946840,0,0,0.340037,0.018697,NXF1


In [37]:
enr.results.to_csv("pathways.csv")

In [None]:
for node in nodes:
    print(node)

In [None]:
tmp = pd.DataFrame(enr.results)
tmp = tmp.head(tmp.shape[0]-111) # Remove KEGG entries

In [None]:
enr.results["Term"].str.extractall(r"(R-HSA-.*)")

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,R-HSA-74160
1,0,R-HSA-1640170
2,0,R-HSA-69278
3,0,R-HSA-5663205
4,0,R-HSA-68886
...,...,...
1516,0,R-HSA-975298
1517,0,R-HSA-425428
1518,0,R-HSA-192456
1519,0,R-HSA-1187000


In [None]:
ris = tmp["Term"].str.extractall(r"(R-HSA-.*)")
tmp["Term"].loc[0:len(ris)] = np.reshape(ris.values, (len(ris)))
tmp.to_csv("modified_pathways.csv")

ValueError: ignored

## Prova disgenet

In [35]:
list_genes_disgenet = ",".join(nodes)

In [36]:
list_genes_disgenet

'SFPQ,EZH2,SRPK2,SRSF6,U2AF2,SRSF4,HNRNPM,SRSF1,U2AF1,YWHAG,YWHAB,RNPS1,SEPT2,PUF60,VHL,MYC,RBM39,SAFB,PPP1CA,EP300,SMARCB1,TCF3,SNIP1,SRPK1,SF3B6,ESR1,ILF3,SRSF11,HSPA8,ESR2,MAGOH,RBX1,NXF1,PRPF6,RAN,CDC5L,EED,BMI1,USP39,SIRT7,SAP18,SRRM2,GAG,DHX9,PRC1,NONO,PHB,RNF2,TRIM28,BRD4,FBXW11,PRPF3,MKI67,NHP2L1,RBBP6,EIF4A3,OBSL1,MAGEA6,PPP1CC,JMJD6,HBP1,VPR,CCNF,SF3B1,SNRNP70,SRSF7,SRSF5,ILF2,PRPF19,EFTUD2,ACIN1,BUD31,DDX21,PRPF40A,S100A9,SEPT7,NFIA,PHGDH,PSIP1,DIDO1,HP1BP3,ECT2,VCP,DHX8,HDAC11,FBXO7,CLK2,SRPK3,HIST1H4A,UBR5,FAM96B,CUL7,SUZ12,RBM4B,UBE2A,ZC3H18,IFI16,KIF20A,NR2C2,KIFAP3,TRIM26,CHMP4B,CIT,MKRN1,CHCHD1,ATOH1,C11ORF30,FANCD2,KIF14,FGF11,DCPS,KIAA1429,RC3H2,ACTC1,PRDM16,MECOM,PINK1,NSP7,RPL13,NSP10,NSP8,NSP16,NSP5,NSP9,ORF6,ORF8,KIF23,BRD3,CIC,NSP11,NSP15,MKRN3,NUP50,NAA40,ZBTB2,ZCCHC10,HSD17B14,PEA15,SLFN11,CHCHD4,USP25,TFPI,NDUFAF2,SAMD7,RBM17,CLIP4,SON'

In [27]:
'''
Script example to use the DisGeNET REST API with the new authentication system
'''

#For this example we are going to use the python default http library
import requests

#Build a dict with the following format, change the value of the two keys your DisGeNET account credentials, if you don't have an account you can create one here https://www.disgenet.org/signup/ 
auth_params = {"email":"n.dallanoce@studenti.unipi.it","password":"bpnYAyiKJ5sXCm_"}

api_host = "https://www.disgenet.org/api"

api_key = None
s = requests.Session()
try:
    r = s.post(api_host+'/auth/', data=auth_params)
    if(r.status_code == 200):
        #Lets store the api key in a new variable and use it again in new requests
        json_response = r.json()
        api_key = json_response.get("token")
        print(api_key + "This is your user API key.") #Comment this line if you don't want your API key to show up in the terminal
    else:
        print(r.status_code)
        print(r.text)
except requests.exceptions.RequestException as req_ex:
    print(req_ex)
    print("Something went wrong with the request.")

if api_key:
    #Add the api key to the requests headers of the requests Session object in order to use the restricted endpoints.
    s.headers.update({"Authorization": "Bearer %s" % api_key}) 
    #Lets get all the diseases associated to a gene eg. APP (EntrezID 351) and restricted by a source.
    gda_response = s.get(api_host+'/gda/gene/'+list_genes_disgenet)#, params={'source':'UNIPROT'})
    print(gda_response.json())

if s:
    s.close()

3b45b91be16591ffac2a1ff7b07c238253ae1c86This is your user API key.


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [28]:
import pandas as pd

data = pd.json_normalize(gda_response.json())
data

Unnamed: 0,geneid,gene_symbol,uniprotid,gene_dsi,gene_dpi,gene_pli,protein_class,protein_class_name,diseaseid,disease_name,disease_class,disease_class_name,disease_type,disease_semantic_type,score,ei,el,year_initial,year_final,source
0,2099,ESR1,P03372,0.324,0.962,9.992000e-01,DTO_00102000,Nuclear receptor,C0006142,Malignant neoplasm of breast,C04;C17,Neoplasms; Skin and Connective Tissue Di...,disease,Neoplastic Process,1.00,0.967,,1983.0,2020.0,ALL
1,2177,FANCD2,Q9BXW9,0.479,0.885,1.097400e-30,,,C3160738,"FANCONI ANEMIA, COMPLEMENTATION GROUP D2",C16;C18;C15,"Congenital, Hereditary, and Neonatal Diseas...",disease,Disease or Syndrome,1.00,1.000,,2001.0,2019.0,ALL
2,4609,MYC,P01106,0.344,0.923,9.980100e-01,DTO_05007542,Transcription factor,C0006413,Burkitt Lymphoma,C04;C01;C20;C15,Neoplasms; Infections; Immune System ...,disease,Neoplastic Process,1.00,0.977,,1982.0,2020.0,ALL
3,7428,VHL,P40337,0.443,0.846,7.951500e-02,DTO_05007624,Enzyme,C0019562,Von Hippel-Lindau Syndrome,C16;C10;C14,"Congenital, Hereditary, and Neonatal Diseas...",disease,Disease or Syndrome,1.00,0.974,,1976.0,2020.0,ALL
4,9129,PRPF3,O43395,0.695,0.385,1.000000e+00,DTO_05007557,Nucleic acid binding,C1832378,Retinitis Pigmentosa 18,C16;C11,"Congenital, Hereditary, and Neonatal Diseas...",disease,Disease or Syndrome,0.93,1.000,,2002.0,2011.0,ALL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13867,131474,CHCHD4,Q8N4Q1,1.000,0.115,1.873700e-01,,,C0027651,Neoplasms,C04,Neoplasms,group,Neoplastic Process,0.01,1.000,,2012.0,2012.0,ALL
13868,131474,CHCHD4,Q8N4Q1,1.000,0.115,1.873700e-01,,,C0178874,Tumor Progression,C23,"Pathological Conditions, Signs and Symptoms",phenotype,Neoplastic Process,0.01,1.000,,2012.0,2012.0,ALL
13869,131474,CHCHD4,Q8N4Q1,1.000,0.115,1.873700e-01,,,C1856689,FRIEDREICH ATAXIA 1,C16;C18;C10,"Congenital, Hereditary, and Neonatal Diseas...",disease,Disease or Syndrome,0.01,1.000,,2018.0,2018.0,ALL
13870,344658,SAMD7,Q7Z3H4,0.931,0.077,1.141500e-06,,,C0035334,Retinitis Pigmentosa,C16;C11,"Congenital, Hereditary, and Neonatal Diseas...",disease,Disease or Syndrome,0.01,1.000,,2016.0,2016.0,ALL


In [29]:
import pandas as pd

data = pd.json_normalize(gda_response.json())

data = data[data['disease_type']=="disease"]

data = data[["gene_symbol", "disease_name", "disease_class_name"]]


In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9752 entries, 0 to 13871
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   gene_symbol         9752 non-null   object
 1   disease_name        9752 non-null   object
 2   disease_class_name  8828 non-null   object
dtypes: object(3)
memory usage: 304.8+ KB


In [31]:
len(data["disease_name"].unique())

3066

In [None]:
gda_response.content