# 1. PPI and GDA data gathering and interactome reconstruction

## 1.1.
Download PPIs from BioGRID latest release to build the human interactome:

* use “all organisms” tab3 file, unzip and get “Homo sapiens” only
* filter out all non-human interactions, i.e., both “organism A” and “B” fields must be = 9606 (Homo sapiens)
* keep only “physical” interactions” (“Experimental System Type” = physical)
* purge out redundant and self loops
* isolate the largest connected component (LCC)

In [64]:
import os
import zipfile
from pathlib import Path
from dotenv import load_dotenv

import requests
import networkx as nx
import pandas as pd

Load the environment variables from the `.env` file. If you don't have an `.env` file, create one and place the API token for DisGenet inside. 

```
disgenet_api_token="<api token>"
```

In [67]:
load_dotenv()

True

If you haven't already, download the biogrid data, unzip it and place the homo sapiens txt file inside the `data/` directory.

In [47]:
!wget -O biogrid.zip https://downloads.thebiogrid.org/Download/BioGRID/Latest-Release/BIOGRID-ORGANISM-LATEST.tab3.zip

--2023-12-27 12:54:05--  https://downloads.thebiogrid.org/Download/BioGRID/Latest-Release/BIOGRID-ORGANISM-LATEST.tab3.zip
Resolving downloads.thebiogrid.org (downloads.thebiogrid.org)... 173.255.198.187
Connecting to downloads.thebiogrid.org (downloads.thebiogrid.org)|173.255.198.187|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/download]
Saving to: ‘biogrid.zip’

biogrid.zip             [                 <=>] 159.74M  1.61MB/s    in 1m 52s  

2023-12-27 12:56:00 (1.43 MB/s) - ‘biogrid.zip’ saved [167496760]



Then, let's extract the homo sapiens file into the `data/` directory.

In [58]:
data_folder = "data"
Path(data_folder).mkdir(parents=True, exist_ok=True)

with zipfile.ZipFile("biogrid.zip", 'r') as zip_ref:
        # Iterate through the files in the zip archive
        for file_info in zip_ref.infolist():
            # Check if the file name matches the target filename
            if "Homo_sapiens" in file_info.filename:
                # Extract the file to the target folder
                zip_ref.extract(file_info, data_folder)
                
                # rename the file
                os.rename(os.path.join(data_folder, file_info.filename), os.path.join(data_folder, "biogrid.txt"))
                break

Let's load the file, it is **tab separated** so we need to specify the tab separator `\t`. Also, some of the columns have mixed data types, so we set `low_memory=False`.

In [5]:
biogrid = pd.read_csv("data/biogrid.txt", sep="\t", low_memory=False)

For an explanation of the different columns, we can check the [biogrid wiki](https://wiki.thebiogrid.org/doku.php/biogrid_tab_version_3.0).

In [4]:
biogrid

Unnamed: 0,#BioGRID Interaction ID,Entrez Gene Interactor A,Entrez Gene Interactor B,BioGRID ID Interactor A,BioGRID ID Interactor B,Systematic Name Interactor A,Systematic Name Interactor B,Official Symbol Interactor A,Official Symbol Interactor B,Synonyms Interactor A,...,TREMBL Accessions Interactor B,REFSEQ Accessions Interactor B,Ontology Term IDs,Ontology Term Names,Ontology Term Categories,Ontology Term Qualifier IDs,Ontology Term Qualifier Names,Ontology Term Types,Organism Name Interactor A,Organism Name Interactor B
0,103,6416,2318,112315,108607,-,-,MAP2K4,FLNC,JNKK|JNKK1|MAPKK4|MEK4|MKK4|PRKMK4|SAPKK-1|SAP...,...,Q59H94,NP_001120959|NP_001449,-,-,-,-,-,-,Homo sapiens,Homo sapiens
1,117,84665,88,124185,106603,-,-,MYPN,ACTN2,CMD1DD|CMH22|MYOP|RCM4,...,Q59FD9|F6THM6,NP_001094|NP_001265272|NP_001265273,-,-,-,-,-,-,Homo sapiens,Homo sapiens
2,183,90,2339,106605,108625,-,-,ACVR1,FNTA,ACTRI|ACVR1A|ACVRLK2|ALK2|FOP|SKR1|TSRI,...,-,NP_002018,-,-,-,-,-,-,Homo sapiens,Homo sapiens
3,278,2624,5371,108894,111384,-,-,GATA2,PML,DCML|IMD21|MONOMAC|NFE1B,...,-,NP_150250|NP_150253|NP_150252|NP_150247|NP_150...,-,-,-,-,-,-,Homo sapiens,Homo sapiens
4,418,6118,6774,112038,112651,RP4-547C9.3,-,RPA2,STAT3,REPA2|RP-A p32|RP-A p34|RPA32,...,-,NP_644805|NP_003141|NP_001356447|NP_001356443|...,-,-,-,-,-,-,Homo sapiens,Homo sapiens
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1188614,3590145,253260,7408,128962,113251,-,-,RICTOR,VASP,AVO3|PIA|hAVO3,...,A0A024R0V4,NP_003361,-,-,-,-,-,-,Homo sapiens,Homo sapiens
1188615,3590146,253260,1072,128962,107499,-,-,RICTOR,CFL1,AVO3|PIA|hAVO3,...,V9HWI5,NP_005498,-,-,-,-,-,-,Homo sapiens,Homo sapiens
1188616,3590147,7189,4217,113041,110381,-,RP3-325F22.4,TRAF6,MAP3K5,MGC:3310|RNF85,...,-,NP_005914,-,-,-,-,-,-,Homo sapiens,Homo sapiens
1188617,3621586,8237,1956,113866,108276,RP4-659F15.2,-,USP11,EGFR,UHX1,...,-,NP_001333829|NP_001333828|NP_958440|NP_005219|...,-,-,-,-,-,-,Homo sapiens,Homo sapiens


### Filter out all non-human interactions, i.e., both “organism A” and “B” fields must be = 9606 (Homo sapiens)
First, we only keep human PPIs. The organism A and B columns are columns `Organism ID Interactor A` and `Organism ID Interactor B`.

In [8]:
biogrid_human = biogrid[(biogrid["Organism ID Interactor A"] == 9606) & (biogrid["Organism ID Interactor B"] == 9606)]

### Keep only “physical” interactions” (“Experimental System Type” = physical)
Then, let's remove all non-physical interactions.

In [9]:
biogrid_human_physical = biogrid_human[biogrid_human["Experimental System Type"] == "physical"]

### Purge out redundant and self loops
Some of PPIs could be between proteins and themselves. If we think of the interactions as a graph structure, any edge that refers to its origin node can be removed. We can also remove any duplicate edges between nodes.

First, let's remove the self-loops by checking where the **Offical Symbol Interactor** columns are the same.

In [19]:
biogrid_no_self_loops = biogrid_human_physical[biogrid_human_physical["Official Symbol Interactor A"] != biogrid_human_physical["Official Symbol Interactor B"]]

print(f"Total number of self loops {len(biogrid_human_physical) - len(biogrid_no_self_loops)}")

Total number of self loops 7375


As a next step, let's also remove all duplicate rows for the **Official Symbol Interactor A/B**, since we want to build a **simple** graph out of these interactors and not a **multi-graph**.

In [34]:
biogrid_no_duplicates = biogrid_no_self_loops[~biogrid_no_self_loops[["Official Symbol Interactor A", "Official Symbol Interactor B"]].duplicated()]

print(f"There are {len(biogrid_no_self_loops) - len(biogrid_no_duplicates)} duplicate rows")

There are 217822 duplicate rows


### Isolate LCC
Finally, let's find the largest connected component (LCC) in the graph. The LCC is the largest subgraph that exists in our network. In this subgraph, every node can be reached from every other node.

In [39]:
graph = nx.Graph()
graph.add_edges_from(zip(biogrid_no_duplicates["Official Symbol Interactor A"], biogrid_no_duplicates["Official Symbol Interactor B"]))

We can use the `connected_components` function from networkx to give us the nodes of the largest connected component. Then, we create a subgraph of these components.

In [43]:
lcc_nodes = max(nx.connected_components(graph), key=len)
lcc = graph.subgraph(lcc_nodes).copy()

## 1.2 Gather gene-disease associations
We want to gather gene-disease associations to explore the links between genes associated with specific diseases and the proteins they interact with.

I will use the DisGENET REST-API to fetch the cureated information associated with the **polydactyly** disease (C0152427). Since the API requirest an authentication token, we will load the one we got when we created our account there from an environment variable.

In [73]:
polydacytyl_concept_id = "C0152427"

response = requests.get(f"https://www.disgenet.org/api/gda/disease/{polydacytyl_concept_id}", 
             params={"source":"CURATED"},
             headers={'Authorization': f'Bearer {os.getenv("disgenet_api_token")}'})

Let's take the response and turn it into a pandas dataframe to be able to work with it.

In [78]:
disease_df = pd.DataFrame(response.json())