# Download and process PPI network

This notebook serves a roughly identical function as the following two. This notebook exists to format the PPI network for feature computation and analysis. Specifically, the following tasks are performed:

1. Download and save raw data into `../data/1.raw/`
2. Convert the network relationships from two sources (high-throughput and low-throughput experimental results) into a single network, including mapping relationships to a common identifier.
3. Save the edges that appear in one of three networks (`train`, `test_recon`, or `test_new`) into the file at `../data/2.edges/ppi.tsv.xz`.
4. Process to have not just edges that appear in one network but all possible node pairs for nodes that have an edge in the training network. This is for prediction of edges. Save this in `../data/3.all_nodes/ppi.tsv.xz`

In [1]:
import pathlib
import re

import numpy as np
import pandas as pd
import requests

import analysis

# 0. Setup folders

In [2]:
data_path = pathlib.Path('../../data/')
data_path.joinpath('1.raw/').mkdir(exist_ok=True, parents=True)
data_path.joinpath('2.edges/').mkdir(exist_ok=True, parents=True)
data_path.joinpath('3.all_nodes/').mkdir(exist_ok=True, parents=True)
data_path.joinpath('4.data/').mkdir(exist_ok=True, parents=True)

# 1. Download raw files

### Protein-protein interaction networks

* STRING https://string-db.org/
* High-throughput, systematic PPIs
    * We use two networks from the same group, both created through high-throughput screening. Data is available for download at http://interactome.baderlab.org/download.
    * Rual et al. (2005) *Nature* https://www.ncbi.nlm.nih.gov/pubmed/16189514
    * Rolland et al. (2014) *Cell* https://www.ncbi.nlm.nih.gov/pubmed/25416956

In [3]:
file_to_url = {
    'ppi_string.txt.gz': ('https://stringdb-static.org/download/protein.links.v11.0/'
                          '9606.protein.links.v11.0.txt.gz'),
    
    'ppi_string_mapping.tsv.gz': ('https://string-db.org/mapping_files/uniprot/'
                                  'human.uniprot_2_string.2018.tsv.gz'),
    
    'ppi_ht_1.psi': 'http://interactome.baderlab.org/data/Raul-Vidal(Nature_2005).psi',
    'ppi_ht_2.psi': 'http://interactome.baderlab.org/data/Rolland-Vidal(Cell_2014).psi',
}

# for file, url in file_to_url.items():
#     with open(data_path.joinpath(f'1.raw/{file}'), 'wb') as f:
#         res = requests.get(url)
#         f.write(res.content)

# 2. Process files to edges

Processing is generally as follows: 

(Note this example is for an undirected network with self-loops)

1. Convert raw relationship data (in whatever form) to the following (excluding all duplicates):

| source 	| target 	| network A 	| network B 	|
|--------	|--------	|-----------	|-----------	|
| A      	| B      	| 1         	| 0         	|
| A      	| C      	| 1         	| 1         	|
| B      	| C      	| 0         	| 1         	|

2. Assign 70% of Network1 edges to the training network. Map the nodes in the training network to the integers 0, ..., num(nodes)-1. If the network is undirected, ensure that `id_a` $\leq$ `id_b`. If the network is directed, index the source nodes first, (0, ..., num(source)-1), then target nodes (num(source),...). This mapping is done for the convenience of XSwap later. Results in `[network name]_edges_df`, which have the following schema:

| source 	| target 	| source_id 	| target_id 	| train 	| network A 	| network B 	|
|--------	|--------	|-----------	|-----------	|-------	|-----------	|-----------	|
| A      	| B      	| 0         	| 1         	| 0     	| 1         	| 0         	|
| A      	| C      	| 0         	| 2         	| 1     	| 1         	| 1         	|
| B      	| C      	| 1         	| 2         	| 0     	| 0         	| 1         	|

3. Take the subset of nodes that have an edge in the training network. The Cartesian product of these nodes will be the `[network_name]_df`, which have the following schema:

| source 	| target 	| source_id 	| target_id 	| train 	| network A 	| network B 	|
|--------	|--------	|-----------	|-----------	|-------	|-----------	|-----------	|
| A      	| A      	| 0         	| 0         	| 0     	| 0         	| 0         	|
| A      	| B      	| 0         	| 1         	| 0     	| 1         	| 0         	|
| A      	| C      	| 0         	| 2         	| 1     	| 1         	| 1         	|
| B      	| B      	| 1         	| 1         	| 0     	| 0         	| 0         	|
| B      	| C      	| 1         	| 2         	| 0     	| 0         	| 1         	|
| C      	| C      	| 2         	| 2         	| 0     	| 0         	| 0         	|


## 2.1 PPI

### 2.1.1 STRING

The two PPI networks use different mappings. We convert STRING to UniProt identifiers.

In [4]:
# Ensembl to UniProtKB identifier mappings
mapping_df = pd.read_csv(data_path.joinpath('1.raw/ppi_string_mapping.tsv.gz'), sep='\t',
                         compression='gzip', names=['species', 'uniprot_entry', 'string', 
                                                      'unknown_a', 'unknown_b'])

# Create dictionary with mappings
string_to_uniprot = (
    mapping_df
    .assign(uniprot=lambda df: df['uniprot_entry'].apply(lambda x: re.search('[A-Z0-9]+', x).group()))
    .set_index('string')
    .loc[:, 'uniprot']
    .to_dict()
)

# Load PPI network from STRING
string_edges = set(map(tuple, map(sorted,
    pd.read_csv(data_path.joinpath('1.raw/ppi_string.txt.gz'), compression='gzip', 
                sep=' ', dtype=str)
    .assign(
        uniprot_a=lambda df: df['protein1'].map(string_to_uniprot),
        uniprot_b=lambda df: df['protein2'].map(string_to_uniprot),
    )
    .dropna()
    .loc[:, ['uniprot_a', 'uniprot_b']]
    .values
)))

string_edges_df = (
    pd.DataFrame(sorted(string_edges), columns=['uniprot_a', 'uniprot_b'])
    .assign(
        test_recon=1,
    )
)

string_edges_df.head(2)

Unnamed: 0,uniprot_a,uniprot_b,test_recon
0,A0A024R161,A0A075B734,1
1,A0A024R161,A2A3L6,1


### 2.1.2 High-throughput PPI network

In [5]:
# Combine the two networks
ht_df = pd.concat([
    pd.read_csv(data_path.joinpath('1.raw/ppi_ht_1.psi'), sep='\t'), 
    pd.read_csv(data_path.joinpath('1.raw/ppi_ht_2.psi'), sep='\t')
], ignore_index=True)

ht_edges = set(map(tuple, map(sorted, 
    ht_df
    .rename(columns={
        'Unique identifier for interactor A': 'ida', 
        'Unique identifier for interactor B': 'idb'})
    .filter(items=['ida', 'idb',])
    .query('ida != "-" and idb != "-"')
    .assign(
        uniprot_a=lambda df: df['ida'].apply(lambda x: re.search('(?<=uniprotkb:)[0-9A-Z]+', x).group()),
        uniprot_b=lambda df: df['idb'].apply(lambda x: re.search('(?<=uniprotkb:)[0-9A-Z]+', x).group()),
    )
    .loc[:, ['uniprot_a', 'uniprot_b']]
    .values
)))

ht_edges_df = (
    pd.DataFrame(sorted(ht_edges), columns=['uniprot_a', 'uniprot_b'])
    .assign(
        test_new=1,
    )
)

ht_edges_df.head(2)

Unnamed: 0,uniprot_a,uniprot_b,test_new
0,A0A024R0Y4,A0A0R4J2E4,1
1,A0A024R0Y4,O14964,1


### 2.1.3 Combined PPI network

Now, having two PPI networks both mapped to UniProt identifiers, we subset to the intersection of the two sets of nodes, using only nodes that are present in both networks. Then we map the shared nodes to IDs, unique integers from 0 to the number of shared nodes. This is done for efficiency in XSwap later on. Finally, as the edges are undirected, they are sorted so that the first ID is always <= the second ID. This ensures that we don't accidentally miss duplicates, etc.

In [6]:
# Only use nodes that are present in both networks
string_nodes = set(string_edges_df.loc[:, 'uniprot_a':'uniprot_b'].values.flatten())
ht_nodes = set(ht_edges_df.loc[:, 'uniprot_a':'uniprot_b'].values.flatten())
shared_nodes = set(string_nodes.intersection(ht_nodes))

print(f'STRING: {len(string_nodes)} nodes\nHT: {len(ht_nodes)} nodes\n'
      f'SHARED: {len(shared_nodes)} nodes')

# Join DataFrames and subset to node pairs consisting only of nodes shared between both networks
np.random.seed(0)
ppi_edges_df = (
    string_edges_df  # ERROR COULD BE IN OUTER JOIN
    .merge(ht_edges_df, how='outer', on=['uniprot_a', 'uniprot_b'])
    .loc[lambda df: (df['uniprot_a'].apply(lambda x: x in shared_nodes) & 
                     df['uniprot_b'].apply(lambda x: x in shared_nodes))]
    .fillna(0)
    .assign(
        train=lambda df: df['test_recon'].apply(lambda x: x == 1 and np.random.rand() < 0.7).astype(int),
        test_recon=lambda df: df['test_recon'].astype(int),
        test_new=lambda df: df['test_new'].astype(int),
    )
)

# Map nodes onto unique integers (for XSwap)
ppi_nodes = sorted(set(
    ppi_edges_df
    .query('train == 1')
    .loc[:, 'uniprot_a':'uniprot_b']
    .values.flatten()
))
ppi_mapping = {name: i for name, i in zip(ppi_nodes, range(len(ppi_nodes)))}
ppi_reversed_mapping = {v: k for k, v in ppi_mapping.items()}

# Create a DF of all edges whose nodes have an edge in at least one of the networks
ppi_edges_df = (
    ppi_edges_df
    .assign(
        mapped_a=lambda df: df['uniprot_a'].map(ppi_mapping),
        mapped_b=lambda df: df['uniprot_b'].map(ppi_mapping),
    )
    # Drop node pairs with nodes not in the train network
    .dropna()
    .assign(
        # Edges are bi-directional, so make id_a <= id_b
        id_a=lambda df: df.apply(lambda row: min(row['mapped_a'], row['mapped_b']), axis=1).astype(int),
        id_b=lambda df: df.apply(lambda row: max(row['mapped_a'], row['mapped_b']), axis=1).astype(int),
        
        # Re-ordering means that UniProt IDs may now be reversed. 
        # Apply reverse mapping to ensure correctness.
        name_a=lambda df: df['id_a'].map(ppi_reversed_mapping),
        name_b=lambda df: df['id_b'].map(ppi_reversed_mapping),
    )
    .filter(items=['name_a', 'name_b', 'id_a', 'id_b', 'train', 'test_recon', 'test_new'])
    .reset_index(drop=True)
)

ppi_edges_df.to_csv(data_path.joinpath('2.edges/ppi.tsv.xz'), compression='xz', index=False, sep='\t')

ppi_edges_df.head(2)

STRING: 19080 nodes
HT: 4517 nodes
SHARED: 4083 nodes


Unnamed: 0,name_a,name_b,id_a,id_b,train,test_recon,test_new
0,A0A087WT00,O00154,0,48,1,1,0
1,A0A087WT00,O43736,0,237,0,1,0


In [7]:
%%time

ppi_df = analysis.process_edges_to_full_network(ppi_edges_df, ppi_mapping, allow_loop=True, directed=False)
ppi_df.to_csv(data_path.joinpath('3.all_nodes/ppi.tsv.xz'), compression='xz', index=False, sep='\t')

ppi_df.head(2)

CPU times: user 3min 53s, sys: 3.13 s, total: 3min 56s
Wall time: 3min 56s
