# Data processing and standardization

Generally, an interaction is defined as a functional relationship between two biological molecules. Our main focus is to build a protein-protein interaction network between all available human proteins. 

The existing interactions are represented using different gene identifiers. For example, GeneMANIA and FunCoup use Ensembl, HumanNET utilizes Entrez, and HIPPIE employs UniProt. Therefore, there is a need to consolidate these interactions using a single naming scheme to facilitate data integration and analysis. 

## Build UniProt accession mapping for naming standardization

We used the UniProt accession as the main identifiers for all proteins. As our knowledge growth, the identifiers for the existing proteins might be changed or removed. Therefore, we mapped the individual identifiers to the UniProt primary accession for unification.

In [1]:
import os
import sys

def hline(number=78):
    """A basic function to draw a line"""
    print(number * '-')
    print()

The UniProt database contains two primary collections, SwissProt and TrEMBL. The SwissProt collection is based on the manual curation process, while the TrEMBL collection is based on electronic annotation. The size of the TrEMBL collection is very large. Therefore, we provided a fast disk-based key-value mapping using the `rocksdb` library. 

The `uniprot` module contains a function `build_mapping_rdb`, which could be used to build the database for identifier conversion. The main input to this function is the UniProt collections in the flat file format. In this study, we only used the SwissProt collection. Users could build TrEMBL collection if needed.

In [2]:
from src import uniprot, misc

project_directory = '/projects/ooihs/ReNet/'
sprot_file = os.path.join(project_directory,
                          'data/uniprot/uniprot_sprot.dat.gz')
sprot_db = os.path.join(project_directory, 'results/sprot')

data_directory = os.path.join(project_directory, 'data/')
map_directory = os.path.join(project_directory, 'results/map/')
final_directory = os.path.join(project_directory, 'results/final/')

if not os.path.isdir(map_directory):
    os.makedirs(map_directory)

if not os.path.isdir(final_directory):
    os.makedirs(final_directory)
    
uniprot.build_mapping_rdb(sprot_file, sprot_db, {'9606'})

## Uncompressing the downloaded files

*Note:*
* HumanCyc: copy `human.tar.gz` to `data/humancyc` directory
* HPRD: copy `HPRD_FLAT_FILES_041310.tar.gz` to `data/hprd` directory
* DIP: copy `dip20170205.txt.gz` to `data/dip` directory
* InBioMap: copy `InBio_Map_core_2016_09_12.tar.gz` to `data/inbiomap` directory

In [3]:
%%capture
%cd {data_directory} 
%cd bindtranslation
!tar -zxvf BINDTranslation_v1_mitab_AllSpecies.tar.gz
%cd ../humancyc
!tar -zxvf human.tar.gz
%cd ../hprd
!tar -zxvf HPRD_FLAT_FILES_041310.tar.gz
%cd ../intact
!unzip -o intact.zip
%cd ../biogrid
!unzip -o BIOGRID-ALL-3.4.154.mitab.zip
%cd ../mppi
!gunzip mppi.gz
%cd ../hupa
!unzip -o HUPA_RELEASE_2.0.zip
%cd ../spike
!unzip -o LatestSpikeDB.xml.zip
%cd ../corum
!unzip -o allComplexes.txt.zip
%cd ../inbiomap
!tar -zxvf InBio_Map_core_2016_09_12.tar.gz
%cd ../irefindex
!unzip -o 9606.mitab.07042015.txt.zip
%cd ../..

# Converting of the interaction data into internal formats

Besides naming scheme, each interaction resource also uses a different format to represent the molecular interactions. Therefore, there is a need to convert the external data formats into the same data format for integration or comparison.

Our internal representation is a simple tab-separated format, for both interactions and the participants. The interaction file has three columns, with the first and second columns represent the interacting genes or proteins in their primary names. The third column is a [JSON](https://en.wikipedia.org/wiki/JSON) object containing useful information. The participant file stores all unique genes or proteins in the interaction file, with their second names. 

## Process the collected interactions

### Molecular interactions

#### Process the molecular interactions given in psi-mitab format

The Proteomics Standards Initiative (PSI) provides community standards for representing molecular interaction data. Two representations are available for molecular interactions, PSI-MI XML and PSI-MI TAB. The `mitab` module processes  molecular interactions defined in the PSI-MI tabular format. In our application, we mainly focused on extracting and processing information such as identifiers, tanomony, interaction types, publications, and experimental methods.

The following databases provides molecular interaction in the psi-mi tab formats:

* [IntAct](http://www.ebi.ac.uk/intact/)
* [DIP](http://dip.doe-mbi.ucla.edu/dip/Main.cgi)
* [MINT](http://mint.bio.uniroma2.it/)
* [BioGRID](https://thebiogrid.org/)
* [InnateDB](http://innatedb.com/)
* [MatrixDB](http://matrixdb.univ-lyon1.fr/)
* [Reactome](http://reactome.org/)
* [BINDTranslation](http://download.baderlab.org/BINDTranslation/)
* [CCSB HuRI](http://interactome.baderlab.org/)


*Note:*

* Many of the databases contain interactions derived from various organisms. In this example, only human interactions will be processed. 
* Not all identifiers can be converted to the corresponding UniProt accessions. For example, rna, mirna or protein complexes cannot be converted.

In [4]:
from src import mitab

def process_mitab(data_directory, output_directory):
    """process the list of interaction files in mitab format"""
    
    # each database has a list of input files
    data_files = {
        'intact': [os.path.join(data_directory,
                                'intact/intact.txt')],
        'dip': [os.path.join(data_directory,
                             'dip/dip20170205.txt.gz')],
        'mint': [os.path.join(data_directory,
                              'mint/MINT_MiTab.txt')],
        'biogrid': [os.path.join(data_directory,
                                 'biogrid/BIOGRID-ALL-3.4.154.mitab.txt')],
        'innatedb': [os.path.join(data_directory,
                                  'innatedb/innatedb_all.mitab.gz')],
        'matrixdb': [os.path.join(data_directory,
                                  'matrixdb/matrixdb_FULL.tab.gz')],
        'reactome': [os.path.join(data_directory,
                                  'reactome/reactome.homo_sapiens.interactions.psi-mitab.txt')]
    }

    # add in BINDTranslation
    file_list = []
    file_directory = os.path.join(data_directory, 'bindtranslation')
    # only process human interactions
    file_list.append(os.path.join(file_directory, 'taxid9606_PSIMI25.tsv'))
    data_files['bindtranslation'] = file_list
    
    # add in CCSB HuRI collections, now in PSI-MI Tab format
    file_list = []
    file_directory = os.path.join(data_directory, 'huri')
    for filename in os.listdir(file_directory):
        if filename.endswith('.psi'):
            file_list.append(os.path.join(file_directory, filename))
    data_files['huri'] = file_list

    # start the extraction process
    for dbname in data_files:
        # extract the interactions
        mitab.process(data_files[dbname], output_directory, dbname)
        print()
        
        # convert the interactor to uniprot mapping
        uniprot.convert_participant(output_directory, 'interaction', dbname, sprot_db)
        print()
        
        # convert the interaction to uniprot mapping
        uniprot.convert_interaction(output_directory, dbname, 'interaction')
        
        hline()
        
process_mitab(data_directory, map_directory)

Processing intact
Total number of interaction processed: 790620
Total number of unique interactions: 502251
Total number of unique interactors: 101718

Converting participants to uniprot accessions
Total interactors: 101718
Total interactors with valid conversion: 20322
Conversion rate: 19.98%

Converting interactions to uniprot accession
Total interactions: 502251
Interactions with valid uniprot accession: 152967
Conversion rate: 30.46%
------------------------------------------------------------------------------

Processing dip
Total number of interaction processed: 76881
Total number of unique interactions: 76877
Total number of unique interactors: 28255

Converting participants to uniprot accessions
Total interactors: 28255
Total interactors with valid conversion: 3636
Conversion rate: 12.87%

Converting interactions to uniprot accession
Total interactions: 76877
Interactions with valid uniprot accession: 5542
Conversion rate: 7.21%
------------------------------------------------

#### HTRIdb

In [5]:
from src import htridb

"""process the interactions from HTRIdb"""
htridb_file = os.path.join(data_directory, 'htridb/htridb_with_ppi.csv')

htridb.process_interaction(htridb_file, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'htridb', sprot_db) 
print()

uniprot.convert_interaction(map_directory, 'htridb') 

Processing HTRIdb
Total number of interaction processed: 251356
Total number of unique interactions: 149879
Total number of unique interactors: 19406

Converting participants to uniprot accessions
Total interactors: 19406
Total interactors with valid conversion: 17685
Conversion rate: 91.13%

Converting interactions to uniprot accession
Total interactions: 149879
Interactions with valid uniprot accession: 147429
Conversion rate: 98.37%


#### INstruct

In [6]:
from src import instruct

"""process the interactions from INstruct"""
instruct_directory = os.path.join(data_directory, 'instruct')

instruct.process(instruct_directory, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'instruct',  sprot_db)
print()

uniprot.convert_interaction(map_directory, 'instruct')


Processing INstruct
Total number of interaction processed: 11470
Total number of unique interactions: 6585
Total number of unique interactors: 3627

Converting participants to uniprot accessions
Total interactors: 3627
Total interactors with valid conversion: 3627
Conversion rate: 100.00%

Converting interactions to uniprot accession
Total interactions: 6585
Interactions with valid uniprot accession: 6585
Conversion rate: 100.00%


#### MPPI

In [7]:
from src import mppi

"""process the interactions from MPPI"""
mppi_file = os.path.join(data_directory, 'mppi/mppi')

mppi.process(mppi_file, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'mppi', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'mppi')

Processing MPPI
Total number of interaction processed: 1814
Total number of unique interactions: 913
Total number of unique interactors: 953

Converting participants to uniprot accessions
Total interactors: 953
Total interactors with valid conversion: 493
Conversion rate: 51.73%

Converting interactions to uniprot accession
Total interactions: 913
Interactions with valid uniprot accession: 383
Conversion rate: 41.95%


#### SynSysNet

In [8]:
from src import synsysnet

"""process the interactions from SynSysNet"""
synsysnet_directory = os.path.join(data_directory, 'synsysnet')

synsysnet.process(synsysnet_directory, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'synsysnet', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'synsysnet')

Processing SynSysNet
Total number of interaction processed: 4638
Total number of unique interactions: 4638
Total number of unique interactors: 882

Converting participants to uniprot accessions
Total interactors: 882
Total interactors with valid conversion: 882
Conversion rate: 100.00%

Converting interactions to uniprot accession
Total interactions: 4638
Interactions with valid uniprot accession: 4638
Conversion rate: 100.00%


#### TcoF (Dragon database of transcription co-factors and transcription factor interacting proteins)

In [9]:
from src import tcof

"""process the interactions from Dragon TcoF"""
tcof_file = os.path.join(data_directory, 'tcof/tcof_ppi_20100927.txt')

tcof.process(tcof_file, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'tcof', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'tcof')


Processing TcoF
Total number of interaction processed: 15948
Total number of unique interactions: 7045
Total number of unique interactors: 3012

Converting participants to uniprot accessions
Total interactors: 3012
Total interactors with valid conversion: 3012
Conversion rate: 100.00%

Converting interactions to uniprot accession
Total interactions: 7045
Interactions with valid uniprot accession: 7045
Conversion rate: 100.00%


#### HPRD

In [10]:
from src import hprd

"""process the interactions from HTRD"""
hprd_directory = os.path.join(data_directory, 'hprd/FLAT_FILES_041210')

hprd.process_interaction(hprd_directory, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'hprd', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'hprd')

hline()
hprd.process_complex(hprd_directory, map_directory)
print()

uniprot.convert_participant(map_directory, 'complex', 'hprd', sprot_db)
print()

uniprot.convert_profile(map_directory, 'complex', 'hprd')


Processing HPRD interactions
Total number of interaction processed: 39175
Total number of unique interactions: 39142
Total number of unique interactors: 9673

Converting participants to uniprot accessions
Total interactors: 9673
Total interactors with valid conversion: 9650
Conversion rate: 99.76%

Converting interactions to uniprot accession
Total interactions: 39142
Interactions with valid uniprot accession: 39084
Conversion rate: 99.85%
------------------------------------------------------------------------------

Processing HPRD complexes
Total number of records processed: 7325
Total number of unique complexes: 1521
Total number of unique molecules: 2742

Converting participants to uniprot accessions
Total interactors: 2742
Total interactors with valid conversion: 2739
Conversion rate: 99.89%

Converting profiles to uniprot accession
Total profiles: 1521
Profiles with valid uniprot accession: 1521
Conversion rate: 100.00%


#### HUPA

In [11]:
from src import hupa

"""process the interactions from HUPA"""
# hupa needs to access some information from HPRD
hupa_directory = os.path.join(data_directory, 'hupa')

hupa.process_interaction(hprd_directory, hupa_directory, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'hupa', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'hupa')

hline()
hupa.process_complex(hprd_directory, hupa_directory, map_directory)
print()

uniprot.convert_participant(map_directory, 'complex', 'hupa', sprot_db)
print()

uniprot.convert_profile(map_directory, 'complex', 'hupa')


Processing HUPA interactions
Total number of interaction processed: 4606
Total number of unique interactions: 4569
Total number of unique interactors: 1965

Converting participants to uniprot accessions
Total interactors: 1965
Total interactors with valid conversion: 1957
Conversion rate: 99.59%

Converting interactions to uniprot accession
Total interactions: 4569
Interactions with valid uniprot accession: 4561
Conversion rate: 99.82%
------------------------------------------------------------------------------

Processing HUPA complexes
Total number of records processed: 11
Total number of unique complexes: 1
Total number of unique molecules: 11

Converting participants to uniprot accessions
Total interactors: 11
Total interactors with valid conversion: 11
Conversion rate: 100.00%

Converting profiles to uniprot accession
Total profiles: 1
Profiles with valid uniprot accession: 1
Conversion rate: 100.00%


#### CIDeR

In [12]:
from src import cider

"""process the interactions from CIDeR"""
cider_file = os.path.join(data_directory, 'cider/cider.csv')
cider_update_file = cider_file + '.updated'

# the download file contains some unintentional corrupted lines
# the clean up function fixes the issues.
cider.clean_up_cider_file(cider_file, cider_update_file)

cider.process_interaction(cider_update_file, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'cider', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'cider')

Total number of lines: 54467
After clean up: 52217
Processing CIDeR interactions
Total number of interaction processed: 7542
Total number of unique interactions: 5123
Total number of unique interactors: 2490
Total number of negative interactions: 590

Converting participants to uniprot accessions
Total interactors: 2490
Total interactors with valid conversion: 2426
Conversion rate: 97.43%

Converting interactions to uniprot accession
Total interactions: 5123
Interactions with valid uniprot accession: 5052
Conversion rate: 98.61%


#### SPIKE

In [13]:
from src import spike

"""process molecular interactions from SPIKE"""
spike_directory = os.path.join(data_directory, 'spike')
spike.process_interaction(spike_directory, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'spike', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'spike')

Processing SPIKE interactions
Total number of interaction processed: 44752
Total number of unique interactions: 43393
Total number of unique interactors: 10621

Converting participants to uniprot accessions
Total interactors: 10621
Total interactors with valid conversion: 9486
Conversion rate: 89.31%

Converting interactions to uniprot accession
Total interactions: 43393
Interactions with valid uniprot accession: 39930
Conversion rate: 92.02%


#### SIGNOR 2.0

In [14]:
from src import signor

"""process molecular interactions and protein complexes"""
signor_directory = os.path.join(data_directory, 'signor')
interaction_file = os.path.join(signor_directory, 'signor.xlsx')
signor.process_interaction(interaction_file, map_directory)

print()
uniprot.convert_participant(map_directory, 'interaction', 'signor', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'signor')

complex_file = os.path.join(signor_directory, 'signor_complexes.csv')
family_file = os.path.join(signor_directory, 'signor_family.csv')
hgnc_file = os.path.join(data_directory, 'hgnc/hgnc_complete_set.json')
hgnc = misc.load_hgnc_json(hgnc_file)

hline()
signor.process_complex(complex_file, family_file, hgnc, map_directory)
print()

uniprot.convert_participant(map_directory, 'complex', 'signor', sprot_db)
print()

uniprot.convert_profile(map_directory, 'complex', 'signor')

Processing SIGNOR interactions
Total number of interaction processed: 19167
Total number of unique interactions: 9733
Total number of unique interactors: 3783

Converting participants to uniprot accessions
Total interactors: 3783
Total interactors with valid conversion: 3775
Conversion rate: 99.79%

Converting interactions to uniprot accession
Total interactions: 9733
Interactions with valid uniprot accession: 9709
Conversion rate: 99.75%
------------------------------------------------------------------------------

Loading SIGNOR protein families
Processing SIGNOR complexes
Total number of unique complexes: 179
Total number of unique molecules: 334

Converting participants to uniprot accessions
Total interactors: 334
Total interactors with valid conversion: 334
Conversion rate: 100.00%

Converting profiles to uniprot accession
Total profiles: 179
Profiles with valid uniprot accession: 179
Conversion rate: 100.00%


#### PDZBase

In [15]:
from src import pdzbase

"""process the interactions from PDZBase"""
pdzbase.process(map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'pdzbase', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'pdzbase')

Processing PDZBase
Downloading the interactions from PDZBase
Total number of interaction processed: 339
Total number of unique interactions: 276
Total number of unique interactors: 311

Converting participants to uniprot accessions
Total interactors: 311
Total interactors with valid conversion: 124
Conversion rate: 39.87%

Converting interactions to uniprot accession
Total interactions: 276
Interactions with valid uniprot accession: 103
Conversion rate: 37.32%


#### TRIP

In [16]:
from src import trip

"""process the interactions from TRIP"""
# issue with unicode during processing the trip file, need to change
# the system encoding to utf8
trip_file = os.path.join(data_directory, 'trip/20150806.csv')

trip.process(trip_file, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'trip', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'trip')

Processing TRIP
Total number of interaction processed: 2813
Total number of unique interactions: 718
Total number of unique interactors: 472

Converting participants to uniprot accessions
Total interactors: 472
Total interactors with valid conversion: 397
Conversion rate: 84.11%

Converting interactions to uniprot accession
Total interactions: 718
Interactions with valid uniprot accession: 587
Conversion rate: 81.75%


#### DeathDomain

In [17]:
from src import deathdomain

deathdomain.process(map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'deathdomain', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'deathdomain')
    

Processing DeathDomain
Processing family: CARD
Processing family: DD
Processing family: DED
Processing family: PYD
Total number of interaction processed: 1964
Total number of unique interactions: 181
Total number of unique interactors: 75

Converting participants to uniprot accessions
Total interactors: 75
Total interactors with valid conversion: 75
Conversion rate: 100.00%

Converting interactions to uniprot accession
Total interactions: 181
Interactions with valid uniprot accession: 181
Conversion rate: 100.00%


#### PepCyber: P~Pep

In [18]:
from src import ppep

"""process the protein complexes from PPep"""
ppep_data = os.path.join(data_directory, 'ppep')

ppep.process(ppep_data, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'ppep', sprot_db)
print()
uniprot.convert_interaction(map_directory, 'ppep')


Processing PepCyber ~ PPep interactions
Total number of interaction processed: 9529
Total number of unique interactions: 3563
Total number of unique interactors: 1383

Converting participants to uniprot accessions
Total interactors: 1383
Total interactors with valid conversion: 1016
Conversion rate: 73.46%

Converting interactions to uniprot accession
Total interactions: 3563
Interactions with valid uniprot accession: 2805
Conversion rate: 78.73%


#### KEGG (Kyoto Encyclopedia of Genes and Genomes)


In [19]:
from src import kegg

"""process various collections of KEGG database"""
kegg_directory = os.path.join(data_directory, 'kegg')

kegg.process_interaction(kegg_directory, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'kegg', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'kegg')
print()

# process the kegg pathway so that we can use it for enrichment analysis
kegg.process_pathway(kegg_directory, map_directory)
print()

uniprot.convert_participant(map_directory, 'pathway', 'kegg', sprot_db)
print()

uniprot.convert_profile(map_directory, 'pathway', 'kegg')
print()

Processing KEGG interactions
Number of pathways processed: 322
Total number of unique interactions: 43658
Total number of unique interactors: 4991

Converting participants to uniprot accessions
Total interactors: 4991
Total interactors with valid conversion: 4862
Conversion rate: 97.42%

Converting interactions to uniprot accession
Total interactions: 43658
Interactions with valid uniprot accession: 42990
Conversion rate: 98.47%

Processing KEGG Pathway Collection
Total number of records processed: 322
Total number of unique diseases: 313
Total number of unique molecules: 7281

Converting participants to uniprot accessions
Total interactors: 7281
Total interactors with valid conversion: 7071
Conversion rate: 97.12%

Converting profiles to uniprot accession
Total profiles: 313
Profiles with valid uniprot accession: 313
Conversion rate: 100.00%



#### SignaLink

In [20]:
from src import signalink
import importlib
importlib.reload(signalink)

"""process pathway information from Signalink"""
signalink_directory = os.path.join(data_directory, 'signalink')
signalink.sql_to_tsv(signalink_directory)
print()

signalink.process_interaction(signalink_directory, map_directory)
print()

uniprot.convert_participant(map_directory, 'interaction', 'signalink', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'signalink')

Converting Signalink sqldump

Processing Signalink interactions
Total number of interaction processed: 363998
Total number of unique interactions: 305201
Total number of unique interactors: 9202

Converting participants to uniprot accessions
Total interactors: 9202
Total interactors with valid conversion: 4400
Conversion rate: 47.82%

Converting interactions to uniprot accession
Total interactions: 305201
Interactions with valid uniprot accession: 98306
Conversion rate: 32.21%


### Protein Complexes

#### CORUM

In [21]:
from src import corum

"""process the protein complexes from CORUM"""
corum_file = os.path.join(data_directory, 'corum/allComplexes.txt')
corum.process_complex(corum_file, map_directory)
print()

uniprot.convert_participant(map_directory, 'complex', 'corum', sprot_db)
print()

uniprot.convert_profile(map_directory, 'complex', 'corum')


Processing CORUM complexes
Total number of unique complexes: 3633
Total number of unique molecules: 5349

Converting participants to uniprot accessions
Total interactors: 5349
Total interactors with valid conversion: 3312
Conversion rate: 61.92%

Converting profiles to uniprot accession
Total profiles: 3633
Profiles with valid uniprot accession: 2538
Conversion rate: 69.86%


#### INTACT

In [22]:
from src import intact

"""process the protein complexes from INTACT"""
complex_file = os.path.join(data_directory, 'intact/homo_sapiens.tsv')
intact.process_complex(complex_file, map_directory)
print()

uniprot.convert_participant(map_directory, 'complex', 'intact', sprot_db)
print()

uniprot.convert_profile(map_directory, 'complex', 'intact')


Processing INTACT complexes
Total number of unique complexes: 614
Total number of unique molecules: 951

Converting participants to uniprot accessions
Total interactors: 951
Total interactors with valid conversion: 946
Conversion rate: 99.47%

Converting profiles to uniprot accession
Total profiles: 614
Profiles with valid uniprot accession: 614
Conversion rate: 100.00%


#### HumanCyc

In [23]:
from src import humancyc

"""process various information from HumanCyc"""
humancyc_directory = os.path.join(data_directory, 'humancyc/20.0')

humancyc.process_profile(humancyc_directory, map_directory, 'complex')
print()

uniprot.convert_participant(map_directory, 'complex', 'humancyc', sprot_db)
print()

uniprot.convert_profile(map_directory, 'complex', 'humancyc')

Processing HumanCyc complex
Total number of records processed: 28
Total number of unique diseases: 26
Total number of unique molecules: 50

Converting participants to uniprot accessions
Total interactors: 50
Total interactors with valid conversion: 50
Conversion rate: 100.00%

Converting profiles to uniprot accession
Total profiles: 26
Profiles with valid uniprot accession: 26
Conversion rate: 100.00%


### Manually curated interactions

These interactions could be used to build a golden positive set.

#### HuRI collection

The file `LitBM-17` is a manually curated interactions.

In [24]:
input_files = [os.path.join(data_directory, 'huri/LitBM-17.psi')]
mitab.process(input_files, map_directory, 'litbm')
print()

uniprot.convert_participant(map_directory, 'interaction', 'litbm', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'litbm')


Processing litbm
Total number of interaction processed: 53504
Total number of unique interactions: 8703
Total number of unique interactors: 4971

Converting participants to uniprot accessions
Total interactors: 4971
Total interactors with valid conversion: 4962
Conversion rate: 99.82%

Converting interactions to uniprot accession
Total interactions: 8703
Interactions with valid uniprot accession: 8692
Conversion rate: 99.87%


## Negative interactions

These negative interactions are not occurring under a particular condition. At present, there is no single experimental method that could detect negative interactions. 

### Nagatome

In [25]:
from src import negatome

"""process negative interactions from Negatome"""
negatome_file = os.path.join(data_directory, 'negatome/combined.txt')
negatome.process(negatome_file, map_directory)
print()

uniprot.convert_participant(map_directory, 'notinteraction', 'negatome', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'negatome', 'notinteraction')


Processing Negatome
Total number of interaction processed: 6532
Total number of unique interactions: 5867
Total number of unique interactors: 3375

Converting participants to uniprot accessions
Total interactors: 3375
Total interactors with valid conversion: 1226
Conversion rate: 36.33%

Converting interactions to uniprot accession
Total interactions: 5867
Interactions with valid uniprot accession: 1560
Conversion rate: 26.59%


### CIDeR

During the processing of the CIDeR collection (Step above), we also reported negative interactions with "NOT" as it quantifier. Here, we only standardize the reported negative interactions.

In [26]:
# negative interaction part
uniprot.convert_participant(map_directory, 'notinteraction', 'cider', sprot_db)

uniprot.convert_interaction(map_directory, 'cider', 'notinteraction')


Converting participants to uniprot accessions
Total interactors: 2490
Total interactors with valid conversion: 2426
Conversion rate: 97.43%
Converting interactions to uniprot accession
Total interactions: 590
Interactions with valid uniprot accession: 589
Conversion rate: 99.83%


### IntAct

IntAct also provides negative interactions. Similarly, we can process the file using the `mitab` module.

In [27]:
"""process negative interactions from IntAct"""
input_file = [os.path.join(data_directory, 'intact/intact_negative.txt')]
mitab.process(input_file, map_directory, 'intact', 'notinteraction', 'NOT')
print()

uniprot.convert_participant(map_directory, 'notinteraction', 'intact', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'intact', 'notinteraction')

Processing intact
Total number of interaction processed: 959
Total number of unique interactions: 914
Total number of unique interactors: 712

Converting participants to uniprot accessions
Total interactors: 712
Total interactors with valid conversion: 505
Conversion rate: 70.93%

Converting interactions to uniprot accession
Total interactions: 914
Interactions with valid uniprot accession: 372
Conversion rate: 40.70%


### Extract Human Interactions

In [28]:
datasets = ['bindtranslation', 'biogrid', 'cider', 'deathdomain', 
           'dip', 'hprd', 'htridb', 'hupa', 'huri',
           'innatedb', 'instruct', 'intact', 'kegg',
           'matrixdb', 'mint', 'mppi', 'pdzbase', 'ppep',
           'reactome', 'signalink', 'signor', 'spike',
           'synsysnet', 'tcof', 'trip'
]

input_files = []

for ds in datasets:
    input_files.append(os.path.join(map_directory, 'interaction.' + ds + '.uniprot'))

hgnc_file = os.path.join(data_directory, 'hgnc/hgnc_complete_set.json')
hgnc = misc.load_hgnc_json(hgnc_file)

output_file = os.path.join(final_directory, 'interaction.human.uniprot')

# combine and filter the interactions based on HGNC
misc.combine_interaction_datasets(input_files, output_file, hgnc['uniprot'])

Combining interactions in datasets...
...Processing bindtranslation
...Processing biogrid
...Processing cider
...Processing deathdomain
...Processing dip
...Processing hprd
...Processing htridb
...Processing hupa
...Processing huri
...Processing innatedb
...Processing instruct
...Processing intact
...Processing kegg
...Processing matrixdb
...Processing mint
...Processing mppi
...Processing pdzbase
...Processing ppep
...Processing reactome
...Processing signalink
...Processing signor
...Processing spike
...Processing synsysnet
...Processing tcof
...Processing trip
Total number of interaction processed: 876229
Total number of unique interactions: 504044


## Interactions from other integrated resouces

### ComPPI

The conversion rate of ComPPI is generally very low, due to the fact that many of the proteins are obtained from UniProt/TrEMBL collection. Generally, our application excludes these unreviewed proteins.

In [29]:
from src import comppi

input_file = os.path.join(data_directory, 'comppi/comppi--interactions--tax_hsapiens_loc_all.txt.gz')
comppi.process(input_file, map_directory)

print()
uniprot.convert_participant(map_directory, 'interaction', 'comppi', sprot_db)
print()
uniprot.convert_interaction(map_directory, 'comppi', 'interaction')

Processing ComPPI
Total number of interaction processed: 385481
Total number of unique interactions: 385481
Total number of unique interactors: 23265

Converting participants to uniprot accessions
Total interactors: 23265
Total interactors with valid conversion: 15929
Conversion rate: 68.47%

Converting interactions to uniprot accession
Total interactions: 385481
Interactions with valid uniprot accession: 181793
Conversion rate: 47.16%


### ConsensusPathDB (CPDB)

In [30]:
from src import cpdb

input_file = os.path.join(data_directory, 'cpdb/ConsensusPathDB_human_PPI.gz')
cpdb.process(input_file, map_directory)

print()
uniprot.convert_participant(map_directory, 'interaction', 'cpdb', sprot_db)
print()
uniprot.convert_interaction(map_directory, 'cpdb', 'interaction')

Processing ConsensusPathDB
Total number of interaction processed: 291415
Total number of unique interactions: 272998
Total number of unique interactors: 17941

Converting participants to uniprot accessions
Total interactors: 17941
Total interactors with valid conversion: 16296
Conversion rate: 90.83%

Converting interactions to uniprot accession
Total interactions: 272998
Interactions with valid uniprot accession: 261435
Conversion rate: 95.76%


### GeneMANIA

In [31]:
from src import genemania

print('Protein interactions from selected publications')
genemania_directory = os.path.join(data_directory, 'genemania/Homo_sapiens/')
genemania.process(genemania_directory, map_directory, '9606', 'PPI', 'interaction', ['Physical_Interactions'])
print()

uniprot.convert_participant(map_directory, 'interaction', 'genemania', sprot_db)
print()

uniprot.convert_interaction(map_directory, 'genemania', 'interaction')

Protein interactions from selected publications
Processing GeneMANIA
Total number of interaction processed: 596213
Total number of unique interactions: 265427
Total number of unique interactors: 16341

Converting participants to uniprot accessions
Total interactors: 16341
Total interactors with valid conversion: 16070
Conversion rate: 98.34%

Converting interactions to uniprot accession
Total interactions: 265427
Interactions with valid uniprot accession: 259919
Conversion rate: 97.92%


### FunCoup

In [32]:
from src import funcoup

input_file = os.path.join(data_directory, 'funcoup/FC4.0_H.sapiens_full.gz')
funcoup.process(input_file, map_directory)

print()
uniprot.convert_participant(map_directory, 'interaction', 'funcoup', sprot_db)
print()
uniprot.convert_interaction(map_directory, 'funcoup', 'interaction')

Processing FunCoup
Total number of interaction processed: 6403719
Total number of unique interactions: 6403719
Total number of unique interactors: 18355

Converting participants to uniprot accessions
Total interactors: 18355
Total interactors with valid conversion: 18064
Conversion rate: 98.41%

Converting interactions to uniprot accession
Total interactions: 6403719
Interactions with valid uniprot accession: 6351280
Conversion rate: 99.18%


### HumanNET

In [33]:
from src import humannet

humannet_directory = os.path.join(data_directory, 'umannet/')
humannet.process(humannet_directory, map_directory)

print()
uniprot.convert_participant(map_directory, 'interaction', 'humannet', sprot_db)
print()
uniprot.convert_interaction(map_directory, 'humannet', 'interaction')


Processing HumanNet
Total number of interaction processed: 476399
Total number of unique interactions: 476399
Total number of unique interactors: 16243

Converting participants to uniprot accessions
Total interactors: 16243
Total interactors with valid conversion: 15935
Conversion rate: 98.10%

Converting interactions to uniprot accession
Total interactions: 476399
Interactions with valid uniprot accession: 468789
Conversion rate: 98.40%


### InBioMap

In [34]:
from src import inbiomap

inbiomap_directory = os.path.join(data_directory, 'inbiomap/InBio_Map_core_2016_09_12/')
inbiomap.process(inbiomap_directory, map_directory)

print()
uniprot.convert_participant(map_directory, 'interaction', 'inbiomap', sprot_db)
print()
uniprot.convert_interaction(map_directory, 'inbiomap', 'interaction')

Processing InBioMap
Total number of interaction processed: 625641
Total number of unique interactions: 625641
Total number of unique interactors: 17653

Converting participants to uniprot accessions
Total interactors: 17653
Total interactors with valid conversion: 17650
Conversion rate: 99.98%

Converting interactions to uniprot accession
Total interactions: 625641
Interactions with valid uniprot accession: 625458
Conversion rate: 99.97%


### IRefIndex

In [35]:
from src import irefindex

input_file = os.path.join(data_directory, 'irefindex/9606.mitab.04072015.txt')
irefindex.process(input_file, map_directory)

print()
uniprot.convert_participant(map_directory, 'interaction', 'irefindex', sprot_db)
print()
uniprot.convert_interaction(map_directory, 'irefindex', 'interaction')


Processing IRefIndex
Total number of interaction processed: 673100
Total number of unique interactions: 321740
Total number of unique interactors: 45629

Converting participants to uniprot accessions
Total interactors: 45629
Total interactors with valid conversion: 23962
Conversion rate: 52.51%

Converting interactions to uniprot accession
Total interactions: 321740
Interactions with valid uniprot accession: 223487
Conversion rate: 69.46%


### HIPPIE

In [36]:
from src import hippie

input_file = os.path.join(data_directory, 'hippie/hippie_current.txt')
hippie.process(input_file, map_directory)

print()
uniprot.convert_participant(map_directory, 'interaction', 'hippie', sprot_db)
print()
uniprot.convert_interaction(map_directory, 'hippie', 'interaction')


Processing HIPPIE
Total number of interaction processed: 340629
Total number of unique interactions: 324778
Total number of unique interactors: 17365

Converting participants to uniprot accessions
Total interactors: 17365
Total interactors with valid conversion: 16790
Conversion rate: 96.69%

Converting interactions to uniprot accession
Total interactions: 324778
Interactions with valid uniprot accession: 322865
Conversion rate: 99.41%


### STRING

In [37]:
from src import string

input_file = os.path.join(data_directory, 'string/9606.protein.links.full.v10.5.txt.gz')
string.process(input_file, map_directory)

print()
uniprot.convert_participant(map_directory, 'interaction', 'string', sprot_db)
print()
uniprot.convert_interaction(map_directory, 'string', 'interaction')


Processing STRING
Total number of interaction processed: 11353056
Total number of unique interactions: 5676528
Total number of unique interactors: 19576

Converting participants to uniprot accessions
Total interactors: 19576
Total interactors with valid conversion: 17566
Conversion rate: 89.73%

Converting interactions to uniprot accession
Total interactions: 5676528
Interactions with valid uniprot accession: 4782791
Conversion rate: 84.26%


## Standardize the interactions using HGNC

In [38]:
databases = ['inbiomap', 'hippie', 'genemania', 'irefindex', 'cpdb', 'comppi',
             'string', 'humannet', 'funcoup']

hgnc_file = os.path.join(data_directory, 'hgnc/hgnc_complete_set.json')
hgnc = misc.load_hgnc_json(hgnc_file)

for dbname in databases:
    print('Filtering', dbname)
    filename = 'interaction.' + dbname + '.uniprot'
    input_file = os.path.join(map_directory, filename) 
    output_file = os.path.join(final_directory, filename)
    misc.filter_interactions(input_file, output_file, hgnc['uniprot'])
    print()
    

Filtering inbiomap
Filtering interactions ...
Total number of interaction processed: 615524
Total number of final interactions: 615154

Filtering hippie
Filtering interactions ...
Total number of interaction processed: 324393
Total number of final interactions: 324264

Filtering genemania
Filtering interactions ...
Total number of interaction processed: 259319
Total number of final interactions: 259319

Filtering irefindex
Filtering interactions ...
Total number of interaction processed: 185325
Total number of final interactions: 185273

Filtering cpdb
Filtering interactions ...
Total number of interaction processed: 261435
Total number of final interactions: 261353

Filtering comppi
Filtering interactions ...
Total number of interaction processed: 174939
Total number of final interactions: 174858

Filtering string
Filtering interactions ...
Total number of interaction processed: 4746189
Total number of final interactions: 4746181

Filtering humannet
Filtering interactions ...
Total nu

## Gene Ontology

In [39]:
from src import goa

input_file = os.path.join(data_directory, 'go/goa_human.gaf.gz')
goa.process(input_file, map_directory)

print()
uniprot.convert_participant(map_directory, 'geneontology', 'goa', sprot_db)
print()
uniprot.convert_profile(map_directory, 'geneontology', 'goa')

Processing Gene Ontology Association
Total number of records processed: 439134
Total number of unique diseases: 12488
Total number of unique molecules: 15934

Converting participants to uniprot accessions
Total interactors: 15934
Total interactors with valid conversion: 15909
Conversion rate: 99.84%

Converting profiles to uniprot accession
Total profiles: 12488
Profiles with valid uniprot accession: 12488
Conversion rate: 100.00%
