# Downloading the interaction data

Molecular interaction data is available in a number of online resources. The first step in the construction of a molecular interaction network is to retrieve the interactions from these resources.  At present, there are more than 650 interaction-related resources listed in the [Pathguide](http://pathguide.org). The list is likely to increase in the near future. The module provides a practical guide for downloading interaction data from some of these well-known human-related resources. 

## Prerequisite

The following code block introduces a function to download a file from a given url.  

In [1]:
import os
import urllib.request
import requests

project_directory = '/projects/ooihs/ReNet/'

def retrieve_file(url, output):
    """save the url to a output. The output can be a directory or
    file. If the output is a directory, then save the filename pointed
    by the url into the directory.
    
    Return the local file with complete path.
    """
    try:
        localfile = output
        if os.path.isdir(output):
            localfile = os.path.join(output, url.split('/')[-1])
            
        download_file, headers = urllib.request.urlretrieve(url, localfile)
        return download_file
    except:
        return None
    
    
def download_files(url_list, output_directory):
    if not os.path.isdir(output_directory):
        os.makedirs(output_directory)

    for url in urls:
        filename = retrieve_file(url, output_directory)
        if filename:
            print('Downloaded', filename)
        else:
            print('Failed', url)

## Primary data resources

### [IntAct](https://www.ebi.ac.uk/intact/)

IntAct is developed at EMBL-EBI for collecting literature reported molecular interactions. The data is available in multiple formats. ReNet mainly uses on the PSI-MI tab-separated format.

The zip file `intact.zip` contains `intact.txt` and `intact_negative.txt`,  which are experimentally derived interactions and negative interactions respectively. The file `homo_sapiens.tsv` contains the human protein complexes.

In [2]:
data_directory = os.path.join(project_directory, 'data/intact')

urls = ['ftp://ftp.ebi.ac.uk/pub/databases/intact/current/psimitab/intact.zip',
       'ftp://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/homo_sapiens.tsv']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/intact/intact.zip
Downloaded /projects/ooihs/ReNet/data/intact/homo_sapiens.tsv


### [BioGRID](https://thebiogrid.org/)

BioGRID is another effort to collect molecular interactions from public literature. The molecular interactions are not pertaining to protein-protein interactions, but also include genetic interactions, chemical associations and post translational modifications. 

In [3]:
data_directory = os.path.join(project_directory, 'data/biogrid')

urls = ['https://thebiogrid.org/downloads/archives/Release%20Archive/BIOGRID-3.4.154/BIOGRID-ALL-3.4.154.mitab.zip']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/biogrid/BIOGRID-ALL-3.4.154.mitab.zip


### [Database of Interacting Proteins](http://dip.doe-mbi.ucla.edu/dip/Main.cgi)

DIP is also one of the earliest efforts to collect experimentally derived interactions between proteins. Users are requested to register for an account in order to download the data.

ReNet uses the MITAB2.5 format. The version used is `dip20170205.txt`

### [Molecular Interaction Database](http://mint.bio.uniroma2.it/)

MINT is also another early effort to collect experimentally verified protein-protein interactions. The current development has been merged with IntAct.

ReNet uses the final available version.

In [4]:
data_directory = os.path.join(project_directory, 'data/mint')

urls = ['http://mint.bio.uniroma2.it/mitab/MINT_MiTab.txt']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/mint/MINT_MiTab.txt


### [InnateDB](http://innatedb.com/)

InnateDB is a specialized database for genes, proteins, interactions and pathways involving in the innate immune response. At present, ReNet uses the following collection from the [download page](http://innatedb.com/redirect.do?go=downloadCurated):

InnateDB Curated Interactions (DNA, RNA, Protein) (updated weekly) 

In [5]:
data_directory = os.path.join(project_directory, 'data/innatedb')

urls = ['http://innatedb.com/download/interactions/innatedb_all.mitab.gz']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/innatedb/innatedb_all.mitab.gz


### [MatrixDB](http://matrixdb.univ-lyon1.fr/)

MatrixDB is also a specialized database focused on interactions established by extracellular matrix proteins, proteoglycans and plysaccharides.

The latest version is released on July 20, 2017.

In [6]:
data_directory = os.path.join(project_directory, 'data/matrixdb')

urls = ['http://matrixdb.univ-lyon1.fr/download/matrixdb_FULL.tab.gz']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/matrixdb/matrixdb_FULL.tab.gz


### [HuRI](http://interactome.baderlab.org/)

The Human Reference Protein Interactome Mapping Project is an effort to generate a reference map for the human interactome. Only the released data is used. The file `LitBM-17.psi` is a literature curated dataset by the same group. ReNet also uses it as part of the golden positive set.

In [7]:
data_directory = os.path.join(project_directory, 'data/huri')

urls = [
    'http://interactome.baderlab.org/data/Rolland-Vidal(Cell_2014).psi',
    'http://interactome.baderlab.org/data/Yu-Vidal(Nature_Methods_2011).psi',
    'http://interactome.baderlab.org/data/Raul-Vidal(Nature_2005).psi',
    'http://interactome.baderlab.org/data/Yang-Vidal(Cell_2016).psi',
    'http://interactome.baderlab.org/data/Venkatesan-Vidal(Nature_Methods_2009).psi',
    'http://interactome.baderlab.org/data/LitBM-17.psi'
]

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/huri/Rolland-Vidal(Cell_2014).psi
Downloaded /projects/ooihs/ReNet/data/huri/Yu-Vidal(Nature_Methods_2011).psi
Downloaded /projects/ooihs/ReNet/data/huri/Raul-Vidal(Nature_2005).psi
Downloaded /projects/ooihs/ReNet/data/huri/Yang-Vidal(Cell_2016).psi
Downloaded /projects/ooihs/ReNet/data/huri/Venkatesan-Vidal(Nature_Methods_2009).psi
Downloaded /projects/ooihs/ReNet/data/huri/LitBM-17.psi


### [Human Protein Reference Database](http://hprd.org/)

HPRD is an early effort devoted to collect information pertaining to human proteins. The last version was released on April 13 2010, and haven't been updated since then. Users are requested to register in order to download the file. 

The following file was downloaded: `HPRD_FLAT_FILES_041310.tar.gz`

### [Human Proteinpedia](http://www.humanproteinpedia.org/)

Human Proteinpedia is a sister project of HPRD, which is a community portal for sharing and integration of human protein data. Only the public data is used, HuPA_Release_2.0.

In [8]:
data_directory = os.path.join(project_directory, 'data/hupa')

urls = ['http://humanproteinpedia.org/HuPA_Download/FULL/HUPA_RELEASE_2.0.zip']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/hupa/HUPA_RELEASE_2.0.zip


### [The MIPS Mammalian Protein-Protein Interation Database](http://mips.helmholtz-muenchen.de/proj/ppi/)

MPPI is a collection of manually curated protein-protein interactions. The database has not been updated.

In [9]:
data_directory = os.path.join(project_directory, 'data/mppi')

urls = ['http://mips.helmholtz-muenchen.de/proj/ppi/data/mppi.gz']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/mppi/mppi.gz


### [Human Transcriptional Regulation Interaction Database](http://www.lbbc.ibb.unesp.br/htri/)

HTRIdb is a specialized database to collect information pertaining to transcriptional regulation interactions. The data can be downloaded [here](http://www.lbbc.ibb.unesp.br/htri/pagdown.jsp). The `TF-TG Interactions + PPIs` in csv format was used.

In [10]:
# the following example demonstrates how to download HTRIdb data using requests module
# "get" handling. Please consult the example in SIGNOR for "post" handling.
# the method might fail if HTRIdb change their web handling
url = 'http://www.lbbc.ibb.unesp.br/htri/consulta'

# parameters for getting HTRIdb data
# type: 1 TF-TG Interactions
#       2 TF-TG Interactions + PPIs
# down: 1 Text format
#       2 Excel format
#       3 CSV format
parameters = {
    "type": 2,
    "all": 'true',
    "down": 3
}

data_directory = os.path.join(project_directory, 'data/htridb')

if not os.path.isdir(data_directory):
    os.makedirs(data_directory)
    
output_file = os.path.join(data_directory, 'htridb_with_ppi.csv')
try:
    req = requests.get(url, params=parameters)
    req.raise_for_status()
    with open(output_file, 'w') as ofh:
        ofh.write(req.text)
    print('Downloaded', output_file)
    req.close()
except:
    print("Error in downloading HTRIdb data")

Downloaded /projects/ooihs/ReNet/data/htridb/htridb_with_ppi.csv


### [SynSysNet](http://bioinformatics.charite.de/synsys/index.php)

SynSysNet is a manually curated database for synaptic proteins. The database provides a list of protein-protein interactions involving in synaptic activities. The following files are required.

In [11]:
data_directory = os.path.join(project_directory, 'data/synsysnet')

urls = [
    'http://bioinformatics.charite.de/synsys/download/genes.csv',
    'http://bioinformatics.charite.de/synsys/download/proteins.csv',
    'http://bioinformatics.charite.de/synsys/download/ppis.csv'
]

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/synsysnet/genes.csv
Downloaded /projects/ooihs/ReNet/data/synsysnet/proteins.csv
Downloaded /projects/ooihs/ReNet/data/synsysnet/ppis.csv


### [TcoF](http://www.cbrc.kaust.edu.sa/tcof/)

Dragon database of transcription co-factors and transcription factor interacting proteins(TcoF) collects interactions involved in the human transcriptional regulations.

In [12]:
data_directory = os.path.join(project_directory, 'data/tcof')

urls = [
    'http://www.cbrc.kaust.edu.sa/tcof/download_helper.php?path=download/tcof_tfs_20100927.txt&name=tcof_tfs_20100927.txt',
    'http://www.cbrc.kaust.edu.sa/tcof/download_helper.php?path=download/tcof_tcofs_20100927.txt&name=tcof_tcofs_20100927.txt',
    'http://www.cbrc.kaust.edu.sa/tcof/download_helper.php?path=download/tcof_ppi_20100927.txt&name=tcof_ppi_20100927.txt'
]

if not os.path.isdir(data_directory):
    os.makedirs(data_directory)

for url in urls:
    # extract the name from url
    download_name = url.split('=')[2]
    filename = retrieve_file(url, os.path.join(data_directory, download_name))
    if filename:
        print('Downloaded', filename)
    else:
        print('Failed', url)

Downloaded /projects/ooihs/ReNet/data/tcof/tcof_tfs_20100927.txt
Downloaded /projects/ooihs/ReNet/data/tcof/tcof_tcofs_20100927.txt
Downloaded /projects/ooihs/ReNet/data/tcof/tcof_ppi_20100927.txt


### [INstruct](http://instruct.yulab.org/)

INstruct is a database of 3D protein interactome networks based on structural information.

In [13]:
data_directory = os.path.join(project_directory, 'data/instruct')

urls = ['http://instruct.yulab.org/download/sapiens.sin']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/instruct/sapiens.sin


### [BINDtranslation](http://baderlab.org/BINDTranslation)

This project transfer the old BIND database to the PSI-MI format.

In [14]:
data_directory = os.path.join(project_directory, 'data/bindtranslation')

urls = ['http://download.baderlab.org/BINDTranslation/release1_0/BINDTranslation_v1_mitab_AllSpecies.tar.gz']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/bindtranslation/BINDTranslation_v1_mitab_AllSpecies.tar.gz


### [CIDeR](http://mips.helmholtz-muenchen.de/cider/)

CIDeR is a manually curated database of interactions involved in various diseases. The latest version was released on 22.09.2016.

In [15]:
data_directory = os.path.join(project_directory, 'data/cider')

urls = ['http://mips.helmholtz-muenchen.de/cider/download/start']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/cider/start


### [SPIKE](http://www.cs.tau.ac.il/~spike/)

SPIKE is a database of highly curated human signaling pathways. The database in XML format is used. This file format allows us to extract protein-protein interactions from the pathways.

In [16]:
data_directory = os.path.join(project_directory, 'data/spike')

urls = ['http://www.cs.tau.ac.il/~spike/download/LatestSpikeDB.xml.zip']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/spike/LatestSpikeDB.xml.zip


### [SIGNOR](http://signor.uniroma2.it/)

The Signaling Network Open Resource is a database of causal relationships between biological entities.

The data can be downloaded [here](http://signor.uniroma2.it/downloads.php). Users need to download Human interaction data in xls format. The csv format cannot be handled properly. SIGNOR Complexes and Protein Families are needed well.

In [17]:
# the following example demonstrates how to download SIGNOR data using requests module
# form handling functioin.
# the method might fail if SIGNOR change their web handling

# formdata for query signor interaction
formdata = {
    "organism": 'human',
    "format": 'Excel5',
    "submit": 'Download'
}

data_directory = os.path.join(project_directory, 'data/signor')

if not os.path.isdir(data_directory):
    os.makedirs(data_directory)
    
url = 'http://signor.uniroma2.it/download_entity.php'

output_file = os.path.join(data_directory, 'signor.xlsx')
try:
    req = requests.post(url, data=formdata)
    req.raise_for_status()
    with open(output_file, 'wb') as ofh:
        ofh.write(req.content)
    print('Downloaded', output_file)
    req.close()
except:
    print("Error in downloading SIGNOR data")
    
# for downloading signor complexes and protein family
url = 'http://signor.uniroma2.it/download_complexes.php'

values = [
    ('Download complex data', 'signor_complexes.csv'),
    ('Download protein family data', 'signor_family.csv')
]

for item in values:
    mode, filename = item

    try:
        req = requests.post(url, data={'submit': mode})
        output_file = os.path.join(data_directory, filename)
        with open(output_file, 'w') as ofh:
            ofh.write(req.text)
        print('Downloaded', output_file)
        req.close()
    except:
        print("Error in:", mode)

Downloaded /projects/ooihs/ReNet/data/signor/signor.xlsx
Downloaded /projects/ooihs/ReNet/data/signor/signor_complexes.csv
Downloaded /projects/ooihs/ReNet/data/signor/signor_family.csv


### [Reactome](https://reactome.org/)

Reactome is a pathway of human reactions. Besides pathway maps, the database also provides human protein-protein interaction data.

In [18]:
data_directory = os.path.join(project_directory, 'data/reactome')

urls = ['https://reactome.org/download/current/interactors/reactome.homo_sapiens.interactions.psi-mitab.txt'
]

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/reactome/reactome.homo_sapiens.interactions.psi-mitab.txt


### [SignaLink](http://www.signalink.org/)

SignaLink 2.0 is an integrated resource for signaling pathways. The data can be downloaded [here](http://www.signalink.org/download). Please select the complete database.

In [19]:
data_directory = os.path.join(project_directory, 'data/signalink')

urls = ['http://signalink.org/export/signalink.sql.gz']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/signalink/signalink.sql.gz


### [TRIP](http://trpchannel.org/)

TRIP is a manually curated database of protein-protein interactions for mammalian TRP channels. Transient receptor potential (TRP) channels are a family of $Ca^{2+}$-permeable cation channels. 

In [20]:
data_directory = os.path.join(project_directory, 'data/trip')

urls = ['http://trpchannel.org/20150806.csv']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/trip/20150806.csv


### [KEGG](http://www.kegg.jp/)

Kyoto Encyclopedia of Genes and Genomes (KEGG) is a resource characterizing various components of biological systems such as pathways, diseases, genes and proteins. In this study, we mainly used the pathway collections.

KEGG provides API to download their data. For researchers with fundings, it is advisable to support KEGG development through KEGG [FTP Subscription](https://www.bioinformatics.jp/en/keggftp.html).

`kegg` module is capable of downloading the human collection, including genes, pathways and other modules. The following code block downloads the pathways (including pathway maps), human genes and diseases.

In [21]:
from src import kegg

modules = ['disease']
data_directory = os.path.join(project_directory, 'data/kegg')
kegg.download_kegg(data_directory, 'hsa', modules=modules)

/projects/ooihs/ReNet/data/kegg/ not exists. I will create one instead.
Downloading pathways for hsa
Number of pathways: 322
Total files to download: 966
Total number of items successfully downloaded: 966
Complete!
Downloading modules for hsa
Total files: 188
Total number of items successfully downloaded: 188
Complete!
Downloading hsa
Total files: 39519
Total number of items successfully downloaded: 39519
Complete!
Downloading disease
Total files: 1949
Total number of items successfully downloaded: 1949
Complete!


### [P~PeP](http://www.pepcyber.org/PPEP/)

P~PeP is the largest public database of human protein-protein interactions mediated by phosphoprotein binding domains (PPBDs).

In [23]:
from src import ppep

"""process the protein complexes from PPep"""
ppep_data = os.path.join(project_directory, 'data/ppep/')
if not os.path.isdir(ppep_data):
    os.makedirs(ppep_data)
    
ppep.download_interactions(ppep_data)

Downloading the PPep proteins...
Number of proteins downloaded: 209

Downloading the PPep interactions...
Total number of interactions to download: 9529


... Downloading 50 interactions
... Downloading 100 interactions
... Downloading 150 interactions
... Downloading 200 interactions
... Downloading 250 interactions
... Downloading 300 interactions
... Downloading 350 interactions
... Downloading 400 interactions
... Downloading 450 interactions
... Downloading 500 interactions
... Downloading 550 interactions
... Downloading 600 interactions
... Downloading 650 interactions
... Downloading 700 interactions
... Downloading 750 interactions
... Downloading 800 interactions
... Downloading 850 interactions
... Downloading 900 interactions
... Downloading 950 interactions
... Downloading 1000 interactions
... Downloading 1050 interactions
... Downloading 1100 interactions
... Downloading 1150 interactions
... Downloading 1200 interactions
... Downloading 1250 interactions
... Downloading

### Databases without downloadable data

The following databases do not provide interactions in a download form. ReNet will query the website for the interactions during the standardization prcess.

* [PDZBase](https://abc.med.cornell.edu/pdzbase)
is a manually curated protein-protein interactions involving PDZ domains.
* [DeathDomani](http://deathdomain.org/)
is a manually curated database of protein-protein interactions for Death Domain Superfamily.

## Databases for constructing golden reference sets

### [CORUM](http://mips.helmholtz-muenchen.de/corum/)

CORUM is a comprehensive collection of mammalian protein complexes. Corum current release: 02.07.2017.

In [24]:
data_directory = os.path.join(project_directory, 'data/corum')

urls = ['http://mips.helmholtz-muenchen.de/corum/download/allComplexes.txt.zip']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/corum/allComplexes.txt.zip


### [HumanCyc](https://humancyc.org/)

HumanCyc is an encyclopedia of human genes and metabolism. The database contains manually curated human protein complexes. Registration is required in order to download the data. 

[Registration link](https://humancyc.org/download-flatfiles.shtml).

After registration, please download the human individual database (human.tar.gz, version 19.5)

### [Negatome](http://mips.helmholtz-muenchen.de/proj/ppi/negatome/)

Negatome collects protein interactions and domain pairs that are unlikely to form direct physical interactions. The dataset is generally very small and unsuitable to be used as a golden negative set. ReNet uses the information to ensure the golden positive set does not contain any negative interactions. The `Combined` collection is used.

In [25]:
data_directory = os.path.join(project_directory, 'data/negatome')

urls = ['http://mips.helmholtz-muenchen.de/proj/ppi/negatome/combined.txt']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/negatome/combined.txt


## Integrated resources

### [ComPPI](http://comppi.linkgroup.hu/)

ComPPI stands for Compartmentalized Protein-Protein Interaction Database, which provides qualitative information on the protein interaction and protein localization information. The interactions were integrated from multiple databases.

In [26]:
url = 'http://comppi.linkgroup.hu/downloads'

# parameters for getting ComPPI data
# fDlSet: comp, int, protnloc
# fDlSpec: 0 (human)
# fDlMLoc: all
# fDlSubmit: Download
parameters = {
    "fDlSet": 'int',
    "fDlSpec": 0,
    "fDlMLoc": "all",
    "fDlSubmit": "Download"
}

data_directory = os.path.join(project_directory, 'data/comppi')

if not os.path.isdir(data_directory):
    os.makedirs(data_directory)
    
try:
    req = requests.post(url, data=parameters, stream=True)
    req.raise_for_status()
    
    # default name
    output_file = os.path.join(data_directory, 'comppi_interaction.txt.gz')
    # use the file attachment name if presented
    for term in req.headers['content-disposition'].split(';'):
        term = term.strip()
        if term.startswith('filename='):
            output_file = os.path.join(data_directory, 
                                       term.split('=')[1].strip('"'))
        
    with open(output_file, 'wb') as ofh:
        for chunk in req.iter_content(chunk_size=1024):
            ofh.write(chunk)
    print('Downloaded', output_file)
    req.close()
except:
    print("Error in downloading ComPPI interactions")

Downloaded /projects/ooihs/ReNet/data/comppi/comppi--interactions--tax_hsapiens_loc_all.txt.gz


### [ConsensusPathDB](http://cpdb.molgen.mpg.de/)

ConsensusPathDB is a resource that integrates interaction data from binary and complex protein-protein, genetic, metabolic, signaling, gene regulatory and drug-target interactions, and pathways. 

In [27]:
data_directory = os.path.join(project_directory, 'data/cpdb')

urls = ['http://cpdb.molgen.mpg.de/download/ConsensusPathDB_human_PPI.gz']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/cpdb/ConsensusPathDB_human_PPI.gz


### [GeneMANIA](http://genemania.org/)

GeneMANIA is one of the initial efforts to construct a comprehensive human molecular interaction network.

The latest interaction data is available for download from the following url, [http://genemania.org/data/current/](http://genemania.org/data/current/). For each organism, two sets of interactions data are available, individual networks obtained from existing publications or resources, and combined interaction data. The individual interaction data set contains gene pairs reported in a particular study or  resource. These gene pairs are weighted using a method presented in [PMID:11112](). Various forms of interactions are available: Co-expression, etc. The folder also contains various attributes given in [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). The file `networks.txt` contains information about each network and `identifier_mappings.txt` file provides conversion between various identifiers.

The combined data set integrates all the individual networks and evaluated against GO Biological Processes. The file `COMBINATIONS_WEIGHTS.DEFAULT_NETWORKS.BP_COMBINING.txt` gives the weight of each individual network.

The following code block shows how to retrieve all human interactions from both data collections.

In [29]:
from bs4 import BeautifulSoup
import os

def retrieve_genemania_files(genemania_url, output_directory):
    """retrieve the list of interactions files available
    in the given genemania_url. Save the file information in
    filelist.txt in the output_directory, and return the
    file list."""
    
    # contain the list of files in the genemania_url
    files = []
    try:
        r = requests.get(genemania_url)
        r.raise_for_status()
       
        # extract the files from a html table using BeautifulSoup
        html = BeautifulSoup(r.text, 'html.parser')
        table = html.find('table')
        with open(os.path.join(output_directory, 'filelist.txt'), 'w') as ofhandle:
            ofhandle.write(html.title.string + '\n\n')
            for row in table.find_all('tr'):
                # get all childiren in a table row, either th or thd
                rec = [child.get_text() for child in row.children]
                if len(rec) > 1:
                    # skip the line with Parent Directory
                    if rec[1] == 'Parent Directory':
                        continue
                    ofhandle.write('\t'.join(rec[1:]) + '\n')
                    # if the entry is with valid file size and
                    # is not a header
                    if (rec[3].strip() != '-' and 
                        rec[1].strip() != 'Name'):
                        files.append(rec[1].strip())
        r.close()
    except:
        print('Error in {url}'.format(url=url))    
        
    return files


base_url = 'http://genemania.org/data/current/'
sources = ['Homo_sapiens'] 
# to download the combined version
#sources.append('Homo_sapiens.COMBINED')
data_directory = os.path.join(project_directory, 'data/genemania/')

for item in sources:
    path = os.path.join(data_directory, item)
    
    if not os.path.isdir(path):
        os.makedirs(path)
        
    # get the list of files in a given url
    url = base_url + item + '/'
    files = retrieve_genemania_files(url, path)
    
    if len(files) == 0:
        print("No files to download")
        continue
        
    for file in files:
        file_url = url + file
        filename = retrieve_file(file_url, path)
        if filename:
            print('Downloaded {file}'.format(file=filename))
        else:
            print('Failed {file}'.format(file=file_url))
            
    print()

Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Attributes.Consolidated-Pathways-2013.gmt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Attributes.Drug-interactions-2013.gmt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Attributes.InterPro.gmt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Attributes.Transcriptional-factor-targets-2013.gmt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Attributes.miRNA-target-predictions-2013.gmt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Agnelli-Neri-2007.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Agnelli-Neri-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Alizadeh-Staudt-2000.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Alter-Stephan-2011.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Ariazi-Jordan-2011.txt
Downl

Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Gibault-Aurias-2012.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Gobble-Singer-2011.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Gomez-Abad-Piris-2011.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Griffin-Glass-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Grigoryev-Salomon-2010.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Gysin-McMahon-2012.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Haferlach-Falini-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Hagg-Bjorkegren-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Hanamura-Shaughnessy-2006.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Hannenhalli-Cappola-200

Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Maertzdorf-Kaufmann-2011.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Mallon-McKay-2013.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Marisa-Boige-2013.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Maser-DePinho-2007.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Mateescu-Mechta-Grigoriou-2011.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Matsuyama-Sugihara-2010.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Meier-Seiler-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Melis-Farci-2014.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Meyer-Debatin-2011.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Micke-Botling-2011.txt
Do

Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Smirnov-Cheung-2012.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Smith-Beauchamp-2010.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Sorich-Evans-2008.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Sotiriou-Delorenzi-2006.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Spira-Brody-2007.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Srlie-Brresen-Dale-2001.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Steidl-Gascoyne-2010.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Stinson-Dornan-2011.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Stirewalt-Radich-2008.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-expression.Stratford-Yeh-2010.txt
Down

Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Co-localization.Zhang-Shang-2006.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Genetic_Interactions.BIOGRID-SMALL-SCALE-STUDIES.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Genetic_Interactions.Bailey-Hieter-2015.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Genetic_Interactions.Blomen-Brummelkamp-2015.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Genetic_Interactions.IREF-SMALL-SCALE-STUDIES.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Genetic_Interactions.Lin-Smith-2010.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Genetic_Interactions.Luo-Elledge-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Genetic_Interactions.Toyoshima-Grandori-2012.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Genetic_Interactions.Vizeacoumar-Moffat-2013.txt
Downloaded /projects/ooihs/ReNet/data/

Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Gao-Reinberg-2012.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Gautier-Hall-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Giannone-Liu-2010.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Glatter-Gstaiger-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Gloeckner-Ueffing-2007.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Goehler-Wanker-2004.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Golebiowski-Hay-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Goudreault-Gingras-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Grant-2010.txt
Downloaded /projects/ooihs/ReNet/data/gene

Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.McFarland-Nussbaum-2008.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Meek-Piwnica-Worms-2004.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Milev-Mouland-2012.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Miyamoto-Sato-Yanagawa-2010.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Murakawa-Landthaler-2015.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Nakayama-Ohara-2002.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Nakayasu-Adkins-2013.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Napolitano-Meroni-2011.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Narayan-Bennett-2012.txt
Downloaded /

Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Wang-Yang-2011.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Weimann-Stelzl-2013_A.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Weimann-Stelzl-2013_B.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Weinmann-Meister-2009.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Wen-Wu-2014.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Whisenant-Salomon-2015.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Wilker-Yaffe-2007.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Witt-Labeit-2008.txt
Downloaded /projects/ooihs/ReNet/data/genemania/Homo_sapiens/Physical_Interactions.Wong-O'Bryan-2012.txt
Downloaded /projects/ooihs/ReNet/data/genemania/

The gmt file is generally a [tab-separated file](https://en.wikipedia.org/wiki/Tab-separated_values). The first and second column give the gene and its description. The following columns are the associated attributes.

The following code block executes a shell command (`head`) to display the first line of the `Attributes.Consolidated-Pathways-2013.gmt`. The first line shows that the gene `ENSG00000000419` involved in 10 pathways, including `HUMANCYC:MANNOSYL-CHITO-DOLICHOL-BIOSYNTHESIS` and `KEGG:hsa00510`. The pathway is given in the format `database:pathway_id`. The gene is provided as [Ensembl](http://www.ensembl.org/index.html) ID, and no description is given. 

In [30]:
!head -n 1 data/genemania/Homo_sapiens/Attributes.Consolidated-Pathways-2013.gmt

head: cannot open 'data/genemania/Homo_sapiens/Attributes.Consolidated-Pathways-2013.gmt' for reading: No such file or directory


The ntework files are also tab-separated, with first and second column give a gene pair. The weight is given in the third column.

The following code block displays the first two lines of the `Co-expression.Agnelli-Neri-2007.txt`. The first line is a header, and the interactions start from second line.

The other files follow the same format.

In [31]:
!head -n 2 data/genemania/Homo_sapiens/Co-expression.Agnelli-Neri-2007.txt

head: cannot open 'data/genemania/Homo_sapiens/Co-expression.Agnelli-Neri-2007.txt' for reading: No such file or directory


### [FunCoup](http://funcoup.sbc.su.se/search/)

FunCoup provides genome-wide functional couplings (associations) for several model organisms, including human. The functional couplings are derived from 10 different evidence types and integrated in a naive Bayesian fashion. The networks can be downloaded from the following url, [http://funcoup.sbc.su.se/downloads/](http://funcoup.sbc.su.se/downloads/).

The file format is a tab-separated format with the following columns:

Columns	|Label	|Description
---|:---|:---
0	|PFC	|The confidence score
1	|FBS_max	|The final Bayesian score (FBS) of the strongest coupling class
2-3	|Gene*	|The Ensembl gene identifier of the gene pair
4-7	|FBS_*	|The FBS scores of the four different coupling classes (Some species don't have all classes)
8-16	|LLR_*	|The log-likelihood ratios (LLRs) of the different evidence types
17-27	|LLR_*	|The LLRs of the evidence from the different species
28	|Max_category	|The strongest coupling class for this pair. This is also the class for which the LLRs are given.

The following code block downloads the human functional coupling network.

In [32]:
data_directory = os.path.join(project_directory, 'data/funcoup')

urls = ['http://funcoup.sbc.su.se/downloads/download.action?type=network&fileName=FC4.0_H.sapiens_full.gz']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/funcoup/download.action?type=network&fileName=FC4.0_H.sapiens_full.gz


### [Human Integrated Protein-Protein Interaction rEference](http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/index.php)

HIPPIE is a collection of human protein-protein interactions. HIPPIE scores the interaction based on the amount and reliability of evidence supporting them. One novel feature of HIPPIE is the assignment of quality scores to each experimental method. The experimental quality scores are used to reflect the reliability and error rate of each technique. The assessment may be subjective as no single score could be used to represent the data quality generated from different studies using the same experimental method.



In [33]:
data_directory = os.path.join(project_directory, 'data/hippie')

urls = ['http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/hippie_current.txt',
       'http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/RS/experimental_scores.tsv']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/hippie/hippie_current.txt
Downloaded /projects/ooihs/ReNet/data/hippie/experimental_scores.tsv


### [HumanNet](http://www.functionalnet.org/humannet/)

HumanNet was developed for understanding genome-wide associated data. 

In [34]:
data_directory = os.path.join(project_directory, 'data/humannet')

urls = ['http://www.functionalnet.org/humannet/HumanNet.v1.benchmark.txt',
        'http://www.functionalnet.org/humannet/HumanNet.v1.join.txt',
        'http://www.functionalnet.org/humannet/HumanNet.v1.evidence_code.txt'
       ]

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/humannet/HumanNet.v1.benchmark.txt
Downloaded /projects/ooihs/ReNet/data/humannet/HumanNet.v1.join.txt
Downloaded /projects/ooihs/ReNet/data/humannet/HumanNet.v1.evidence_code.txt


### [IRefIndex](http://irefindex.org)

IRefIndex is one of the early efforts to consolidate protein-protein interactions from multiple resources. The interactions were collected from multiple resources, including BIND, BioGRID, CORUM, DIP, IntAct, HPRD, MATRIXDB, MINT, MPACGT, MPPI and MPIDB. No scoring scheme was used to evaluate the interactions.

The last version was released on April 7, 2015, and can be downloaded using the following url, [http://irefindex.org/download/irefindex/data/archive/release_14.0/psi_mitab/MITAB2.6/](http://irefindex.org/download/irefindex/data/archive/release_14.0/psi_mitab/MITAB2.6/)

In [35]:
data_directory = os.path.join(project_directory, 'data/irefindex')

urls = ['http://irefindex.org/download/irefindex/data/archive/release_14.0/psi_mitab/MITAB2.6/9606.mitab.07042015.txt.zip']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/irefindex/9606.mitab.07042015.txt.zip


### [STRING](https://string-db.org/)

STRING is probably the largest collections of functional interactions for more than 2,000 organisms. The functional interactions were derived from computational predictions, co-expression, experimental data, literature and text mining. Information from different evidence types are combined in the naive Bayesian fashion.

The complete data set is very large. Therefore, it is advisable to download the interaction data based on a specific organism. A number of different interaction formats are available for download. In this study, we used the protein network data with subscores per channel. This allows us to select a subject of interactions based on evidence types.

In [36]:
# STRING server returns HTTP 403 error when using the download_file function
# The following solution is based on requests, with streaming
def requests_file(url, output):
    """save the url to a output. The output can be a directory or
    file. If the output is a directory, then save the filename pointed
    by the url into the directory.
    
    Return the local file with complete path.
    """
    try:
        r = requests.get(url, stream=True)
        r.raise_for_status()
        localfile = output
        if os.path.isdir(output):
            localfile = os.path.join(output, url.split('/')[-1])
        with open(localfile, 'wb') as ofhandle:
            for chunk in r.iter_content(chunk_size=1024):
                ofhandle.write(chunk)
        return localfile
    except:
        return None

url = 'https://stringdb-static.org/download/protein.links.full.v10.5/9606.protein.links.full.v10.5.txt.gz'

data_directory = os.path.join(project_directory, 'data/string')

if not os.path.isdir(data_directory):
    os.makedirs(data_directory)
    
filename = requests_file(url, data_directory)
if filename:
    print('Downloaded {file}'.format(file=filename))
else:
    print('Failed {file}'.format(file=url))    

Downloaded /projects/ooihs/ReNet/data/string/9606.protein.links.full.v10.5.txt.gz


### [InBioMap](https://www.intomics.com/inbio/map/#home)

InBioMap is also a collection of publicly available human protein-protein interactions. The database is an extension of InWeb, which is used in the [DAPPLE](http://archive.broadinstitute.org/mpg/dapple/dapple.php). The interactions are scored using the reported studies.

The data can be downloaded from the following url, [https://www.intomics.com/inbio/map/#downloads](https://www.intomics.com/inbio/map/#downloads). The download must be done using a web browser, as it requires an authorization.

## Other useful resources

### [UniProt](http://uniprot.org)

The UniProt is the main resource for protein sequences. There are two major data collections in the UniProt database, Swiss-Prot and TrEMBL collections. The Swiss-Prot collection is manually annotated and reviewed. Therefore, it contains up-to-date functional annotations about a protein. On the other hand, the TrEMBL collection is automatically annotated through computational analysis and not reviewed. While the TrEMBL collection is the best place for exploring uncharacterized protein sequences, it might not be useful for functional studies. This is because inclusion of unreviewed information might increase noises during the data analysis, and could lead to miss-interpretation of the analysis results.

For our study, we mainly focus on using protein information from the Swiss-Prot collection. The latest data is available in the following url, [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/). 


In [37]:
data_directory = os.path.join(project_directory, 'data/uniprot')

urls = ['ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/uniprot/uniprot_sprot.dat.gz


### [HUGO Gene Nomenclature Committee](https://www.genenames.org/)

HGNC is a committee responsible for approving human gene symbols and names. The information will be used for standardizing the human proteins for data integration and comparison.

In [38]:
data_directory = os.path.join(project_directory, 'data/hgnc')

urls = ['ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/json/hgnc_complete_set.json']

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/hgnc/hgnc_complete_set.json


##  [Gene Ontology](http://geneontology.org/)

Gene Ontology is an initiative to provide consistent descriptions of genes and gene products across organisms. Gene Ontology Consortium only provides structured ontologies that describe gene products in terms of their associated biological processes, cellular components and molecular functions. These ontologies are provided in a species-independent manner. Therefore, the annotations for a particular organism has to be download separately. Here, we used the annotations for human provided by [EBI](http://geneontology.org/page/download-annotations).

In [39]:
data_directory = os.path.join(project_directory, 'data/go')

urls = [
    'http://purl.obolibrary.org/obo/go.obo',
    'http://geneontology.org/gene-associations/goa_human.gaf.gz'
]

download_files(urls, data_directory)

Downloaded /projects/ooihs/ReNet/data/go/go.obo
Downloaded /projects/ooihs/ReNet/data/go/goa_human.gaf.gz
