# GO enrichment analysis of Ebola virus - human protein-protein interaction network and bat orthologues.

Pieter Moris 2017
ADReM - UAntwerpen

Continuation of Ben Verhees' work in 2015

## Data collection and pre-processing
The entire pre-processing pipeline was bundled in a bash script in ` data_preprocessing/data_setup.sh`.

The Ebola virus - human protein-protein interaction network was retrieved from the Host - Pathogen Interaction Database (HPIDB) 2.0 (http://www.agbase.msstate.edu/hpi/main.html, last updated June 28, 2016) by downloading the full dataset and extracting all the interactions involving ebola. The following NCBI taxon id's were used:

```128951  Ebola virus - Zaire (1995) count: 3
128952  Ebola virus - Mayinga, Zaire, 1976 count: 254
186538  Zaire ebolavirus count: 1
386032  Reston ebolavirus - Reston (1989) count: 3
9606    Homo sapiens count: 261```

Note that the reston virus was also included for these analyses.

The total size of the protein-protein interaction network was 261 (155 after omitting identical interactions with difference evidence levels) and involved 147 unique human proteins.

Next, the one-on-one orthology mapping between the human and *Myotis lucifugus* genomes was retrieved from the genome pair view in OMA browser (http://omabrowser.org/oma/genomePW/). In total, 20,121 pairs were found in the database, although a number of proteins were involved in multiple pair mappings. The total number of unique human proteins for which at least one orthologue was found, was 15,033.

Finally, the human gene ontology mapping (in .obo format) and gene associations (in .gaf format) were obtained from http://purl.obolibrary.org/obo/go/releases/2016-11-26/ and ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/ respectively.

It was found that 27 of the ebola-interacting human proteins lacked an orthologue in *Myotis lucifugus*.

## Gene ontology enrichment analysis

### Are ebola-interacting human genes that lack a bat orthologue enriched?

An enrichment analysis was carried out to answer the following research question: "Of all the human proteins that interact with ebola virus proteins, is the subset of these that do not have a corresponding bat orthologue enriched for a particular biological process or function?" 

In other words, we will compare the entire set of human proteins in the human-ebola protein-protein interaction network, with the subset of human proteins that lack a bat orthologue.

The GO tool was configured to test from the bottom up, i.e. the most specific GO terms were tested first. Only when these terms were not found to be significant (default threshold 0.05), we propagate through the tree and test the parent GO terms. In addition, any GO term with fewer than three representatives (default setting) will be skipped. Finally, the Benjamini Hochberg multiple testing correction (i.e. the false discovery rate) is applied with a separate significance threshold (default = 0.1).

In [1]:
import os
backgroundPath = os.path.abspath('../data/background.txt')
with open(backgroundPath, 'r') as inGenes:
    backgroundSet = set([line.rstrip() for line in inGenes][1:])
print('Retrieved', len(backgroundSet),
      'background uniprot AC\'s from', backgroundPath)
print(backgroundSet)

Retrieved 146 background uniprot AC's from /media/pieter/DATA/ebola-go/data/background.txt
{'P04844', 'O15131', 'P21796', 'P68104', 'P35232', 'P22102', 'Q8N257', 'Q99832', 'P52597', 'Q8IXQ5', 'Q96CS3', 'P51532', 'Q9Y5M8', 'P35613', 'Q14164', 'Q00839', 'P16615', 'O95816', 'P09651', 'P60866', 'Q09028', 'Q9UNL2', 'P06899', 'P62879', 'O43791', 'Q5QNW6', 'Q9BTM1', 'Q6FI13', 'P11142', 'Q16778', 'Q14974', 'P46934', 'Q8TAQ2', 'P55795', 'Q15758', 'P10606', 'Q13263', 'Q8WWM7', 'Q99879', 'P12273', 'P08238', 'Q71UM5', 'Q6P2E9', 'P49790', 'Q96B45', 'Q92688', 'O60814', 'P31943', 'P05023', 'P20700', 'Q93079', 'P61978', 'P0DMV9', 'P0C0S8', 'P31689', 'Q99729', 'P14866', 'P52294', 'P49411', 'Q9BQE3', 'O00264', 'Q7L7L0', 'Q96TA2', 'P48643', 'Q14257', 'Q32P51', 'Q99878', 'Q9P258', 'Q08211', 'F8VVM2', 'P53992', 'Q00325', 'O95831', 'P57053', 'Q96KK5', 'Q99880', 'O15269', 'P22626', 'Q96A08', 'P45880', 'Q12906', 'P27708', 'O43175', 'Q96AX9', 'P68431', 'P78537', 'P51571', 'Q969G3', 'P62988', 'P04908', 'P20671'

In [2]:
%run ../go-enrichment-tool/go_enrichment_script.py -b ../data/background.txt -s ../data/interest.txt -o ../data/go_data/go.obo -g ../data/go_data/goa_human.gaf -O ../output -m 3 -t 0.1 -T 0.05

Retrieved 27 subset uniprot AC's from /media/pieter/DATA/ebola-go/data/interest.txt
Retrieved 147 background uniprot AC's from /media/pieter/DATA/ebola-go/data/background.txt
Not every uniprot AC that was provided in the background set was found in the GAF file:
['F8VVM2', 'P62988']

Retrieved 145 annotated (background filtered) uniprot AC's from /media/pieter/DATA/ebola-go/data/go_data/goa_human.gaf

Not every uniprot AC that was provided in the gene subset was found in the GAF file:
['P62988', 'F8VVM2']

Retrieved 45914 GO terms from /media/pieter/DATA/ebola-go/data/go_data/go.obo
background
 {'P04844', 'O15131', 'P21796', 'P68104', 'P35232', 'P22102', 'Q8N257', 'Q99832', 'P52597', 'Q8IXQ5', 'Q96CS3', 'P51532', 'Q9Y5M8', 'P35613', 'Q14164', 'Q00839', 'P16615', 'O95816', 'P09651', 'P60866', 'Q09028', 'Q9UNL2', 'P06899', 'P62879', 'O43791', 'Q5QNW6', 'Q9BTM1', 'Q6FI13', 'P11142', 'Q16778', 'Q14974', 'A8CG34', 'P46934', 'Q8TAQ2', 'P55795', 'Q15758', 'P10606', 'Q13263', 'Q8WWM7', 'Q99879

### Are human genes that lack a bat orthologoue enriched compared to the entire human annotation set?

In [None]:
%run ../go-enrichment-tool/go_enrichment_script.py -b ../data/background.txt -s ../data/interest.txt -o ../data/go_data/go.obo -g ../data/go_data/goa_human.gaf -O ../output -m 0 -t 0.1 -T 0 

Retrieved 27 subset uniprot AC's from /media/pieter/DATA/ebola-go/data/interest.txt
Retrieved 147 background uniprot AC's from /media/pieter/DATA/ebola-go/data/background.txt
Not every uniprot AC that was provided in the background set was found in the GAF file:
['F8VVM2', 'P62988']

Retrieved 145 annotated (background filtered) uniprot AC's from /media/pieter/DATA/ebola-go/data/go_data/goa_human.gaf

Not every uniprot AC that was provided in the gene subset was found in the GAF file:
['P62988', 'F8VVM2']

Retrieved 45914 GO terms from /media/pieter/DATA/ebola-go/data/go_data/go.obo
background
 {'P04844', 'O15131', 'P21796', 'P68104', 'P35232', 'P22102', 'Q8N257', 'Q99832', 'P52597', 'Q8IXQ5', 'Q96CS3', 'P51532', 'Q9Y5M8', 'P35613', 'Q14164', 'Q00839', 'P16615', 'O95816', 'P09651', 'P60866', 'Q09028', 'Q9UNL2', 'P06899', 'P62879', 'O43791', 'Q5QNW6', 'Q9BTM1', 'Q6FI13', 'P11142', 'Q16778', 'Q14974', 'A8CG34', 'P46934', 'Q8TAQ2', 'P55795', 'Q15758', 'P10606', 'Q13263', 'Q8WWM7', 'Q99879

In [None]:
%run ../go-enrichment-tool/go_enrichment_script.py -s ../data/interest.txt -o ../data/go_data/go.obo -g ../data/go_data/goa_human.gaf -O ../output -m 0 -t 0.1 -T 0 