# GO enrichment analysis of Ebola virus - human protein-protein interaction network and bat orthologues.

Pieter Moris 2017
ADReM - UAntwerpen

Continuation of Ben Verhees' work in 2015

## Data collection and pre-processing
The entire pre-processing pipeline was bundled in a bash script in ` data_preprocessing/data_setup.sh`.

The Ebola virus - human protein-protein interaction network was retrieved from the Host - Pathogen Interaction Database (HPIDB) 2.0 (http://www.agbase.msstate.edu/hpi/main.html, last updated June 28, 2016) by downloading the full dataset and extracting all the interactions involving ebola. The following NCBI taxon id's were used:

```128951  Ebola virus - Zaire (1995) count: 3
128952  Ebola virus - Mayinga, Zaire, 1976 count: 254
186538  Zaire ebolavirus count: 1
386032  Reston ebolavirus - Reston (1989) count: 3
9606    Homo sapiens count: 261```

Note that the reston virus was also included for these analyses.

The total size of the protein-protein interaction network was 261 (155 after omitting identical interactions with difference evidence levels) and involved 147 unique human proteins.

Next, the one-on-one orthology mapping between the human and *Myotis lucifugus* genomes was retrieved from the genome pair view in OMA browser (http://omabrowser.org/oma/genomePW/). In total, 20,121 pairs were found in the database, although a number of proteins were involved in multiple pair mappings. The total number of unique human proteins for which at least one orthologue was found, was 15,033.

Finally, the human gene ontology mapping (in .obo format) and gene associations (in .gaf format) were obtained from http://purl.obolibrary.org/obo/go/releases/2016-11-26/ and ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/ respectively.

It was found that 27 of the ebola-interacting human proteins lacked an orthologue in *Myotis lucifugus*.

## Gene ontology enrichment analysis

### Are ebola-interacting human genes that lack a bat orthologue enriched?

An enrichment analysis was carried out to answer the following research question: "Of all the human proteins that interact with ebola virus proteins, is the subset of these that do not have a corresponding bat orthologue enriched for a particular biological process or function?" 

In other words, we will compare the entire set of human proteins in the human-ebola protein-protein interaction network, with the subset of human proteins that lack a bat orthologue.

The GO tool was configured to test from the bottom up, i.e. the most specific GO terms were tested first. Only when these terms were not found to be significant (default threshold 0.05), we propagate upwards through the tree and test the parent GO terms. In addition, any GO term with fewer than three representatives (default setting) will be skipped regardless. Finally, the Benjamini Hochberg multiple testing correction (i.e. the false discovery rate) is applied with a separate significance threshold (default = 0.1).

In [1]:
%run ../go-enrichment-tool/go_enrichment_script.py -b ../data/background.txt -s ../data/interest.txt -o ../data/go_data/go.obo -g ../data/go_data/goa_human.gaf -O ../output -m 3 -t 0.1 -T 0.05

Retrieved 27 subset uniprot AC's from /media/pieter/DATA/ebola-go/data/interest.txt
Retrieved 147 background uniprot AC's from /media/pieter/DATA/ebola-go/data/background.txt
Retrieved 145 annotated (background filtered) uniprot AC's from /media/pieter/DATA/ebola-go/data/go_data/goa_human.gaf

Removing these genes from the background set...
Removing these genes from the interest set...
Retrieved 45914 GO terms from /media/pieter/DATA/ebola-go/data/go_data/go.obo
Propagating through ontology to find all children and parents for each term...
Tested 508 GO categories.
4 were significant at alpha = 0.05

Tested GO-terms and their p-values:
{'GO:0007077': 0.53501258333706492, 'GO:0016043': 0.5079728819467868, 'GO:0071840': 0.5079728819467868, 'GO:0008150': 0.38254283924055632, 'GO:0009987': 0.11473072350902329, 'GO:0022402': 0.59349594822832286, 'GO:0044802': 0.50982074032120683, 'GO:0044699': 0.96819060481261232, 'GO:0061024': 0.6316985466627616, 'GO:0044763': 0.93088962882998993, 'GO:0030

### Are human genes that lack a bat orthologoue enriched compared to the entire human annotation set?

In [2]:
%run ../go-enrichment-tool/go_enrichment_script.py -b ../data/background.txt -s ../data/interest.txt -o ../data/go_data/go.obo -g ../data/go_data/goa_human.gaf -O ../output -m 0 -t 0.1 -T 0 

Retrieved 27 subset uniprot AC's from /media/pieter/DATA/ebola-go/data/interest.txt
Retrieved 147 background uniprot AC's from /media/pieter/DATA/ebola-go/data/background.txt
Retrieved 145 annotated (background filtered) uniprot AC's from /media/pieter/DATA/ebola-go/data/go_data/goa_human.gaf

Removing these genes from the background set...
Removing these genes from the interest set...
Retrieved 45914 GO terms from /media/pieter/DATA/ebola-go/data/go_data/go.obo
Propagating through ontology to find all children and parents for each term...
Tested 710 GO categories.
0 were significant at alpha = 0.0

Tested GO-terms and their p-values:
{'GO:0007077': 0.53501258333706492, 'GO:0016043': 0.5079728819467868, 'GO:0071840': 0.5079728819467868, 'GO:0008150': 0.38254283924055632, 'GO:0009987': 0.11473072350902329, 'GO:0022402': 0.59349594822832286, 'GO:0044802': 0.50982074032120683, 'GO:0044699': 0.96819060481261232, 'GO:0061024': 0.6316985466627616, 'GO:0044763': 0.93088962882998993, 'GO:00303

In [3]:
%run ../go-enrichment-tool/go_enrichment_script.py -s ../data/interest.txt -o ../data/go_data/go.obo -g ../data/go_data/goa_human.gaf -O ../output -m 0 -t 0.1 -T 0 

Retrieved 27 subset uniprot AC's from /media/pieter/DATA/ebola-go/data/interest.txt
No background gene set provided, retrieving all genes from the gene annotation file...
Retrieved 19363 annotated uniprot AC's from /media/pieter/DATA/ebola-go/data/go_data/goa_human.gaf after filtering on the background set.

Removing these genes from the interest set...
Retrieved 45914 GO terms from /media/pieter/DATA/ebola-go/data/go_data/go.obo
Propagating through ontology to find all children and parents for each term...
Tested 710 GO categories.
0 were significant at alpha = 0.0

Tested GO-terms and their p-values:
{'GO:0007077': 0.055320289623286112, 'GO:0016043': 6.5641025113313736e-05, 'GO:0071840': 7.1947830641113937e-05, 'GO:0008150': 0.059021186887728817, 'GO:0009987': 0.0010030095945770298, 'GO:0022402': 0.15664963900824319, 'GO:0044802': 0.012513528330479234, 'GO:0044699': 0.73222319498086497, 'GO:0061024': 0.041033170442238719, 'GO:0044763': 0.55317705464125944, 'GO:0030397': 0.06020108393

In [4]:
type(output[:,0])

numpy.ndarray

In [5]:
GOterms[output[0,0]].name

'macromolecular complex'

In [6]:
[GOterms[x].name for x in output[:,0]]

['macromolecular complex',
 'chromosomal part',
 'heterocyclic compound binding',
 'organic cyclic compound binding',
 'intracellular organelle part',
 'organelle part',
 'nucleic acid binding',
 'chromatin DNA binding',
 'nuclear part',
 'nucleosomal DNA binding',
 'structure-specific DNA binding',
 'nuclear chromosome part',
 'nucleosome binding',
 'small molecule binding',
 'cellular component organization',
 'cellular component organization or biogenesis',
 'RNA polymerase II distal enhancer sequence-specific DNA binding',
 'nucleotide binding',
 'nucleoside phosphate binding',
 'npBAF complex',
 'nucleoplasm',
 'macromolecular complex subunit organization',
 'SWI/SNF superfamily-type complex',
 'nBAF complex',
 'enhancer sequence-specific DNA binding',
 'SWI/SNF complex',
 'protein heterodimerization activity',
 'enhancer binding',
 'translation elongation factor activity',
 'cellular macromolecular complex assembly',
 'intracellular part',
 'BAF-type complex',
 'nuclear localizat