In [2]:
import os

In [1]:
import re

---

# Preparing genetic interactions from BIOGRID

**2015 July 4-6, 8; August 3**

Previously, in Lee et al. "Predicting genetic modifier loci using functional gene networks" (2010), yeast genetic interactions were predicted using the functional network approach. The sources for the seed data were:

- Tong AH et al. 2004. Global mapping of the yeast genetic interaction network. Science 303:808. PMID 14764870.
- Davierwala AP et al. 2005. The synthetic genetic interaction spectrum of essential genes. Nat Genet 37:1147. PMID 16155567.

Lee et al. used YeastNet v2, and while it contains genetic interactions, it doesn't contain the most recent datasets, such as the Costanzo et al. 2010 Science paper. The objective now is to evaluate the functional network approach on these newer interaction data. Note that is now a newer version of YeastNet (v3). 

The sources listed above will need to be excluded from this analysis. The BIOGRID database has entries for "Author" and "Pubmed ID". Checking whether the Tong AH (2004) and Davierwala AP (2005) papers are listed in version 3.3.124:

In [1]:
%%bash
cd /work/jyoung/DataDownload/BIOGRID/BIOGRID-ORGANISM-3.3.124.tab2/
grep 14764870 BIOGRID-ORGANISM-Saccharomyces_cerevisiae_S288c-3.3.124.tab2.txt | wc -l

4369


In [2]:
%%bash
cd /work/jyoung/DataDownload/BIOGRID/BIOGRID-ORGANISM-3.3.124.tab2/
grep 14764870 BIOGRID-ORGANISM-Saccharomyces_cerevisiae_S288c-3.3.124.tab2.txt | head -1

108850	855450	851373	35569	31885	YNL271C	YDL225W	BNI1	SHS1	PPF3|SHE5|formin BNI1|L000000190	SEP7|septin SHS1	Synthetic Lethality	genetic	Tong AH (2004)	14764870	559292	559292	High Throughput	-	-	inviable	-	-	BIOGRID


In [3]:
%%bash
cd /work/jyoung/DataDownload/BIOGRID/BIOGRID-ORGANISM-3.3.124.tab2/
grep 16155567 BIOGRID-ORGANISM-Saccharomyces_cerevisiae_S288c-3.3.124.tab2.txt | wc -l

567


In [4]:
%%bash
cd /work/jyoung/DataDownload/BIOGRID/BIOGRID-ORGANISM-3.3.124.tab2/
grep 16155567 BIOGRID-ORGANISM-Saccharomyces_cerevisiae_S288c-3.3.124.tab2.txt | head -1

484166	851532	852538	32027	32931	YDL029W	YBR236C	ARP2	ABD1	ACT2|actin-related protein 2|L000000026	L000000011	Synthetic Growth Defect	genetic	Davierwala AP (2005)	16155567	559292	559292	High Throughput	-	-	vegetative growth	SGA screen	-	BIOGRID


Also checking the Costanzo 2010 Science paper:

In [5]:
%%bash
cd /work/jyoung/DataDownload/BIOGRID/BIOGRID-ORGANISM-3.3.124.tab2/
grep 20093466 BIOGRID-ORGANISM-Saccharomyces_cerevisiae_S288c-3.3.124.tab2.txt | wc -l

68087


In [6]:
%%bash
cd /work/jyoung/DataDownload/BIOGRID/BIOGRID-ORGANISM-3.3.124.tab2/
grep 20093466 BIOGRID-ORGANISM-Saccharomyces_cerevisiae_S288c-3.3.124.tab2.txt | head -1

354378	851236	850395	31767	31012	YAL063C	YCR028C-A	FLO9	RIM1	flocculin FLO9|L000003331	L000001640	Negative Genetic	genetic	Costanzo M (2010)	20093466	559292	559292	High Throughput	-0.1969	-	colony size	A Synthetic Genetic Array (SGA) analysis was carried out to quantitatively score genetic interactions based on fitness defects that were estimated from the colony size of double versus single mutants. Genetic interactions were considered significant if they had an SGA score of epsilon > 0.08 for positive interactions and epsilon < -0.08 for negative interactions, and a p-value < 0.05.	-	BIOGRID


<s>So now process the BIOGRID *S. cerevisiae* file to remove Tong AH (2004) and Davierwala AP (2005), and keep only those lines for genetic interactions. But first download the latest version of BIOGRID (v3.4.126).</s>

The Tong and Davierwala datasets were used as seed sets in the 2010 Genome Research paper. While these definitely need to be removed, note that YeastNet v2 [1] itself contains other genetic interactions. In particular, "genetic interactions (GI) were collected from...BioGRID (downloaded on June 2006)..." [1]. 

Therefore, remove every genetic interaction before 2007 from BIOGRID. This would also remove anything from Tong and Davierwala. 

[1] Lee I, Li Z, Marcotte EM (2007) An Improved, Bias-Reduced Probabilistic Functional Gene Network of Baker's Yeast, Saccharomyces cerevisiae. PLoS ONE 2(10): e988.

In [3]:
os.chdir('/work/jyoung/DataDownload/BIOGRID/BIOGRID-ORGANISM-3.4.127.tab2')

In [4]:
readFile = open('BIOGRID-ORGANISM-Saccharomyces_cerevisiae_S288c-3.4.127.tab2.txt')
writeFile = open('/work/jyoung/genetic_interact/data/BIOGRID-3.4.127-for-yeastnetv2.txt', 'w')
header = readFile.readline().rstrip().split('\t')
PMIDcolNum = header.index('Pubmed ID')
expSysTypeCol = header.index('Experimental System Type')
authorCol = header.index('Author')
for line in readFile:
    tokens = line.rstrip().split('\t')
    ##if tokens[PMIDcolNum] != '14764870' and tokens[PMIDcolNum] != '16155567':
    ##    if tokens[expSysTypeCol] == 'genetic':
    ##        writeFile.write(line)
    if tokens[expSysTypeCol] == 'genetic':
        author = tokens[authorCol]
        year = int(re.search(r'\((\d+)\)', author).group(1))
        if year > 2006:
            writeFile.write(line)
readFile.close()
writeFile.close()

Download YeastNet v2:

    wget http://www.functionalnet.org/yeastnet/data/yeastnet2.benchmarkset.gene.txt
    wget http://www.functionalnet.org/yeastnet/data/yeastnet2.benchmarkset.orf.txt
    wget http://www.functionalnet.org/yeastnet/data/yeastnet2.gene.txt
    wget http://www.functionalnet.org/yeastnet/data/yeastnet2.orf.txt
    wget http://www.functionalnet.org/yeastnet/data/yeastnet2-evidence.gene.txt
    wget http://www.functionalnet.org/yeastnet/data/yeastnet2-evidence.orf.txt

The network genes are in either gene name (i.e. RPC19) or ORF (i.e. YNL113W). The function *read_biogrid( )* in the script *func_net_pred.py* will need to be modified to accept arguments for compiling genetic interactions in either systematic name or official symbol. 

**2015 August 2**

Go back and download the newest version (3.4.127) of BIOGRID:

    wget http://thebiogrid.org/downloads/archives/Release%20Archive/BIOGRID-3.4.127/BIOGRID-ORGANISM-3.4.127.tab2.zip
    mkdir BIOGRID-ORGANISM-3.4.127.tab2
    unzip BIOGRID-ORGANISM-3.4.127.tab2.zip -d BIOGRID-ORGANISM-3.4.127.tab2/

Re-do the removal of old genetic interaction datasets as above. 

Modify the *read_biogrid( )* function in *func_net_pred.py* to read in the columns for systematic name interactors. Note that the columns are now a hard-coded parameter. Systematic name interactors are columns 5 & 6, while official symbol interactors are 7 & 8. 

# Breakdown of experimental evidence codes

**2015 August 3**

In [5]:
os.chdir('/work/jyoung/genetic_interact/data')

In [6]:
def experimental_evidence_breakdown(filename, colNum):
    """
    INPUT:
    1.) <string> name of BIOGRID file to read
    2.) <integer> number of interactor column to read
    """
    typeCounts = dict()
    genes = set()
    for line in open(filename):
        tokens = line.split('\t')
        genes.update(tokens[colNum:colNum+2])
        typeCounts[tokens[11]] = typeCounts.get(tokens[11], 0) + 1
    print('Number of genes:', len(genes))
    print('Experimental evidence code breakdown:')
    print(typeCounts)

In [7]:
biogridFilename = 'BIOGRID-3.4.127-for-yeastnetv2.txt'

In [8]:
sysNameCol = 5  # systematic name

In [9]:
experimental_evidence_breakdown(biogridFilename, sysNameCol)

Number of genes: 5558
Experimental evidence code breakdown:
{'Dosage Rescue': 1897, 'Phenotypic Suppression': 5335, 'Negative Genetic': 110089, 'Phenotypic Enhancement': 5343, 'Synthetic Lethality': 6160, 'Synthetic Growth Defect': 13287, 'Positive Genetic': 23237, 'Synthetic Rescue': 4114, 'Dosage Lethality': 1120, 'Dosage Growth Defect': 1859, 'Synthetic Haploinsufficiency': 282}


In [10]:
symbolCol = 7  # official symbol

In [11]:
experimental_evidence_breakdown(biogridFilename, symbolCol)

Number of genes: 5593
Experimental evidence code breakdown:
{'Dosage Rescue': 1897, 'Phenotypic Suppression': 5335, 'Negative Genetic': 110089, 'Phenotypic Enhancement': 5343, 'Synthetic Lethality': 6160, 'Synthetic Growth Defect': 13287, 'Positive Genetic': 23237, 'Synthetic Rescue': 4114, 'Dosage Lethality': 1120, 'Dosage Growth Defect': 1859, 'Synthetic Haploinsufficiency': 282}
