# Online Transcription Factor Databases Comparison

----------------------

Author: Mikayla Webster (m1webste@ucsd.edu)

Date: 18th October, 2017

----------------------

<a id='toc'></a>
## Table of Contents
1. [Import packages](#import)
2. [Slowkow](#slowkow)
    3. [TRED](#tred)
    4. [ITFP](#itfp)
    5. [ENCODE](#encode)
    6. [Neph2012](#neph2012)
    7. [TRRUST](#trrust)
    8. [Marbach2016](#marbach2016)
9. [Jaspar Genereg](#jaspar)

## Import packages
<a id='import'></a>

In [2]:
import numpy as np
import pandas as pd
import networkx as nx
import scipy
from scipy import stats
import mygene
import math

import matplotlib as mpl
mpl.rc('text', usetex = False)
mpl.rc('font', family = 'serif')
% matplotlib inline

## Slowkow
<a id='slowkow'></a>

Slowkow is a compilation of six transcription factor (and their gene associations) online databases: TRED, ITFP, ENCODE, Neph2012, TRRUST, and Marbach2016. Slowkow, as well as links to each individual databases' source page, is available for download at this [github link](https://github.com/slowkow/tftargets). The raw data file is formatted as an R data structure, named "tftargets.rda". To extract the databases from this file for use in Jupyter Notebooks, I used [RStudio](https://www.rstudio.com/) to write the transcription factor (TF) gene symbols to a text file. I did not write to file each TF's associations.

First, I navigated to the directory containing "tftargets.rda":
    1. setwd("C:/Users/m1webste/Documents/CCBB_2/URA")

Next, I loaded the tftarget data structure to my workspace:
    1. tftargets = load("slowkow_databases/tftargets.rda")

I wrote each database to a different, respectively named text file. All databases but Neph2012 were structured as a list of lists. The name of each list was the TF, while the contents of each list were the gene associations. I simply printed all the names to file as one column. This was done arbitrarily, with the intention of later parsing the TF's with newlines as separaters. 
    1. write(names(TRED), file = "TRED_TF.txt", ncolumns = 1)
    2. write(names(IFTP), file = "IFTP_TF.txt", ncolumns = 1)
    3. write(names(ENCODE), file = "ENCODE_TF.txt", ncolumns = 1)
    4. write(names(TRRUST), file = "TTRUST_TF.txt", ncolumns = 1)
    5. write(names(Marbach2016), file = "Marbach_TF.txt", ncolumns = 1)
    
Neph2012 was structured as several nested lists, therefor required more than simply printing the names of each list to file.    
    1. for (i in (1:41)){
      2. for (name in names(Neph2012[[i]])){
        3. write(name, file = "Neph2012.txt", ncolumns = 1, append = TRUE, sep = "\t")
      4. }
    5. }



In [3]:
def load_subdb(file_name):
    
    # read files formatted as \n separated items
    f = open(file_name)
    lines = f.read().splitlines()
    
    # convert everything to ALL CAPS
    [x.upper() for x in lines]
    
    # remove duplicates
    return set(lines)

In [8]:
def load_all_others(exclude):
    
    file_list = []
    
    if exclude == "TRED":
        file_list = ['./slowkow_databases/ITFP_TF.txt', './slowkow_databases/ENCODE_TF.txt',
                 './slowkow_databases/Neph2012_TF.txt','./slowkow_databases/TRRUST_TF.txt','./slowkow_databases/Marbach2016_TF.txt']
    elif exclude == "ITFP":
        file_list = ['./slowkow_databases/TRED_TF.txt', './slowkow_databases/ENCODE_TF.txt',
                 './slowkow_databases/Neph2012_TF.txt','./slowkow_databases/TRRUST_TF.txt','./slowkow_databases/Marbach2016_TF.txt']
    elif exclude == "ENCODE":
        file_list = ['./slowkow_databases/TRED_TF.txt', './slowkow_databases/ITFP_TF.txt',
                 './slowkow_databases/Neph2012_TF.txt','./slowkow_databases/TRRUST_TF.txt','./slowkow_databases/Marbach2016_TF.txt']
    elif exclude == "Neph2012":
        file_list = ['./slowkow_databases/TRED_TF.txt', './slowkow_databases/ITFP_TF.txt', './slowkow_databases/ENCODE_TF.txt',
                 './slowkow_databases/TRRUST_TF.txt','./slowkow_databases/Marbach2016_TF.txt']
    elif exclude == "TRRUST":
        file_list = ['./slowkow_databases/TRED_TF.txt', './slowkow_databases/ITFP_TF.txt', './slowkow_databases/ENCODE_TF.txt',
                 './slowkow_databases/Neph2012_TF.txt','./slowkow_databases/Marbach2016_TF.txt']
    elif exclude == "Marbach2016":
        file_list = ['./slowkow_databases/TRED_TF.txt', './slowkow_databases/ITFP_TF.txt', './slowkow_databases/ENCODE_TF.txt',
                 './slowkow_databases/Neph2012_TF.txt','./slowkow_databases/TRRUST_TF.txt']
    elif exclude == "None":
        file_list = ['./slowkow_databases/TRED_TF.txt','./slowkow_databases/ITFP_TF.txt','./slowkow_databases/ENCODE_TF.txt',
                 './slowkow_databases/Neph2012_TF.txt','./slowkow_databases/TRRUST_TF.txt','./slowkow_databases/Marbach2016_TF.txt']
    else:
        print 'Bad input'
        
    # read files formatted as \n separated items
    return_list = []
    for file_name in file_list:
        with open(file_name) as f:
            lines = f.read().splitlines()
            return_list.extend(lines)
    
    # convert everything to ALL CAPS
    [x.upper() for x in return_list]
    
    # remove duplicates
    return set(return_list)
        
    return return_list

### TRED
<a id='tred'></a>

In [16]:
tred_list = load_subdb('./slowkow_databases/TRED_TF.txt')
not_tred = load_all_others("TRED")
print 'Size of TRED: ' + str(len(tred_list))
print 'Size of Slowkow without TRED: ' + str(len(not_tred))
contribution1 = len(tred_list) - len(list(set(tred_list) & set(not_tred)))
print 'TRED\'s contribution to slowkow: ' + str(contribution1)

Size of TRED: 133
Size of Slowkow without TRED: 2697
TRED's contribution to slowkow: 8


### ITFP
<a id='itfp'></a>

In [17]:
itfp_list = load_subdb('./slowkow_databases/ITFP_TF.txt')
not_itfp = load_all_others("ITFP")
print 'Size of ITFP: ' + str(len(itfp_list))
print 'Size of Slowkow without ITFP: ' + str(len(not_itfp))
contribution2 = len(itfp_list) - len(list(set(itfp_list) & set(not_itfp)))
print 'ITFP\'s contribution to slowkow: ' + str(contribution2)

Size of ITFP: 1974
Size of Slowkow without ITFP: 1130
ITFP's contribution to slowkow: 1575


### ENCODE
<a id='encode'></a>

In [18]:
encode_list = load_subdb('./slowkow_databases/ENCODE_TF.txt')
not_encode = load_all_others("ENCODE")
print 'Size of ENCODE: ' + str(len(encode_list))
print 'Size of Slowkow without ENCODE: ' + str(len(not_encode))
contribution3 = len(encode_list) - len(list(set(encode_list) & set(not_encode)))
print 'ENCODE\'s contribution to slowkow: ' + str(contribution3)

Size of ENCODE: 157
Size of Slowkow without ENCODE: 2676
ENCODE's contribution to slowkow: 29


### Neph2012
<a id='neph2012'></a>

In [19]:
neph2012_list = load_subdb('./slowkow_databases/Neph2012_TF.txt')
not_neph2012 = load_all_others("Neph2012")
print 'Size of Neph2012: ' + str(len(neph2012_list))
print 'Size of Slowkow without Neph2012: ' + str(len(not_neph2012))
contribution4 = len(neph2012_list) - len(list(set(neph2012_list) & set(not_neph2012)))
print 'Neph2012\'s contribution to slowkow: ' + str(contribution4)

Size of Neph2012: 536
Size of Slowkow without Neph2012: 2682
Neph2012's contribution to slowkow: 23


### TRRUST
<a id='trrust'></a>

In [21]:
trrust_list = load_subdb('./slowkow_databases/TRRUST_TF.txt')
not_trrust = load_all_others("TRRUST")
print 'Size of TRRUST: ' + str(len(trrust_list))
print 'Size of Slowkow without TRRUST: ' + str(len(not_trrust))
contribution5 = len(trrust_list) - len(list(set(trrust_list) & set(not_trrust)))
print 'TRRUST\'s contribution to slowkow: ' + str(contribution5)

Size of TRRUST: 748
Size of Slowkow without TRRUST: 2521
TRRUST's contribution to slowkow: 184


### Marbach2016
<a id='marbach2016'></a>

In [22]:
marbach2016_list = load_subdb('./slowkow_databases/Marbach2016_TF.txt')
not_marbach2016 = load_all_others("Marbach2016")
print 'Size of Marbach2016: ' + str(len(marbach2016_list))
print 'Size of Slowkow without Marbach2016: ' + str(len(not_marbach2016))
contribution6 = len(marbach2016_list) - len(list(set(marbach2016_list) & set(not_marbach2016)))
print 'Marbach2016\'s contribution to slowkow: ' + str(contribution6)

Size of Marbach2016: 643
Size of Slowkow without Marbach2016: 2550
Marbach2016's contribution to slowkow: 155


## Jaspar Genereg
<a id='jaspar'></a>

The 2016 version of Jaspar's TF database can be dowloaded as a .txt file [here](http://jaspar2016.genereg.net/). At the bottom of the page, navigate to the "Download" button. Open the "database" file and download "MATRIX.txt". 

In [23]:
def load_jaspar(filename):
    
    # parse jaspar file
    jasp_df = pd.read_csv(filename, sep = "\t", header= None, names = ['col1', 'col2', 'col3', 'col4', 'tf_genes'])
    
    # return transcription factors with ALL CAPS names
    return list(jasp_df['tf_genes'].str.upper())

2049

In [28]:
jasp = load_jaspar("jaspar_genereg_matrix.txt")
sk = load_all_others("None")
print "Size of Jaspar Database: " + str(len(jasp))
print "Size of Slowkow Database: " + str(len(sk))
contributionj = len(jasp) - len(list(set(jasp) & set(sk)))
contributionsk = len(sk) - len(list(set(jasp) & set(sk)))
print "Number of nodes unique to Jaspar: " + str(contributionj)
print "Number of nodes unique to Slowkow: " + str(contributionsk)

Size of Jaspar Database: 2049
Size of Slowkow Database: 2705
Number of nodes unique to Jaspar: 1530
Number of nodes unique to Slowkow: 2186
