Negative set: whole genome negatives vs shuffled reference 

Single tasking vs multi-tasking for SPI1, CTCF, ZNF143, SIX5 

Wish list: Multi-input model w/ DNAse + Sequence as input  (Abhi,outperforms factor net within same cell type) 

-reverse complements in same minibatch 

-guaranteed proportion of positives in minibatch (or not) 

## Input data 

We will learn to predict transcription factor binding for four transcription factors in the GM12878 cell line (one of the Tier 1 cell lines for the ENCODE project). First, we download the narrowPeak bed files for each of these transcription factors. You can skip the following code block if you already have the data downloaded. 

In [None]:
## CTCF, optimal IDR thresholded peaks, Stam Lab, hg19
# https://www.encodeproject.org/experiments/ENCSR000DRZ/
!wget https://www.encodeproject.org/files/ENCFF473RXY/@@download/ENCFF473RXY.bed.gz 

## SPI1, optimal IDR thresholded peaks, Myers lab, hg19
# https://www.encodeproject.org/experiments/ENCSR000BGQ/
!wget https://www.encodeproject.org/files/ENCFF002CHQ/@@download/ENCFF002CHQ.bed.gz
    
## ZNF143, optimal IDR thresholded peaks, Snyder lab, hg19
#https://www.encodeproject.org/experiments/ENCSR936XTK/
!wget https://www.encodeproject.org/files/ENCFF544NXC/@@download/ENCFF544NXC.bed.gz

## SIX5, optimal IDR thresholded peaks, Myers Lab, hg19
# https://www.encodeproject.org/experiments/ENCSR000BJE/
!wget https://www.encodeproject.org/files/ENCFF606WUV/@@download/ENCFF606WUV.bed.gz


We will use the [seqdataloader](https://github.com/kundajelab/seqdataloader) package to generate positive and negative labels for the TF-ChIPseq peaks across the genome. We will treat each sample as a task for the model and compare the performance of the model on SPI1 task in the single-tasked and multi-tasked setting.

In [1]:
## seqdataloader accepts an input file, which we call tasks.tsv, with task names in column 1 and the corresponding
## peak files in column 2 
!cat tasks.tsv 

SPI1	ENCFF002CHQ.bed.gz	
CTCF	ENCFF473RXY.bed.gz	
ZNF143	ENCFF544NXC.bed.gz	
SIX5	ENCFF606WUV.bed.gz	


In [2]:
from genomewide_labels import * 
#we will include all chromosomes with the exception of 1,2, and 19 in our training set 
train_set_params={
    'task_list':"tasks.tsv",
    'outf':"TF.train.tsv.gz",
    'output_type':'gzip',
    'chrom_sizes':'hg38.chrom.sizes',
    'chorms_to_exclude':['chr1','chr2','chr19'],
    'bin_stride':50,
    'left_flank':400,
    'right_flank':400,
    'bin_size':200,
    'threads':4,
    'subthreads':4,
    'allow_ambiguous':True,
    'labeling_approach':'peak_summit_in_bin_classification'
    }

genomwide_labels(train_set_params)

In [None]:
#We will include chromsoome 1 in our validation set 
valid_set_params={'task_list':"tasks.tsv",
    'outf':"TF.valid.tsv.gz",
    'output_type':'gzip',
    'chrom_sizes':'hg38.chrom.sizes',
    'chroms_to_keep':'chr1',
    'bin_stride':50,
    'left_flank':400,
    'right_flank':400,
    'bin_size':200,
    'threads':1,
    'subthreads':4,
    'allow_ambiguous':True,
    'labeling_approach':'peak_summit_in_bin_classification'
    }
genomewide_labels(valid_set_params)

In [None]:
#We will include chromosomes 2 and 19 in our testing set 
test_set_params={
    'task_list':"tasks.tsv",
    'outf':"TF.test.tsv.gz",
    'output_type':'gzip',
    'chrom_sizes':'hg38.chrom.sizes',
    'chroms_to_keep':['chr2','chr19'],
    'bin_stride':50,
    'left_flank':400,
    'right_flank':400,
    'bin_size':200,
    'threads':2,
    'subthreads':4,
    'allow_ambiguous':True,
    'labeling_approach':'peak_summit_in_bin_classification'
    }
genomwide_labels(test_set_params)

## Case 1: Negatives consist of shuffled references, single-tasked models

## Case 2: Whole-genome negatives, single-tasked models 

## Case 3: Negatives consist of shuffled references, multi-tasked models 

## Case 4: Whole-genome negatives, multi-tasked models 

## Case 5: What happens if we don't balance our batches? 