## I want to be able to

- Specify my own bigwigs together with control experiments
- Use the same control experiment for multiple tracks
- Train only in peak regions or genome-wide
- Train on AWS and in Colab notebook
- Fully utilize my GPU without waiting for data-loading
- Easily try out multiple hyper-parameters
- See all the logs and evaluation metrics in the file
- See all the logs on comet.ml or similar website
- Visualize the loss curves and see some example predictions (observed vs predicted) afterwards
- Use my own evaluation notebook
- Have a good set of default hyper-parameters for each assay (ChIP-nexus and ChIP-seq)
- Train on multiple assays simultaneously
- Train on any functional genomics assay where events result in coverage peaks
  - ChIP-nexus
  - ChIP-seq
  - DNase
  - eClip
  - CutNRun
- Specify my own architecture, loss function, training function if needed

## Implementation

- Load the data into memory if possible
- Use gin-config files for specifying architectures, loss functions etc
  - Note: You have to educate users about it. It may be too complicated.
- Use SeqModel
- Use the default model without specifying any hyper-parameters for different kinds of data
- Don't save the pkl file
- Infer the number of tasks from dataspec


## Args

- `<dataspec>`: dataspec.yaml
- `--config=model.gin` (gin config files specifying the model architecture and the loss etc)
- `--premade=bpnet9`: pre-made config file to use (e.g. use the default architecture). The user could override the setting using bindings.
- `--override`: model/loss/training parameters to override
- `--evaluate` If true, the model will also be evaluated on the 
- `--report`: path to the ipynb report file. Use the default one.

## Output

- train.log
- model.h5
- evaluate.html
- evaluate.ipynb
- model.gin -> copied from the input
- dataspec.yaml -> copied from the input


## TODO

- specify only a single track
  - do we really need to specify pos_track and neg_track explicitly or can we just list them? 

# Training BPNet on your own data

<img src="figs/bpnet-arch.png" alt="BPNet" style="width: 450px;"/>

## 1. Specify data -> write `dataspec.yml`

The first step requires specifying the data on which to train the model. BPNet takes as input nucleotide sequence and outputs the read coverage profile for multiple tracks at base-resolution. The coverage tracks can come from any genome-wide functional genomics assay that has a sufficient spatial resolution including ChIP-nexus, ChIP-exo, ChIP-seq, DNase-seq, and ATAC-seq. Additionally, different experiments may have differnet biases that need to be accounted for. Both, the signal and the bias/control tracks have to be stored in [BigWig](https://genome.ucsc.edu/goldenpath/help/bigWig.html) files.

In this tutorial, we will use the data from the BPNet paper (TODO - link) measuring TF binding of 4 TFs (Oct4, Sox2, Nanog and Klf4) with ChIP-nexus in mouse embryonic stem cells (mESCs). To make things faster, we will focus only on a small subset of the regions.

- [ ] describe `dataspec.yml`
  - give one example
  - explain each entry in it

First, download the required data

In [4]:
#!wget ....
#!

`dataspec.yaml` for these data looks as follows

```yaml
fasta_file: ./reference-genome.fa
task_specs:
  task1:
    pos_counts: ./task1.pos.bigWig
    neg_counts: ./task1.neg.bigWig
    peaks: ./task1.peaks.bed.gz  # optional. Peaks associated with task1
    
    # optional
    ignore_strand: False  # if True, use the sum of pos_ and neg_counts
    bias_bigwig: ./InputDNA.counts.bigWig  # measured bias. IP in ChIP-seq
    bias_model: null   # sequence-based bias model
  task2:
    ...  # similarly as for task1
  task3:
    ...
```

In [10]:
!cat dataspec.yml

fasta_file: data/mm10_no_alt_analysis_set_ENCODE.fasta
task_specs:
  Oct4:
    pos_counts: data/Oct4/counts.pos.bw
    neg_counts: data/Oct4/counts.neg.bw
    peaks: data/Oct4/idr-optimal-set.summit.bed.gz
  Sox2:
    pos_counts: data/Sox2/counts.pos.bw
    neg_counts: data/Sox2/counts.neg.bw
    peaks: data/Sox2/idr-optimal-set.summit.bed.gz
  Nanog:
    pos_counts: data/Nanog/counts.pos.bw
    neg_counts: data/Nanog/counts.neg.bw
    peaks: data/Nanog/idr-optimal-set.summit.bed.gz
  Klf4:
    pos_counts: data/Klf4/counts.pos.bw
    neg_counts: data/Klf4/counts.neg.bw
    peaks: data/Klf4/idr-optimal-set.summit.bed.gz

bias_specs:
  input:
    pos_counts: data/patchcap/counts.pos.bw
    neg_counts: data/patchcap/counts.neg.bw
    tasks:
      - Oct4
      - Sox2
      - Nanog
      - Klf4

The `dataspec.yml` file contains three parts:
- `task_specs`
- `bias_specs`
- `fasta_file`.

**NOTE:** Before you jump ahead and start training the model, we recommend eyeballing the coverage tracks (BigWig) and peak regions (bed) using the genome browser such as the [WashU](https://epigenomegateway.wustl.edu/) or [IGV](https://software.broadinstitute.org/software/igv/). If you can not identify peaks by eye then the model will not be able to do it either.

### Using `DataSpec` to visualize the raw data

Having specified your data in `dataspec.yml`, you can use also `bpnet.specs.DataSpec` to parse the file and visualize the tracks for a specific genomic interval.

In [5]:
## TODO - show how to do that. Show the Oct4 enhancer from the paper

### How do I get my data into a BigWig file?

Functional genomics experiments based on sequencing yield many short reads which then get aligned to the reference genome. The alignment locations of the reads are typically stored in the [BAM](http://samtools.github.io/hts-specs/SAMv1.pdf) file. There are different ways of computing the coverage from aligned reads. To prevent loosing any spatial information in the profiles, we would like to generate non-smoothed tracks (as raw as possible). For ChIP-exo/nexus/seq experiments this means counting the 5' locations of the reads. Note that the aligned reads also have strand information hence we will generate two coverage tracks, one for the positive/forward and one for the negative/reverse strand. If multiple technical or biological replicate experiments were performed for a specific transcription factor, we will simply add up the coverage tracks.

The `bpnet` CLI offers a simple convenience command to convert the read alignments stored in the BAM file into the coverage tracks.

```bash
bpnet align2bigwig <rep1.bam> <rep2.bam> --fragment-point=5prime --strand-spec <output prefix>
```

For more information run `bpnet align2bigwig --help` or read the source code here (TODO - link). Note that `align2bigwig` also accepts the `TagAlign` file generated by the ENCODE ChIP-seq pipeline.

In [8]:
## TODO - show the read coverage profile for chip-seq and the 5' ends of the reads

### How do I get `regions.bed`?

For large genomes such as human or mouse, training genome-wide can be computationally expensive. Most of the regions in the genome will contain very little counts, hence the model will not recieve a lot of information. We can significantly speed up the training process by training the model only in regions with higher number of counts. These regions are determined using traditional peak callers such as MACS2. Since we just want to discard regions with little or no counts, we don't care about the exact peak locations or even high false positive rates. Hence almost any peak caller should be fine.

You can also train the model genome-wide by tiling the genome into bins (using say stride of 400). You can generate the intervals of the tiled genome using - TODO - refer to dataloader tools?. BPNet also allows you to sample regions overlapping a known peak with higher probability. Then you have to provide the `TODO dataloader command` with the peak locations. In that case, the produced bed file will be a tsv file looking as follows

```tsv
CHROM    START    END   task1   task2   ...
 chr1    10000  11000       1       0

```

Columns following the interval coordinates specify whether the center of the interval overlapped a peak from `task1` (1=yes) or `task2` (0=no).

### Can I train the model without the bias track?

Technically, yes. It will work well for assays with low amount of bias such as ChIP-exo or ChIP-nexus. However, we generally recommend using the bias track. 

## 2. Specify model architecture -> use pre-made models or write `model.gin`

## FAQ

### I want to train a model on ChIP-seq. How can I do this?

### I want to train a model on DNase-seq or ATAC-seq. How can I do this?

The key difference between DNase-seq and ChIP-seq/exo is that DNase-seq coverage is not strand specific. Hence a single BigWig file is required. By contrast, ChIP-seq required two BigWigs - one for the positive and one for the negative strand. Beware that controlling for DNase biases is still an open question and you should think carefully about it.

Otherwise, you can just specify a similar `dataspec.yml` as before:

```yaml
TODO: write
```