# ROCCO Demonstration

This notebook consists of two sections.
1. BAM Preprocessing
1. Running Rocco

The first section walks through the BAM-->WIG pipeline to generate ROCCO conformable input from a collection of samples' BAM files.

The second section provides usage examples for running ROCCO on multiple chromosomes and analyzing results.

**Download Input Alignments:** To acquire the ATAC-seq alignments (human lymphoblast) used for this demo, run
```
xargs -L 1 curl -O -J -L < demo_files/bam_links.txt
```
at the command line.

These files are are obtained from the [ENCODE](https://www.encodeproject.org/search/?type=Experiment&control_type%21=%2A&status=released&perturbed=false&assay_title=ATAC-seq&biosample_ontology.cell_slims=lymphoblast&audit.ERROR.category%21=extremely+low+read+depth&audit.NOT_COMPLIANT.category%21=low+FRiP+score&audit.NOT_COMPLIANT.category%21=poor+library+complexity&audit.NOT_COMPLIANT.category%21=severe+bottlenecking&audit.WARNING.category%21=moderate+library+complexity&audit.WARNING.category%21=mild+to+moderate+bottlenecking&audit.WARNING.category%21=moderate+number+of+reproducible+peaks) project.

## BAM Processing
The downloaded alignment files have already been QC-processed with the [ENCODE ATAC-seq pipeline](https://www.encodeproject.org/atac-seq/). This particular preprocessing protocol is not required, but we assume the BAM files used as input to ROCCO have already been preprocessed according to some QC standard---duplicate removal, adapter trimming, etc.

### `prep_bams.py`
This script generates a smooth signal track for each sample's BAM file and then divides each into chromosome-specific directories `tracks_<chromosome name>`, thereby providing ROCCO conformable input.

The script performs the following steps:
1. Create local links to the BAM files (and associated `.bai` files if they exist) in the directory specified by command-line arg `-i`/`--bamdir`. If the BAM files are present in the current directory, this parameter doesn't need to be specified, and no links are created. *Optional*: if the BAM files have not yet been indexed, invoke the `--index` command-line argument to call Pysam
1. For each BAM file, check if sorted by coordinates--if not, [Pysam](https://github.com/pysam-developers/pysam) is called for sorting.
1. Call [PEPATAC's](https://github.com/databio/pepatac) `bamSitesToWig.py` script to generate a smooth fixed-interval signal track for each sample/replicate. Note, if this script is not already present current working directory, it will be downloaded via `wget`.
1. Divide the replicate's signal track by chromosome and place resulting subtracks into the chromosome-specific directories `tracks_<chromosome name>`

See the [flowchart](https://github.com/nolan-h-hamilton/ROCCO/blob/main/docs/bamsig_flowchart.png) for a visualization.

Since the downloaded alignments have not yet been indexed, we use the script's `--index` parameter. Note that these BAM files from ENCODE have already been sorted by coordinates.

Users with the necessary computational resources are encouraged to make use of the `--cores` command.

In [1]:
!python prep_bams.py -i . --index --cores 4

{'bamdir': '.', 'sizes': 'hg38.sizes', 'interval_length': 50, 'cores': 4, 'index': True, 'retain': False}
creating directory `tracks_chr1`
creating directory `tracks_chr2`
creating directory `tracks_chr3`
creating directory `tracks_chr4`
creating directory `tracks_chr5`
creating directory `tracks_chr6`
creating directory `tracks_chr7`
creating directory `tracks_chr8`
creating directory `tracks_chr9`
creating directory `tracks_chr10`
creating directory `tracks_chr11`
creating directory `tracks_chr12`
creating directory `tracks_chr13`
creating directory `tracks_chr14`
creating directory `tracks_chr15`
creating directory `tracks_chr16`
creating directory `tracks_chr17`
creating directory `tracks_chr18`
creating directory `tracks_chr19`
creating directory `tracks_chr20`
creating directory `tracks_chr21`
creating directory `tracks_chr22`
creating directory `tracks_chrX`
creating directory `tracks_chrY`
running with 4 cores


processing: ENCFF009NCL.bam
reading alignment file with pysam...
[

## Running ROCCO
In this section we run ROCCO and analyze results.
### Run on Multiple Chromosomes
`ROCCO.py` will look for chromosome-specific parameters in the CSV file specified with the `-p` argument, in our case, `demo_files/demo_params.csv`. Since a `NULL` entry is present in each cell in this file, the genome-wide defaults will be used. To run ROCCO with chromosome-specific parameters, this file can be modified accordingly, replacing `NULL` with the desired parameter value for the corresponding chromosome in each row.

In [2]:
!python ROCCO.py -p demo_files/demo_params.csv --combine ROCCO_out_combined.bed

running each ROCCO_chrom.py job sequentially
job 0: python3 /work/users/n/h/nolanh/ROCCO/ROCCO_chrom.py --chrom chr20 --wig_path tracks_chr20 --budget 0.035 --gamma 1.0 --tau 0.0 --c1 1.0 --c2 1.0 --c3 1.0 --solver ECOS --bed_format 3 --outdir . --rr_iter 50
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF009NCL.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF110EWQ.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF231YYD.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF395ZMS.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF495DQP.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF621AYF.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF767FGV.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF797EAL.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF801THG.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF948HNW.bam.bw.wig
ROCCO_chrom: writing 

### ROCCO predicted peak regions over `chr22` using default parameters
IDR thresholded peaks and fold change signals from ENCODE are included
![Alt text](demo_files/demo1.png)


### ROCCO predicted peak regions over a random 5mb region in `chr22`
![Alt text](demo_files/demo2.png)