# ROCCO Quick Start Demo

This notebook consists of three sections.
1. BAM Preprocessing
1. Running Rocco
1. Analyzing Results

The first section walks through the BAM --> WIG pipeline to generate ROCCO conformable input from a collection of samples' BAM files.

The second section involves running ROCCO for a couple scenarios, and the third section carries out some cursory analysis of results

## BAM Preprocessing

**Download Input Alignments:** To acquire the ATAC-seq alignments (human lymphoblast) used for this demo, run
```
xargs -L 1 curl -O -J -L < demo_files/bam_links.txt
```
in the main `ROCCO` directory.

These files are are obtained from the ENCODE project with the follwing [query](https://www.encodeproject.org/search/?type=Experiment&control_type%21=%2A&status=released&perturbed=false&assay_title=ATAC-seq&biosample_ontology.cell_slims=lymphoblast&audit.ERROR.category%21=extremely+low+read+depth&audit.NOT_COMPLIANT.category%21=low+FRiP+score&audit.NOT_COMPLIANT.category%21=poor+library+complexity&audit.NOT_COMPLIANT.category%21=severe+bottlenecking&audit.WARNING.category%21=moderate+library+complexity&audit.WARNING.category%21=mild+to+moderate+bottlenecking&audit.WARNING.category%21=moderate+number+of+reproducible+peaks).

The downloaded alignment files have been QC-processed with the [ENCODE ATAC-seq pipeline](https://www.encodeproject.org/atac-seq/). In general, we assume the BAM files used as input to ROCCO have been prepared according to some QC standard---duplicate removal, adapter trimming, etc.

#### [`prep_bams.py`](https://nolan-h-hamilton.github.io/ROCCO/prep_bams.html)
This script generates a smooth signal track for each sample's BAM file and then divides each into chromosome-specific directories `tracks_<chromosome name>`, thereby providing ROCCO conformable input.

Full documentation for this script is available [here](https://nolan-h-hamilton.github.io/ROCCO/prep_bams.html), and this [flowchart](https://github.com/nolan-h-hamilton/ROCCO/blob/main/docs/bamsig_flowchart.png) offers a visualization of the workflow.

Users with the computational resources are encouraged to make use of the `--cores` command.

In [1]:
# usage: prep_bams.py [-h] [-i BAMDIR] [-o OUTDIR] [-s SIZES] [-L INTERVAL_LENGTH] [-c CORES]
# defaults used except for --cores:
!python prep_bams.py --cores 4

[E::idx_find_and_load] Could not retrieve index file for '/work/users/n/h/nolanh/ROCCO/ENCFF009NCL.bam'
no index file available for /work/users/n/h/nolanh/ROCCO/ENCFF009NCL.bam, calling pysam.index()
/work/users/n/h/nolanh/ROCCO/ENCFF009NCL.bam: running bamSitesToWig.py
[E::idx_find_and_load] Could not retrieve index file for '/work/users/n/h/nolanh/ROCCO/ENCFF110EWQ.bam'
no index file available for /work/users/n/h/nolanh/ROCCO/ENCFF110EWQ.bam, calling pysam.index()
/work/users/n/h/nolanh/ROCCO/ENCFF110EWQ.bam: running bamSitesToWig.py
[E::idx_find_and_load] Could not retrieve index file for '/work/users/n/h/nolanh/ROCCO/ENCFF231YYD.bam'
no index file available for /work/users/n/h/nolanh/ROCCO/ENCFF231YYD.bam, calling pysam.index()
/work/users/n/h/nolanh/ROCCO/ENCFF231YYD.bam: running bamSitesToWig.py
[E::idx_find_and_load] Could not retrieve index file for '/work/users/n/h/nolanh/ROCCO/ENCFF395ZMS.bam'
no index file available for /work/users/n/h/nolanh/ROCCO/ENCFF395ZMS.bam, calling p

## Running ROCCO
Note, to run ROCCO in this section, we use the MOSEK solver. Remove `--solver MOSEK` from each command to use the ECOS solver instead. 

ECOS is the default, open-source option but produces lengthy output and was excluded for brevity.

#### 1) Run on a Single Chromosome (`chr22`) with Default Parameters
[`ROCCO_chrom.py`](https://nolan-h-hamilton.github.io/ROCCO/ROCCO_chrom.html) builds $\mathbf{S}_{chr}$ from wig files in `--wig_path` before solving the optimization problem underlying ROCCO. We use the `--verbose` flag for demonstration

In [2]:
!python ROCCO_chrom.py --chrom chr22 --wig_path tracks_chr22 --verbose --solver MOSEK

{'chrom': 'chr22', 'start': -1, 'end': -1, 'locus_size': -1, 'wig_path': 'tracks_chr22', 'rr_iter': 50, 'verbose': True, 'integral': False, 'budget': 0.035, 'gamma': 1.0, 'tau': 0.0, 'c1': 1.0, 'c2': 1.0, 'c3': 1.0, 'solver': 'MOSEK', 'bed_format': 6, 'identifiers': None, 'outdir': '.'}
ROCCO_chrom: inferred args
{'chrom': 'chr22', 'start': 10516250, 'end': 50818450, 'locus_size': 50, 'wig_path': 'tracks_chr22', 'rr_iter': 50, 'verbose': True, 'integral': False, 'budget': 0.035, 'gamma': 1.0, 'tau': 0.0, 'c1': 1.0, 'c2': 1.0, 'c3': 1.0, 'solver': 'MOSEK', 'bed_format': 6, 'identifiers': None, 'outdir': '.'}
ROCCO_chrom: reading wig file tracks_chr22/chr22_ENCFF009NCL.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr22/chr22_ENCFF110EWQ.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr22/chr22_ENCFF231YYD.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr22/chr22_ENCFF395ZMS.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr22/chr22_ENCFF495DQP.bam.bw.wig
ROCCO_chrom: reading wig 

#### 2) Run on Multiple Chromosomes with Default Parameters
[`ROCCO.py`](https://nolan-h-hamilton.github.io/ROCCO/ROCCO.html) will look for chromosome-specific parameters in the CSV file specified with the `--param_file` argument, in our case, `demo_files/demo_params.csv`. Since a `NULL` entry is present in each cell in this file, the genome-wide defaults will be used.

**`demo_files/demo_params.csv`**
```
chromosome,input_path,budget,gamma,tau,c1,c2,c3
chr20,tracks_chr20,NULL,NULL,NULL,NULL,NULL,NULL
chr21,tracks_chr21,NULL,NULL,NULL,NULL,NULL,NULL
chr22,tracks_chr22,NULL,NULL,NULL,NULL,NULL,NULL
```

In [3]:
!python ROCCO.py --param_file demo_files/demo_params.csv --combine ROCCO_out_combined.bed --outdir demo_outdir --solver MOSEK

running each ROCCO_chrom.py job sequentially
job 0: python3 /work/users/n/h/nolanh/ROCCO/ROCCO_chrom.py --chrom chr20 --wig_path tracks_chr20 --budget 0.035 --gamma 1.0 --tau 0.0 --c1 1.0 --c2 1.0 --c3 1.0 --solver MOSEK --bed_format 3 --outdir demo_outdir --rr_iter 50
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF009NCL.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF110EWQ.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF231YYD.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF395ZMS.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF495DQP.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF621AYF.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF767FGV.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF797EAL.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF801THG.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr20/chr20_ENCFF948HNW.bam.bw.wig
ROCCO_chro

#### 3) Run on Multiple Chromosomes with Variable Budgets

We run ROCCO over chromosomes 17-19 with specific budgets for each. In this example, budgets are computed loosely based on gene density of the respective chromosome.

**spec_params.csv**
```
chromosome,input_path,budget,gamma,tau,c1,c2,c3
chr17,tracks_chr17,0.04,NULL,NULL,NULL,NULL,NULL
chr18,tracks_chr18,0.03,NULL,NULL,NULL,NULL,NULL
chr19,tracks_chr19,0.05,NULL,NULL,NULL,NULL,NULL
```



In [4]:
!python ROCCO.py --param_file spec_params.csv --combine spec_combined.bed --outdir spec_outdir --solver MOSEK

running each ROCCO_chrom.py job sequentially
job 0: python3 /work/users/n/h/nolanh/ROCCO/ROCCO_chrom.py --chrom chr17 --wig_path tracks_chr17 --budget 0.04 --gamma 1.0 --tau 0.0 --c1 1.0 --c2 1.0 --c3 1.0 --solver MOSEK --bed_format 3 --outdir spec_outdir --rr_iter 50
ROCCO_chrom: reading wig file tracks_chr17/chr17_ENCFF009NCL.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr17/chr17_ENCFF110EWQ.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr17/chr17_ENCFF231YYD.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr17/chr17_ENCFF395ZMS.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr17/chr17_ENCFF495DQP.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr17/chr17_ENCFF621AYF.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr17/chr17_ENCFF767FGV.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr17/chr17_ENCFF797EAL.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr17/chr17_ENCFF801THG.bam.bw.wig
ROCCO_chrom: reading wig file tracks_chr17/chr17_ENCFF948HNW.bam.bw.wig
ROCCO_chrom

## Analyzing Results 


#### ROCCO predicted peak regions over `chr22` using default parameters
IDR thresholded peaks and fold change signals from ENCODE are included
![Alt text](demo_files/demo1.png)

#### Peak Summary for Variable Budgets, Human Chromosomes 17-19

We first create a size file including chromosomes 17-19 so we can call `bedtools summary` on the peak results

**chroms.size**
```
chr17	83257441
chr18	80373285
chr19	58617616
```

In [5]:
!bedtools summary -i spec_combined.bed -g chroms.size | column -t

chrom  num_records  total_bp  chrom_frac_genome  frac_all_ivls  frac_all_bp  min    max     mean
chr17  4707         3326300   0.37461            0.385          0.384        50     13100   706.671
chr18  4082         2410800   0.36164            0.334          0.278        50     7050    590.593
chr19  3438         2927350   0.26375            0.281          0.338        50     14650   851.469
all    12227        8664450   1.0                1.0            50           14650  708.63  
