<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Cooler-command-line-interface" data-toc-modified-id="Cooler-command-line-interface-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Cooler command line interface</a></span><ul class="toc-item"><li><span><a href="#Example" data-toc-modified-id="Example-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Example</a></span></li><li><span><a href="#Aggregate-a-list-of-read-pairs-into-a-cool-file" data-toc-modified-id="Aggregate-a-list-of-read-pairs-into-a-cool-file-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Aggregate a list of read pairs into a <code>cool</code> file</a></span></li><li><span><a href="#Text-export" data-toc-modified-id="Text-export-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Text export</a></span></li><li><span><a href="#Balancing" data-toc-modified-id="Balancing-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Balancing</a></span></li><li><span><a href="#Display-the-contact-matrix" data-toc-modified-id="Display-the-contact-matrix-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Display the contact matrix</a></span></li><li><span><a href="#Optional:-Load-a-pairs-file-indexed-with-Pairix" data-toc-modified-id="Optional:-Load-a-pairs-file-indexed-with-Pairix-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Optional: Load a pairs file indexed with <a href="https://github.com/4dn-dcic/pairix" target="_blank">Pairix</a></a></span></li></ul></li></ul></div>

Notes:

To simulate shell scripts and terminal interaction, we preface every code cell with the "cell magic": `%%bash`, which sends the code to bash instead of the Python interpreter. Another way to send code from IPython to the shell is to prefix a line with shell escape `!`.

# Cooler command line interface

If you type `cooler` at the command line with no arguments or with `-h` or `--help` you'll get the following quick reference of available subcommands.

In [None]:
%%bash

cooler -h

For more information about a specific subcommand, type `cooler <subcommand> -h` to display the help text.

In [None]:
%%bash

cooler info -h

## Example

Let's try it.

In [None]:
%%bash

cooler info data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

In [None]:
%%bash

cooler info -f bin-size data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

In [None]:
%%bash

cooler info -m data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

For more in-depth introspection into the HDF5 file structure, you can use `cooler tree` to display the group and array hierarchy and `cooler attrs` to display all attributes in the hierarchy. As a bonus, these commands work on any HDF5 file!

In [None]:
%%bash

cooler tree data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

In [None]:
%%bash

cooler attrs data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

## Aggregate a list of read pairs into a `cool` file

To make a contact matrix, we need

1. A list of read pairs representing captured contacts.
2. A segmentation of the genome into bins by which we aggregate (bin) the read pair counts.

For(1), we will start with a very small subsample of 100,000 read pairs from GSM1551552 (Rao et al, GM12878). The fields of the file are readID, strand1, chrom1, pos1, frag1, strand2, chrom2, pos2, frag2, mapq1, mapq2.

In [None]:
%%bash

zcat data/GSM1551552_HIC003_merged_nodups.txt.subset.gz | head

This data was mapped to the Broad's `b37` assembly and uses ENSEMBL-style chromosome names (`1..22`, `X`, `Y`, `MT`) instead of the UCSC-style (`chr1..chr22`, `chrX`, `chrY`, `chrM`).

We provide a chromosome sizes file with the chromosomes we want to use in a desired order. Make sure the name style of the chromsizes file matches the name style of the pairs file! The following is the `b37` chromosome sizes file with chromosomes in a "natural" semantic order, leaving out the unlocalized and unplaced scaffolds.

In [None]:
%%bash

cat data/b37.chrom.sizes

We also need to decide how we want to bin the contacts. Usually, we choose a fixed bin size or "resolution". Another option for Hi-C data is to use restriction fragment-delimited genomic bins based on the restriction enzyme used in the experiment. `cooler` allows for any binning scheme you like, as long as you provide it as a **bin table**. We can store a bin table in a simple BED file using the `makebins` command.

In [None]:
%%bash

cooler makebins -h

If you have the FASTA sequence of the reference genome, you can also "digest" it to create a bin table of fragments.

In [None]:
%%bash

cooler digest -h

In [None]:
%%bash

CHROMSIZES_FILE='data/b37.chrom.sizes'

cooler makebins --out "data/bins.1000kb.bed" $CHROMSIZES_FILE 1000000

# what's in the file?
head "data/bins.1000kb.bed"

### Note

There is a convenient syntax to specify a fixed-resolution bin table, so you rarely need to generate one manually: 

```
<chromsizes_path>:<binsize-in-bp>
```

e.g. The bin table above can be specified as `data/b37.chrom.sizes.reduced:1000000`.

In [None]:
%%bash

cooler cload pairs -h

In [None]:
%%bash
# Note that the input pairs file happens to be space-delimited, so we convert to tab-delimited with `tr`.
CHROMSIZES_FILE='data/b37.chrom.sizes'
BINSIZE=1000000
PAIRS_FILE='data/GSM1551552_HIC003_merged_nodups.txt.subset.gz'
OUTPUT_FILE='data/test.cool'

zcat $PAIRS_FILE \
    | tr ' ' '\t' \
    | cooler cload pairs -c1 3 -p1 4 -c2 7 -p2 8 $CHROMSIZES_FILE:$BINSIZE - $OUTPUT_FILE 

There are benefits to sorting and indexing pairs. See below.

## Text export

The `cooler dump` command lets us print the data back out as text with several formatting and annotation options. It also accepts range queries, both intra- and inter-chromosomal.

In [None]:
%%bash

cooler dump -h

In [None]:
%%bash

cooler dump -t chroms data/test.cool

In [None]:
%%bash

cooler dump -t bins data/test.cool | head

In [None]:
%%bash

cooler dump -t pixels --header data/test.cool | head

In [None]:
%%bash

cooler dump -t pixels --header --join data/test.cool | head

In [None]:
%%bash

cooler dump -t pixels -r 10:10,000,000-20,000,000 -r2 10:30,000,000-80,000,000 --header --join data/test.cool | head

In [None]:
%%bash

cooler dump -t pixels --header --balanced data/test.cool | head

Oops! Our contact matrix isn't balanced yet. Let's do that next.

## Balancing

Matrix balancing normalization, i.e. iterative correction.

We usually normalize or "correct" Hi-C using a technique called matrix balancing. This involves finding a set of weights or biases $b_i$ for each bin $i$ such that

$$ Normalized[i,j] = Observed[i,j] \cdot b[i] \cdot b[j], $$

such that the marginals (i.e., row/column sums) of the global contact matrix are flat and equal.

`cooler balance` will store the pre-computed balancing weights in the bin table as an extra column called `weight`.

Note that whole-genome matrix balancing on a high resolution matrix requires iterative computations on a matrix that may not fit in computer memory, even in sparse form. Our "out-of-core" method performs the calculations by splitting and loading the data into smaller chunks and combining the partial results afterwards.

In [None]:
%%bash

cooler balance -h

`cooler balance` iterates until the balanced marginals (i.e. row sums of the balanced matrix) are sufficiently flat (the variance falls below the limit `tol`).

In [None]:
%%bash

cooler balance -p 10 -c 10000 data/test.cool

In [None]:
%%bash

cooler dump --header --balanced data/test.cool | head

## Display the contact matrix

You can also use the `cooler show` function to produce images of the contact matrix. Requires the `matplotlib` Python package.

In [None]:
%%bash

cooler show -h

Here's the undersampled dataset.

In [None]:
%%bash

cooler show --out data/test.png --dpi 200 data/test.cool 3:0-80,000,000

In [None]:
from IPython.display import Image
Image('data/test.png')

Here's what the full one looks like.

In [None]:
%%bash

cooler show --out data/test2.png --dpi 200 data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool chr3:0-80,000,000

In [None]:
from IPython.display import Image
Image('data/test2.png')

## Optional: Load a pairs file indexed with [Pairix](https://github.com/4dn-dcic/pairix)

Alternatively, you can sort, format and index your pairs file before ingesting. Having an indexed pairs file can also be useful of other kinds of read-level analyses.

We use the `cooler csort` command to do this. What does it do? 

Given a chromosome order, it creates a new pairs file with the following properties:

1. _Consistently ordered mates_: mates of every interchromsomal pair are properly "flipped" in order to respect the requested order of the chromosomes. For intrachromosomal pairs, mates are flipped such that `pos1` is always less than or equal to `pos2`. As a result, the data will have an **upper triangular** orientation with respect to the chromosome order (interpreting sides 1 and 2 as `i` and `j` axes in a matrix coordinate system).
2. _Sorted_: once the mates are oriented, the pair records are lexically sorted by chrom1, chrom2, pos1, pos2. With (1) and (2), contacts are said to be sorted by chromosome-chromosome block, a.k.a. "block" sorted.
3. _Indexed_: we use [bgzip](http://www.htslib.org/doc/tabix.html) to compress the file and [Pairix](https://github.com/4dn-dcic/pairix) to index it. This creates a small `.px2` index file which facilitates 2-dimensional queries on the reads. 

**Notes**: 

- If (1) is already satisfied, you can also prepare a Pairix-indexed file manually, without `cooler csort`. See this [example](https://github.com/4dn-dcic/pairix#usage-examples-for-pairix)

In [None]:
!cooler csort -h

In [None]:
%%bash
# Note that the input pairs file happens to be space-delimited, which we specify 
# with the --sep argument  (tab is assumed by default).
# The output pairs file will always be tab-delimited!

CHROMSIZES_FILE='data/b37.chrom.sizes'
PAIRS_FILE='data/GSM1551552_HIC003_merged_nodups.txt.subset.gz'

cooler csort -c1 3 -p1 4 -c2 7 -p2 8 --sep ' ' --out data/pairs.sorted.txt.gz $PAIRS_FILE $CHROMSIZES_FILE 

In [None]:
%%bash

# What's in the output?
zcat data/pairs.sorted.txt.gz | head

Finally, using `cooler cload pairix`, we aggregate (bin) the contacts in `pairs.sorted.txt.gz` against the bins file, `bins.1000kb.bed`, and write the contents to the binary `test.cool` file.

A Pairix-indexed file has the advantage of 2D querying. However, it uses a slightly different sorting convention:

1. Like the previous Tabix scheme, interchromosomal pairs in Pairix files should consistently respect some order of the chromosomes (i.e. be "upper triangular"). Unlike the previous scheme, the chromosome order used to create the pairs file can be arbitrary, and does not need to match the order you wish to use in the cooler file.
2. Unlike the previous Tabix scheme, where the file is sorted by `chrom1`, `pos1`, `chrom2`, `pos2`, Pairix files are sorted by `chrom1`, `chrom2`, `pos1`, `pos2`. 


In [None]:
%%bash

cooler cload pairix -h

In [None]:
%%bash

# alternatively, we could pass $CHROMSIZES_FILE:1000000 below instead of creating $BINS_FILE
BINS_FILE='data/bins.1000kb.bed'
INDEXED_PAIRS_FILE='data/pairs.sorted.txt.gz'
OUTPUT_FILE='data/test.cool'

cooler cload pairix $BINS_FILE $INDEXED_PAIRS_FILE $OUTPUT_FILE