# Table of Contents
 <p><div class="lev1"><a href="#Cooler-command-line-interface"><span class="toc-item-num">1&nbsp;&nbsp;</span>Cooler command line interface</a></div><div class="lev2"><a href="#Example"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Example</a></div><div class="lev2"><a href="#Aggregate-a-list-of-read-pairs-into-a-cool-file"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Aggregate a list of read pairs into a <code>cool</code> file</a></div><div class="lev2"><a href="#Balancing"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Balancing</a></div><div class="lev2"><a href="#Display-the-contact-matrix"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Display the contact matrix</a></div>

Notes:

To simulate shell scripts and terminal interaction, we preface every code cell with the "cell magic": `%%bash`, which sends the code to bash instead of the Python interpreter. Another way to send code from IPython to the shell is to prefix a line with shell escape `!`.

In [None]:
# before we start, set the system locale
import os
os.environ['LC_ALL'] = 'C.UTF-8'
os.environ['LANG'] = 'C.UTF-8'

# Cooler command line interface

If you type `cooler` at the command line with no arguments or with `-h` or `--help` you'll get the following quick reference of available subcommands.

In [None]:
%%bash

cooler -h

For more information about a specific subcommand, type `cooler <subcommand> -h` to display the help text.

In [None]:
%%bash

cooler info -h

## Example

Let's try it.

In [None]:
%%bash

cooler info data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

In [None]:
%%bash

cooler info -f bin-size data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

In [None]:
%%bash

cooler info -m data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

## Aggregate a list of read pairs into a `cool` file

To make a contact matrix, we need

1. A list of read pairs representing captured contacts.
2. A segmentation of the genome into bins by which we aggregate (bin) the read pair counts.

For(1), we will start with a very small subsample of 100,000 read pairs from GSM1551552 (Rao et al, GM12878). The fields of the file are readID, strand1, chrom1, pos1, frag1, strand2, chrom2, pos2, frag2, mapq1, mapq2.

In [None]:
%%bash

zcat data/GSM1551552_HIC003_merged_nodups.txt.subset.gz | head

This data was mapped to the Broad's `b37` assembly and uses ENSEMBL-style chromosome names (`1..22`, `X`, `Y`, `MT`) instead of the UCSC-style (`chr1..chr22`, `chrX`, `chrY`, `chrM`).

Since we will be storing half of the contact matrix, we would like the chromosomes to appear in a "natural" order before we. So we provide a chromosome sizes file with the chromosomes we want to use in the desired order. Make sure the names match the names in the pairs file! The following is the `b37` chromosome sizes file with the unplaced scaffolds left out.

In [None]:
%%bash

cat data/b37-chromsizes.select.txt

We also need to decide how we want to bin the contacts. Usually, we choose a fixed bin size or "resolution". Another option for Hi-C data is to use restriction fragment-delimited genomic bins based on the restriction enzyme used in the experiment. `cooler` allows for any binning scheme you like, as long as you provide it as a **bin table**. We can store a bin table in a simple BED file using the `makebins` command.

In [None]:
%%bash

cooler makebins -h

If you have the FASTA sequence of the reference genome, you can also "digest" it to create a bin table of fragments.

In [None]:
%%bash

cooler digest -h

In [None]:
%%bash

CHROMSIZES_FILE='data/b37-chromsizes.select.txt'

cooler makebins --out bins.1000kb.bed $CHROMSIZES_FILE 1000000

# what's in the file?
head bins.1000kb.bed

Next, we need to make sure our pairs file is properly oriented, sorted and indexed. We use the `cooler csort` command to do this. What does it do?

1. _Oriented_: given the ordering of the chromosomes, each read pair should lie in the upper triangle of the contact map.
2. _Sorted_: once oriented, the reads are lexically sorted by chrom1, pos1, chrom2, pos2.
3. _Indexed_: we use bgzip to compress the file and [Tabix](http://www.htslib.org/doc/tabix.html) to index it. This creates a small `.tbi` index file which provides random access to any range of contacts along the "row" axis of the contact map. This not only simplifies the process of binning but is useful of other kinds of read-level analyses.

For now, we only extract the chrom1, pos1, chrom2, pos2, strand1, strand2 fields from the pairs file. You can specify which column each of these fields lies in the source contacts file.

In [None]:
!cooler csort -h

In [None]:
%%bash

CHROMSIZES_FILE='data/b37-chromsizes.select.txt'
PAIRS_FILE='data/GSM1551552_HIC003_merged_nodups.txt.subset.gz'

cooler csort -c1 3 -p1 4 -c2 7 -p2 8 -s1 2 -s2 6 --out pairs.sorted.txt.gz $CHROMSIZES_FILE $PAIRS_FILE

In [None]:
%%bash

# What's in the output?
zcat pairs.sorted.txt.gz | head

Finally, using `cooler cload`, we aggregate (bin) the contacts in `pairs.sorted.txt.gz` against the bins file, `bins.1000kb.bed`, and write the contents to the binary `test.cool` file.

In [None]:
%%bash

BINS_FILE='bins.1000kb.bed'
INDEXED_PAIRS_FILE='pairs.sorted.txt.gz'
OUTPUT_FILE='test.cool'

cooler cload $BINS_FILE $INDEXED_PAIRS_FILE $OUTPUT_FILE

The `cooler dump` command lets us print the data back out as text with several formatting and annotation options. It also accepts range queries, both intra- and inter-chromosomal.

In [None]:
%%bash

cooler dump -h

In [None]:
%%bash

cooler dump -t chroms test.cool

In [None]:
%%bash

cooler dump -t bins test.cool | head

In [None]:
%%bash

cooler dump -t pixels --header test.cool | head

In [None]:
%%bash

cooler dump -t pixels --header --join test.cool | head

In [None]:
%%bash

cooler dump -t pixels -r 10:10,000,000-20,000,000 -r2 10:30,000,000-80,000,000 --header --join test.cool | head

In [None]:
%%bash

cooler dump -t pixels --header --balanced test.cool | head

Oops! Our contact matrix isn't balanced yet. Let's do that next.

## Balancing

Matrix balancing normalization, i.e. iterative correction.

We usually normalize or "correct" Hi-C using a technique called matrix balancing. This involves finding a set of weights or biases $b_i$ for each bin $i$ such that

$$ Normalized[i,j] = Observed[i,j] \cdot b[i] \cdot b[j], $$

such that the marginals (i.e., row/column sums) of the global contact matrix are flat and equal.

`cooler balance` will store the pre-computed balancing weights in the bin table as an extra column called `weight`.

Note that whole-genome matrix balancing on a high resolution matrix requires iterative computations on a matrix that may not fit in computer memory, even in sparse form. Our "out-of-core" method performs the calculations by splitting and loading the data into smaller chunks and combining the partial results afterwards.

In [None]:
%%bash

cooler balance -h

`cooler balance` iterates until the balanced marginals (i.e. row sums of the balanced matrix) are sufficiently flat (the variance falls below the limit `tol`).

In [None]:
%%bash

cooler balance -p 10 -c 10000 test.cool

In [None]:
%%bash

cooler dump --header --balanced test.cool | head

## Display the contact matrix

You can also use the `cooler show` function to produce images of the contact matrix. Requires the `matplotlib` Python package.

In [None]:
%%bash

cooler show -h

Here's the undersampled dataset.

In [None]:
%%bash

cooler show -b --out test.png --dpi 200 test.cool 3:0-80,000,000

In [None]:
from IPython.display import Image
Image('test.png')

Here's what the full one looks like.

In [None]:
%%bash

cooler show -b --out test2.png --dpi 200 data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool chr3:0-80,000,000

In [None]:
from IPython.display import Image
Image('test2.png')