# Table of Contents
 <p><div class="lev1"><a href="#Cooler-quickstart">Cooler quickstart</a></div><div class="lev2"><a href="#Direct-access-with-h5py">Direct access with <code>h5py</code></a></div><div class="lev2"><a href="#The-Cooler-class">The <code>Cooler</code> class</a></div><div class="lev3"><a href="#The-info-dictionary">The info dictionary</a></div><div class="lev3"><a href="#Table-Views">Table Views</a></div><div class="lev3"><a href="#Enter-The-Matrix">Enter The Matrix</a></div><div class="lev3"><a href="#Balancing-your-selection">Balancing your selection</a></div><div class="lev3"><a href="#Genomic-coordinate-range-selection">Genomic coordinate range selection</a></div><div class="lev2"><a href="#Functional-API">Functional API</a></div>

# Cooler quickstart

In [None]:
from __future__ import division, print_function
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas
import h5py

import cooler

In [None]:
!wget ftp://cooler.csail.mit.edu/coolers/hg19/Rao2014-GM12878-MboI-allreps-filtered.5kb.cool

In [None]:
filepath = 'Rao2014-GM12878-MboI-allreps-filtered.5kb.cool'

## Direct access with `h5py`

The `h5py` library (HDF5 for Python) provides an excellent Pythonic interface between HDF5 and native [NumPy](http://www.numpy.org/) arrays and dtypes. It allows you to treat an HDF5 file like a dictionary with complete access to the file's contents as well as the ability to manipulate groups and read or write datasets and attributes. There is additionally a low-level API that wraps the `libhdf5` C functions directly. See the [h5py docs](http://docs.h5py.org/en/latest/index.html).

In [None]:
h5 = h5py.File(filepath, 'r')

In [None]:
h5

In [None]:
h5.keys()

Files and Groups are `dict`-like.

In [None]:
h5['pixels']

In [None]:
list(h5['pixels'].keys())

`h5py` dataset objects are **views** onto the data on disk

In [None]:
h5['pixels']['bin2_id']

Slicing or indexing returns a numpy array in memory.

In [None]:
h5['pixels']['bin2_id'][:10]

In [None]:
h5['pixels']['count'][:10]

In [None]:
h5.close()

The Python `cooler` package is just a thin wrapper over `h5py`.

- It lets you access the data tables as [Pandas](http://pandas.pydata.org/) [data frames and series](http://pandas.pydata.org/pandas-docs/stable/10min.html). 
- It also provides a _matrix abstraction_: letting you query the upper triangle pixel table as if it were a full rectangular [sparse matrix](http://www.scipy-lectures.org/advanced/scipy_sparse/storage_schemes.html) via [SciPy](http://www.scipy-lectures.org/index.html).

See below.

## The `Cooler` class

Accepts a file path or an open HDF5 file object.

NOTE: Using a filepath allows the `Cooler` object to be serialized/pickled since the file is only opened when needed.


In [None]:
c = cooler.Cooler(filepath)

### The info dictionary

In [None]:
c.info

### Table Views
Tables are accessed via methods.

In [None]:
c.chroms()

The return value is a selector or "view" on a table that accepts column and range queries ("slices").

- Column selections return a new view.
- Range selections return pandas [DataFrames or Series](http://pandas.pydata.org/pandas-docs/stable/dsintro.html).

In [None]:
c.chroms()[1:5]

In [None]:
# get the whole table
c.chroms()[:]

In the bin table, the **weight** column contains the _matrix balancing weights_ computed for each genomic bin.

In [None]:
c.bins()[:10]

Selecting a list of columns returns a new DataFrame view on that subset of columns

In [None]:
bins = c.bins()[['chrom', 'start', 'end']]
bins

In [None]:
bins[:10]

Selecting a single column returns a Series view

In [None]:
weights = c.bins()['weight']
weights

In [None]:
weights[500:510]

The pixel table contains the non-zero upper triangle entries of the contact map.

In [None]:
c.pixels()[:10]

Use the `join=True` option if you would like to expand the bin IDs into genomic bin coordinates by joining the output with the bin table.

In [None]:
c.pixels(join=True)[:10]

Pandas lets you readily dump any table selection to tabular text file.

In [None]:
df = c.pixels(join=True)[:100]

# tab-delimited file, don't write the index column or header row
df.to_csv('myselection.txt', sep='\t', index=False, header=False)

In [None]:
!head myselection.txt

Another way to annotate the bins in a data frame of pixels is to use `cooler.annotate`. It does a [left outer join](http://chris.friedline.net/2015-12-15-rutgers/lessons/python2/04-merging-data.html) from the `bin1_id` and `bin2_id` columns onto a data frame indexed by bin ID that describes the bins.

In [None]:
bins = c.bins()[:]  # fetch all the bins

pix = c.pixels()[100:110]  # select some pixels with unannotated bins
pix

In [None]:
cooler.annotate(pix, bins)

In [None]:
cooler.annotate(pix, bins[['weight']], replace=False)

### Enter The Matrix

Finally, the `matrix` method provides a 2D-sliceable view on the data. It allows you to query the data on file as a full rectangular contact matrix.

In [None]:
c.matrix()

The result of a query is a `scipy.sparse.coo_matrix` object.

In [None]:
mat = c.matrix()[1000:1200, 1000:1200]
mat

It is straightforward to convert to a dense 2D numpy array.

In [None]:
arr = mat.toarray()
arr

Notice that the lower triangle has been automatically filled in.

In [None]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
im = ax.matshow(np.log10(arr), cmap='YlOrRd')
fig.colorbar(im)

Notice the light and dark "banded" appearance? That's because you are looking at the unnormalized counts.

### Balancing your selection

We usually normalize or "correct" Hi-C using a technique called matrix balancing. This involves finding a set of weights or biases $b_i$ for each bin $i$ such that

$$ Normalized[i,j] = Observed[i,j] \cdot b[i] \cdot b[j], $$

such that the marginals (i.e., row/column sums) of the global contact matrix are flat and equal.

Cooler can store the pre-computed balancing weights in the bin table. You can manually apply them to balance your selection.

In [None]:
# get the balancing weights as a numpy array
weights = c.bins()['weight']  # view
bias = weights[1000:1200]     # series
bias = bias.values            # array

# fetch a sparse matrix
mat = c.matrix()[1000:1200, 1000:1200]

# apply the balancing weights
mat.data = bias[mat.row] * bias[mat.col] * mat.data

# convert to dense numpy array
arr = mat.toarray()

As a shortcut, we get the same result by passing `balance=True` to the matrix view constructor.

In [None]:
arr2 = c.matrix(balance=True)[1000:1200, 1000:1200].toarray()
np.allclose(arr, arr2, equal_nan=True)

In [None]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
im = ax.matshow(np.log10(arr), cmap='YlOrRd')
fig.colorbar(im)

### Genomic coordinate range selection

The bin table, pixel table and matrix views also accept UCSC-style genomic range strings or (chrom, start, end) triples.

In [None]:
c.bins().fetch('chr2:10,000,000-20,000,000')

In [None]:
cis = c.matrix().fetch('chr21')
cis.shape

In [None]:
trans = c.matrix().fetch('chr21', 'chr22')
trans.shape

## Functional API

Instead of the methods of the `Cooler` class, you can use the similarly named functions in the `cooler` module directly. However, they will only accept an open HDF5 file handle, not a file path string, and they execute their queries eagerly.

Open the HDF5 file with h5py

In [None]:
h5 = h5py.File(filepath, 'r')

In [None]:
cooler.info(h5)

In [None]:
cooler.bins(h5, 0, 10)

... etc.

Note that `cooler.get()` is a very generic utility that lets you interpret a HDF5 group containing 1D datasets as a table.

In [None]:
print(cooler.get.__doc__)