*Please read it in [nbviewer](https://nbviewer.jupyter.org/github/kevingroup/techniXnippets/blob/master/Hi-C_pipeline_v2.ipynb?flush_cache=true)*
# Hi-C pipeline
*Author: Qin CAO*

## Pipeline overview
The pipeline takes raw fastq files from Hi-C experiments and produces
* .hic files for each cell sample(which contain Hi-C contact matrices at multiple resolutions)
* .hic files for merged biological replicates(if any)
* A/B compartments
* Topologically associating domains(TADs)
* Significant intra-chromosomal interactions 

## Raw .fastq files to Hi-C contact matrices
### *[Juicer](https://github.com/aidenlab/juicer/wiki)*

Juicer is a very powerful tool that deals with Hi-C data. It can map .fastq files to .hic files, which is a binary format efficiently stores contact matrices at multiple resolutions. 
#### [Installation of Juicer](https://github.com/aidenlab/juicer/wiki/Installation)
#### Configuration of juicer.sh file
We need to modify the configurations in juicer.sh file based on our own data. Some configurations can also be modified by passing the parameters when running juicer.sh. The key points are listed below. 


In [None]:
read1str="_R1" #suffix of paired read
read2str="_R2" #suffix of paired read
site="MboI" #enzyme site
genomeID="hg38" #genome ID
ligation="GATCGATC" #enzyme site ligation junction sequence

#### An example of processing raw .fastq files to .hic files

In [None]:
./juicer.sh -g hg38 -s MboI 
#In this example, the reference genome is hg38 and the enzyme is MboI. 

#### QCs and basic statistics
Juicer provides QCs and basic statistics. We summarize some key definitions/terms here. 
* Sequenced Read Pairs=Normal Paired+Chimeric Paired+Chimeric Ambiguous+Unmapped
* Alignable=Normal Paired+Chimeric Paired
* Alignable=Unique Reads+PCR Duplicates+Optical Duplicates
* Unique Reads=Intra-fragment Reads+Below MAPQ Threshold+Hi-C Contacts
* Hi-C Contacts=Inter-chromosomal+Intra-chromosomal

#### Read depth and resolution


#### Merge biological replicates
Juicer provides mega.sh to merge two .hic files, which is suitable for merging biological replicates.

For all the biological replicates from the same sample, a final .hic file is produced. 



## Extract interactions from .hic files
### *[Juicer dump](https://github.com/aidenlab/juicer/wiki/Data-Extraction)*
The parameters of juicer dump include the kind of matrix, normalization method, chromosome coordinates and bin resolution.
#### A usage example of dump

In [None]:
java -jar $juicer_tools dump observed KR inter.hic 4 4 BP 50000 4.txt
#This will dump the observed intrachromosomal matrix of chromosome 4 with Knight-Ruiz Matrix Balancing(KR) normalization at 50Kb resolution to the file 4.txt.

#### The example output of the example above

In [None]:
#bin1_start bin2_start KR_normalized_reads
3050000 3050000 689.73926
3050000 3100000 39.28981
3100000 3100000 558.02704
3050000 3150000 20.03267
3100000 3150000 71.13022
3150000 3150000 470.02182
3050000 3200000 10.919163
3100000 3200000 27.693422
3150000 3200000 47.443287
3200000 3200000 457.82416

## Normalization
We recommend to use KR normalization to normalize Hi-C contact matrices as it has been widely used in many high-impact papers. Hi-C contact matrices with KR normalization can be easily generated by Juicer dump as shown in the example above. 


## Call A/B compartment
### *[Juicer eigenvector](https://github.com/aidenlab/juicer/wiki/Eigenvector)*
It computes the first principal component(PC1) of the Hi-C contact matrix, in which the sign indicates the compartment(e.g. + indicates compartment A and - indicates compartment B or vise versa). A note here is that it is hard to compute eigenvectors for a very sparse matrix. Typically it can only handle relatively lower resolutions(e.g. 500kb), which is also consistent with the definition of A/B compartments. 
#### A usage example of eigenvector

In [None]:
java -jar juicer_tools.jar eigenvector KR HIC001.hic X BP 5000 eigen.txt
#This will calculate the eigenvector of chromosome X with KR normalization at 5Kb resolution and print to eigen.txt.

#### An example output
The sign indicates A or B comparment. 

In [None]:
#Each line is a bin with PC1 value
#chr    bin_start bin_end PC1 
chr10   3500000 4000000 -0.00357968
chr10   4000000 4500000 -0.0702789
chr10   4500000 5000000 -0.00730329
chr10   5000000 5500000 -0.00533751
chr10   5500000 6000000 -0.00196586
chr10   6000000 6500000 0.0402201

## Call TAD
### *[Hi-C Domain Caller](http://chromosome.sdsc.edu/mouse/hi-c/download.html)*
Hi-C Domain Caller takes Hi-C contact matrices(we recommend to use KR normalized matrices) as input and generates TADs. 

#### An example output

In [None]:
#Each line is a TAD
#chr    TAD_start       TAD_end
chr10   11350000        13000000 
chr10   13000000        14400000 
chr10   14750000        17750000 
chr10   18900000        20350000 
chr10   20350000        20950000

## Call significant Hi-C interactions
### *[Fit-Hi-C](https://bioconductor.org/packages/release/bioc/vignettes/FitHiC/inst/doc/fithic.html)*

Fit-Hi-C takes the Hi-C interactions at a certain resolution as input and produces p-value and q-value for each interaction to indicate the interaction significance. 

It may not be easy to get reasonable results at a very high input resolution, and thus we recommend 25kb and 50kb as the reasonable resolutions.

We recommend to set q-value$\le$0.1 as the default threshold to call significant interactions. 

We recommend to use Fit-Hi-C in [R version](https://bioconductor.org/packages/release/bioc/vignettes/FitHiC/inst/doc/fithic.html) as it is actively updated.

#### An example output

In [None]:
#chr_1  bin1_mid        chr_2   bin2_mid        contact         pvalue  qvalue
chr1    fragmentMid1    chr2    fragmentMid2    contactCount    p_value q_value
chr2    25000   chr2    75000   4245    0.0140615961205994      0.132152059562782

## Data visualization for roughly checking Hi-C patterns
### *[Juicebox](https://github.com/aidenlab/Juicebox)*
Juicebox can load .hic files and display Hi-C contact matrices. The web version of Juicebox is [here](https://www.aidenlab.org/juicebox/). 

Roughly, diagonal bands and block structures should be observed. 
#### An example image
<img src="https://github.com/theaidenlab/juicebox/wiki/images/juicebox3.png" width="800" />

