# HiC 增强完整流程，GM12878为例

### 1. HiC 文件预处理: Reading Data

#### 数据文件：

工作目录为找出需要分析的文件，样本数据存放在 **{Datasets}/HiC/raw** 中，首先在其中找到所有 _10kb_ 并且为 _MAPQGE30_ 目录

~~~bash
$ find . -name "10kb*" -type d -exec find {} -name "MAPQGE30" -type d \;
~~~

#### 异常文件处理

~~~bash
$ sed -i 's/^0\.0$/NaN/g' file_need.change
~~~

#### 已知异常文件：

* K562_chr22: SQRTVnorm 代替
* IMR90_chr9: SQRTVnorm 代替

In [1]:
%run data_aread.py -h

usage: data_aread.py -c CELL_LINE
                     [-hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}]
                     [-q {MAPQGE30,MAPQG0}] [-n {KRnorm,SQRTVCnorm,VCnorm}]
                     [--help]

Read raw data from Rao's Hi-C.

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          REQUIRED: Cell line for analysis[example:GM12878]

Miscellaneous Arguments:
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        High resolution specified[default:10kb]
  -q {MAPQGE30,MAPQG0}  Mapping quality of raw data[default:MAPQGE30]
  -n {KRnorm,SQRTVCnorm,VCnorm}
                        The normalization file for raw data[default:KRnorm]


#### >>> 运行

~~~bash
$ python data_aread.py -c GM12878
~~~

耗时小于 <mark>2.5 min</mark>

### 2. 数据降采样

In [2]:
%run data_downsample.py -h

usage: data_downsample.py -c CELL_LINE -hr
                          {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb} -lr
                          LOW_RES -r RATIO [--help]

Downsample data from high resolution data

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          REQUIRED: Cell line for analysis[example:GM12878]
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        REQUIRED: High resolution specified[example:10kb]
  -lr LOW_RES           REQUIRED: Low resolution specified[example:40kb]
  -r RATIO              REQUIRED: The ratio of downsampling[example:16]


#### >>> 运行

~~~bash
$ python data_downsample.py -c GM12878 -hr 10kb -lr 40kb -r 16
~~~

耗时小于 <mark>2 min</mark>

### 3. 数据分割

In [3]:
%run data_generate.py -h

usage: data_generate.py -c CELL_LINE -hr
                        {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb} -lr LOW_RES
                        [-s {all,train,valid}] -chunk CHUNK -stride STRIDE
                        -bound BOUND -scale SCALE [-type {max,avg}] [--help]

Divide data for train and predict

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          REQUIRED: Cell line for analysis[example:GM12878]
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        REQUIRED: High resolution specified[example:10kb]
  -lr LOW_RES           REQUIRED: Low resolution specified[example:40kb]
  -s {all,train,valid}  REQUIRED: Dataset for train/valid/predict(all)

SRGAN Arguments:
  -chunk CHUNK          REQUIRED: chunk size for dividing[example:40]
  -stride STRIDE        REQUIRED: stride for dividing[example:40]
  -bound BOUND          REQUIRED: distance boundary interested[example:201]
  -scale SCALE         

#### >>> 运行

* 生成训练数据
~~~bash
$ python data_generate.py -c GM12878 -hr 10kb -lr 40kb -s train -chunk 40 -stride 40 -bound 201 -scale 1
$ python data_generate.py -c GM12878 -hr 10kb -lr 40kb -s valid -chunk 40 -stride 40 -bound 201 -scale 1
~~~


* 生成预测数据
~~~bash
$ python data_generate.py -c GM12878 -hr 10kb -lr 40kb -s all -chunk 40 -stride 40 -bound 201 -scale 1
~~~


耗时小于 <mark>3 min</mark>

### 4. 数据预测

In [4]:
%run data_predict.py -h

usage: data_predict.py -c CELL_LINE -lr LOW_RES -ckpt CHECKPOINT [--cuda CUDA]
                       [--help]

Predict data with HiCplus and SRGAN model

optional arguments:
  --help, -h        Print this help message and exit

Required Arguments:
  -c CELL_LINE      REQUIRED: Cell line for analysis[example:GM12878]
  -lr LOW_RES       REQUIRED: Low resolution specified[example:40kb]
  -ckpt CHECKPOINT  REQUIRED: Checkpoint file of SRGAN model

Miscellaneous Arguments:
  --cuda CUDA       Whether or not using CUDA[default:1]


#### >>> 运行

~~~bash
$ python data_predict.py -c GM12878 -lr 40kb -ckpt save/generator_nonpool_noupsample_b201.pytorch
~~~


耗时小于 <mark>4 min</mark>

## 后期处理

### 1. 产生 FitHiC 输入文件

In [5]:
%run input_pfithic.py -h

usage: input_pfithic.py -c CELL_LINE -lr LOW_RES [-hr HIGH_RES]
                        [-L LOWERBOUND] [-U UPPERBOUND] [--help]

Generate pFitHiC inputs for Loops calling

optional arguments:
  --help, -h     Print this help message and exit

Required Arguments:
  -c CELL_LINE   REQUIRED: Cell line for analysis[example:GM12878]
  -lr LOW_RES    REQUIRED: The low resolution predicted from[example:40kb]

Miscellaneous Arguments:
  -hr HIGH_RES   OPTIONAL: The high resolution which predicted[default:10kb]
  -L LOWERBOUND  OPTIONAL: lower bound on the intra-chromosomal distance
                 range[default:1]
  -U UPPERBOUND  OPTIONAL: upper bound on the intra-chromosomal distance
                 range[default:110]


#### >>> 运行

~~~bash
$ python input_pfithic.py -c GM12878 -lr 40kb -L 1 -U 120
~~~


耗时约 <mark>4 min</mark>

### 2. 运行 FitHiC

In [6]:
%run -m pfithic.runner -h

usage: runner.py -i INTERSFILE -f FRAGSFILE -o OUTDIR -r RESOLUTION
                 [-L DISTLOWTHRES] [-U DISTUPTHRES] [-b NOOFBINS]
                 [-p NOOFPASSES] [-m MAPPABILITYTHRESHOLD] [-l LIBNAME] [-log]
                 [-x {intraOnly,interOnly,All}] [-t BIASFILE]
                 [-tL BIASLOWERBOUND] [-tU BIASUPPERBOUND] [--help]
                 [--version]

A pandas based FitHiC runner

optional arguments:
  --help, -h            Print this help message and exit
  --version             show program's version number and exit

Required Arguments:
  -i INTERSFILE, --interactions INTERSFILE
                        REQUIRED: interactions between fragment pairs are read
                        from INTERSFILE
  -f FRAGSFILE, --fragments FRAGSFILE
                        REQUIRED: midpoints (or start indices) of the
                        fragments are read from FRAGSFILE
  -o OUTDIR, --outdir OUTDIR
                        REQUIRED: where the output files will be written
  -r R

#### >>> 运行 (包装后脚本)

~~~bash
$ ./run_pfithic.sh GM12878 40kb
~~~

该脚本中有更多参数，需要手动调节

耗时约 <mark>10 min</mark>