# Analysis pipeline for simple and twisted interactions

## Data preparation

### Datasets

We analyzed datasets from the following pulications:

1. [Nora et al. 2017](https://www.ncbi.nlm.nih.gov/pubmed/28525758), mouse embryonic stem cells, Hi-C, HindIII
2. [Mifsud et al. 2015](https://www.ncbi.nlm.nih.gov/pubmed/25938943), human GM12878 and CD34+ blood cells, promoter capture Hi-C, HindIII
3. [Schoenefelder et al. 2015](https://www.ncbi.nlm.nih.gov/pubmed/25752748), mouse embryonic stem cells, promoter capture Hi-C, HindIII
4. [Chesi et al. 2019](https://www.ncbi.nlm.nih.gov/pubmed/30890710), human BMP2 induced osteoblasts and liver carcinoma HepG2 cells, Capture-C, DpnII

Paired-end reads were downloaded from the European Nucleotide Archive ([ENA](https://www.ebi.ac.uk/ena)) using the command line utility program ```wget```. Paired-end FASTQ files from ENA are in sync and can be processed directly.

### Trucation, mapping and counting of read pairs with Diachromatic

We used our Java application ```Diachromatic``` in order to derive read pair counts for interacting pairs of restriction digests. Source code and binaries are available on [GitHub](https://github.com/TheJacksonLaboratory/diachromatic) and documantation on [Read the Docs](https://diachromatic.readthedocs.io/en/latest/). Furthermore, a [recently published review article](https://www.mdpi.com/2073-4425/10/7/548) gives some application examples of ```Diachromatic```. We chose appropriate genome builds and restriction enzymes and default parameters otherwise with one exception. For the data of Chesi et al., we used the ```--sticky-end``` option for truncation because, for Capture-C, no fill-in of sticky ends is performed.

```Diachromatic``` expects a *digest map* for the corresponding genome build and restriction enzyme as input. A *digest map* is a text file in which each line corresponds to restriction digest in the genome. The following line gives one example.
```
chr1    15172   15749   14      DpnII   DpnII   578     0.272   0.272   0.157   0.000   F       0       0
```
The first three columns contain the coordinates of a DpnII restriction digest of on chromosome 1. We used our Java application [GOPHER](https://www.ncbi.nlm.nih.gov/pubmed/30642251/) in order to prepare *digest maps* for mouse and human with DpnII or HindIII.

GOPHER flags digests that were selected for target enrichment as *selected* or *active* which is indicated by an F or T (FALSE or TRUE) in column 12. This information is used in ```Diachromatic``` to derive a quality metric that reflects the efficiency of target enrichment. Furthermore, the states (inactive/active) of digests are passed through the reported interactions. For the promoter capture Hi-C and Capture-C datasets, we used the preset option *x* in order to flag all digests that overlap a transcription start site (TSS) of a protein-coding gene as *active*. For the CTCF depletion data, we manually prepared a digest map in which all digests that overlap a suspected TAD boundary that was *gone* upon CTCF depletion are selected.

### Diachromatic interaction files

All analyses presented here are based on Diachromatic interaction files in which each line represents one interaction. For instance, the line
```
chr5	156958482	156963927	A	chr5	157097590	157104795	I	0:2
```
represents an interaction between two restriction digests on chromosome 5. The first digest is flagged as active (A) and the second digest as inactive (I). The last column contains the number of simple and twisted read pairs for the given interaction separated by a colon. For the example above, there were no simple and two twisted read pairs.

Interactions with only one read pairs are not informative when analyzing shifts between simple and twisted read pairs. Therefore, all analyses presented here were performed on interactions with more than one read pair only (gt1 stands for greater than 1).

Precalculated files can be downloaded from [Owncloud](https://owncloud-ext.charite.de/owncloud/index.php/s/PpjXgvfj9f6vuLL). In order to perform the analyses below, download the data from Owncloud using a web browser and save them to the directory ```diachrscripts/data/```.

### UCSC refGene.txt.gz file

The RefSeq annotation of TSS for ```mm10``` and ```hg38``` were taken from UCSC's ```refGene.txt.gz``` file. This file is used for various analysis steps including:

1. Preparation of a *digest map* as input for Diachromatic using GOPHER
2. TSS strand analysis
3. Expression analysis

It is important to use the same version of the ```refGene.txt.gz``` file for different analysis steps in order to avoid inconsistencies due to changes.


### GTF input file for Tophat/Cufflinks

The GTF files that were used as input for Tophat and Cuffdiff were derived from the ```refGene.txt.gz``` file that was also used for the prepartion of the *digest map* as follows:

```shell
gzip -d refGene.txt.gz
cut -f 2- refGene.txt > refGene.input
genePredToGtf file refGene.input hg38refGene.gtf
cat hg38refGene.gtf  | sort -k1,1 -k4,4n > hg38refGene.gtf.sorted
```

In [11]:
!gunzip -c data/mifsud_2015_hg38/gt1_interactions/MIFSUD_R10.interaction.counts.table.gt1.tsv.gz | head -n 10

chr5	169261149	169264959	I	chr5	169981022	169985681	I	0:2
chr15	27414821	27423378	I	chr10	75079891	75084888	I	2:0
chr12	69208776	69210256	I	chr12	69224169	69241528	A	0:4
chr5	156958482	156963927	A	chr5	157097590	157104795	I	0:2
chr2	26344841	26357639	A	chr2	26492107	26495850	I	2:0
chr3	123863281	123867648	I	chr3	123990381	123993517	A	1:4
chr15	51337430	51339515	A	chr13	98945157	98947376	I	0:2
chr9	77597547	77602192	I	chr9	77632643	77638540	I	0:2
chr22	31517931	31518878	I	chr22	31749092	31754887	A	2:0
chr5	37292844	37306395	I	chr5	37325948	37332645	I	0:2
gunzip: error writing to output: Broken pipe
gunzip: data/mifsud_2015_hg38/gt1_interactions/MIFSUD_R10.interaction.counts.table.gt1.tsv.gz: uncompress failed


## k-interaction analysis

Use the script ```diachrscripts/analyze_k_interactions_script.py```.

## TSS strand analysis

Use the script ```diachrscripts/analyze_tss_strand_distribution_script.py```.

## Expression analysis

Use the script ```diachrscripts/analyze_expression_levels_script.py```.

In [12]:
print("Hallo")

Hallo


In [13]:
!pwd

/Users/hansep/PycharmProjects/diachrscripts


In [14]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3

%%HTML