In [1]:
import pyfaidx

# Preparation of input files

The analysis of Hi-C and CHi-C data is generally very complex. For the preparatory analysis with `Diachromatic`, a tutorial can be found in the Read The Docs for this repository. Here we provide intermediate files for different steps of the analysis with `diachrscripts`.

## Diachromatic output files

`Diachromatic` generates tab-delimited interaction files in which the read pair counts are reported separately according to their relative paired-end orientation. `diachrscripts` is a collection of Python modules, scripts and Jupyter notebooks that can be used to examine interactions for imbalances in the four read pair counts. We have prepared `Diachromatic` output files for publicly available data from three biological replicates ([Mifsud et al. (2015)](https://pubmed.ncbi.nlm.nih.gov/25938943/)) and made them available for download.

In [None]:
# Robin -> Link

## Pooled Diachromatic interaction files

The script `pooler.py` can be used to pool the read pair counts of `Diachromatic` interactions occurring in different replicates. How this script works is documented here: [jupyter_notebooks/usage/usage_of_pooler.ipynb](https://github.com/TheJacksonLaboratory/diachrscripts/blob/master/jupyter_notebooks/usage/usage_of_pooler.ipynb).

We pool interactions by discarding those occurring in only one biological replicate and separately summing up the four counts for the remaining, overlapping interactions. Because Diachromatic reports all interactions with at least one supporting read pair, this step can be extremely memory intensive. We have prepared a Diachromatic interaction file with pooled counts from the files for the three biological replicates and made it available for download.

In [None]:
# Robin -> Link

## Diachromatic11 files with classified and scored interactions

The script `UICer.py` (Unbalanced Interaction Caller) can be used to classify and score interactions from diachromatic interaction files as balanced or unbalanced with regard to imbalances in the four read pair counts.

In [None]:
!mkdir -p ../../UICer_interactions/CHC
!wget -O ../../UICer_interactions/CHC/MIF_GM12878_CHC_REPC_evaluated_and_categorized_interactions.tsv.gz https://www.genecascade.org/downloads/diachrscripts/MIF_GM12878_CHC_REPC_evaluated_and_categorized_interactions.tsv.gz

## Reference sequence for GC and repeat content analysis

In [31]:
GENOME = 'hg38' # hg38, mm10
REF_DIR = '../additional_files/reference_sequence/'
G_LINK = 'http://hgdownload.soe.ucsc.edu/goldenPath/' + GENOME + '/bigZips/' + GENOME + '.fa.gz'

In [30]:
!mkdir -p $REF_DIR
!wget -O $REF_DIR$GENOME'_genome.fa.gz' $G_LINK
!gunzip $REF_DIR$GENOME'_genome.fa.gz'  
pyfaidx.Faidx(REF_DIR + GENOME + '_genome.fa')

--2023-01-31 07:32:46--  http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 870703359 (830M) [application/x-gzip]
Saving to: ‘../additional_files/reference_sequence/mm10_genome.fa.gz’


2023-01-31 07:35:18 (5.55 MB/s) - ‘../additional_files/reference_sequence/mm10_genome.fa.gz’ saved [870703359/870703359]



Faidx("../additional_files/reference_sequence/mm10_genome.fa")