In [1]:
import os
import pyfaidx

# Get started with `diachrscripts`

The analysis of Hi-C and CHi-C data is generally very complex. For the preparatory analysis with `Diachromatic`, a tutorial can be found in the Read The Docs for this repository. Here we provide intermediate files for different steps of the analysis with `diachrscripts`.

## Diachromatic output files

`Diachromatic` generates tab-delimited interaction files in which the read pair counts are reported separately according to their relative paired-end orientation. `diachrscripts` is a collection of Python modules, scripts and Jupyter notebooks that can be used to examine interactions for imbalances in the four read pair counts. We have prepared `Diachromatic` output files for publicly available data from three biological replicates in GM12878 cells ([Mifsud et al. (2015)](https://pubmed.ncbi.nlm.nih.gov/25938943/)) and made them available for download.

In [2]:
os.makedirs('../Diachromatic_interactions/gzdir', exist_ok=True)
!wget -O ../Diachromatic_interactions/gzdir/MIF_GM12878_CHC_REP1.interaction.counts.table.clr_200000.tsv.gz https://genecascade.org/downloads/diachrscripts/MIF_GM12878_CHC_REP1.interaction.counts.table.clr_200000.tsv.gz
!wget -O ../Diachromatic_interactions/gzdir/MIF_GM12878_CHC_REP2.interaction.counts.table.clr_200000.tsv.gz https://genecascade.org/downloads/diachrscripts/MIF_GM12878_CHC_REP2.interaction.counts.table.clr_200000.tsv.gz
!wget -O ../Diachromatic_interactions/gzdir/MIF_GM12878_CHC_REP3.interaction.counts.table.clr_200000.tsv.gz https://genecascade.org/downloads/diachrscripts/MIF_GM12878_CHC_REP3.interaction.counts.table.clr_200000.tsv.gz

--2023-03-22 10:09:24--  https://genecascade.org/downloads/diachrscripts/MIF_GM12878_CHC_REP1.interaction.counts.table.clr_200000.tsv.gz
Resolving genecascade.org (genecascade.org)... 193.175.174.14
Connecting to genecascade.org (genecascade.org)|193.175.174.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 217508602 (207M) [application/x-gzip]
Saving to: ‘../Diachromatic_interactions/gzdir/MIF_GM12878_CHC_REP1.interaction.counts.table.clr_200000.tsv.gz’


2023-03-22 10:09:56 (6.93 MB/s) - ‘../Diachromatic_interactions/gzdir/MIF_GM12878_CHC_REP1.interaction.counts.table.clr_200000.tsv.gz’ saved [217508602/217508602]

--2023-03-22 10:09:57--  https://genecascade.org/downloads/diachrscripts/MIF_GM12878_CHC_REP2.interaction.counts.table.clr_200000.tsv.gz
Resolving genecascade.org (genecascade.org)... 193.175.174.14
Connecting to genecascade.org (genecascade.org)|193.175.174.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1660904402 (

## Pooled Diachromatic interaction file

The script `pooler.py` can be used to pool the read pair counts of `Diachromatic` interactions occurring in different replicates. How this script works is documented here: [jupyter_notebooks/usage/usage_of_pooler.ipynb](usage/usage_of_pooler.ipynb). We pool interactions by discarding those occurring in only one biological replicate and separately summing up the four counts for the remaining, overlapping interactions.

Because Diachromatic reports all interactions with at least one supporting read pair, pooling interaction files can be extremely memory intensive. For instance, to pool two interaction files with `39,000,000` interactions and `48,000,000`, `44GB` memory was required. Therefore, we provide a pooled interaction file for download that was created with `pooler.py` using the three files downloaded in the cell above as input.

In [3]:
os.makedirs('../Diachromatic_interactions', exist_ok=True)
!wget -O ../Diachromatic_interactions/MIF_GM12878_CHC_REPC_at_least_2_combined_interactions.tsv.gz https://genecascade.org/downloads/diachrscripts/MIF_GM12878_CHC_REPC_at_least_2_combined_interactions.tsv.gz

--2023-03-22 10:16:32--  https://genecascade.org/downloads/diachrscripts/MIF_GM12878_CHC_REPC_at_least_2_combined_interactions.tsv.gz
Resolving genecascade.org (genecascade.org)... 193.175.174.14
Connecting to genecascade.org (genecascade.org)|193.175.174.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 263110784 (251M) [application/x-gzip]
Saving to: ‘../Diachromatic_interactions/MIF_GM12878_CHC_REPC_at_least_2_combined_interactions.tsv.gz’


2023-03-22 10:17:11 (6.93 MB/s) - ‘../Diachromatic_interactions/MIF_GM12878_CHC_REPC_at_least_2_combined_interactions.tsv.gz’ saved [263110784/263110784]



## Diachromatic11 file with scored and classified interactions

The script `UICer.py` (Unbalanced Interaction Caller) can be used to classify interactions from diachromatic interaction files as balanced or unbalanced. The result is a Diachromatic11 file with two additional columns for interaction categories and imbalance scores. In addition, the script also selects two equally sized reference sets of balanced and unbalanced interactions that have nearly identical distributions of total read pair counts per interaction. The interaction categories are:

* `UX`: Unbalanced interaction for which no balanced reference interaction with identical total read pair count could be selected
* `UR`: Unbalanced interaction for which a balanced reference interaction with identical total read pair count could be selected
* `BR`: Balanced interaction selected as reference interaction
* `BX`: Balanced interaction not selected as reference interaction

How to use the `UICer.py` script is documented here:
[jupyter_notebooks/usage/usage_of_UICer.ipynb](usage/usage_of_UICer.ipynb).

If `UICer.py` is called without specifying a classification threshold, then a randomization procedure is invoked to determine a classification threshold that will keep the FDR below 5%. Depending on the size of the input and the number of iterations, this can be quite computationally intensive. For example, for a dataset with `2,327,311` interactions, it took about two and a half hours to perform `1000` iterations in `10` parallel threads. Therefore, we provide a Diachromatic11 interaction file for download that was created with `UICer.py` using the file downloaded in the cell above as input.

In [4]:
os.makedirs('../UICer_interactions/CHC', exist_ok=True)
!wget -O ../UICer_interactions/CHC/MIF_GM12878_CHC_REPC_evaluated_and_categorized_interactions.tsv.gz https://www.genecascade.org/downloads/diachrscripts/MIF_GM12878_CHC_REPC_evaluated_and_categorized_interactions.tsv.gz

--2023-03-22 10:17:11--  https://www.genecascade.org/downloads/diachrscripts/MIF_GM12878_CHC_REPC_evaluated_and_categorized_interactions.tsv.gz
Resolving www.genecascade.org (www.genecascade.org)... 193.175.174.14
Connecting to www.genecascade.org (www.genecascade.org)|193.175.174.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49761063 (47M) [application/x-gzip]
Saving to: ‘../UICer_interactions/CHC/MIF_GM12878_CHC_REPC_evaluated_and_categorized_interactions.tsv.gz’


2023-03-22 10:17:18 (6.89 MB/s) - ‘../UICer_interactions/CHC/MIF_GM12878_CHC_REPC_evaluated_and_categorized_interactions.tsv.gz’ saved [49761063/49761063]



## Jupyter notebooks for various analyses

All further analyses in `diachscripts` require a Diachromatic11 interaction file as input and are performed in the Jupyter notebooks listed below.

### Frequencies of interaction configurations

In this analysis, the overall frequencies of the ten configurations of interactions are determined. This analysis can be performed here:
[analysis/frequencies_of_interaction_configurations.ipynb](analysis/frequencies_of_interaction_configurations.ipynb).

### Visualization of configurations at baited fragments

We visualize interactions at baited restriction fragments, similar as in triangular heatmaps typically used for Hi-C data, as rectangles along the genomic axis whose edge lengths correspond to the lengths of the two associated fragments and whose colors correspond to the configurations of interactions. Such visualizations can be created here:
[analysis/visualization_of_configurations.ipynb](analysis/visualization_of_configurations.ipynb).

### Classification of baited fragments

Depending on whether a baited fragment has predominantly paired-end orientations associated with the 5' end, the 3' end, or both ends of the fragment, we classify it as BFC0, BFC1, and BFC2. The subdivision of all baited fragments into BFC0, BFC1 and BFC2 can be done here:
[analysis/baited_fragment_classification.ipynb](analysis/baited_fragment_classification.ipynb).

### Bait analysis

In this analysis, the correlation between the subdivision of the baited fragments into BFC0, BFC1 and BFC2 and the baits actually used for the underlying CHi-C experiment is investigated. Furthermore, the baits are examined in terms of distance to their restriction sites and GC as well as repeat content. These analyses can be performed here:
[analysis/bait_analysis.ipynb](analysis/bait_analysis.ipynb).

Note that this analysis requires a corresponding reference sequence, which can be Note that this analysis requires that the baited fragments have already been classified into BFC0, BFC1, and BFC2. In addition, a corresponding reference sequence is required, which can be downloaded at the bottom of this notebook.


### Unbaited fragment analysis

In this notebook, we analyze unbaited fragments in terms of length, GC and repeat content. This analysis can be performed here:
[analysis/unbaited_fragment_analysis.ipynb](analysis/unbaited_fragment_analysis.ipynb).

### Distance-dependent contact frequencies

In this analysis, we use distance-dependent contact frequencies to investigate the impact of  imbalances of the four read pair counts of interactions on the total read pair counts of interactions. This analysis can be performed here:
[analysis/distance_dependent_contact_frequencies.ipynb](analysis/distance_dependent_contact_frequencies.ipynb).

## Reference sequence for GC and repeat content analysis

The analysis of the GC and repeat content of baits requires the sequence of a reference genome and a corresponding FASTA index. Reference genomes can be downloaded and indexed as follows.

In [5]:
GENOME = 'hg38' # hg38, hg19, mm10, ...
REF_DIR = '../additional_files/reference_sequence/'
GENOME_URL = 'http://hgdownload.soe.ucsc.edu/goldenPath/' + GENOME + '/bigZips/' + GENOME + '.fa.gz'

In [6]:
os.makedirs(REF_DIR, exist_ok=True)
!wget -O $REF_DIR$GENOME'_genome.fa.gz' $GENOME_URL
!gunzip -f $REF_DIR$GENOME'_genome.fa.gz'  
pyfaidx.Faidx(REF_DIR + GENOME + '_genome.fa')

--2023-03-22 10:17:18--  http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 983659424 (938M) [application/x-gzip]
Saving to: ‘../additional_files/reference_sequence/hg38_genome.fa.gz’


2023-03-22 10:20:13 (5.39 MB/s) - ‘../additional_files/reference_sequence/hg38_genome.fa.gz’ saved [983659424/983659424]



Faidx("../additional_files/reference_sequence/hg38_genome.fa")