# 

## Methods

In this project, we formulate two objectives:

**A**: Reproduce the Hi-C interaction maps and eigendecomposition from ([Wang et al. 2019](#ref-wang_reprogramming_2019)), with some modifications. We briefly use *HiCExplorer*, but change the analyses to use the *Open2C Ecosystem* ([Open Chromosome Collective 2024](#ref-open2c)) which have a Pyton API as well as command-line functions, which can be paired very well with Jupyter Notebooks. The majority of the data analysis was run with a *gwf* workflow, and the commands that were visually inspected were run in Jupyter Notebooks.

**B** Compare with regions of selection that are found in *papio anubis*, and maybe in *human* too. Investigate the biological meaning of the results.

All computations were performed on GenomeDK (GDK) \[ref\], an HPC cluster located on Aarhus Uninversity, and most of the processing of the data was nested into a *gwf* workflow \[ref\], a workflow manager developed at GDK. I would like to thank GDK and Aarhus University for providing computational resources and support that contributed to these research results.

The whole of this project is carried out with reproducibility in mind, so an effort (and quite a significant amount of time) has been put into documenting code and organizing the project for readbility and transparency through a Quarto project \[ref\]. Therefore, all code, virtual environments and text is made available as a Quarto book, rendered directly from the GitHub repository with GitHub Pages \[\]. To make this possible, the Quarto documentation has been extensively studied and discussed with *KMT* \[ref, aknowledge\].

### Downloading Data and Project Structure

To reproduce the results from ([Wang et al. 2019](#ref-wang_reprogramming_2019)), I chose to use their raw data directly from the SRA portal \[ref\]. I filtered the data to contain all their paired-end Hi-C reads, and included only macaque samples. The data set also contains RNAseq data, and the same tissues for both macaque and mouse. The meta data for the data set was extracted into a runtable `SRA-runtable.tsv`. To get an overview of the data accessions used in this analysis, we will first summarize the runtable that contains the accession numbers and some metadata for each sample (([**tbl-runtable?**](#ref-tbl-runtable))). It adds up to ~1Tb of compressed `fastq` files, holding ~9.5 billion reads, roughly evenly spread on the 5 tissue types.

``` python
display(df1)
```

<style type="text/css">
</style>

       source_name              BioSample      Run          GB          Bases             Reads
  ---- ------------------------ -------------- ------------ ----------- ----------------- -------------
  16   fibroblast               SAMN08375237   SRR6502335   29.771059   73,201,141,800    244,003,806
  17   fibroblast               SAMN08375237   SRR6502336   22.755361   65,119,970,100    217,066,567
  18   fibroblast               SAMN08375236   SRR6502337   21.434722   52,769,196,300    175,897,321
  19   fibroblast               SAMN08375236   SRR6502338   21.420030   52,378,949,100    174,596,497
  20   fibroblast               SAMN08375236   SRR6502339   10.207410   28,885,941,600    96,286,472
  9    fibroblast               SAMN08375237   SRR7349189   52.729173   139,604,854,200   465,349,514
  10   fibroblast               SAMN08375236   SRR7349190   53.085520   142,008,353,400   473,361,178
  21   pachytene spermatocyte   SAMN08375234   SRR6502342   60.258880   150,370,993,500   501,236,645
  22   pachytene spermatocyte   SAMN08375234   SRR6502344   27.146048   65,697,684,300    218,992,281
  23   pachytene spermatocyte   SAMN08375234   SRR6502345   26.202707   63,490,538,700    211,635,129
  0    pachytene spermatocyte   SAMN09427370   SRR7345458   55.970557   153,281,577,900   510,938,593
  1    pachytene spermatocyte   SAMN09427370   SRR7345459   53.982492   144,993,841,200   483,312,804
  11   pachytene spermatocyte   SAMN08375235   SRR7349191   51.274476   137,821,979,100   459,406,597
  24   round spermatid          SAMN08375232   SRR6502351   20.924497   55,095,075,300    183,650,251
  25   round spermatid          SAMN08375232   SRR6502352   41.133960   115,578,475,800   385,261,586
  26   round spermatid          SAMN08375232   SRR6502353   36.444117   96,195,161,400    320,650,538
  2    round spermatid          SAMN09427369   SRR7345460   38.244654   104,105,827,200   347,019,424
  3    round spermatid          SAMN09427369   SRR7345461   53.996261   144,532,309,500   481,774,365
  12   round spermatid          SAMN08375232   SRR7349192   52.384556   140,431,608,000   468,105,360
  29   sperm                    SAMN08375229   SRR6502360   26.653940   64,752,370,800    215,841,236
  30   sperm                    SAMN08375228   SRR6502362   23.973440   58,369,232,700    194,564,109
  13   sperm                    SAMN08375229   SRR7349193   52.806276   141,148,572,300   470,495,241
  14   sperm                    SAMN08375229   SRR7349195   22.444378   60,523,788,600    201,745,962
  15   sperm                    SAMN08375229   SRR7349196   38.253606   104,119,671,000   347,065,570
  27   spermatogonia            SAMN08375231   SRR6502356   22.845286   58,909,579,800    196,365,266
  28   spermatogonia            SAMN08375231   SRR6502357   17.947471   46,888,332,900    156,294,443
  4    spermatogonia            SAMN09427379   SRR7345462   18.686342   52,032,780,000    173,442,600
  5    spermatogonia            SAMN09427379   SRR7345463   29.956561   82,384,836,000    274,616,120
  6    spermatogonia            SAMN09427379   SRR7345464   39.145759   105,153,716,100   350,512,387
  7    spermatogonia            SAMN09427378   SRR7345465   35.816184   96,048,594,600    320,161,982
  8    spermatogonia            SAMN09427378   SRR7345467   28.396816   77,248,140,900    257,493,803

### Handling coolers (Or: preparing coolers)

***\[A flowchart showing the pipeline from `.fastq` to `.mcool`. The first 6 steps were done with a Probably BioRender or Inkscape.\]***

#### The *gwf* workflow targets

A *gwf* workflow was created to handle the first part of the data processing, and each accesion number (read pair, mate pair) from the Hi-C sequencing was processed in parallel, so their execution was independen on the other samples.

##### Downloading the reads

The reads were downloaded from NCBI SRA portal \[ref\] directly to GDK using `sra-downloader` \[ref\] through docker \[ref\] as `.fastq.gz` files.

##### Handling the reference

The latest reference genome for rhesus macaque (*macaca mulata*), *rheMac10* (or *Mmul_10*, UCSC or NCBI naming conventions, respectively) was downloaded to GDK from UCSC web servers with `wget` \[ref\]. To use `bwa` (Burrow Wheeler’s Aligner) \[ref\] for mapping, rheMac10 needs to be indexed with both `bwa index` with the `--bwtsw` option and `samtools faidx`, which results in six indexing files for `bwa mem` to use.

##### Mapping paired-end reads

##### Pair and sort the reads

##### Filter (deduplicate) pairs

##### Create interaction matrices (coolers)

##### Pooling samples (Merging coolers)

The strategy to get the best signal was by pooling the interaction matrices. `cooler merge` was used to merge all samples in each sub-folder (cell type) to just one interaction matrix for each cell type. The reason for that is that we choose to trust Wang et al. ([2019](#ref-wang_reprogramming_2019)) when they say that compartments are highly reproducible between replicates, and by merging all replicates, we will have a more robust signal.

##### Create multi-resolution coolers (zoomify)

##### Matrix balancing (Iterative correction)

##### Eigendecomposition

Open Chromosome Collective. 2024. “Open Chromosome Collective (Open2C).” Resource. *Open2C*. <https://open2c.github.io/>.

Wang, Yao, Hanben Wang, Yu Zhang, Zhenhai Du, Wei Si, Suixing Fan, Dongdong Qin, et al. 2019. “Reprogramming of Meiotic Chromatin Architecture During Spermatogenesis.” *Molecular Cell* 73 (3): 547–561.e6. <https://doi.org/10.1016/j.molcel.2018.11.019>.