# 1. Download ChIPseq data

## 1.1 Chen et al. (2015) Hypothalamus ChIPseq

> [GSE Accession: GSE66868](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66868)

In [3]:
# Execute in python

list_of_gsms = [f"GSM16335{n}" for n in range(77, 79)]
with open("data/GSE66868/gsms.txt", "w") as wH:
    for gsm in list_of_gsms:
        print(gsm, file=wH)

### 1.1.1 Download RAW fastq

> Execute in command line

```bash
$ cd data/GSE66868

$ cat gsms.txt | parallel -j 1 -u "ffq --ftp {} | jq -r '.[] | .url'" > ftp_links.txt

$ cat ftp_links.txt | parallel -j 1 -u 'curl -O {}'
```

In [4]:
# Execute in python

import glob
import os
chen_chip_fastqs = glob.glob("data/GSE66868/*.fastq.gz")
for fastq in chen_chip_fastqs:
    os.rename(fastq, fastq.replace(".fastq.gz", "_1.fastq.gz"))

### 1.1.2 Create `data/GSE66868.ChIP.dict.yaml`

```
SRR1914881:
  control: SRR1914878
SRR1914882:
  control: SRR1914879
SRR1914883:
  control: SRR1914880
```

### 1.1.3 Run [snakePipes](https://snakepipes.readthedocs.io/en/latest/) DNA-mapping and ChIP-seq pipeline

- Configure snakePipes according to the documentation provided on its website.
- Download [mm9/GRCm37_gencode_release1](https://zenodo.org/record/4478284) premade index and [set up accordingly](https://snakepipes.readthedocs.io/en/latest/content/setting_up.html#download-premade-indices)
- Create a python script to convert chromosome names

```python
import pandas as pd
import numpy as np

df = pd.read_csv("mm9-blacklist.bed", sep="\t", header=None)
mm9_convert = pd.read_csv("https://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCm37_ensembl2UCSC.txt", sep="\t", header=None, index_col=1)[0].to_dict()

df[0] = df[0].apply(lambda x: mm9_convert.get(x, np.nan))
df.dropna().to_csv(f"mm9-blacklist-converted.bed", sep="\t", header=None, index=None)
```

- Execute the shell script to download blacklist regions

```bash
wget https://github.com/Boyle-Lab/Blacklist/raw/master/lists/Blacklist_v1/mm9-blacklist.bed.gz
gzip -d mm9-blacklist.bed.gz
python blacklist_coordinates.py
```

- Modify the `GRCm37_gencode_release1.yaml` for the following values

```yaml
blacklist_bed: 'mm9-blacklist-converted.bed' # use absolute path which should belong to the directory of extracted index archive
ignoreForNormalization: 'NT_166325 MT NT_166464 NT_166452 NT_166480 NT_166448 NT_166458 NT_166443 NT_166466 NT_166476 NT_166479 NT_166478 NT_166474 NT_166471 NT_166445 NT_166465 NT_166457 NT_166470 NT_166454 NT_166472 NT_166449 NT_166481 NT_166337 NT_166459 NT_166456 NT_166473 NT_166461 NT_166475 NT_166462 NT_166444 NT_166453 NT_166446 NT_166469 NT_072868 NT_166335 NT_166467 NT_166283 NT_166338 NT_166340 NT_166442 NT_166334 NT_166286 NT_166451 NT_166336 NT_166339 NT_166290 NT_053651 NT_166450 NT_166447 NT_166468 NT_166460 NT_166477 NT_166455 NT_166291 NT_166463 NT_166433 NT_166402 NT_166327 NT_166308 NT_166309 NT_109319 NT_166282 NT_166314 NT_166303 NT_112000 NT_110857 NT_166280 NT_166375 NT_166311 NT_166307 NT_166310 NT_166323 NT_166437 NT_166374 NT_166364 NT_166439 NT_166328 NT_166438 NT_166389 NT_162750 NT_166436 NT_166372 NT_166440 NT_166326 NT_166342 NT_166333 NT_166435 NT_166434 NT_166341 NT_166376 NT_166387 NT_166281 NT_166313 NT_166380 NT_166360 NT_166441 NT_166359 NT_166386 NT_166356 NT_166357 NT_166423 NT_166384 NT_161879 NT_161928 NT_166388 NT_161919 NT_166381 NT_166367 NT_166392 NT_166406 NT_166365 NT_166379 NT_166358 NT_161913 NT_166378 NT_166382 NT_161926 NT_166345 NT_166385 NT_165789 NT_166368 NT_166405 NT_166390 NT_166373 NT_166361 NT_166348 NT_166369 NT_161898 NT_166417 NT_166410 NT_166383 NT_166362 NT_165754 NT_166366 NT_166363 NT_161868 NT_166407 NT_165793 NT_166352 NT_161925 NT_166412 NT_165792 NT_161924 NT_166422 NT_165795 NT_166354 NT_166350 NT_165796 NT_161904 NT_166370 NT_165798 NT_165791 NT_161885 NT_166424 NT_166346 NT_165794 NT_166377 NT_166418 NT_161877 NT_166351 NT_166408 NT_166349 NT_161906 NT_166391 NT_161892 NT_166415 NT_165790 NT_166420 NT_166353 NT_166344 NT_166371 NT_161895 NT_166404 NT_166413 NT_166419 NT_161916 NT_166347 NT_161875 NT_161911 NT_161897 NT_161866 NT_166409 NT_161872 NT_166403 NT_161902 NT_166414 NT_166416 NT_166421 NT_161923 NT_161937 Y X'
```

> Execute in command line

```bash
$ cd data

$ DNA-mapping -i GSE66868 -o GSE66868_DNA_mapping -m mapping --local -j 24 --reads '_1' '_2' GRCm37_gencode_release1

$ cd GSE66868_DNA_mapping

$ ChIP-seq -d . --singleEnd --windowSize 1000 --bwBinSize 1000 --fragmentLength 150 --local --snakemakeOptions "-p -j 64" --peakCaller MACS2 --peakCallerOptions "-g mm -m 1 30 -p 0.2 --call-summits --keep-dup all" --bigWigType log2ratio GRCm37_gencode_release1 ../GSE66868.ChIP.dict.yaml
```

## 1.2 Gabel et al. (2015) Visual cortex ChIPseq

> [GSE Accession: GSE67293](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67293)

In [6]:
# Execute in python

list_of_gsms = [f"GSM16439{n}" for n in range(34, 38)]
with open("data/GSE67293/gsms.txt", "w") as wH:
    for gsm in list_of_gsms:
        print(gsm, file=wH)

### 1.2.1 Download RAW fastq

> Execute in command line

```bash
$ cd data/GSE67293

$ cat gsms.txt | parallel -j 1 -u "ffq --ftp {} | jq -r '.[] | .url'" > ftp_links.txt

$ cat ftp_links.txt | parallel -j 1 -u 'curl -O {}'
```

In [7]:
# Execute in python

import glob
import os
gabel_chip_fastqs = glob.glob("data/GSE67293/*.fastq.gz")
for fastq in gabel_chip_fastqs:
    os.rename(fastq, fastq.replace(".fastq.gz", "_1.fastq.gz"))

### 1.2.2 Create `data/GSE67293.ChIP.dict.yaml`

```
chip_dict:
    SRR1930028:
        control: SRR1930029
    SRR1930030:
        control: SRR1930031
```

### 1.2.3 Run [snakePipes](https://snakepipes.readthedocs.io/en/latest/) DNA-mapping and ChIP-seq pipeline

- Configure snakePipes according to the documentation provided on its website.
- Download [mm9/GRCm37_gencode_release1](https://zenodo.org/record/4478284) premade index and [set up accordingly](https://snakepipes.readthedocs.io/en/latest/content/setting_up.html#download-premade-indices)

> Execute in command line

```bash
$ cd data

$ DNA-mapping -i GSE67293 -o GSE67293_DNA_mapping -m mapping --local --snakemakeOptions "-p -j 64" --reads '_1' '_2' GRCm37_gencode_release1

$ cd GSE67293_DNA_mapping 

$ ChIP-seq -d . --singleEnd --windowSize 1000 --bwBinSize 1000 --fragmentLength 150 --local -j 24 --peakCaller MACS2 --peakCallerOptions "-g mm -m 1 30 -p 0.2 --call-summits --keep-dup all" --bigWigType log2ratio GRCm37_gencode_release1 ../GSE67293.ChIP.dict.yaml
```

## 1.3 Boxer et al. (2020) Forebrain ChIPseq

> [GSE Accession: GSE139509](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE139509)

In [8]:
list_of_gsms = [f"GSM41426{n}" for n in range(62, 70)]
with open("data/GSE139509/gsms.txt", "w") as wH:
    for gsm in list_of_gsms:
        print(gsm, file=wH)

### 1.3.1 Download RAW fastq

> Execute in command line

```bash
$ cd data/GSE139509

$ cat gsms.txt | parallel -j 1 -u "ffq --ftp {} | jq -r '.[] | .url'" > ftp_links.txt

$ cat ftp_links.txt | parallel -j 1 -u 'curl -O {}'
```

In [9]:
# Execute in python

import glob
import os
boxer_chip_fastqs = glob.glob("data/GSE139509/*.fastq.gz")
for fastq in boxer_chip_fastqs:
    os.rename(fastq, fastq.replace(".fastq.gz", "_1.fastq.gz"))

### 1.3.2 Create `data/GSE139509.ChIP.dict.yaml`

```
SRR10356997:
  control: SRR10356999
SRR10356998:
  control: SRR10357000
SRR10357001:
  control: SRR10357003
SRR10357002:
  control: SRR10357004
```

### 1.3.3 Run [snakePipes](https://snakepipes.readthedocs.io/en/latest/) DNA-mapping and ChIP-seq pipeline

- Configure snakePipes according to the documentation provided on its website.
- Because Boxer et al. (2020) used mm10 assembly for their bisulfite sequencing, in order to keep assemblies consistent between multiomic data sets, we change the assembly for RNAseq processing
- Download [mm10/GRCm38_gencode_release19](https://zenodo.org/record/4468065) premade index and [set up accordingly](https://snakepipes.readthedocs.io/en/latest/content/setting_up.html#download-premade-indices)
- Execute the shell script to download blacklist regions

```bash
wget https://github.com/Boyle-Lab/Blacklist/raw/master/lists/mm10-blacklist.v2.bed.gz
gzip -d mm10-blacklist.v2.bed.gz
```

- Modify the `GRCm37_gencode_release1.yaml` for the following values

```yaml
blacklist_bed: 'mm10-blacklist.v2.bed'
ignoreForNormalization: 'chrX chrY chrM GL456210.1 GL456211.1 GL456212.1 GL456213.1 GL456216.1 GL456219.1 GL456221.1 GL456233.1 GL456239.1 GL456350.1 GL456354.1 GL456359.1 GL456360.1 GL456366.1 GL456367.1 GL456368.1 GL456370.1 GL456372.1 GL456378.1 GL456379.1 GL456381.1 GL456382.1 GL456383.1 GL456385.1 GL456387.1 GL456389.1 GL456390.1 GL456392.1 GL456393.1 GL456394.1 GL456396.1 JH584292.1 JH584293.1 JH584294.1 JH584295.1 JH584296.1 JH584297.1 JH584298.1 JH584299.1 JH584300.1 JH584301.1 JH584302.1 JH584303.1 JH584304.1'
```

> Execute in command line

```bash
$ cd data

$ DNA-mapping -i GSE139509 -o GSE139509_DNA_mapping -m mapping --local --snakemakeOptions "-p -j 64" --reads '_1' '_2' GRCm38_gencode_release19

$ cd GSE139509_DNA_mapping 

$ ChIP-seq -d . --singleEnd --windowSize 1000 --bwBinSize 1000 --fragmentLength 150 --local -j 24 --peakCaller MACS2 --peakCallerOptions "-g mm -m 1 30 -p 0.2 --call-summits --keep-dup all" --bigWigType log2ratio GRCm38_gencode_release19 ../GSE139509.ChIP.dict.yaml
```