# Setup the reference genome

In [3]:
import gzip
import os
import random
import requests

random.seed(6969)

In [5]:
os.makedirs("../reference_assembly", exist_ok=True)
assembly_original = "GRCh38.p14.fasta.gz"
assembly_trunc = "GRCh38.p14.5A.fa"
assembly_trunc_noN = "GRCh38.p14.5A.noN.fa"

## Download the human genome `GRCh38.p14`
We need the human genome that was used for the pedigree genotypes (contig names, lengths, etc.) We'll download the entire thing, then subset it to keep only the first 5 autosomes that we are interested in.

In [None]:
GRCh38 = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz"
r = requests.get(GRCh38)
open(assembly_original, 'wb').write(r.content)

972898531

Here we are subsetting `GRCh38` for the 5 autosomes we are interested in.
However, we also need to do one more thing. If we use chromosomes that have long stretches of N nucleotides, we will end up being unable to simulate reads from those genomic regions. Ultimately, using an actual human reference genome means it's incomplete, but it does have natural things in it that we can't otherwise create with a random ATGC simulation (or other simulators I've found), which includes:
- centromeric regions
- repeat regions scattered all over
- non-random association between adjacent nucleotides

Unlike the Mimick situation, we aren't building a new simulator here ðŸ˜…, instead we are just going to **replace N nucleotides with random ATCG bases** to create the input haploid genome for the pedigree/phase linked read simulations. For the assembly benchmark, we'll later add some mutations to create a diploid. 

In [7]:
target_chrom = {"NC_000001.11", "NC_000002.12", "NC_000003.12", "NC_000004.12", "NC_000005.10"}
with (
    gzip.open(assembly_original, 'r') as fain,
    open(assembly_trunc, 'w') as faout,
    open(assembly_trunc_noN, 'w') as faoutN,
    ):
    write = False
    _name = ""
    for i in fain:
        if i.startswith(b">"):
            _name = i.decode("utf8").split()[0].lstrip(">")
            if _name in target_chrom:
                faout.write(f">{_name}\n")
                faoutN.write(f">{_name}\n")
                _write = True
            else:
                _write = False
        else:
            if _write:
                faoutN.write(
                    ''.join(random.choice("ATCG") if nuc == 'N' else nuc for nuc in i.decode("utf-8"))
                )
                faout.write(i.decode("utf-8"))

Below is a small sanity check to ensure that the N bases are preserved in `assembly_trunc` and replaced in `assembly_trunc_noN`

In [None]:
with (
    open(assembly_trunc, 'r') as fa,
    open(assembly_trunc_noN, 'r') as faN,
    ):
    for k,v in {"truncated": fa, "truncated_noN": faN}.items():
        print(k)
        print(v.readline() , end = "")
        print(v.readline())

truncated
>NC_000001.11
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

truncated_noN
>NC_000001.11
AATTGATCTCCCGAGAGCCCTGACCTGCTAGCGCGTCCCAGGTCGGTAATAGTAGGAACGCGCGATATTTGAAGCACAAG



## Housekeeping
Finally, index the genomes for random access later. This also conveniently stores the contig lengths in the `.fai` files.

In [13]:
os.system(f"bgzip {assembly_trunc}")
os.system(f"samtools faidx {assembly_trunc}.gz")

os.system(f"bgzip {assembly_trunc_noN}")
os.system(f"samtools faidx {assembly_trunc_noN}.gz")

0