# Create a VCF from the genotypes
We simulated the parent-offspring genotypes (hold the applause), and now need to somehow make a VCF file out of it as input for `Mimick`. The genotypes are given as `0` and `1`, so we need to parse the human reference genome to find the `0` (aka 'Ref') and make up `1` (aka 'Alt'). After that, we need to encode it into a VCF file, something simple enough that it satisfies what Mimick needs.

In [41]:
import gzip
import os
import pysam
import requests

## Download the human genome `GRCh38.p14`
We need the human genome that was used for the pedigree genotypes (contig names, lengths, etc.) We'll download the entire thing, then subset it to keep only the first 5 autosomes that we are interested in.

In [None]:
GRCh38 = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz"
os.makedirs("../reference_assembly", exist_ok=True)
assembly_original = os.path.join("../reference_assembly", "GRCh38.p14.fasta.gz")
assembly_trunc = os.path.join("../reference_assembly", "GRCh38.p14.5A.fa")
r = requests.get(GRCh38)
open(assembly_original, 'wb').write(r.content)

972898531

Here we are subsetting `GRCh38` for the 5 autosomes we are interested in. We'll also use this loop to calculate the length of each chromosome.

In [50]:
target_chrom = {"NC_000001.11" : 0, "NC_000002.12" : 0, "NC_000003.12" : 0, "NC_000004.12" : 0, "NC_000005.10" : 0}
with (
    gzip.open(assembly_original, 'r') as fain,
    open(assembly_trunc, 'w') as faout,
    ):
    write = False
    _name = ""
    for i in fain:
        if i.startswith(b">"):
            _name = i.decode("utf8").split()[0].lstrip(">")
            if _name in target_chrom:
                faout.write(f">{_name}\n")
                _write = True
            else:
                _write = False
        else:
            if _write:
                target_chrom[_name] += len(i)
                faout.write(i.decode("utf-8"))

os.system(f"bgzip {assembly_trunc}")
assembly_trunc += ".gz"

In [38]:
target_chrom

{'NC_000001.11': 252068378,
 'NC_000002.12': 245220949,
 'NC_000003.12': 200774254,
 'NC_000004.12': 192592237,
 'NC_000005.10': 183807488}

## Start with a template
How do we know what will work for Mimick? Great question. Even though I wrote the thing, I don't offhand remember what fields the VCF requires. I could check, but where's the fun in that. Instead, let's just pull the VCF file Mimick uses for testing and explore that.

In [None]:
url = 'https://github.com/pdimens/mimick/raw/refs/heads/main/test/test.vcf.gz'
r = requests.get(url)
open("test.vcf.gz" , 'wb').write(r.content)
vcf_template = [i.decode("utf-8").strip() for i in gzip.open("test.vcf.gz" , 'r').readlines()]
vcf_header = [i for i in vcf_template if i.startswith("#")]
print("\n".join(vcf_header))

## Update the header
Let's update the header to:
1. remove entries that include `bcftools` and add our parent and offspring names to the column names
2. for the sake of being pedantic, let's also replace the `source` so it's clear where this file is from (us, here, doing this)
3. replace the `contig` part with the list of the actual 5 contigs we are working with and their lengths

In [40]:
vcf_new_header = [i for i in vcf_header if not i.startswith("##bcftools")]

# replace sample names
sample_names = ["father","mother"] + [f"offspring_{i}" for i in range(1,11)]
vcf_new_header[-1] = '#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t' + "\t".join(sample_names)
vcf_new_header[2] = "##source=PhasingAssemblySims - Pavel Dimens"

# update contig names/lengths
idx = 3
del vcf_new_header[idx]
for k,v in target_chrom.items():
    vcf_new_header.insert(idx, f"##contig=<ID={k},length={v}>")
    idx += 1

print("\n".join(vcf_new_header))

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##source=PhasingAssemblySims - Pavel Dimens
##contig=<ID=NC_000001.11,length=252068378>
##contig=<ID=NC_000002.12,length=245220949>
##contig=<ID=NC_000003.12,length=200774254>
##contig=<ID=NC_000004.12,length=192592237>
##contig=<ID=NC_000005.10,length=183807488>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	father	mother	offspring_1	offspring_2	offspring_3	offspring_4	offspring_5	offspring_6	offspring_7	offspring_8	offspring_9	offspring_10


In [42]:
with open("pedigree.vcf", "w") as _vcf:
    _vcf.write("\n".join(vcf_new_header))

## `CHROM`, `POS`, `REF` and `ALT`
We need to populate the CHROM, POS, REF, and ALT columns, but since this is just a text file (and not a table), we should probably do this line-by-line. To achieve that, we'll need to read in the pedigree genotype matrix we created. That way, we can go row-by-row in the pedigree and:
1. get the chromosome and position with which to index the genome
    - we'll use `pysam` for random access
2. find the nucleotide at that position and choose "any other base"
    - REF A/T -> ALT G/C
    - REF G/C -> ALT A/T
    - **will have to manually inspect for `N`s**
3. collapse the sample genotypes into the `GT` field as-is (since they are already in `0`|`1` format)

In [51]:
# creata a fasta index for random access
os.system(f"samtools faidx {assembly_trunc}")

0