# Howto: Create the HDF5 File as Input for ancIBD

This Vignette Notebook runs you through the steps for producing the input that ancIBD needs, a so called hdf5 file. We chose this file format for its ability to store meta data in a folder-like internal structure, and ability for partial reading the data (for details of this file format, see https://en.wikipedia.org/wiki/Hierarchical_Data_Format).

# General Description

### Starting Point: A imputed and phased VCF
Generallz, an `ancIBD` pipeline starts from an externally imputed and phased VCF. `ancIBD` is optimized to work with data that is output from `GLIMPSE` (see https://odelaneau.github.io/GLIMPSE/ for documentation and examples) when using the publicly available 1000G reference haplotype panel.

### Transforming VCF to HDF5 Files
The notebook here showcases how to produce a HDF5 file from the Glimpse output VCF.

### Requirements of input VCF:
Importantly, the input for `ancIBD` should have two fields:
- GT: Diploid Genotype (the most likely imputed phased diploid genotype)
- GP: Genotype probabilities

These two fields are in the standard VCF output of Glimpse. They get then transformed to the HDF5 file, and these data is key for a successful run of `ancIBD`.

# Example Pipeline

In [1]:
### First do Python Imports and set working directory
import sys as sys
import os
import matplotlib.cm as cm
import pandas as pd

###
# Edit the following path to your vignette folder:
path = "/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette/"
os.chdir(path)  # Set the right Path (in line with Atom default)
print(os.getcwd())
###

### The following Code sets the working directory to your ancIBD code 
### Please comment out and set path if you want to use custom ancIBD versions
sys.path.insert(0,"/n/groups/reich/hringbauer/git/hapBLOCK/package/")  # Set path to custom ancIBD package first in path

from ancIBD.IO.prepare_h5 import vcf_to_1240K_hdf  # The ancIBD helper function that converst VCF to HDF

/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette


## Function to Transform HDF5 to VCF
The funcion `vcf_to_1240k_hdf` runs the transformation from VCF and outputs a hdf5 for 1240k SNPs.

### Input and Output:
Generally, input data for ancIBD is organized per chromosomes - with one file for each chromosome.

### Parameters
One needs to set paths for the intermediate files:
- path_vcf: Path of an intermediate VCF file - which is internally filtered to 1240k data 
- path_h5: Path of the output HDF5 files
- marker_path: Path of the 1240k SNPs to use (a simple table provided with the example data)
- map_path: Path of the map file to use (eigenstrat .snp provided with the example data, it has the map data included)
- af_path (optional): Path of allele frequencies to merge into hdf5 file
The data for the 1240k SNPs (`marker_path`), for the linkage map (`map_path`) and allele frequencies (`af_path`) are provided with the vignette, and only needs to be changed for custom SNP sets.

### Function to create H5 from VCF
Below you find the function call for a single chromosome. You can write a loop (see below), or also use an array job to run this function on a cluster. The latter will save a lot of runtime on big input data as it runs in parallel.

In [2]:
%%time
ch = 22

base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
vcf_to_1240K_hdf(in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
                 path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",
                 path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
                 marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
                 map_path = f"./data/v51.1_1240k.snp", 
                 af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
                 col_sample_af = "", 
                 buffer_size=20000, chunk_width=8, chunk_length=20000,
                 ch=ch)

Print downsampling to 1240K...
Running bash command: 
bcftools view -Oz -o ./data/vcf.1240k/example_hazelton_chr22.vcf.gz -T ./data/filters/snps_bcftools_ch22.csv -M2 -v snps ./data/vcf.raw/example_hazelton_chr22.vcf.gz
Finished BCF tools filtering to target markers.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr22.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 15483 variants.
Loaded 6 individuals.
Loaded 16420 Chr.22 1240K SNPs.
Intersection 15408 out of 15483 HDF5 SNPs
Interpolating 75 variants.
Finished Chromosome 22.
Adding map to HDF5...
Intersection 15408 out of 15483 target HDF5 SNPs. 75 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr22.h5

CPU times: user 9.84 s, sys: 605 ms, total: 10.4 s
Wall time: 13 s


### Loop Glimpse vcf -> hdf5 over all chromosomes
This is the same as the run for a single chromsome above, but now looped over multiple chromsomes.

The runtime can go into the minutes or even hours for larger datasets. Parallelization of this transformation (e.g. via a parralel array job on a cluster, running each chromosome in parallel) can speed up that time if needed.

In [None]:
%%time 

chs = range(1,23)

for ch in chs:
    base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
    vcf_to_1240K_hdf(in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
                     path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",
                     path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
                     marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
                     map_path = f"./data/v51.1_1240k.snp", 
                     af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
                     col_sample_af = "", 
                     buffer_size=20000, chunk_width=8, chunk_length=20000,
                     ch=ch)

## Next Steps
Congratulations, this is it! You just have created the output data needed for `ancIBD`. Now you can continue with the actual `ancIBD` run in the vignette notebook in `./run_ancIBD.ipynb`.