# How to create the HDF5 File needed as Input for ancIBD

This Vignette Notebook runs you through the steps for producing the input that ancIBD needs, a so called hdf5 file.

### Starting Point: A imputed and phased VCF
The IBD software starts from an externally imputed and phased VCF. ancIBD is optimized to work with data that is output from `GLIMPSE` (see https://odelaneau.github.io/GLIMPSE/ for documentation and examples) and was imputed using the publicly available 1000G reference haplotype panel.

### Transforming VCF to HDF5
The notebook here showcases how to produce a HDF5 file from the Glimpse output VCF.

### Requirement of input VCF:
Importantly, the input VCF should have two fields:
- GT: Diploid Genotype (the most likely imputed diploid genotype)
- GP: Genotype probabilities

These two fields are in the standard output of Glimpse. They get also transformed to the HDF5 file, and these data is key for a successful run of `ancIBD`.

In [1]:
### First do Imports
import sys as sys
import os
import matplotlib.cm as cm
import pandas as pd

###
# Edit the following path to your vignette folder:
path = "/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette/"
os.chdir(path)  # Set the right Path (in line with Atom default)
print(os.getcwd())
###

### The following Code sets the working directory to your ancIBD code 
### Please comment out and set path if you do not use the standard installation
#sys.path.insert(0,"/n/groups/reich/hringbauer/git/hapBLOCK/package/")  # hack to get development package first in path

/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette


In [2]:
from ancIBD.IO.prepare_h5 import vcf_to_1240K_hdf

## Code to Transform HDF5 to VCF
The funcion `vcf_to_1240k_hdf` runs the transformation from VCF and outputs a hdf5 for 1240k SNPs.

### Input and Output:
Generally, data is organized per chromosomes with one file for each chromosome.

The example of `vcf_to_1240k_hdf` below transforms the full example vcf data in `./data/vcf.raw/`
into hdf5 files suitable for `ancIBD` into  `./data/hdf5/`

### Parameters
One needs to set paths for the intermediate files:
- path_vcf: Path of an intermediate VCF file - which is internally filtered to 1240k data 
- path_h5: Path of the output HDF5 files
- marker_path: Path of the 1240k SNPs to use (a simple table provided with the example data)
- map_path: Path of the map file to use (eigenstrat .snp provided with the example data, it has the map data included)
- af_path (optional): Path of allele frequencies to merge into hdf5 file
The data for the 1240k SNPs (`marker_path`), for the linkage map (`map_path`) and allele frequencies (`af_path`) are provided with the vignette, and only needs to be changed for custom SNP sets.

Below you find the function call for a single chromosome. You can write a loop (see below), or also use an array job to run this function on a cluster. The latter will save a lot of runtime on big input data as it runs in parallel.

In [3]:
%%time
ch = 22

base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
vcf_to_1240K_hdf(in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf",
                 path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf",
                 path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
                 marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
                 map_path = f"./data/v51.1_1240k.snp", 
                 af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
                 col_sample_af = "", 
                 buffer_size=20000, chunk_width=8, chunk_length=20000,
                 ch=ch)

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr22.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 15483 variants.
Loaded 6 individuals.
Loaded 16420 Chr.22 1240K SNPs.
Intersection 15408 out of 15483 HDF5 SNPs
Interpolating 75 variants.
Finished Chromosome 22.
Adding map to HDF5...
Intersection 15408 out of 15483 target HDF5 SNPs. 75 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr22.h5

CPU times: user 7.62 s, sys: 527 ms, total: 8.14 s
Wall time: 8.67 s


### Loop Glimpse vcf -> hdf5 over all chromosomes
This is the same as the run for a single chromsome above, but now looped over multiple chromsomes.
Notice that the runtime can go into the minutes or even hours for larger datasets. Parallelization of this transformation (e.g. via a parralel array job on a cluster) can speed up that time if needed.

In [4]:
%%time 

chs = range(1,23)

for ch in chs:
    base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
    vcf_to_1240K_hdf(in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
                     path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf",
                     path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
                     marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
                     map_path = f"./data/v51.1_1240k.snp", 
                     af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
                     col_sample_af = "", 
                     buffer_size=20000, chunk_width=8, chunk_length=20000,
                     ch=ch)

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr1.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 88408 variants.
Loaded 6 individuals.
Loaded 93166 Chr.1 1240K SNPs.
Intersection 88115 out of 88408 HDF5 SNPs
Interpolating 293 variants.
Finished Chromosome 1.
Adding map to HDF5...
Intersection 88115 out of 88408 target HDF5 SNPs. 293 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr1.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr2.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 93875 variants.
Loaded 6 individuals.
Loaded 98657 Chr.2 1240K SNPs.
Intersection 93471 out of 93875 HDF5 SNPs
Interpolating 404 varian

### Next Steps
This is it, congratulations! Now you have the output data needed for `ancIBD`. Now you can continue with the vignette notebook in `./run_ancIBD.ipynb`.