Convert bioinformatics file formats to Zarr
Initially supports converting VCF to the sgkit vcf-zarr specification
This is early alpha-status code: everything is subject to change, a and it has not been thoroughly tested
Convert a VCF to zarr format:
python3 -m bio2zarr vcf2zarr convert <VCF> <zarr>
Converts the VCF to zarr format.
Do not use this for anything but the smallest files
The recommended approach is to use a multi-stage conversion
First, convert the VCF into an intermediate columnar format:
python3 -m bio2zarr vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
Then, (optionally) inspect this representation to get a feel for your dataset
python3 -m bio2zarr vcf2zarr inspec tmp/sample.exploded
Then, (optionally) generate a conversion schema to describe the corresponding Zarr arrays:
python3 -m bio2zarr vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
View and edit the schema, deleting any columns you don't want.
Finally, convert to Zarr
python3 -m bio2zarr vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
Use the -p, --worker-processes
argument to control the number of workers used
to do zarr encoding.