# Storage details for Genomics England chr2 dataset

This is a summary of the chr2-stats notebook,and using the data from ICF and Zarr "inspect" commands stored in corresponding CSVs.


## VCFs

The VCF data for chromosome 2 for the aggV2 dataset is split across 106 chunks, using a total of 12.81 TiB of storage. 


## Intermediate columnar format 

We converted the full VCF to ICF using vcf2zarr's distributed explode commands.
We split the job into 14605 partitions, with a 20.7min average run time.

The storage used by each field is shown below. Total size is 9.94 TiB.

In [1]:
import pandas as pd
import humanfriendly
import numpy as np

def parse_size(values):
    size = np.zeros(values.shape, dtype=int)
    for j, val in enumerate(values):
        size[j] = humanfriendly.parse_size(val)
    return size

df_icf = pd.read_csv("chr2_if_inspect.csv")
df_icf["compressed_bytes"] = parse_size(df_icf.compressed.values)
df_icf.sort_values("compressed_bytes", ascending=False)

Unnamed: 0,name,type,chunks,size,compressed,max_n,min_val,max_val,compressed_bytes
10,FORMAT/AD,Integer,561755,34.08 TiB,3.57 TiB,2,0.0,10000.0,3925256511160
13,FORMAT/GQ,Integer,285891,17.04 TiB,2.62 TiB,1,0.0,3100.0,2880720464773
8,FORMAT/DP,Integer,285891,17.04 TiB,2.23 TiB,1,0.0,10000.0,2451910929940
9,FORMAT/DPF,Integer,285891,17.04 TiB,1.05 TiB,1,0.0,8300.0,1154487209164
15,FORMAT/PL,Integer,838874,51.11 TiB,238.94 GiB,3,0.0,3000.0,256559871426
7,FORMAT/GT,Integer,426102,25.56 TiB,149.29 GiB,3,-2.0,1.0,160298916904
17,FORMAT/GQX,Integer,285891,17.04 TiB,59.52 GiB,1,0.0,3100.0,63909113364
14,FORMAT/FT,String,561755,34.08 TiB,33.2 GiB,1,,,35648228556
2,QUAL,Float,14763,6.25 GiB,116.39 MiB,1,0.0,6100000.0,122043760
18,INFO/OLD_MULTIALLELIC,String,14763,7.1 GiB,86.08 MiB,1,,,90261422


In [2]:
humanfriendly.format_size(df_icf.compressed_bytes.sum(), binary=True)

'9.94 TiB'

How many chunk files?

In [8]:
df_icf.chunks.sum()

3960177

## VCF Zarr

We converted the ICF data to Zarr using vcf2zarr distributed encode commands. 
We split the job into 5989 partitions with an 18.6min average run time.

It uses a total of 2.54 TiB of storage over 8294982 files and directories.

This is a compression of 5X over the VCF.

In [3]:
12.81 / 2.54

5.043307086614173

In [2]:

df_zarr = pd.read_csv("chr2_zarr_inspect.csv")
df_zarr["stored_bytes"] = parse_size(df_zarr.stored.values)
df_zarr.sort_values("stored_bytes", ascending=False, inplace=True)
df_zarr

Unnamed: 0,name,dtype,stored,size,ratio,nchunks,chunk_size,avg_chunk_stored,shape,chunk_shape,compressor,filters,stored_bytes
0,/call_AD,int16,658.44 GiB,17.03 TiB,26.0,473131,37.75 MiB,1.43 MiB,"(59880903, 78195, 2)","(10000, 1000, 2)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,706994566594
1,/call_GQ,int16,654.45 GiB,8.52 TiB,13.0,473131,18.88 MiB,1.42 MiB,"(59880903, 78195)","(10000, 1000)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,702710336716
2,/call_DP,int16,570.03 GiB,8.52 TiB,15.0,473131,18.88 MiB,1.23 MiB,"(59880903, 78195)","(10000, 1000)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,612065051934
3,/call_DPF,int16,447.09 GiB,8.52 TiB,20.0,473131,18.88 MiB,990.86 KiB,"(59880903, 78195)","(10000, 1000)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,480059232092
4,/call_PL,int16,162.56 GiB,25.55 TiB,160.0,473131,56.63 MiB,360.27 KiB,"(59880903, 78195, 3)","(10000, 1000, 3)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,174547470909
5,/call_GQX,int16,40.99 GiB,8.52 TiB,210.0,473131,18.88 MiB,90.84 KiB,"(59880903, 78195)","(10000, 1000)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,44012677365
6,/call_FT,object,25.04 GiB,34.07 TiB,1400.0,473131,75.51 MiB,55.5 KiB,"(59880903, 78195)","(10000, 1000)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",[VLenUTF8()],26886495272
7,/call_genotype,int8,21.46 GiB,8.52 TiB,410.0,473131,18.88 MiB,47.57 KiB,"(59880903, 78195, 2)","(10000, 1000, 2)","Blosc(cname='zstd', clevel=7, shuffle=BITSHUFF...",,23042499543
8,/call_genotype_mask,bool,12.78 GiB,8.52 TiB,680.0,473131,18.88 MiB,28.32 KiB,"(59880903, 78195, 2)","(10000, 1000, 2)","Blosc(cname='zstd', clevel=7, shuffle=BITSHUFF...",,13722420510
9,/call_genotype_phased,bool,2.35 GiB,4.26 TiB,1900.0,473131,9.44 MiB,5.2 KiB,"(59880903, 78195)","(10000, 1000)","Blosc(cname='zstd', clevel=7, shuffle=BITSHUFF...",,2523293286


In [3]:
total = df_zarr.stored_bytes.sum()

humanfriendly.format_size(total, binary=True)

'2.54 TiB'

In [9]:
df_zarr.nchunks.sum()

6312488

How much of the overall storage is consumed by the top 4 fields?

In [15]:
df_zarr.head(4).stored_bytes.sum() / total

0.8972142320632752

In [8]:
df_display_table = pd.DataFrame({
    "Field":df_zarr.name,
    "type": df_zarr.dtype,
    "storage":df_zarr.stored,
    "compress": df_zarr.ratio,
    "percentage": df_zarr.stored_bytes / total})
threshold = 0.01 / 100 # 0.01% 
df_display_table = df_display_table[df_display_table.percentage >= threshold].copy()
df_display_table.sort_values("percentage", ascending=False, inplace=True)
df_display_table["percentage"] = df_display_table["percentage"].map('{:.2%}'.format)
df_display_table["compress"] = df_display_table["compress"].map('{:.1f}'.format)
df_display_table


Unnamed: 0,Field,type,storage,compress,percentage
0,/call_AD,int16,658.44 GiB,26.0,25.35%
1,/call_GQ,int16,654.45 GiB,13.0,25.20%
2,/call_DP,int16,570.03 GiB,15.0,21.95%
3,/call_DPF,int16,447.09 GiB,20.0,17.22%
4,/call_PL,int16,162.56 GiB,160.0,6.26%
5,/call_GQX,int16,40.99 GiB,210.0,1.58%
6,/call_FT,object,25.04 GiB,1400.0,0.96%
7,/call_genotype,int8,21.46 GiB,410.0,0.83%
8,/call_genotype_mask,bool,12.78 GiB,680.0,0.49%
9,/call_genotype_phased,bool,2.35 GiB,1900.0,0.09%


Output the (rough) table data for the manuscript:

In [9]:
print(df_display_table.to_latex(index=False, escape=True).replace("object", "str"))

\begin{tabular}{lllll}
\toprule
Field & type & storage & compress & percentage \\
\midrule
/call\_AD & int16 & 658.44 GiB & 26.0 & 25.35\% \\
/call\_GQ & int16 & 654.45 GiB & 13.0 & 25.20\% \\
/call\_DP & int16 & 570.03 GiB & 15.0 & 21.95\% \\
/call\_DPF & int16 & 447.09 GiB & 20.0 & 17.22\% \\
/call\_PL & int16 & 162.56 GiB & 160.0 & 6.26\% \\
/call\_GQX & int16 & 40.99 GiB & 210.0 & 1.58\% \\
/call\_FT & str & 25.04 GiB & 1400.0 & 0.96\% \\
/call\_genotype & int8 & 21.46 GiB & 410.0 & 0.83\% \\
/call\_genotype\_mask & bool & 12.78 GiB & 680.0 & 0.49\% \\
/call\_genotype\_phased & bool & 2.35 GiB & 1900.0 & 0.09\% \\
/call\_PS & int8 & 383.38 MiB & 12000.0 & 0.01\% \\
/call\_ADF & int8 & 383.38 MiB & 12000.0 & 0.01\% \\
/call\_ADR & int8 & 383.38 MiB & 12000.0 & 0.01\% \\
\bottomrule
\end{tabular}

