# French Canadian

This notebook estimates the computational size of the French Canadian dataset, and computes the time that it takes to load the dataframe by using tszip. This also examines the computational size of the compressed French Canadian dataset and the compressed vcf file.

The chromosome 9 data from the French Canadian dataset (`simulated_genomes_chr9.tsz`) is installed from https://zenodo.org/record/6839683.

Please put it inside `data` folder before running the code.

In [1]:
import gzip
import humanize
import numpy as np
import os
import shutil
import time
import tszip

In [2]:
%%time
ts = tszip.decompress("data/simulated_genomes_chr9.tsz")

CPU times: user 36.2 s, sys: 6.34 s, total: 42.6 s
Wall time: 39 s


In [3]:
ts.num_individuals

2723339

In [5]:
ts.num_samples // 2

1426749

In [6]:
tszip_size = os.path.getsize("data/simulated_genomes_chr9.tsz")
print("tszip size is ", humanize.naturalsize(tszip_size, format='%.2f'))

tszip size is  1.36 GB


In [8]:
# Take the first 1000 sites and extrapolate
ts_sub = ts.delete_sites(np.arange(1000, ts.num_sites))


In [14]:
tmp_vcf = "data/tmp.vcf"
with open(tmp_vcf, "w") as f:
    ts_sub.write_vcf(f)

In [15]:
sub_size = os.path.getsize(tmp_vcf)
total_size = ts.num_sites * sub_size / ts_sub.num_sites

print("sub is ", humanize.naturalsize(sub_size, format='%.3f'))
print("extrapolated size is ", humanize.naturalsize(total_size, format='%.3g'))

sub is  5.723 GB
extrapolated size is  280 TB


In [10]:
%%bash
gzip -k data/tmp.vcf

In [16]:
tmp_vcfgz = "data/tmp.vcf.gz"
sub_size = os.path.getsize(tmp_vcfgz)
total_size = ts.num_sites * sub_size / ts_sub.num_sites

print("sub is ", humanize.naturalsize(sub_size, format='%.3f'))
print("extrapolated size is ", humanize.naturalsize(total_size, format='%.3g'))

sub is  11.314 MB
extrapolated size is  553 GB


In [17]:
total_size / tszip_size

407.7327720746952