Profile the code for writing to the file.

In [1]:
import kipoi
import kipoi_veff.snv_predict as sp
from pathlib import Path
import os
from kipoi.writers import HDF5BatchWriter, TsvBatchWriter, MultipleBatchWriter
from kipoi_veff.utils.io import SyncBatchWriter

model_name = "DeepSEA/variantEffects"

In [2]:
cd notebooks/

[Errno 2] No such file or directory: 'notebooks/'
/data/nasif12/home_if12/avsec/workspace/kipoi/kipoi-veff/notebooks


In [3]:
output_dir = Path("/tmp/kipoi")
output_dir.mkdir(exist_ok=True)

In [4]:
# Install line_profiler: https://github.com/rkern/line_profiler
# pip install line_profiler

In [5]:
%load_ext line_profiler

Then we need to know where the query VCF is located and where we want to store the results.

In [6]:
# The input vcf path
vcf_path = "example_data/clinvar_donor_acceptor_chr22.vcf"

Finally the dataloader arguments are set that are required to run the dataloader. Here we omit the `intervals_file` argument of the dataloader, because that has been tagged as bed file input in the `dataloader.yaml` file, which means that `score_variants` will automatically populate that argument with a temporary bed file that is generated from the VCF in order to query every variant contained in the input VCF. ("Variant-centered approach")

In [7]:
# The datalaoder keyword arguments
dataloader_arguments = {"fasta_file": "example_data/hg19_chr22.fa"}

### Writing to the vcf file

In [8]:
sp.score_variants(model = model_name,
                  dl_args = dataloader_arguments,
                  input_vcf = vcf_path,
                  output_vcf = str(output_dir / "output.vcf"))

Using downloaded and verified file: /data/nasif12/home_if12/avsec/.kipoi/models/DeepSEA/variantEffects/downloaded/model_files/weights/35956ab9c28960b5a3693f470fe980c1


100%|██████████| 14/14 [00:03<00:00,  4.11it/s]


### Writing to a tsv file

**This is very slow (2 s / it)**

In [13]:
tsv_writer =  SyncBatchWriter(TsvBatchWriter(output_dir / "preds.tsv"))
sp.score_variants(model = model_name,
                  dl_args = dataloader_arguments,
                  input_vcf = vcf_path,
                  output_writers=tsv_writer,
                  output_vcf = None)

Using downloaded and verified file: /data/nasif12/home_if12/avsec/.kipoi/models/DeepSEA/variantEffects/downloaded/model_files/weights/35956ab9c28960b5a3693f470fe980c1
[32mINFO[0m [44m[kipoi_veff.snv_predict][0m Using variant-centered sequence generation.[0m


100%|██████████| 14/14 [00:54<00:00,  3.78s/it]


### Writing to an hdf5 file

**This is slow as well**

In [14]:
h5_writer = SyncBatchWriter(HDF5BatchWriter(str(output_dir / 'preds.h5')))

In [15]:
sp.score_variants(model = model_name,
                  dl_args = dataloader_arguments,
                  input_vcf = vcf_path,
                  output_writers=h5_writer,
                  output_vcf = None)

Using downloaded and verified file: /data/nasif12/home_if12/avsec/.kipoi/models/DeepSEA/variantEffects/downloaded/model_files/weights/35956ab9c28960b5a3693f470fe980c1
[32mINFO[0m [44m[kipoi_veff.snv_predict][0m Using variant-centered sequence generation.[0m


100%|██████████| 14/14 [00:52<00:00,  3.57s/it]


In [None]:
# remove the file after running
!rm {h5_writer.batch_writer.file_path}

## Code profiling

Let's use line_profiler to see the bottlenecks in the code. Specify the function to benchmark with: `-f function`

In [9]:
tsv_writer =  SyncBatchWriter(TsvBatchWriter(output_dir / "preds.tsv"))
%lprun -f sp.predict_snvs sp.score_variants(model = model_name, dl_args=dataloader_arguments, input_vcf=vcf_path, output_writers=tsv_writer, output_vcf=None)

Using downloaded and verified file: /data/nasif12/home_if12/avsec/.kipoi/models/DeepSEA/variantEffects/downloaded/model_files/weights/35956ab9c28960b5a3693f470fe980c1


100%|██████████| 14/14 [00:08<00:00,  1.76it/s]


Timer unit: 1e-06 s

Total time: 9.0017 s
File: /data/nasif12/home_if12/avsec/workspace/kipoi/kipoi-veff/kipoi_veff/snv_predict.py
Function: predict_snvs at line 468

Line #      Hits         Time  Per Hit   % Time  Line Contents
   468                                           def predict_snvs(model,
   469                                                            dataloader,
   470                                                            vcf_fpath,
   471                                                            batch_size,
   472                                                            num_workers=0,
   473                                                            dataloader_args=None,
   474                                                            vcf_to_region=None,
   475                                                            vcf_id_generator_fn=default_vcf_id_gen,
   476                                                            evaluation_function=analyse_model_pre

In [26]:
!rm {output_dir}/preds.h5

In [28]:
tsv_writer =  SyncBatchWriter(AsyncBatchWriter(HDF5BatchWriter(output_dir / "preds.h5"), max_queue_size=2))
%lprun -f tsv_writer.__call__ sp.score_variants(model = model_name, dl_args=dataloader_arguments, input_vcf=vcf_path, output_writers=tsv_writer, output_vcf=None)

Using downloaded and verified file: /data/nasif12/home_if12/avsec/.kipoi/models/DeepSEA/variantEffects/downloaded/model_files/weights/35956ab9c28960b5a3693f470fe980c1


100%|██████████| 14/14 [00:01<00:00, 10.09it/s]


Timer unit: 1e-06 s

Total time: 0.010126 s
File: <ipython-input-13-eab3274b67ce>
Function: __call__ at line 10

Line #      Hits         Time  Per Hit   % Time  Line Contents
    10                                               def __call__(self, predictions, records, line_ids=None):
    11        14        374.0     26.7      3.7          validate_input(predictions, records, line_ids)
    12                                           
    13        14         20.0      1.4      0.2          if line_ids is None:
    14                                                       line_ids = {}
    15                                           
    16        14       6524.0    466.0     64.4          batch = numpy_collate([variant_to_dict(v) for v in records])
    17        14         93.0      6.6      0.9          batch['line_idx'] = np.array(line_ids)
    18        14       2005.0    143.2     19.8          batch['preds'] = {k: df.values for k, df in six.iteritems(predictions)}
    19        

`df_to_np_dict` is the bottleneck

## Buffer writing

Let's use buffer_size=5

In [9]:
from kipoi.writers import AsyncBatchWriter

In [10]:
%load_ext autoreload

In [11]:
%autoreload 2

In [12]:
from kipoi_veff.utils.io import *

In [23]:
tsv_writer =  SyncBatchWriter(TsvBatchWriter(output_dir / "preds.tsv"))
sp.score_variants(model = model_name, dl_args=dataloader_arguments, input_vcf=vcf_path, output_writers=tsv_writer, output_vcf=None)

Using downloaded and verified file: /data/nasif12/home_if12/avsec/.kipoi/models/DeepSEA/variantEffects/downloaded/model_files/weights/35956ab9c28960b5a3693f470fe980c1



  0%|          | 0/14 [00:00<?, ?it/s][A
  7%|▋         | 1/14 [00:00<00:07,  1.82it/s][A
 14%|█▍        | 2/14 [00:00<00:06,  1.95it/s][A
 21%|██▏       | 3/14 [00:01<00:05,  2.08it/s][A
 29%|██▊       | 4/14 [00:01<00:04,  2.09it/s][A
 36%|███▌      | 5/14 [00:02<00:04,  2.18it/s][A
 43%|████▎     | 6/14 [00:02<00:03,  2.26it/s][A
 50%|█████     | 7/14 [00:03<00:03,  2.31it/s][A
 57%|█████▋    | 8/14 [00:03<00:02,  2.35it/s][A
 64%|██████▍   | 9/14 [00:03<00:02,  2.28it/s][A
 71%|███████▏  | 10/14 [00:04<00:01,  2.29it/s][A
 79%|███████▊  | 11/14 [00:04<00:01,  2.30it/s][A
 86%|████████▌ | 12/14 [00:05<00:00,  2.34it/s][A
 93%|█████████▎| 13/14 [00:05<00:00,  2.38it/s][A
100%|██████████| 14/14 [00:05<00:00,  2.97it/s][A
[A

This took 6 seconds in total instead of 51. Is this still the bottleneck?

In [29]:
tsv_writer =  SyncBatchWriter(AsyncBatchWriter(HDF5BatchWriter(output_dir / "preds.h5"), max_queue_size=2))
%lprun -f sp.predict_snvs sp.score_variants(model = model_name, dl_args=dataloader_arguments, input_vcf=vcf_path, output_writers=tsv_writer, output_vcf=None)

Using downloaded and verified file: /data/nasif12/home_if12/avsec/.kipoi/models/DeepSEA/variantEffects/downloaded/model_files/weights/35956ab9c28960b5a3693f470fe980c1


100%|██████████| 14/14 [00:01<00:00, 12.45it/s]


Timer unit: 1e-06 s

Total time: 11.2923 s
File: /data/nasif12/home_if12/avsec/workspace/kipoi/kipoi-veff/kipoi_veff/snv_predict.py
Function: predict_snvs at line 468

Line #      Hits         Time  Per Hit   % Time  Line Contents
   468                                           def predict_snvs(model,
   469                                                            dataloader,
   470                                                            vcf_fpath,
   471                                                            batch_size,
   472                                                            num_workers=0,
   473                                                            dataloader_args=None,
   474                                                            vcf_to_region=None,
   475                                                            vcf_id_generator_fn=default_vcf_id_gen,
   476                                                            evaluation_function=analyse_model_pr