Processing of life sciences data with large reference files #4271

rjg2186 · 2023-09-06T09:00:10Z

rjg2186
Sep 6, 2023

Hi Team,

I am trying Nextflow with GLS (not yet updated to GCB) that uses a BLAST database files of around 335 GB (nt BLAST database from NCBI) and the input sequence query is around 100 sequences.

The task is to perform BLASTN for the input sequence against the 335 GB nt BLAST database. In the .config file, the GS bucket path that contains the database files are added.

The process was submitted using two different process as shown below

n2-highmem-48 with 48 cpus and 350GB memory, 50 sequences as input, complete database
n2-highmem-32 with 32 cpus and 250GB memory, 50 sequences as input, complete database
BLAST docker image from the official release page
In both the above cases, the stageIn option is set as 'symlink' for the input files

From the first case, the complete process takes 1hr 22m, with BLAST analysis taking 8m 4s
For the second case, the same process takes 4h 18m, with BLAST analysis taking 3h 6m

Could the team please let me know why it is taking 1hr 22m in the first case and around 4h in second case, since the same process in one of the high end local linux machine takes only 10 minutes to complete. The BLAST database/index files are usually loaded into memory during the alignment process, as seen in the local linux machines.

Does the 335GB BLAST database is being copied to the instance that is spinned off or to the BLAST container that is native for the execution ?

Could the team also suggest recommended/best practice to handle this much large data using Nextflow

Any inputs/suggestions is highly appreciated. Thanks you !

bentsherman · 2023-09-06T16:44:31Z

bentsherman
Sep 6, 2023
Maintainer

The Nextflow trace metrics can help you here. You can take complete - start, which is the total runtime of a VM, and subtract realtime, which is the runtime of just your task script, and that will give you a rough estimate of how much time was spent staging the input files (and unstaging the output files back to the GCS bucket).

So I would compare both realtime and complete - start - realtime for those two tasks to see how much the extra runtime is due to copying files vs the actual computation.

As for solutions, migrating to Google Batch would be a good first step.

2 replies

rjg2186 Sep 7, 2023
Author

@bentsherman
Thank you for the reply.
Does the Google Batch have option to mount the GS bucket with the BLAST database to running container ? Based on my understanding, may be the current GLS might not support this option and it tries to copy the BLAST database to the instance.

bentsherman Sep 7, 2023
Maintainer

Indeed, Google Batch automatically mounts the GCS bucket into the container for you. It still has to copy the file under the hood, but it might be a little faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing of life sciences data with large reference files #4271

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Processing of life sciences data with large reference files #4271

rjg2186 Sep 6, 2023

Replies: 1 comment · 2 replies

bentsherman Sep 6, 2023 Maintainer

rjg2186 Sep 7, 2023 Author

bentsherman Sep 7, 2023 Maintainer

rjg2186
Sep 6, 2023

Replies: 1 comment 2 replies

bentsherman
Sep 6, 2023
Maintainer

rjg2186 Sep 7, 2023
Author

bentsherman Sep 7, 2023
Maintainer