Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add capability for localcolabfold to run vs a locally hosted database (FSx Lustre or local web-server) #19

Closed
nfg-interline opened this issue May 18, 2022 · 0 comments · Fixed by #23
Labels
enhancement Improvement for existing functionality

Comments

@nfg-interline
Copy link

nfg-interline commented May 18, 2022

Description of feature

Currently, the localcolabfold.nf script is relying on a colabfold_batch command against the public web server hosted by the ColabFold developers (https://github.com/sokrypton/ColabFold) . This may not be obvious to first-time users of colabfold, because the workflow is checking for af2 native database files in the database path. The workflow does not use the database files if the mode is specified as colabfold.

The issue with using the public web server is that it is a limited resource. After submitting anywhere from 10-100 jobs, additional jobs get killed by the web server (see attached file "command-2.err.txt" command-2.err.txt) and IP addresses can remain blocked for some time (1-2 days). This is necessary to keep the webserver operational and serving the community.

A scalable solution for running colabfold would be highly beneficial to all nf-core/proteinfold users. Two options exist:

Solution 1: querying against a locally-installed colabfold db. The colabfold sequence dbs, which are distinct from the af2 native sequence dbs due to the fact colabfold uses mmseqs2 rather than jackhmmer for MSA generation, could be installed locally on an FSx Lustre filesystem. AWS developers have provided such a solution using FSx . NOTE this would also required the Issue "Split colabfold_batch command in localcolabfold.nf into two separate, colabfold_search + colabfold_batch, commands" to be resolved, since colabfold_batch will not run against a file system, only a webserver.

Solution 2: Provide an option for users to pass their own, locally hosted webserver url to the workflow. The local webserver can be set up following instructions here .

Of the two solutions, Solution 1 is preferable for most users. This is because it only requires that a database be maintained in a file system. Solution 2 would require a webserver to be running constantly, or to be configured and launched as part of the pipeline execution (overly complex). The webserver requires at least 2 TB of RAM, which amounts to a considerable overhead if running constantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
1 participant