You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the localcolabfold.nf script is relying on a colabfold_batch command against the public web server hosted by the ColabFold developers (https://github.com/sokrypton/ColabFold) . This may not be obvious to first-time users of colabfold, because the workflow is checking for af2 native database files in the database path. The workflow does not use the database files if the mode is specified as colabfold.
The issue with using the public web server is that it is a limited resource. After submitting anywhere from 10-100 jobs, additional jobs get killed by the web server (see attached file "command-2.err.txt" command-2.err.txt) and IP addresses can remain blocked for some time (1-2 days). This is necessary to keep the webserver operational and serving the community.
A scalable solution for running colabfold would be highly beneficial to all nf-core/proteinfold users. Two options exist:
Solution 1: querying against a locally-installed colabfold db. The colabfold sequence dbs, which are distinct from the af2 native sequence dbs due to the fact colabfold uses mmseqs2 rather than jackhmmer for MSA generation, could be installed locally on an FSx Lustre filesystem. AWS developers have provided such a solution using FSx . NOTE this would also required the Issue "Split colabfold_batch command in localcolabfold.nf into two separate, colabfold_search + colabfold_batch, commands" to be resolved, since colabfold_batch will not run against a file system, only a webserver.
Of the two solutions, Solution 1 is preferable for most users. This is because it only requires that a database be maintained in a file system. Solution 2 would require a webserver to be running constantly, or to be configured and launched as part of the pipeline execution (overly complex). The webserver requires at least 2 TB of RAM, which amounts to a considerable overhead if running constantly.
The text was updated successfully, but these errors were encountered:
Description of feature
Currently, the localcolabfold.nf script is relying on a colabfold_batch command against the public web server hosted by the ColabFold developers (https://github.com/sokrypton/ColabFold) . This may not be obvious to first-time users of colabfold, because the workflow is checking for af2 native database files in the database path. The workflow does not use the database files if the mode is specified as colabfold.
The issue with using the public web server is that it is a limited resource. After submitting anywhere from 10-100 jobs, additional jobs get killed by the web server (see attached file "command-2.err.txt" command-2.err.txt) and IP addresses can remain blocked for some time (1-2 days). This is necessary to keep the webserver operational and serving the community.
A scalable solution for running colabfold would be highly beneficial to all nf-core/proteinfold users. Two options exist:
Solution 1: querying against a locally-installed colabfold db. The colabfold sequence dbs, which are distinct from the af2 native sequence dbs due to the fact colabfold uses mmseqs2 rather than jackhmmer for MSA generation, could be installed locally on an FSx Lustre filesystem. AWS developers have provided such a solution using FSx . NOTE this would also required the Issue "Split colabfold_batch command in localcolabfold.nf into two separate, colabfold_search + colabfold_batch, commands" to be resolved, since colabfold_batch will not run against a file system, only a webserver.
Solution 2: Provide an option for users to pass their own, locally hosted webserver url to the workflow. The local webserver can be set up following instructions here .
Of the two solutions, Solution 1 is preferable for most users. This is because it only requires that a database be maintained in a file system. Solution 2 would require a webserver to be running constantly, or to be configured and launched as part of the pipeline execution (overly complex). The webserver requires at least 2 TB of RAM, which amounts to a considerable overhead if running constantly.
The text was updated successfully, but these errors were encountered: