-
Notifications
You must be signed in to change notification settings - Fork 0
4. Database
LAGOON-MCL uses two databases to obtain functional and structural information for user-supplied sequences.
- AlphaFold clusters database: Based on the AlphaFold Protein Structure Database.
- Pfam database: Based on the Pfam database.
To prepare these databases, two scripts are provided in the tool-kit folder:
-
build_pfam_db.sh: Downloads and constructs the Pfam database. -
build_alphafold_db.sh: Downloads and builds the AlphaFold clusters database and retrieves Pfam annotations via InterPro.
These scripts rely on the MMseqs2, SeqKit, and LAGOON-MCL containers.
For more information on downloading containers, see 1. Usage.
Construction of the MMseqs2 sequence database.
The first step is to download the 210 million sequences (in FASTA format) from the AlphaFold Protein Structure Database, along with the sequence identifiers from the AlphaFold clusters database.
- FASTA file: The file containing the 210 million sequences is available here.
- TSV file: The file containing the sequence IDs of the AlphaFold clusters database is available here.
The second step consists of extracting the FASTA sequences corresponding to the identifiers listed in the TSV file.
The AlphaFold database is located in the database/alphafoldDB directory and consists of the following files:
alphafoldDBalphafoldDB.dbtypealphafoldDB_halphafoldDB_h.dbtypealphafoldDB_h.indexalphafoldDB.indexalphafoldDB.lookupalphafoldDB.source
This step retrieves Pfam annotations linked to AlphaFold IDs from the InterPro database.
Note that sequence IDs in the AlphaFold clusters database correspond to UniProtKB IDs.
The resulting file is located at: database/uniprot_function.json.
To use the alphafoldDB database and the uniprot_function.json file, you can use the parameters below.
The default settings in LAGOON-MCL already use these paths and parameters.
-
--alphafoldDB /path/to/database/alphafoldDB: Path to the folder containing the alphafoldDB database. -
--alphafoldDB_name alphafoldDB: Name of the alphafoldDB database files. -
--uniprot /path/to/database/uniprot_function.json: JSON file containing Pfam annotations linked to UniProtKB sequences.
The build_pfam_db.sh script builds the Pfam profile database using MMseqs2.
The database is constructed with the following MMseqs2 command (from the MMseqs2 documentation):
The build_pfam_db.sh script builds the Pfam profile database using MMseqs2.
The database is created with the following MMseqs2 command (from the MMseqs2 documentation):
mmseqs databases Pfam-A.full pfamDB tmpThe Pfam database is located in the database/alphafoldDB directory and consists of the following files:
- pfamDB
- pfamDB.dbtype
- pfamDB_h
- pfamDB_h.dbtype
- pfamDB_h.index
- pfamDB.index
- pfamDB.version
To use the pfamDB database in LAGOON-MCL, you can use the following parameters:
The default settings in LAGOON-MCL already use these paths and parameters.
-
--pfamDB /path/to/database/pfamDB: Path to the folder containing the pfamDB database. -
--pfamDB_name pfamDB: Name of the pfamDB database files.