-
Notifications
You must be signed in to change notification settings - Fork 0
4. Database
LAGOON-MCL utilise
LAGOON-MCL uses two databases to obtain information (function and structure) for sequences supplied by the user.
The first database is based on the AlphaFold clusters database. The database itself is based on the AlphaFold Protein Structure Database.
The second database used by LAGOON-MCL is based on the Pfam database. \
To prepare the two databases, two scirpts are provided in the tool-kit folder.
-
build_pfam_db.sh: download and construction of the Pfam database. -
build_alphafold_db.sh: download and build AlphaFold clusters database and retrieve Pfam annotations in InterPro.
In order to function, the scripts use the MMseqs2, SeqKit and LAGOON-MCL containers. For more information on downloading containers, please see 1. Usage.
Construction of the MMseqs2 sequence database.
The first step is to download the 210 million sequences (in FASTA format) in the AlphaFold Protein Structure Database. As well as the sequence identifiers present in the Alphafold clusters database.
- [FASTA file] The file with the 210 million sequences is available here.
- [TSV file] The file containing the sequence IDs of the Alphafold clusters database is available here
The second step consists of extracting the fasta sequences from the identifiers in the TSV file.
The database is located in the database/alphafoldDB directory.
It consists of the following files:
- alphafoldDB
- alphafoldDB.dbtype
- alphafoldDB_h
- alphafoldDB_h.dbtype
- alphafoldDB_h.index
- alphafoldDB.index
- alphafoldDB.lookup
- alphafoldDB.source
This step involves retrieving Pfam annotations linked to AlphaFold IDs in the InterPro database. Sequence IDs in the AlphaFold clusters database correspond to UniProtKB IDs.
The file is located in the folder database/uniprot_function.json.
To use the alphafoldDB database and the uniprot_function.json file, you can use the parameters below.
The default settings for LAGOON-MCL already use these paths and parameters.
-
--alphafoldDB /path/to/database/alphafoldDB: path to folder with alphafoldDB database -
--alphafoldDB_name alphafoldDB: alphafoldDB database file names -
--uniprot /path/to/database/uniprot_function.json: json file with Pfam annotations linked to UniProtKB sequences
Script build_pfam_db.sh builds the Pfam profile database with MMseqs2. This database is built using the command provided in the MMseqs2 documentation: mmseqs databases Pfam-A.full pfamDB tmp, where pfamDB is both the name and folder of the database.
The database is located in the database/pfamDB directory.
It consists of the following files:
- pfamDB
- pfamDB.dbtype
- pfamDB_h
- pfamDB_h.dbtype
- pfamDB_h.index
- pfamDB.index
- pfamDB.version
To use the pfamDB database, you can use the parameters below.
The default settings for LAGOON-MCL already use these paths and parameters.
-
--pfamDB /path/to/database/pfamDB: path to folder with pfamDB database -
--pfamDB_name pfamDB: pfamDB database file names