Table of content
* [Introduction](#title-1)
* [AlphaFold clusters database [mandatory]](#title-2)
* [Step 1: Download sequences](#subtitle-2-1)
* [Step 2: Sequence extraction](#subtitle-2-2)
* [Step 3: Building the sequences database with MMseqs2](#subtitle-2-3)
* [Step 4: Extracting Pfam annotations from UniProt IDs](#subtitle-2-4)
* [Use in LAGOON-MCL](#subtitle-2-5)
* [Pfam database [optional]](#title-3)
* [Downloading and building the database](#subtitle-3-1)
* [Use in LAGOON-MC](#subtitle-3-2)
Introduction
LAGOON-MCL uses two databases to obtain functional and structural information for user-supplied sequences.
1. **AlphaFold clusters database:** Based on the [AlphaFold Protein Structure Database](https://www.alphafold.ebi.ac.uk/).
2. **Pfam database:** Based on the [Pfam database](https://www.ebi.ac.uk/interpro/download/Pfam/).
To prepare these databases, two scripts are provided in the `tool-kit` folder:
- `build_pfam_db.sh`: Downloads and constructs the Pfam database.
- `build_alphafold_db.sh`: Downloads and builds the AlphaFold clusters database and retrieves Pfam annotations via InterPro.
These scripts rely on the [MMseqs2](), [SeqKit](), and [LAGOON-MCL]() containers.
For more information on downloading containers, see [1. Usage]().
AlphaFold clusters database [mandatory]
Construction of the **MMseqs2 sequence database**.
Step 1: Download sequences
The first step is to download the **210 million sequences** (in FASTA format) from the AlphaFold Protein Structure Database, along with the sequence identifiers from the AlphaFold clusters database.
- **FASTA file:** The file containing the 210 million sequences is available [here]().
- **TSV file:** The file containing the sequence IDs of the AlphaFold clusters database is available [here]().
Step 2: Sequence extraction
The second step consists of **extracting the FASTA sequences** corresponding to the identifiers listed in the TSV file.
Step 3: Building the sequences database with MMseqs2
The AlphaFold database is located in the `database/alphafoldDB` directory and consists of the following files:
- `alphafoldDB`
- `alphafoldDB.dbtype`
- `alphafoldDB_h`
- `alphafoldDB_h.dbtype`
- `alphafoldDB_h.index`
- `alphafoldDB.index`
- `alphafoldDB.lookup`
- `alphafoldDB.source`
Step 4: Extracting Pfam annotations from UniProt IDs
This step retrieves **Pfam annotations** linked to AlphaFold IDs from the InterPro database. \
Note that sequence IDs in the AlphaFold clusters database correspond to **UniProtKB IDs**.
The resulting file is located at: `database/uniprot_function.json`.
Use in LAGOON-MCL
To use the **`alphafoldDB`** database and the **`uniprot_function.json`** file, you can use the parameters below. \
The default settings in LAGOON-MCL already use these paths and parameters.
- `--alphafoldDB /path/to/database/alphafoldDB` : Path to the folder containing the **alphafoldDB** database.
- `--alphafoldDB_name alphafoldDB` : Name of the **alphafoldDB** database files.
- `--uniprot /path/to/database/uniprot_function.json` : JSON file containing Pfam annotations linked to **UniProtKB** sequences.
Pfam database [optional]
Downloading and building the database
The `build_pfam_db.sh` script builds the **Pfam profile database** using MMseqs2.
The database is constructed with the following MMseqs2 command (from the MMseqs2 documentation):
The `build_pfam_db.sh` script builds the **Pfam profile database** using MMseqs2.
The database is created with the following MMseqs2 command (from the MMseqs2 documentation):
```bash
mmseqs databases Pfam-A.full pfamDB tmp
```
The Pfam database is located in the `database/alphafoldDB` directory and consists of the following files:
* pfamDB
* pfamDB.dbtype
* pfamDB_h
* pfamDB_h.dbtype
* pfamDB_h.index
* pfamDB.index
* pfamDB.version
Use in LAGOON-MCL
To use the **`pfamDB`** database in LAGOON-MCL, you can use the following parameters:
The default settings in LAGOON-MCL already use these paths and parameters.
- `--pfamDB /path/to/database/pfamDB` : Path to the folder containing the **pfamDB** database.
- `--pfamDB_name pfamDB` : Name of the **pfamDB** database files.