Table of content

* [Introduction](#title-1) * [AlphaFold clusters database [mandatory]](#title-2) * [Step 1: Download sequences](#subtitle-2-1) * [Step 2: Sequence extraction](#subtitle-2-2) * [Step 3: Building the sequences database with MMseqs2](#subtitle-2-3) * [Step 4: Extracting Pfam annotations from UniProt IDs](#subtitle-2-4) * [Use in LAGOON-MCL](#subtitle-2-5) * [Pfam database [optional]](#title-3) * [Downloading and building the database](#subtitle-3-1) * [Use in LAGOON-MC](#subtitle-3-2)

Introduction

LAGOON-MCL uses two databases to obtain functional and structural information for user-supplied sequences. 1. **AlphaFold clusters database:** Based on the [AlphaFold Protein Structure Database](https://www.alphafold.ebi.ac.uk/). 2. **Pfam database:** Based on the [Pfam database](https://www.ebi.ac.uk/interpro/download/Pfam/). To prepare these databases, two scripts are provided in the `tool-kit` folder: - `build_pfam_db.sh`: Downloads and constructs the Pfam database. - `build_alphafold_db.sh`: Downloads and builds the AlphaFold clusters database and retrieves Pfam annotations via InterPro. These scripts rely on the [MMseqs2](), [SeqKit](), and [LAGOON-MCL]() containers. For more information on downloading containers, see [1. Usage]().

AlphaFold clusters database [mandatory]

Construction of the **MMseqs2 sequence database**.

Step 1: Download sequences

The first step is to download the **210 million sequences** (in FASTA format) from the AlphaFold Protein Structure Database, along with the sequence identifiers from the AlphaFold clusters database. - **FASTA file:** The file containing the 210 million sequences is available [here](). - **TSV file:** The file containing the sequence IDs of the AlphaFold clusters database is available [here]().

Step 2: Sequence extraction

The second step consists of **extracting the FASTA sequences** corresponding to the identifiers listed in the TSV file.

Step 3: Building the sequences database with MMseqs2

The AlphaFold database is located in the `database/alphafoldDB` directory and consists of the following files: - `alphafoldDB` - `alphafoldDB.dbtype` - `alphafoldDB_h` - `alphafoldDB_h.dbtype` - `alphafoldDB_h.index` - `alphafoldDB.index` - `alphafoldDB.lookup` - `alphafoldDB.source`

Step 4: Extracting Pfam annotations from UniProt IDs

This step retrieves **Pfam annotations** linked to AlphaFold IDs from the InterPro database. \ Note that sequence IDs in the AlphaFold clusters database correspond to **UniProtKB IDs**. The resulting file is located at: `database/uniprot_function.json`.

Use in LAGOON-MCL

To use the **`alphafoldDB`** database and the **`uniprot_function.json`** file, you can use the parameters below. \ The default settings in LAGOON-MCL already use these paths and parameters. - `--alphafoldDB /path/to/database/alphafoldDB` : Path to the folder containing the **alphafoldDB** database. - `--alphafoldDB_name alphafoldDB` : Name of the **alphafoldDB** database files. - `--uniprot /path/to/database/uniprot_function.json` : JSON file containing Pfam annotations linked to **UniProtKB** sequences.

Pfam database [optional]

Downloading and building the database

The `build_pfam_db.sh` script builds the **Pfam profile database** using MMseqs2. The database is constructed with the following MMseqs2 command (from the MMseqs2 documentation): The `build_pfam_db.sh` script builds the **Pfam profile database** using MMseqs2. The database is created with the following MMseqs2 command (from the MMseqs2 documentation): ```bash mmseqs databases Pfam-A.full pfamDB tmp ``` The Pfam database is located in the `database/alphafoldDB` directory and consists of the following files: * pfamDB * pfamDB.dbtype * pfamDB_h * pfamDB_h.dbtype * pfamDB_h.index * pfamDB.index * pfamDB.version

Use in LAGOON-MCL

To use the **`pfamDB`** database in LAGOON-MCL, you can use the following parameters: The default settings in LAGOON-MCL already use these paths and parameters. - `--pfamDB /path/to/database/pfamDB` : Path to the folder containing the **pfamDB** database. - `--pfamDB_name pfamDB` : Name of the **pfamDB** database files.