Skip to content

4. Database

Jérémy Rousseau edited this page Jan 14, 2026 · 2 revisions

Table of content

Introduction

LAGOON-MCL uses two databases to obtain functional and structural information for user-supplied sequences.

  1. AlphaFold clusters database: Based on the AlphaFold Protein Structure Database.
  2. Pfam database: Based on the Pfam database.

To prepare these databases, two scripts are provided in the tool-kit folder:

  • build_pfam_db.sh: Downloads and constructs the Pfam database.
  • build_alphafold_db.sh: Downloads and builds the AlphaFold clusters database and retrieves Pfam annotations via InterPro.

These scripts rely on the MMseqs2, SeqKit, and LAGOON-MCL containers.
For more information on downloading containers, see 1. Usage.

AlphaFold clusters database [mandatory]

Construction of the MMseqs2 sequence database.

Step 1: Download sequences

The first step is to download the 210 million sequences (in FASTA format) from the AlphaFold Protein Structure Database, along with the sequence identifiers from the AlphaFold clusters database.

  • FASTA file: The file containing the 210 million sequences is available here.
  • TSV file: The file containing the sequence IDs of the AlphaFold clusters database is available here.

Step 2: Sequence extraction

The second step consists of extracting the FASTA sequences corresponding to the identifiers listed in the TSV file.

Step 3: Building the sequences database with MMseqs2

The AlphaFold database is located in the database/alphafoldDB directory and consists of the following files:

  • alphafoldDB
  • alphafoldDB.dbtype
  • alphafoldDB_h
  • alphafoldDB_h.dbtype
  • alphafoldDB_h.index
  • alphafoldDB.index
  • alphafoldDB.lookup
  • alphafoldDB.source

Step 4: Extracting Pfam annotations from UniProt IDs

This step retrieves Pfam annotations linked to AlphaFold IDs from the InterPro database.
Note that sequence IDs in the AlphaFold clusters database correspond to UniProtKB IDs.

The resulting file is located at: database/uniprot_function.json.

Use in LAGOON-MCL

To use the alphafoldDB database and the uniprot_function.json file, you can use the parameters below.
The default settings in LAGOON-MCL already use these paths and parameters.

  • --alphafoldDB /path/to/database/alphafoldDB : Path to the folder containing the alphafoldDB database.
  • --alphafoldDB_name alphafoldDB : Name of the alphafoldDB database files.
  • --uniprot /path/to/database/uniprot_function.json : JSON file containing Pfam annotations linked to UniProtKB sequences.

Pfam database [optional]

Downloading and building the database

The build_pfam_db.sh script builds the Pfam profile database using MMseqs2.
The database is constructed with the following MMseqs2 command (from the MMseqs2 documentation):

The build_pfam_db.sh script builds the Pfam profile database using MMseqs2.
The database is created with the following MMseqs2 command (from the MMseqs2 documentation):

mmseqs databases Pfam-A.full pfamDB tmp

The Pfam database is located in the database/alphafoldDB directory and consists of the following files:

  • pfamDB
  • pfamDB.dbtype
  • pfamDB_h
  • pfamDB_h.dbtype
  • pfamDB_h.index
  • pfamDB.index
  • pfamDB.version

Use in LAGOON-MCL

To use the pfamDB database in LAGOON-MCL, you can use the following parameters:
The default settings in LAGOON-MCL already use these paths and parameters.

  • --pfamDB /path/to/database/pfamDB : Path to the folder containing the pfamDB database.
  • --pfamDB_name pfamDB : Name of the pfamDB database files.

Clone this wiki locally