Skip to content

4. Database

jroussea edited this page Apr 16, 2025 · 2 revisions

Table of content

Introduction

LAGOON-MCL utilise

LAGOON-MCL uses two databases to obtain information (function and structure) for sequences supplied by the user.
The first database is based on the AlphaFold clusters database. The database itself is based on the AlphaFold Protein Structure Database.
The second database used by LAGOON-MCL is based on the Pfam database. \

To prepare the two databases, two scirpts are provided in the tool-kit folder.

  • build_pfam_db.sh: download and construction of the Pfam database.
  • build_alphafold_db.sh: download and build AlphaFold clusters database and retrieve Pfam annotations in InterPro.

In order to function, the scripts use the MMseqs2, SeqKit and LAGOON-MCL containers. For more information on downloading containers, please see 1. Usage.

AlphaFold clusters database [mandatory]

Construction of the MMseqs2 sequence database.

Step 1: Download sequences

The first step is to download the 210 million sequences (in FASTA format) in the AlphaFold Protein Structure Database. As well as the sequence identifiers present in the Alphafold clusters database.

  • [FASTA file] The file with the 210 million sequences is available here.
  • [TSV file] The file containing the sequence IDs of the Alphafold clusters database is available here

Step 2: Sequence extraction

The second step consists of extracting the fasta sequences from the identifiers in the TSV file.

Step 3: Building the sequences database with MMseqs2

The database is located in the database/alphafoldDB directory.

It consists of the following files:

  • alphafoldDB
  • alphafoldDB.dbtype
  • alphafoldDB_h
  • alphafoldDB_h.dbtype
  • alphafoldDB_h.index
  • alphafoldDB.index
  • alphafoldDB.lookup
  • alphafoldDB.source

Step 4: Extracting Pfam annotations from UniProt IDs

This step involves retrieving Pfam annotations linked to AlphaFold IDs in the InterPro database. Sequence IDs in the AlphaFold clusters database correspond to UniProtKB IDs.

The file is located in the folder database/uniprot_function.json.

Use in LAGOON-MCL

To use the alphafoldDB database and the uniprot_function.json file, you can use the parameters below.
The default settings for LAGOON-MCL already use these paths and parameters.

  • --alphafoldDB /path/to/database/alphafoldDB: path to folder with alphafoldDB database
  • --alphafoldDB_name alphafoldDB: alphafoldDB database file names
  • --uniprot /path/to/database/uniprot_function.json: json file with Pfam annotations linked to UniProtKB sequences

Pfam database [optional]

Downloading and building the database

Script build_pfam_db.sh builds the Pfam profile database with MMseqs2. This database is built using the command provided in the MMseqs2 documentation: mmseqs databases Pfam-A.full pfamDB tmp, where pfamDB is both the name and folder of the database.

The database is located in the database/pfamDB directory.

It consists of the following files:

  • pfamDB
  • pfamDB.dbtype
  • pfamDB_h
  • pfamDB_h.dbtype
  • pfamDB_h.index
  • pfamDB.index
  • pfamDB.version

Use in LAGOON-MCL

To use the pfamDB database, you can use the parameters below.
The default settings for LAGOON-MCL already use these paths and parameters.

  • --pfamDB /path/to/database/pfamDB: path to folder with pfamDB database
  • --pfamDB_name pfamDB: pfamDB database file names

Clone this wiki locally