Skip to content

3. Tutorial

Jérémy Rousseau edited this page Jan 14, 2026 · 4 revisions

Table of content

Introduction

This tutorial uses the test dataset located in the data-test/ directory.
The files consist of MALV transcriptomes obtained from the EukProt database and from the study by J. F. H. Strassert et al. (2017).
Gene3D, FunFam, and TMHMM annotations were generated using InterProScan 5.

Directory layout

lagoon-mcl/
├── assets/                     # Contains the different assets
├── bin/                        # Contains all python scripts
├── conf/                       # Contains all configuration files (resources, parameters, etc.)
├── containers/                 
│   ├── diamond/                # Contains Diamond container 
│   ├── lagoon-mcl/             # Contains the LAGOON-MCL container and the Dockerfile used to build the Docker image
│   ├── mcl/                    # Contains MCL container 
│   ├── mmseqs2/                # Contains Mmseqs2 container 
│   └── seqkit                  # Contains SeqKit container 
├── database/                   
│   ├── alphafoldDB/            # Contains the MMseqs2 database built from AlphaFold sequences
│   ├── pfamDB/                 # Contains the MMseqs2 database from the Pfam database
│   └── uniprot_function.json   # json file of UniProt identifiers and related Pfam identifiers
├── data-test/                  # Contains a test data set
│   └── malv/                   
│       ├── labels/             # Contains 3 TSV files (Gene3D, FunFam and TMHMM)
│       ├── fasta/              # Contains FASTA files
│       └── annotationsheet.csv # CSV table of correspondence between annotation name and corresponding file (annotation files are present in labels/)
├── html_templates/             # Contains the HTML template used by jinja2 to generate the report.
├── modules/                    # Contains all nextflow modules used by the workflow
├── subworkflow/                # Contains all nextflow sub-workflows used by the main workflow
├── tool-kit/                   
│   ├── bin/                    # Contains python scripts used by the build_alphafold_db.sh script
│   ├── scripts/                # Contains several scripts that can be used to prepare input files
│   ├── slurm/                  # Contains a sample script to run LAGOON-MCL on a compute cluster using (SLURM)
│   ├── build_alpahfold_db.sh   # Bash script for building the MMseqs2 AlphaFold database
│   ├── build_pfam_db.sh        # Bash script for building the MMseqs2 Pfam database
│   └── README.md               # Additional information about the tool-kit/ directory
├── results/                    # Folder created by lagoon-mcl, contains output files
├── workdir/                    # Folder created by lagoon-mcl, contains intermediate files and data
├── main.nf                     # Main nextflow script, contains the workflow
├── nextflow.config             # Main configuration file
├── README.md                   
├── CITATION.md                 # List of all workflow-related references
└── LICENSE             

Tutorial

Step 1: Workflow preparation

The first step is to:

  • Download LAGOON-MCL from the GitHub repository.
  • Download the required containers.
  • Download and build the MMseqs2 databases from the Pfam and AlphaFold datasets.

For more information about the third step, please see Databases.

Downloading LAGOON-MCL

Retrieve LAGOON-MCL from the GitHub repository.

git clone https://github.com/jroussea/lagoon-mcl.git
cd lagoon-mcl

Downloading containers

Containers for Diamond, MCL, SeqKit, and MMseqs2 are available from [BioContainers].
The LAGOON-MCL container is built from a Docker image available on Docker Hub, using the provided Dockerfile.

# SeqKit2 v2.9.0
wget -O containers/seqkit/2.9.0/seqkit.sif https://depot.galaxyproject.org/singularity/seqkit:2.9.0--h9ee0642_0

# Diamond v2.1.10
wget -O containers/diamond/2.1.10/diamond.sif https://depot.galaxyproject.org/singularity/diamond:2.1.10--h43eeafb_2

# MCL v22.282
wget -O containers/mcl/22.282/mcl.sif https://depot.galaxyproject.org/singularity/mcl:22.282--pl5321h031d066_2

# MMseqs2 v15.6f452
wget -O containers/mmseqs2/15.6f452/mmseqs.sif https://depot.galaxyproject.org/singularity/mmseqs2:15.6f452--pl5321h6a68c12_3

# LAGOON-MCL v1.0.0
apptainer build --fakeroot containers/lagoon-mcl/1.0.0/lagoon-mcl.sif docker://jroussea/lagoon-mcl:latest

Database

Two Bash scripts are available in the tool-kit/ directory to download and build MMseqs2 databases from the Pfam and AlphaFold datasets.

For more information, please see Databases.

cd tool-kit/

# Download Alphafold Protein Database [mandatory]
./build_alpahfold_db.sh

# Dowload Pfam [optional]
./build_pfam_db.sh

Step 2: Preparing input files

LAGOON-MCL uses several types of input files:

  • One or more FASTA files.
  • One or more tabulated (TSV) files containing annotations or labels associated with sequences.
  • A CSV file specifying the annotation or label names and their corresponding TSV files.

For more information about input files, see 5. Input Files.

Fasta file

Test FASTA files are available in the data-test/fasta directory.

Correspondent table

This file must be in CSV format and contain two columns.
The first column specifies the annotation name, and the second column specifies the path to the corresponding annotation file.

Example file: annotationsheet.csv.

annotation file
funfam /path/to/lagoon-mcl/data-test/malv/labels/funfam.tsv
gene3d /path/to/lagoon-mcl/data-test/malv/labels/gene3d.tsv
tmhmm /path/to/lagoon-mcl/data-test/malv/labels/tmhmm.tsv

Annotations files

There are three files containing sequence-specific annotations:

  • funfam.tsv: contains FunFam annotations.
  • gene3d.tsv: contains Gene3D annotations.
  • tmhmm.tsv: contains TMHMM annotations.

Step 3: Running the workflow

Parameters can be provided to the workflow in two different ways:

  1. Via the command line.
  2. By editing the params.yaml file.

We recommend editing the params.yaml file, as this allows you to keep a record of the settings you have used.

For this tutorial, you can use the params_test.yaml file, which contains all the required parameters.

Run with test parameters:

nextflow run main.nf -profile singularity -params-file params_test.yaml

Execution with custom parameters:

nextflow run main.nf -profile singularity -params-file params_test.yaml

Step 4: step-by-step pipeline description

This section describes the pipeline step by step.


File checks

Workflow involved:

  • main.nf

Processes involved:

  • CHECKS_FASTA (located in modules/check_file_format.nf)
  • CHECK_CSV (located in modules/check_file_format.nf)
  • CHECKS_TSV (located in modules/check_file_format.nf)

These processes verify that the input files comply with the formats required by LAGOON-MCL.
If any of these checks fail, an error message will be displayed in the terminal.


File processing

Workflow involved:

  • main.nf

Processes involved:

  • FASTA_PROCESSING (located in modules/data_processing.nf)
  • ANNOTATIONS_PROCESSING (located in modules/data_processing.nf)

The FASTA_PROCESSING process removes descriptions from FASTA sequence names, keeping only the identifiers.

The ANNOTATIONS_PROCESSING process uses the annotationsheet.csv file to retrieve annotation names and the paths to their corresponding files.
It adds sequences without labels to each file, assigning them the value NA, and also adds a column containing the label name (e.g., FunFam, Species, etc.).


Function searches

Workflow involved:

  • subworkflow/function_searches.nf

Processes involved:

  • FUNCTION_SEARCHES:MMSEQS_CREATE_DB (located in modules/mmseqs2.nf)
  • FUNCTION_SEARCHES:MMSEQS_SEARCH (located in modules/mmseqs2.nf)
  • FUNCTION_SEARCHES:PFAM_PROCESSING (located in modules/data_processing.nf)

This workflow creates an MMseqs2-indexed database from user-supplied FASTA sequences and aligns it against the database built from Pfam data (MMSEQS_CREATE_DB and MMSEQS_SEARCH).

The PFAM_PROCESSING process generates a Pfam annotation file for sequences based on the output of the ANNOTATIONS_PROCESSING process.

For more information, please see Databases.


Structure searches

Workflow involved:

  • subworkflow/structure_searches.nf

Processes involved:

  • STRUCTURE_SEARCHES:MMSEQS_CREATE_DB (located in modules/mmseqs2.nf)
  • STRUCTURE_SEARCHES:MMSEQS_SEARCH (located in modules/mmseqs2.nf)
  • STRUCTURE_SEARCHES:ALPHAFOLD_ALIGNMENTS (located in modules/structure_searches.nf)
  • STRUCTURE_SEARCHES:ALPHAFOLD_INFORMATIONS (located in modules/structure_searches.nf)
  • STRUCTURE_SEARCHES:ANNOTATIONS_PROCESSING (located in modules/data_processing.nf)

This workflow creates an MMseqs2-indexed database from user-supplied FASTA sequences and aligns it against the database built from AlphaFold data.
It generates a tabulated file containing the alignments and associated metrics in .m8 format.

For more information, please see Databases.


The workflow can be divided into three parts. The first part retains only one alignment per sequence against the AlphaFold database (ALPHAFOLD_ALIGNMENTS). The pipeline keeps the sequence based on the following criteria:

  1. Select the best alignment using two indices: one for mutual sequence coverage and one for alignment disparity. Learn more about calculating these indices here.
  2. Highest percentage identity.
  3. Longest AlphaFold database sequence length.

Using the ALPHAFOLD_INFORMATIONS and ANNOTATIONS_PROCESSING processes, the pipeline extracts sequence identifiers from the AlphaFold database (identical to those in UniProtKB).
The result is three annotation files based on the output of PFAM_PROCESSING:

  • File 1: Three columns

    1. User-supplied sequence identifier
    2. AlphaFold sequence identifier
    3. Annotation name: alphafold_sequences
  • File 2: Three columns

    1. User-supplied sequence identifier
    2. Identifier of the AlphaFold cluster the sequence belongs to
    3. Annotation name: alphafold_clusters
  • File 3: Three columns

    1. User-supplied sequence identifier
    2. Pfam identifier linked to the AlphaFold / UniProtKB sequence identifier
    3. Annotation name: alphafold_pfam

SSN and graph clustering

Workflow involved:

  • subworkflow/ssn_and_graph_clustering.nf (referred to as SSN in main.nf)

Processes involved:

  • SSN_AND_GRAPH_CLUSTERING:DIAMOND_DB (located in modules/diamond.nf)
  • SSN_AND_GRAPH_CLUSTERING:DIAMOND_BLASTP (located in modules/diamond.nf)
  • SSN_AND_GRAPH_CLUSTERING:FILTER_ALIGNMENTS (located in modules/data_processing.nf)
  • SSN_AND_GRAPH_CLUSTERING:GRAPH_CLUSTERING (located in modules/graph_clustering.nf)
  • SSN_AND_GRAPH_CLUSTERING:MCL_OUTPUT_TO_TSV (located in modules/graph_clustering.nf)
  • SSN_AND_GRAPH_CLUSTERING:NETWORK_EDGES (located in modules/graph_clustering.nf)

This workflow first creates a sequence similarity network (SSN).
All user-supplied FASTA sequences are aligned pairwise using Diamond (DIAMOND_DB and DIAMOND_BLASTP).

Next, LAGOON-MCL selects a single alignment for each pair of sequences based on the lowest e-value, then applies the Markov Clustering (MCL) algorithm (FILTER_ALIGNMENTS and GRAPH_CLUSTERING).


MCL Input File

To perform clustering, MCL requires a three-column tabulated file (TSV) derived from the Diamond BLASTp alignments:

  • Column 1: Sequence identifier 1
  • Column 2: Sequence identifier 2
  • Column 3: Edge weight

Edge weights:

  • LAGOON-MCL uses the e-value as the metric.
  • Each e-value is replaced by the negative base-10 logarithm.
  • Maximum weight is capped at 200 for e-values ≤ 1e-200.

Example:

For an e-value of $1e^{-300}$:

$$ -log_{10}(1e^{-300}) = 300 $$

Since $1e^{-300} < 1e^{-200}$, the final weight is capped at 200.


MCL Output

  • MCL_OUTPUT_TO_TSV generates a TSV file with:
    1. Column 1: Cluster identifier
    2. Column 2: Sequence identifiers present in the cluster

This file is used in subsequent workflow steps.

  • NETWORK_EDGES retrieves only the alignments necessary to reconstruct MCL clusters from the filtered alignment file (FILTER_ALIGNMENTS).

Data analysis and visualization

Workflow involved:

  • subworkflow/data_analysis.nf

Processes involved:

  • DATA_ANALYSIS:SEQUENCES_PROCESSING (located in modules/data_analysis.nf)
  • DATA_ANALYSIS:HOMOGENEITY_SCORE (located in modules/data_analysis.nf)
  • DATA_ANALYSIS:SEQUENCES_FILES (located in modules/data_analysis.nf)
  • DATA_ANALYSIS:CLUSTERS_FILES (located in modules/data_analysis.nf)
  • DATA_ANALYSIS:HTML_REPORT (located in modules/data_analysis.nf)

This workflow processes the results by calculating a homogeneity score for each label and generates the final pipeline output files (TSV tables and HTML report).

Output generated by SEQUENCES_PROCESSING and SEQUENCES_FILES

These processes produce the following files:

  • edges_igraph_network_I[inflation].tsv: TSV file containing the alignments used to build clusters in a format compatible with igraph. Sequence identifiers are replaced with numbers starting from 0.
  • nodes_igraph_network_I[inflation].tsv: TSV file containing the list of sequences and labels in a format compatible with igraph. Column 1: numeric sequence identifiers starting from 0, Column 2: sequence names.
  • nodes_metrics_network_I[inflation].tsv: TSV file containing sequence metrics (sequence length, centrality, number of labels per annotation type, etc.).
  • nodes_labels_network_I[inflation].tsv: TSV file containing all labels for all annotation types associated with each sequence.

Homogeneity score (HOMOGENEITY_SCORE)

A homogeneity score is calculated for each cluster and each label to determine the label composition of each cluster.

The HOMOGENEITY_SCORE process generates:

  • clusters_labels_network_I[inflation].tsv: TSV file listing all labels present in each cluster.
  • abundance_matrix_[label]_network_I[inflation].json: Abundance matrix for each label in each cluster.

For more information on the homogeneity score, see Indices.

Cluster metrics (CLUSTERS_FILES)

The CLUSTERS_FILES process generates:

  • clusters_metrics_network_I[inflation].tsv: TSV file containing cluster metrics, including cluster size, diameter, and three additional columns for each annotation type (homogeneity score, number of different labels, and number of sequences annotated with the label).

Final report (HTML_REPORT)

The HTML_REPORT process generates an HTML report with figures to visualize the results.

For more information on output files, see 6. Output Files.

References

D. J. Richter et al., « EukProt: A database of genome-scale predicted proteins across the diversity of eukaryotes », Peer Community Journal, vol. 2, p. e56, sept. 2022, doi: 10.24072/pcjournal.173.

J. F. H. Strassert et al., « Single cell genomics of uncultured marine alveolates shows paraphyly of basal dinoflagellates », ISME J, vol. 12, no 1, Art. no 1, janv. 2018, doi: 10.1038/ismej.2017.167.

For more information on references related to LAGOON-MCL, please see CITATION.md.

Clone this wiki locally