Skip to content

3. Tutorial

jroussea edited this page Apr 16, 2025 · 4 revisions

Table of content

Introduction

This tutorial uses the test dataset in the data-test/ directory. The files are MALV transcriptomes available in the EukProt database and from the study by J. F. H. Strassert et al. 2017. Gene3D, FunFam and TMHMM annotations were obtained using InterProScan5.

Directory layout

lagoon-mcl/
├── assets/                     # Contains the different assets
├── bin/                        # Contains all python scripts
├── conf/                       # Contains all configuration files (resources, parameters, etc.)
├── containers/                 
│   ├── diamond/                # Contains Diamond container 
│   ├── lagoon-mcl/             # Contains the LAGOON-MCL container and the Dockerfile used to build the Docker image
│   ├── mcl/                    # Contains MCL container 
│   ├── mmseqs2/                # Contains Mmseqs2 container 
│   └── seqkit                  # Contains SeqKit container 
├── database/                   
│   ├── alphafoldDB/            # Contains the MMseqs2 database built from AlphaFold sequences
│   ├── pfamDB/                 # Contains the MMseqs2 database from the Pfam database
│   └── uniprot_function.json   # json file of UniProt identifiers and related Pfam identifiers
├── data-test/                  # Contains a test data set
│   └── malv/                   
│       ├── labels/             # Contains 3 TSV files (Gene3D, FunFam and TMHMM)
│       ├── fasta/              # Contains FASTA files
│       └── annotationsheet.csv # CSV table of correspondence between annotation name and corresponding file (annotation files are present in labels/)
├── html_templates/             # Contains the HTML template used by jinja2 to generate the report.
├── modules/                    # Contains all nextflow modules used by the workflow
├── subworkflow/                # Contains all nextflow sub-workflows used by the main workflow
├── tool-kit/                   
│   ├── bin/                    # Contains python scripts used by the build_alphafold_db.sh script
│   ├── scripts/                # Contains several scripts that can be used to prepare input files
│   ├── slurm/                  # Contains a sample script to run LAGOON-MCL on a compute cluster using (SLURM)
│   ├── build_alpahfold_db.sh   # Bash script for building the MMseqs2 AlphaFold database
│   ├── build_pfam_db.sh        # Bash script for building the MMseqs2 Pfam database
│   └── README.md               # Additional information about the tool-kit/ directory
├── results/                    # Folder created by lagoon-mcl, contains output files
├── workdir/                    # Folder created by lagoon-mcl, contains intermediate files and data
├── main.nf                     # Main nextflow script, contains the workflow
├── nextflow.config             # Main configuration file
├── README.md                   
├── CITATION.md                 # List of all workflow-related references
└── LICENSE             

Tutorial

Step 1: Workflow preparation

The first step is to

  • download LAGOON-MCL from the GitHub repository
  • download the containers.
  • download and build the MMseqs2 database for data from the Pfam and AlphaFold databases.

For the third point, please see database for more information.

Downloading LAGOON-MCL

Retrieve LAGOON-MCL from the GitHub repository.

git clone https://github.com/jroussea/lagoon-mcl.git
cd lagoon-mcl

Downloading containers

Containers containing Diamond, MCL, SeqKit and MMseqs2 are available in BioContainer, while the LAGOON-MCL container is built from a docker image available on Docker Hub (dockerfile).

# SeqKit2 v2.9.0
wget -O containers/seqkit/2.9.0/seqkit.sif https://depot.galaxyproject.org/singularity/seqkit:2.9.0--h9ee0642_0

# Diamond v2.1.10
wget -O containers/diamond/2.1.10/diamond.sif https://depot.galaxyproject.org/singularity/diamond:2.1.10--h43eeafb_2

# MCL v22.282
wget -O containers/mcl/22.282/mcl.sif https://depot.galaxyproject.org/singularity/mcl:22.282--pl5321h031d066_2

# MMseqs2 v15.6f452
wget -O containers/mmseqs2/15.6f452/mmseqs.sif https://depot.galaxyproject.org/singularity/mmseqs2:15.6f452--pl5321h6a68c12_3

# LAGOON-MCL v1.0.0
singularity build --fakeroot containers/lagoon-mcl/1.0.0/lagoon-mcl.sif docker://jroussea/lagoon-mcl:latest

Database

Two bash scripts are available in tool-kit/ to download and build MMseqs2 databases from Pfam and AlphaFold databases.

For more information, please see database.

cd tool-kit/

# Download Alphafold Protein Database [mandatory]
./build_alpahfold_db.sh

# Dowload Pfam [optional]
./build_pfam_db.sh

Step 2: Preparing input files

LAGOON-MCL uses several input files.

For more information on input files, see 5. Input files.

Fasta file

Test fasta files are available in data-test/malv/fasta.

Correspondent table

This file must be in CSV format and have 2 columns. The first column must contain the annotation name, and the second column the path to the annotation file.

Exemple de fichier (annotationsheet.csv)

annotation file
funfam /path/to/lagoon-mcl/data-test/malv/labels/funfam.tsv
gene3d /path/to/lagoon-mcl/data-test/malv/labels/gene3d.tsv
tmhmm /path/to/lagoon-mcl/data-test/malv/labels/tmhmm.tsv

Annotations files

There are three files containing sequence-specific annotations

  • funfam.tsv: contains funfam annotations
  • gene3d.tsv: contains gene3d annotations
  • tmhmm.tsv: contains tmhmm annotations

Step 3: Running the workflow

Parameters can be submitted to the workflow in two different ways.

  1. By providing them on the command line
  2. By modifying the conf/custom.config file.

We recommend modifying the conf/custom.config file, as this allows you to keep the settings you've used.

For this tutorial, you can use the conf/test.config file. It contains all the necessary parameters.

Run with test parameters:

nextflow run main.nf -profile singularity,test

Execution with custom parameters:

nextflow run main.nf -profile singularity,test

Step 4: step-by-step pipeline description

This section describes the pipeline step by step.


File checks

Workflow concerned:

  • main.nf

Processes concerned:

  • CHECKS_FASTA (available in: modules/check_file_format.nf)
  • CHECK_CSV (available in: modules/check_file_format.nf)
  • CHECKS_TSV (available in: modules/check_file_format.nf)

Processes check that input files respect the formats used by lagoon-mcl. If these verification steps are not satisfied, an error message will be returned to the terminal.


File processing

Workflow concerned:

  • main.nf

Processes concerned:

  • FASTA_PROCESSING (available in: modules/data_processing.nf)
  • ANNOTATIONS_PROCESSING (available in: modules/data_processing.nf)

The FASTA_PROCESSING process removes descriptions from fasta sequence names, retaining only the identifier.

The ANNOTATIONS_PROCESSING process uses the annotationsheet.csv file to retrieve annotation names and the path to the corresponding file. This process adds sequences without labels to each file, assigning the value “NA”. It also adds a column containing the label name (e.g. Funfam, Species, etc.).


Function searches

Workflow concerned:

  • subworkflow/function_searches.nf

Processes concerned:

  • FUNCTION_SEARCHES:MMSEQS_CREATE_DB (available in: modules/mmseqs2.nf)
  • FUNCTION_SEARCHES:MMSEQS_SEARCH (available in: modules/mmseqs2.nf)
  • FUNCTION_SEARCHES:PFAM_PROCESSING (available in: modules/data_processing.nf)

This workflow creates a database indexed by MMseqs2 from user-supplied fasta sequences and aligns it against the database built from Pfam data (MMSEQS_CREATE_DB and MMSEQS_SEARCH). PFAM_PROCESSING creates a Pfam annotation file of sequences based on the file generated by the ANNOTATIONS_PROCESSING process.

For more information, please see databases.


Structure searches

Workflow concerned:

  • subworkflow/structure_searches.nf

Processes concerned:

  • STRUCTURE_SEARCHES:MMSEQS_CREATE_DB (available in: modules/mmseqs2.nf)
  • STRUCTURE_SEARCHES:MMSEQS_SEARCH (available in: modules/mmseqs2.nf)
  • STRUCTURE_SEARCHES:ALPHAFOLD_ALIGNMENTS (available in: modules/structure_searches.nf)
  • STRUCTURE_SEARCHES:ALPHAFOLD_INFORMATIONS (available in: modules/structure_searches.nf)
  • STRUCTURE_SEARCHES:ANNOTATIONS_PROCESSING (available in: modules/data_processing.nf)

This workflow creates an MMseqs2-indexed database from user-supplied fasta sequences and aligns it against the database built with AlphaFold data. This workflow generates a tabulated file containing the alignments and associated metrics in .m8 format.

For more information, please see databases.

The workflow then breaks down into 3 parts. The first consists in keeping only one alignment for each sequence aligned against the database built from AlphaFold database data (ALPHAFOLD_ALIGNMENTS). The pipeline retains the sequence with:

  1. Select the best alignment using two indices, one for mutual sequence coverage and one for alignment disparity. Find out more about calculating indices here.
  2. The highest percentage identity
  3. Longest AlphaFold database sequence length

Using the ALPHAFOLD_INFORMATIONS and ANNOTATIONS_PROCESSING processes, we extract the sequence identifier from the AlphaFold database (identical to those present in UniProtKB). The result is three annotation files based on the file generated by PFAM_PROCESSING.

  • The first file contains three columns. Column 1: sequence identifier supplied by the user. Column 2: AlphaFold sequence identifier. Column 3: annotation name (alphafold_sequences).
  • The second file contains three columns. Column 1: sequence identifier supplied by the user. Column 2: identifier of the AlphaFold cluster to which the AlphaFold sequence belongs. Column 3: contains the annotation name (alphafold_clusters).
  • The third file contains three columns. Column 1: sequence identifier supplied by the user. Column 2: Pfam identifier linked to the AlphaFold / UniProtKB sequence identifier. Column 3: contains the annotation name (alphafold_pfam).

SSN and graph clustering

Workflow concerned:

  • subworkflow/ssn_and_graph_clustering.nf (named SSN in the main.nf file)

Processes concerned:

  • SSN_AND_GRAPH_CLUSTERING:DIAMOND_DB (available in: modules/diamond.nf)
  • SSN_AND_GRAPH_CLUSTERING:DIAMOND_BLASTP (available in: modules/diamond.nf)
  • SSN_AND_GRAPH_CLUSTERING:FILTER_ALIGNMENTS (available in: modules/data_processing.nf)
  • SSN_AND_GRAPH_CLUSTERING:GRAPH_CLUSTERING (available in: modules/graph_clustering.nf)
  • SSN_AND_GRAPH_CLUSTERING:MCL_OUTPUT_TO_TSV (available in: modules/graph_clustering.nf)
  • SSN_AND_GRAPH_CLUSTERING:NETWORK_EDGES (available in: modules/graph_clustering.nf)

This workflow first creates a network of sequence similarities. To do this, it performs a pairwise alignment of all user-supplied fasta sequences using Diamond (DIAMOND_DB, DIAMOND_BLASTP). In a second step, LAGOON-MCL selects a single alignment for each pair of sequences, then applies a Markov CLustering algorithm to the result. The alignment selected is the one with the lowest e-value (FILTER_ALIGNMENTS, GRAPH_CLUSTERING).

In order to apply clustering to a network, MCL requires a three-column tabulated file (TSV). This file is derived from the alignment file obtained with Diamond BLASTp. The first two columns contain the sequence identifiers and the third column contains the edge weights. LAGOON-MCL uses the e-value. Each e-value is replaced by the negative value of the logarithm in base 10. If the e-value is less than or equal to 1e-200 then the maximum weight will be limited to 200.

For example:
for an e-value of $1e^{-300}$, the negative value of the logarithm in base 10 is:

$$ -log10(1e^{300})=300 $$

However, $1e^{300} < 1e^{200}$ so the final value for this e-value is $200$.

From the MCL results, MCL_OUTPUT_TO_TSV generates a tabulated file (TSV) with two columns (column 1: cluster identifier, column 2: sequence identifier present in the cluster). This file is used in the rest of the workflow.

NETWORK_EDGES is used to retrieve only those alignments that can be used to reconstruct MCL clusters from the alignment file obtained with FILTER_ALIGNMENTS.


Data analysis and visualization

Workflow concerné :

  • subworkflow/data_analysis.nf

Process concernés :

  • DATA_ANALYSIS:SEQUENCES_PROCESSING (available in: modules/data_analysis.nf)
  • DATA_ANALYSIS:HOMOGENEITY_SCORE (available in: modules/data_analysis.nf)
  • DATA_ANALYSIS:SEQUENCES_FILES (available in: modules/data_analysis.nf)
  • DATA_ANALYSIS:CLUSTERS_FILES (available in: modules/data_analysis.nf)
  • DATA_ANALYSIS:HTML_REPORT (available in: modules/data_analysis.nf)

This workflow processes the results by calculating a homogeneity score for each label, and generates the final pipeline output files (TSV table and HTML report).

The combination of the SEQUENCES_PROCESSING and SEQUENCES_FILES processes generates part of the output files:

  • edges_igraph_network_I[inflation].tsv: TSV file containing the alignments used to build clusters in a format usable with igraph. Sequence identifiers are replaced by numbers how to 0.
  • nodes_igraph_network_I[inflation].tsv: TSV file containing the list of sequences and labels in a format usable by igraph. The first column contains the numeric identifiers of the sequences starting at 0 and the second column contains the sequence name.
  • nodes_metrics_network_I[inflation].tsv: file containing sequence metrics (sequence size, centrality, number of labels for each annotation type, etc.).
  • nodes_labels_network_I[inflation].tsv: file containing all labels for all annotation types linked to each sequence.

A homogeneity score for each cluster and each label is calculated (HOMOGENEITY_SCORE). This homogeneity score is used to determine the label composition for each cluster. For more information on the homogeneity score, please see indices. The HOMOGENEITY_SCORE process generates two files:

  • clusters_labels_network_I[inflatio].tsv: TSV files containing all the labels that can be found in each cluster.
  • abundance_matrix_[label]_network_I[inflation].json: abundance matrix for each label in each cluster.

The CLUSTERS_FILES process is used to generate the last file.

  • clusters_metrics_network_I[inflation].tsv: TSV file containing cluster metrics, including cluster size and diameter and three additional columns for each annotation type (homogeneity score, number of different labels and number of sequences annotated with the label).

Finally, HTML_REPORT generates a report and figures to visualize the results.

For more information on output files, please refer to the 6. Output files page.

References

D. J. Richter et al., « EukProt: A database of genome-scale predicted proteins across the diversity of eukaryotes », Peer Community Journal, vol. 2, p. e56, sept. 2022, doi: 10.24072/pcjournal.173.

J. F. H. Strassert et al., « Single cell genomics of uncultured marine alveolates shows paraphyly of basal dinoflagellates », ISME J, vol. 12, no 1, Art. no 1, janv. 2018, doi: 10.1038/ismej.2017.167.

For more information on references related to LAGOON-MCL, see CITATION.md

Clone this wiki locally