-
Notifications
You must be signed in to change notification settings - Fork 0
3. Tutorial
- Introduction
- Directory layout
- Tutorial
- References
This tutorial uses the test dataset located in the data-test/ directory.
The files consist of MALV transcriptomes obtained from the EukProt database and from the study by J. F. H. Strassert et al. (2017).
Gene3D, FunFam, and TMHMM annotations were generated using InterProScan 5.
lagoon-mcl/
├── assets/ # Contains the different assets
├── bin/ # Contains all python scripts
├── conf/ # Contains all configuration files (resources, parameters, etc.)
├── containers/
│ ├── diamond/ # Contains Diamond container
│ ├── lagoon-mcl/ # Contains the LAGOON-MCL container and the Dockerfile used to build the Docker image
│ ├── mcl/ # Contains MCL container
│ ├── mmseqs2/ # Contains Mmseqs2 container
│ └── seqkit # Contains SeqKit container
├── database/
│ ├── alphafoldDB/ # Contains the MMseqs2 database built from AlphaFold sequences
│ ├── pfamDB/ # Contains the MMseqs2 database from the Pfam database
│ └── uniprot_function.json # json file of UniProt identifiers and related Pfam identifiers
├── data-test/ # Contains a test data set
│ └── malv/
│ ├── labels/ # Contains 3 TSV files (Gene3D, FunFam and TMHMM)
│ ├── fasta/ # Contains FASTA files
│ └── annotationsheet.csv # CSV table of correspondence between annotation name and corresponding file (annotation files are present in labels/)
├── html_templates/ # Contains the HTML template used by jinja2 to generate the report.
├── modules/ # Contains all nextflow modules used by the workflow
├── subworkflow/ # Contains all nextflow sub-workflows used by the main workflow
├── tool-kit/
│ ├── bin/ # Contains python scripts used by the build_alphafold_db.sh script
│ ├── scripts/ # Contains several scripts that can be used to prepare input files
│ ├── slurm/ # Contains a sample script to run LAGOON-MCL on a compute cluster using (SLURM)
│ ├── build_alpahfold_db.sh # Bash script for building the MMseqs2 AlphaFold database
│ ├── build_pfam_db.sh # Bash script for building the MMseqs2 Pfam database
│ └── README.md # Additional information about the tool-kit/ directory
├── results/ # Folder created by lagoon-mcl, contains output files
├── workdir/ # Folder created by lagoon-mcl, contains intermediate files and data
├── main.nf # Main nextflow script, contains the workflow
├── nextflow.config # Main configuration file
├── README.md
├── CITATION.md # List of all workflow-related references
└── LICENSE The first step is to:
- Download LAGOON-MCL from the GitHub repository.
- Download the required containers.
- Download and build the MMseqs2 databases from the Pfam and AlphaFold datasets.
For more information about the third step, please see Databases.
Retrieve LAGOON-MCL from the GitHub repository.
git clone https://github.com/jroussea/lagoon-mcl.git
cd lagoon-mclContainers for Diamond, MCL, SeqKit, and MMseqs2 are available from [BioContainers].
The LAGOON-MCL container is built from a Docker image available on Docker Hub, using the provided Dockerfile.
# SeqKit2 v2.9.0
wget -O containers/seqkit/2.9.0/seqkit.sif https://depot.galaxyproject.org/singularity/seqkit:2.9.0--h9ee0642_0
# Diamond v2.1.10
wget -O containers/diamond/2.1.10/diamond.sif https://depot.galaxyproject.org/singularity/diamond:2.1.10--h43eeafb_2
# MCL v22.282
wget -O containers/mcl/22.282/mcl.sif https://depot.galaxyproject.org/singularity/mcl:22.282--pl5321h031d066_2
# MMseqs2 v15.6f452
wget -O containers/mmseqs2/15.6f452/mmseqs.sif https://depot.galaxyproject.org/singularity/mmseqs2:15.6f452--pl5321h6a68c12_3
# LAGOON-MCL v1.0.0
apptainer build --fakeroot containers/lagoon-mcl/1.0.0/lagoon-mcl.sif docker://jroussea/lagoon-mcl:latestTwo Bash scripts are available in the tool-kit/ directory to download and build MMseqs2 databases from the Pfam and AlphaFold datasets.
For more information, please see Databases.
cd tool-kit/
# Download Alphafold Protein Database [mandatory]
./build_alpahfold_db.sh
# Dowload Pfam [optional]
./build_pfam_db.shLAGOON-MCL uses several types of input files:
- One or more FASTA files.
- One or more tabulated (TSV) files containing annotations or labels associated with sequences.
- A CSV file specifying the annotation or label names and their corresponding TSV files.
For more information about input files, see 5. Input Files.
Test FASTA files are available in the data-test/fasta directory.
This file must be in CSV format and contain two columns.
The first column specifies the annotation name, and the second column specifies the path to the corresponding annotation file.
Example file: annotationsheet.csv.
| annotation | file |
|---|---|
| funfam | /path/to/lagoon-mcl/data-test/malv/labels/funfam.tsv |
| gene3d | /path/to/lagoon-mcl/data-test/malv/labels/gene3d.tsv |
| tmhmm | /path/to/lagoon-mcl/data-test/malv/labels/tmhmm.tsv |
There are three files containing sequence-specific annotations:
-
funfam.tsv: contains FunFam annotations. -
gene3d.tsv: contains Gene3D annotations. -
tmhmm.tsv: contains TMHMM annotations.
Parameters can be provided to the workflow in two different ways:
- Via the command line.
- By editing the
params.yamlfile.
We recommend editing the params.yaml file, as this allows you to keep a record of the settings you have used.
For this tutorial, you can use the params_test.yaml file, which contains all the required parameters.
Run with test parameters:
nextflow run main.nf -profile singularity -params-file params_test.yamlExecution with custom parameters:
nextflow run main.nf -profile singularity -params-file params_test.yamlThis section describes the pipeline step by step.
Workflow involved:
main.nf
Processes involved:
-
CHECKS_FASTA(located inmodules/check_file_format.nf) -
CHECK_CSV(located inmodules/check_file_format.nf) -
CHECKS_TSV(located inmodules/check_file_format.nf)
These processes verify that the input files comply with the formats required by LAGOON-MCL.
If any of these checks fail, an error message will be displayed in the terminal.
Workflow involved:
main.nf
Processes involved:
-
FASTA_PROCESSING(located inmodules/data_processing.nf) -
ANNOTATIONS_PROCESSING(located inmodules/data_processing.nf)
The FASTA_PROCESSING process removes descriptions from FASTA sequence names, keeping only the identifiers.
The ANNOTATIONS_PROCESSING process uses the annotationsheet.csv file to retrieve annotation names and the paths to their corresponding files.
It adds sequences without labels to each file, assigning them the value NA, and also adds a column containing the label name (e.g., FunFam, Species, etc.).
Workflow involved:
subworkflow/function_searches.nf
Processes involved:
-
FUNCTION_SEARCHES:MMSEQS_CREATE_DB(located inmodules/mmseqs2.nf) -
FUNCTION_SEARCHES:MMSEQS_SEARCH(located inmodules/mmseqs2.nf) -
FUNCTION_SEARCHES:PFAM_PROCESSING(located inmodules/data_processing.nf)
This workflow creates an MMseqs2-indexed database from user-supplied FASTA sequences and aligns it against the database built from Pfam data (MMSEQS_CREATE_DB and MMSEQS_SEARCH).
The PFAM_PROCESSING process generates a Pfam annotation file for sequences based on the output of the ANNOTATIONS_PROCESSING process.
For more information, please see Databases.
Workflow involved:
subworkflow/structure_searches.nf
Processes involved:
-
STRUCTURE_SEARCHES:MMSEQS_CREATE_DB(located inmodules/mmseqs2.nf) -
STRUCTURE_SEARCHES:MMSEQS_SEARCH(located inmodules/mmseqs2.nf) -
STRUCTURE_SEARCHES:ALPHAFOLD_ALIGNMENTS(located inmodules/structure_searches.nf) -
STRUCTURE_SEARCHES:ALPHAFOLD_INFORMATIONS(located inmodules/structure_searches.nf) -
STRUCTURE_SEARCHES:ANNOTATIONS_PROCESSING(located inmodules/data_processing.nf)
This workflow creates an MMseqs2-indexed database from user-supplied FASTA sequences and aligns it against the database built from AlphaFold data.
It generates a tabulated file containing the alignments and associated metrics in .m8 format.
For more information, please see Databases.
The workflow can be divided into three parts. The first part retains only one alignment per sequence against the AlphaFold database (ALPHAFOLD_ALIGNMENTS). The pipeline keeps the sequence based on the following criteria:
- Select the best alignment using two indices: one for mutual sequence coverage and one for alignment disparity. Learn more about calculating these indices here.
- Highest percentage identity.
- Longest AlphaFold database sequence length.
Using the ALPHAFOLD_INFORMATIONS and ANNOTATIONS_PROCESSING processes, the pipeline extracts sequence identifiers from the AlphaFold database (identical to those in UniProtKB).
The result is three annotation files based on the output of PFAM_PROCESSING:
-
File 1: Three columns
- User-supplied sequence identifier
- AlphaFold sequence identifier
- Annotation name:
alphafold_sequences
-
File 2: Three columns
- User-supplied sequence identifier
- Identifier of the AlphaFold cluster the sequence belongs to
- Annotation name:
alphafold_clusters
-
File 3: Three columns
- User-supplied sequence identifier
- Pfam identifier linked to the AlphaFold / UniProtKB sequence identifier
- Annotation name:
alphafold_pfam
Workflow involved:
-
subworkflow/ssn_and_graph_clustering.nf(referred to as SSN inmain.nf)
Processes involved:
-
SSN_AND_GRAPH_CLUSTERING:DIAMOND_DB(located inmodules/diamond.nf) -
SSN_AND_GRAPH_CLUSTERING:DIAMOND_BLASTP(located inmodules/diamond.nf) -
SSN_AND_GRAPH_CLUSTERING:FILTER_ALIGNMENTS(located inmodules/data_processing.nf) -
SSN_AND_GRAPH_CLUSTERING:GRAPH_CLUSTERING(located inmodules/graph_clustering.nf) -
SSN_AND_GRAPH_CLUSTERING:MCL_OUTPUT_TO_TSV(located inmodules/graph_clustering.nf) -
SSN_AND_GRAPH_CLUSTERING:NETWORK_EDGES(located inmodules/graph_clustering.nf)
This workflow first creates a sequence similarity network (SSN).
All user-supplied FASTA sequences are aligned pairwise using Diamond (DIAMOND_DB and DIAMOND_BLASTP).
Next, LAGOON-MCL selects a single alignment for each pair of sequences based on the lowest e-value, then applies the Markov Clustering (MCL) algorithm (FILTER_ALIGNMENTS and GRAPH_CLUSTERING).
To perform clustering, MCL requires a three-column tabulated file (TSV) derived from the Diamond BLASTp alignments:
- Column 1: Sequence identifier 1
- Column 2: Sequence identifier 2
- Column 3: Edge weight
Edge weights:
- LAGOON-MCL uses the e-value as the metric.
- Each e-value is replaced by the negative base-10 logarithm.
- Maximum weight is capped at 200 for e-values ≤ 1e-200.
Example:
For an e-value of
Since
-
MCL_OUTPUT_TO_TSVgenerates a TSV file with:- Column 1: Cluster identifier
- Column 2: Sequence identifiers present in the cluster
This file is used in subsequent workflow steps.
-
NETWORK_EDGESretrieves only the alignments necessary to reconstruct MCL clusters from the filtered alignment file (FILTER_ALIGNMENTS).
Workflow involved:
subworkflow/data_analysis.nf
Processes involved:
-
DATA_ANALYSIS:SEQUENCES_PROCESSING(located inmodules/data_analysis.nf) -
DATA_ANALYSIS:HOMOGENEITY_SCORE(located inmodules/data_analysis.nf) -
DATA_ANALYSIS:SEQUENCES_FILES(located inmodules/data_analysis.nf) -
DATA_ANALYSIS:CLUSTERS_FILES(located inmodules/data_analysis.nf) -
DATA_ANALYSIS:HTML_REPORT(located inmodules/data_analysis.nf)
This workflow processes the results by calculating a homogeneity score for each label and generates the final pipeline output files (TSV tables and HTML report).
Output generated by SEQUENCES_PROCESSING and SEQUENCES_FILES
These processes produce the following files:
-
edges_igraph_network_I[inflation].tsv: TSV file containing the alignments used to build clusters in a format compatible with igraph. Sequence identifiers are replaced with numbers starting from 0. -
nodes_igraph_network_I[inflation].tsv: TSV file containing the list of sequences and labels in a format compatible with igraph. Column 1: numeric sequence identifiers starting from 0, Column 2: sequence names. -
nodes_metrics_network_I[inflation].tsv: TSV file containing sequence metrics (sequence length, centrality, number of labels per annotation type, etc.). -
nodes_labels_network_I[inflation].tsv: TSV file containing all labels for all annotation types associated with each sequence.
Homogeneity score (HOMOGENEITY_SCORE)
A homogeneity score is calculated for each cluster and each label to determine the label composition of each cluster.
The HOMOGENEITY_SCORE process generates:
-
clusters_labels_network_I[inflation].tsv: TSV file listing all labels present in each cluster. -
abundance_matrix_[label]_network_I[inflation].json: Abundance matrix for each label in each cluster.
For more information on the homogeneity score, see Indices.
Cluster metrics (CLUSTERS_FILES)
The CLUSTERS_FILES process generates:
-
clusters_metrics_network_I[inflation].tsv: TSV file containing cluster metrics, including cluster size, diameter, and three additional columns for each annotation type (homogeneity score, number of different labels, and number of sequences annotated with the label).
Final report (HTML_REPORT)
The HTML_REPORT process generates an HTML report with figures to visualize the results.
For more information on output files, see 6. Output Files.
D. J. Richter et al., « EukProt: A database of genome-scale predicted proteins across the diversity of eukaryotes », Peer Community Journal, vol. 2, p. e56, sept. 2022, doi: 10.24072/pcjournal.173.
J. F. H. Strassert et al., « Single cell genomics of uncultured marine alveolates shows paraphyly of basal dinoflagellates », ISME J, vol. 12, no 1, Art. no 1, janv. 2018, doi: 10.1038/ismej.2017.167.
For more information on references related to LAGOON-MCL, please see CITATION.md.