2. Command line option

Input parameters

`--fasta <file>`

default: null

Enter the path to the FASTA files that LAGOON-MCL should process.
If you want to include multiple FASTA files, enclose the path in quotes (").
For example: "/path/to/lagoon-mcl/data-test/fasta/*.fasta"

`--annotation <file>`

default: null

If you have sequence-specific labels, annotations, or additional information (e.g., taxonomy, function, etc.), you can provide them to LAGOON-MCL by creating a CSV (comma-separated values) file with two columns:

annotation: the name of the label, annotation, or information (e.g., species, Pfam, etc.)
file: the path to the file containing the corresponding label, annotation, or information

For more details about this file, please see 5. Input files.

Ouput parameters

`--outdir <path>`

default: ${projectDir}/results

Directory where LAGOON-MCL will save the output files. This directory can be located outside the folder containing the input files used to run the pipeline.

Pfam database scan

With LAGOON-MCL, you can align your own sequences against the Pfam database using MMseqs2.
If the --annotation parameter is not provided, LAGOON-MCL will automatically scan the Pfam database.

For more information about the Pfam database, see 4. Databases, and for details on the alignment step, see 3. Tutorial.

`--scan_pfam <bool>`

default: true

Boolean parameters only accept true or false (must be lowercase).

true : LAGOON-MCL aligns sequences against the Pfam database using MMseqs2.
false : LAGOON-MCL does not align sequences against the MMseqs2 database.

If the --annotation parameter is null, LAGOON-MCL will align sequences against the Pfam database even if --pfam_scan is set to false.

`--pfam_path <path>`

default: ${projectDir}/database/pfamDB

If --scan_pfam is set to true and the MMseqs2 database derived from Pfam has not been downloaded and built using the build_pfam_db.sh script, you must specify the path to the directory containing the MMseqs2 database built from Pfam.

If you used the ./build_pfam_db.sh script, you can keep the default value.

For more information, please see Databases and Tutorial.

`--pfam_name <string>`

default: pfamDB

If --scan_pfam is set to true and the MMseqs2 database derived from Pfam has not been downloaded and built using the build_pfam_db.sh script, you must specify the name of the MMseqs2 database.
The database name should match the common prefix of all files in the directory provided via the --pfam_path parameter.

AlphaFold Protein Structure Database scan

LAGOON-MCL aligns user-supplied sequences against an MMseqs2 database built from sequences in the AlphaFold Protein Structure Database and, more specifically, from sequences that can be found in the AlphaFold Clusters database. As the sequences in the AlphaFold Protein Structure Database have a UniProt identifier, LAGOON-MCL uses this link to provide Pfam annotations.

For more information on selecting AlphaFold sequences, building the MMseqs2 database and retrieving information from the UniProtKB database, please see databases.

`--alphafold_path <path>`

default: ${projectDir}/database/alphafoldDB

If the MMseqs2 database derived from AlphaFold data has not been built using the build_alphafold_db.sh script, you must specify the path to the directory containing the MMseqs2 database built from AlphaFold.

If you used the ./build_alphafold_db.sh script, you can keep the default value.

`--alphafold_name <string>`

default: alphafoldDB

If the MMseqs2 database derived from AlphaFold data has not been downloaded and built using the build_alphafold_db.sh script, you must specify the name of the MMseqs2 database. \ The database name should match the common prefix of all files in the directory provided via the --alphafold_path parameter.

`--uniprot <file>`

default: ${projectDir}/database/uniprot_function.json

The UniProt file is a JSON file with the following structure:

Key: UniProt identifier
Value: list of Pfam identifiers associated with the UniProt identifier

This file is generated using the build_alphafold_db.sh script.
If you obtained the file without using the script, use this parameter to specify its path.

For more information, please see Database and Tutorial.

Sequence Similarity Network

The sequence similarity network (SSN) is constructed from pairwise alignments generated using Diamond BLASTp.

`--alignment_file <file>`

default: null

If you have a file containing pairwise alignments of your sequences, you can provide it to LAGOON-MCL using this parameter.
If --alignment_file is not null, LAGOON-MCL will skip running Diamond for pairwise sequence alignment.

The alignment file must follow a specific format. Please consult Input Files for details.

`--sensitivity <string>`

default: very-sensitive

This parameter specifies the sensitivity of Diamond BLASTp when aligning sequences.
You can choose from the following options: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, or ultra-sensitive.

For more information, please consult the Diamond documentation.

`--matrix <string>`

default: BLOSUM62

This parameter specifies the scoring matrix used by Diamond BLASTp for sequence alignment.
You can choose from: BLOSUM45, BLOSUM50, BLOSUM80, BLOSUM90, PAM250, PAM70, or PAM30.

For more information, please consult the Diamond documentation.

`--diamond_evalue <float>`

default: 0.001

Specifies the maximum E-value for reporting an alignment in Diamond BLASTp.

For more information, please consult the Diamond documentation.

Clustering

After building the sequence similarity networks (SSNs), graph clustering is performed on the SSN using the Markov Clustering (MCL) algorithm.

Definitions of the terms used in this section:

Nodes: In an SSN, a node represents a sequence.
Edges: In an SSN, an edge represents the relationship between two sequences based on an alignment metric (e.g., identity, E-value, etc.).
Weight: In an SSN, the weight represents the value of a specific metric used for the edge (e.g., identity).

`--I <list>`

default: 1.4,2

I corresponds to the inflation parameter used by MCL. This parameter controls the granularity of clustering:

A low inflation value produces larger clusters.
A high inflation value produces smaller clusters.

The inflation parameter typically ranges from 1.4 to 5.
With LAGOON-MCL, you can provide multiple inflation values to compare the resulting clusters.
Separate each value with a comma, for example: "1.4,2,4,5".

The size of a cluster corresponds to the number of nodes it contains.

For more information, please consult the MCL documentation.

`--max_weight <float>`

default: 200

To apply clustering to a network, MCL requires a three-column tab-delimited file (TSV).
This file is derived from the alignment file obtained with Diamond BLASTp.

The first two columns contain the sequence identifiers.
The third column contains the edge weights.

LAGOON-MCL uses the E-value as the weight. Each E-value is transformed into the negative base-10 logarithm.
If an E-value is smaller than or equal to 1e-200, the maximum weight is capped at 200.

For example:
For an E-value of $1e^{-300}$, the negative base-10 logarithm is:
$$ -log10(1e^{-300}) = 300 $$ However, since $1e^{-300} < 1e^{-200}$, the final weight is capped at 200.

For more information on how to use this parameter, please see Tutorial.

For further details, please consult the MCL documentation.

`--cluster_size <int>`

default: 2

This parameter sets the minimum number of nodes required for a cluster to be retained.
By default, clusters must contain at least two nodes.

Resources

`--max_cpus <int>`

default: 200

Specifies the maximum number of CPUs that can be used by a process.
If this limit is exceeded, the pipeline execution will be terminated.

`--max_memory <flaot.GB>`

default: 750.GB

Specifies the maximum amount of RAM that can be used by a process.
If this limit is exceeded, the pipeline execution will be terminated.

`--max_time <int.h>`

default: 350.h

Specifies the maximum execution time for a process.
If this limit is exceeded, the pipeline execution will be terminated.

Other command line parameters

`--help`

Show help.

`-w / --workdir <path>`

default: /path/to/lagoon-mcl/workdir

Specifies the path to the pipeline's working directory.

`--projectName <string>`

default: lagoon-mcl

Project name. Used to name the working directory.

`-resume`

Use this parameter when restarting the pipeline after an interruption or error.
Nextflow will use cached results for any pipeline steps where the inputs are unchanged, allowing the pipeline to continue from where it stopped.

2. Command line option

Table of contents

Input parameters

--fasta <file>

--annotation <file>

Ouput parameters

--outdir <path>

Pfam database scan

--scan_pfam <bool>

--pfam_path <path>

--pfam_name <string>

AlphaFold Protein Structure Database scan

--alphafold_path <path>

--alphafold_name <string>

--uniprot <file>

Sequence Similarity Network

--alignment_file <file>

--sensitivity <string>

--matrix <string>

--diamond_evalue <float>

Clustering

--I <list>

--max_weight <float>

--cluster_size <int>

Resources

--max_cpus <int>

--max_memory <flaot.GB>

--max_time <int.h>

Other command line parameters

--help

-w / --workdir <path>

--projectName <string>

-resume

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`--fasta <file>`

`--annotation <file>`

`--outdir <path>`

`--scan_pfam <bool>`

`--pfam_path <path>`

`--pfam_name <string>`

`--alphafold_path <path>`

`--alphafold_name <string>`

`--uniprot <file>`

`--alignment_file <file>`

`--sensitivity <string>`

`--matrix <string>`

`--diamond_evalue <float>`

`--I <list>`

`--max_weight <float>`

`--cluster_size <int>`

`--max_cpus <int>`

`--max_memory <flaot.GB>`

`--max_time <int.h>`

`--help`

`-w / --workdir <path>`

`--projectName <string>`

`-resume`