Table of contents
* [Input parameters](#title-1)
* [`--fasta`](#--fasta-file)
* [`--annotation`](#--annotation-file)
* [Output parameters](#title-2)
* [`--outdir`](#--outdir-path)
* [Pfam database scan](#title-3)
* [`--scan_pfam`](#--scan_pfam-bool)
* [`--pfam_path`](#--pfam_path-path)
* [`--pfam_name`](#--pfam_name-string)
* [AlphaFold Protein Structure Database scan](#title-4)
* [`--alphafold_path`](#--alphafold_path-path)
* [`--alphafold_name`](#--alphafold_name-string)
* [`--uniprot`](#--uniprot-file)
* [Sequence Similarity Network](#title-5)
* [`--alignment_file`](#--alignment_file-file)
* [`--sensitivity`](#--sensitivity-string)
* [`--matrix`](#--matrix-string)
* [`--diamond_evalue`](#--diamond_evalue-float)
* [Clustering](#title-6)
* [`--I`](#--i-list)
* [`--max_weight`](#--max_weight-float)
* [`--cluster_size`](#--cluster_size-int)
* [Resources](#title-7)
* [`--max_cpus`](#--max_cpus-int)
* [`--max_memory`](#--max_memory-flaotgb)
* [`--max_time`](#--max_time-inth)
* [Other command line parameters](#title-8)
* [`--help`](#--help)
* [`-w / --workdir`](#-w----workdir-path)
* [`--projectName`](#--projectname-string)
* [`-resume`](#-resume)
Input parameters
### __`--fasta `__
> *default: null*
Enter the path to the FASTA files that LAGOON-MCL should process. \
If you want to include multiple FASTA files, enclose the path in quotes (`"`). \
For example: `"/path/to/lagoon-mcl/data-test/fasta/*.fasta"`
### __`--annotation `__
> *default: null*
If you have sequence-specific labels, annotations, or additional information (e.g., taxonomy, function, etc.), you can provide them to LAGOON-MCL by creating a CSV (comma-separated values) file with two columns:
* `annotation`: the name of the label, annotation, or information (e.g., species, Pfam, etc.)
* `file`: the path to the file containing the corresponding label, annotation, or information
> For more details about this file, please see [5. Input files]().
Ouput parameters
### __`--outdir `__
> *default: ${projectDir}/results*
Directory where LAGOON-MCL will save the output files. This directory can be located outside the folder containing the input files used to run the pipeline.
Pfam database scan
With LAGOON-MCL, you can align your own sequences against the [Pfam database](https://www.ebi.ac.uk/interpro/download/Pfam/) using [MMseqs2](). \
If the `--annotation` parameter is not provided, LAGOON-MCL will automatically scan the Pfam database.
> For more information about the Pfam database, see [4. Databases](), and for details on the alignment step, see [3. Tutorial]().
### __`--scan_pfam `__
> *default: true*
Boolean parameters only accept `true` or `false` (must be lowercase).
- `true` : LAGOON-MCL aligns sequences against the Pfam database using MMseqs2.
- `false` : LAGOON-MCL does not align sequences against the MMseqs2 database.
If the `--annotation` parameter is `null`, LAGOON-MCL will align sequences against the Pfam database even if `--pfam_scan` is set to `false`.
### __`--pfam_path `__
> *default: ${projectDir}/database/pfamDB*
If `--scan_pfam` is set to `true` and the MMseqs2 database derived from Pfam has not been downloaded and built using the `build_pfam_db.sh` script, you must specify the path to the directory containing the MMseqs2 database built from Pfam.
If you used the `./build_pfam_db.sh` script, you can keep the default value.
> For more information, please see [Databases]() and [Tutorial]().
### __`--pfam_name `__
> *default: pfamDB*
If `--scan_pfam` is set to `true` and the MMseqs2 database derived from Pfam has not been downloaded and built using the `build_pfam_db.sh` script, you must specify the name of the MMseqs2 database. \
The database name should match the common prefix of all files in the directory provided via the `--pfam_path` parameter.
AlphaFold Protein Structure Database scan
LAGOON-MCL aligns user-supplied sequences against an MMseqs2 database built from sequences in the AlphaFold Protein Structure Database and, more specifically, from sequences that can be found in the AlphaFold Clusters database. As the sequences in the AlphaFold Protein Structure Database have a UniProt identifier, LAGOON-MCL uses this link to provide Pfam annotations.
> For more information on selecting AlphaFold sequences, building the MMseqs2 database and retrieving information from the UniProtKB database, please see [databases]().
### __`--alphafold_path `__
> *default: ${projectDir}/database/alphafoldDB*
If the MMseqs2 database derived from AlphaFold data has not been built using the `build_alphafold_db.sh` script, you must specify the path to the directory containing the MMseqs2 database built from AlphaFold.
If you used the `./build_alphafold_db.sh` script, you can keep the default value.
### __`--alphafold_name `__
> *default: alphafoldDB*
If the MMseqs2 database derived from AlphaFold data has not been downloaded and built using the `build_alphafold_db.sh` script, you must specify the name of the MMseqs2 database. \
The database name should match the common prefix of all files in the directory provided via the `--alphafold_path` parameter.
### __`--uniprot `__
> *default: ${projectDir}/database/uniprot_function.json*
The UniProt file is a JSON file with the following structure:
- `Key`: UniProt identifier
- `Value`: list of Pfam identifiers associated with the UniProt identifier
This file is generated using the `build_alphafold_db.sh` script.
If you obtained the file without using the script, use this parameter to specify its path.
> For more information, please see [Database]() and [Tutorial]().
Sequence Similarity Network
The sequence similarity network (SSN) is constructed from pairwise alignments generated using Diamond BLASTp.
### __`--alignment_file `__
> *default: null*
If you have a file containing pairwise alignments of your sequences, you can provide it to LAGOON-MCL using this parameter. \
If `--alignment_file` is not `null`, LAGOON-MCL will skip running Diamond for pairwise sequence alignment.
The alignment file must follow a specific format. Please consult [Input Files]() for details.
### __`--sensitivity `__
> *default: very-sensitive*
This parameter specifies the sensitivity of Diamond BLASTp when aligning sequences.
You can choose from the following options: `fast`, `mid-sensitive`, `sensitive`, `more-sensitive`, `very-sensitive`, or `ultra-sensitive`.
> For more information, please consult the [Diamond documentation]().
### __`--matrix `__
> *default: BLOSUM62*
This parameter specifies the scoring matrix used by Diamond BLASTp for sequence alignment. \
You can choose from: `BLOSUM45`, `BLOSUM50`, `BLOSUM80`, `BLOSUM90`, `PAM250`, `PAM70`, or `PAM30`.
> For more information, please consult the [Diamond documentation]().
### __`--diamond_evalue `__
> *default: 0.001*
Specifies the maximum E-value for reporting an alignment in Diamond BLASTp.
> For more information, please consult the [Diamond documentation]().
Clustering
After building the sequence similarity networks (SSNs), graph clustering is performed on the SSN using the Markov Clustering (MCL) algorithm.
Definitions of the terms used in this section:
- **Nodes**: In an SSN, a node represents a sequence.
- **Edges**: In an SSN, an edge represents the relationship between two sequences based on an alignment metric (e.g., identity, E-value, etc.).
- **Weight**: In an SSN, the weight represents the value of a specific metric used for the edge (e.g., identity).
### __`--I `__
> *default: 1.4,2*
`I` corresponds to the inflation parameter used by MCL. This parameter controls the granularity of clustering:
- A low inflation value produces larger clusters.
- A high inflation value produces smaller clusters.
The inflation parameter typically ranges from 1.4 to 5.
With LAGOON-MCL, you can provide multiple inflation values to compare the resulting clusters.
Separate each value with a comma, for example: `"1.4,2,4,5"`.
> The size of a cluster corresponds to the number of nodes it contains.
> For more information, please consult the [MCL documentation]().
### __`--max_weight `__
> *default: 200*
To apply clustering to a network, MCL requires a three-column tab-delimited file (TSV).
This file is derived from the alignment file obtained with Diamond BLASTp.
- The first two columns contain the sequence identifiers.
- The third column contains the edge weights.
LAGOON-MCL uses the E-value as the weight. Each E-value is transformed into the negative base-10 logarithm.
If an E-value is smaller than or equal to 1e-200, the maximum weight is capped at 200.
For example:
For an E-value of $1e^{-300}$, the negative base-10 logarithm is:
$$
-log10(1e^{-300}) = 300
$$
However, since $1e^{-300} < 1e^{-200}$, the final weight is capped at 200.
> For more information on how to use this parameter, please see [Tutorial]().
> For further details, please consult the [MCL documentation]().
### __`--cluster_size `__
> *default: 2*
This parameter sets the minimum number of nodes required for a cluster to be retained.
By default, clusters must contain at least two nodes.
Resources
### __`--max_cpus `__
> *default: 200*
Specifies the maximum number of CPUs that can be used by a process. \
If this limit is exceeded, the pipeline execution will be terminated.
### __`--max_memory `__
> *default: 750.GB*
Specifies the maximum amount of RAM that can be used by a process. \
If this limit is exceeded, the pipeline execution will be terminated.
### __`--max_time `__
> *default: 350.h*
Specifies the maximum execution time for a process. \
If this limit is exceeded, the pipeline execution will be terminated.
Other command line parameters
### __`--help`__
Show help.
### __`-w / --workdir `__
> *default: /path/to/lagoon-mcl/workdir*
Specifies the path to the pipeline's working directory.
### __`--projectName `__
> *default: lagoon-mcl*
Project name. Used to name the working directory.
### __`-resume`__
Use this parameter when restarting the pipeline after an interruption or error. \
Nextflow will use cached results for any pipeline steps where the inputs are unchanged, allowing the pipeline to continue from where it stopped.