-
Notifications
You must be signed in to change notification settings - Fork 0
2. Command line option
- Input parameters
- Output parameters
- Pfam database scan
- AlphaFold Protein Structure Database scan
- Sequence Similarity Network
- Clustering
- Resources
- Other command line parameters
default: null
Enter the path to the FASTA files that LAGOON-MCL should process.
If you want to include multiple FASTA files, enclose the path in quotes (").
For example: "/path/to/lagoon-mcl/data-test/fasta/*.fasta"
default: null
If you have sequence-specific labels, annotations, or additional information (e.g., taxonomy, function, etc.), you can provide them to LAGOON-MCL by creating a CSV (comma-separated values) file with two columns:
-
annotation: the name of the label, annotation, or information (e.g., species, Pfam, etc.) -
file: the path to the file containing the corresponding label, annotation, or information
For more details about this file, please see 5. Input files.
default: ${projectDir}/results
Directory where LAGOON-MCL will save the output files. This directory can be located outside the folder containing the input files used to run the pipeline.
With LAGOON-MCL, you can align your own sequences against the Pfam database using MMseqs2.
If the --annotation parameter is not provided, LAGOON-MCL will automatically scan the Pfam database.
For more information about the Pfam database, see 4. Databases, and for details on the alignment step, see 3. Tutorial.
default: true
Boolean parameters only accept true or false (must be lowercase).
-
true: LAGOON-MCL aligns sequences against the Pfam database using MMseqs2. -
false: LAGOON-MCL does not align sequences against the MMseqs2 database.
If the --annotation parameter is null, LAGOON-MCL will align sequences against the Pfam database even if --pfam_scan is set to false.
default: ${projectDir}/database/pfamDB
If --scan_pfam is set to true and the MMseqs2 database derived from Pfam has not been downloaded and built using the build_pfam_db.sh script, you must specify the path to the directory containing the MMseqs2 database built from Pfam.
If you used the ./build_pfam_db.sh script, you can keep the default value.
default: pfamDB
If --scan_pfam is set to true and the MMseqs2 database derived from Pfam has not been downloaded and built using the build_pfam_db.sh script, you must specify the name of the MMseqs2 database.
The database name should match the common prefix of all files in the directory provided via the --pfam_path parameter.
LAGOON-MCL aligns user-supplied sequences against an MMseqs2 database built from sequences in the AlphaFold Protein Structure Database and, more specifically, from sequences that can be found in the AlphaFold Clusters database. As the sequences in the AlphaFold Protein Structure Database have a UniProt identifier, LAGOON-MCL uses this link to provide Pfam annotations.
For more information on selecting AlphaFold sequences, building the MMseqs2 database and retrieving information from the UniProtKB database, please see databases.
default: ${projectDir}/database/alphafoldDB
If the MMseqs2 database derived from AlphaFold data has not been built using the build_alphafold_db.sh script, you must specify the path to the directory containing the MMseqs2 database built from AlphaFold.
If you used the ./build_alphafold_db.sh script, you can keep the default value.
default: alphafoldDB
If the MMseqs2 database derived from AlphaFold data has not been downloaded and built using the build_alphafold_db.sh script, you must specify the name of the MMseqs2 database. \
The database name should match the common prefix of all files in the directory provided via the --alphafold_path parameter.
default: ${projectDir}/database/uniprot_function.json
The UniProt file is a JSON file with the following structure:
-
Key: UniProt identifier -
Value: list of Pfam identifiers associated with the UniProt identifier
This file is generated using the build_alphafold_db.sh script.
If you obtained the file without using the script, use this parameter to specify its path.
The sequence similarity network (SSN) is constructed from pairwise alignments generated using Diamond BLASTp.
default: null
If you have a file containing pairwise alignments of your sequences, you can provide it to LAGOON-MCL using this parameter.
If --alignment_file is not null, LAGOON-MCL will skip running Diamond for pairwise sequence alignment.
The alignment file must follow a specific format. Please consult Input Files for details.
default: very-sensitive
This parameter specifies the sensitivity of Diamond BLASTp when aligning sequences.
You can choose from the following options: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, or ultra-sensitive.
For more information, please consult the Diamond documentation.
default: BLOSUM62
This parameter specifies the scoring matrix used by Diamond BLASTp for sequence alignment.
You can choose from: BLOSUM45, BLOSUM50, BLOSUM80, BLOSUM90, PAM250, PAM70, or PAM30.
For more information, please consult the Diamond documentation.
default: 0.001
Specifies the maximum E-value for reporting an alignment in Diamond BLASTp.
For more information, please consult the Diamond documentation.
After building the sequence similarity networks (SSNs), graph clustering is performed on the SSN using the Markov Clustering (MCL) algorithm.
Definitions of the terms used in this section:
- Nodes: In an SSN, a node represents a sequence.
- Edges: In an SSN, an edge represents the relationship between two sequences based on an alignment metric (e.g., identity, E-value, etc.).
- Weight: In an SSN, the weight represents the value of a specific metric used for the edge (e.g., identity).
default: 1.4,2
I corresponds to the inflation parameter used by MCL. This parameter controls the granularity of clustering:
- A low inflation value produces larger clusters.
- A high inflation value produces smaller clusters.
The inflation parameter typically ranges from 1.4 to 5.
With LAGOON-MCL, you can provide multiple inflation values to compare the resulting clusters.
Separate each value with a comma, for example: "1.4,2,4,5".
The size of a cluster corresponds to the number of nodes it contains.
For more information, please consult the MCL documentation.
default: 200
To apply clustering to a network, MCL requires a three-column tab-delimited file (TSV).
This file is derived from the alignment file obtained with Diamond BLASTp.
- The first two columns contain the sequence identifiers.
- The third column contains the edge weights.
LAGOON-MCL uses the E-value as the weight. Each E-value is transformed into the negative base-10 logarithm.
If an E-value is smaller than or equal to 1e-200, the maximum weight is capped at 200.
For example:
For an E-value of
$$
-log10(1e^{-300}) = 300
$$
However, since
For more information on how to use this parameter, please see Tutorial.
For further details, please consult the MCL documentation.
default: 2
This parameter sets the minimum number of nodes required for a cluster to be retained.
By default, clusters must contain at least two nodes.
default: 200
Specifies the maximum number of CPUs that can be used by a process.
If this limit is exceeded, the pipeline execution will be terminated.
default: 750.GB
Specifies the maximum amount of RAM that can be used by a process.
If this limit is exceeded, the pipeline execution will be terminated.
default: 350.h
Specifies the maximum execution time for a process.
If this limit is exceeded, the pipeline execution will be terminated.
Show help.
default: /path/to/lagoon-mcl/workdir
Specifies the path to the pipeline's working directory.
default: lagoon-mcl
Project name. Used to name the working directory.
Use this parameter when restarting the pipeline after an interruption or error.
Nextflow will use cached results for any pipeline steps where the inputs are unchanged, allowing the pipeline to continue from where it stopped.