Skip to content

seqator

masikol edited this page Mar 1, 2023 · 5 revisions

seqator

Description

This script performs binning. It bins sequences (in fasta format) and sequence-containing files which have SPAdes-like headers and file names, respectively.

“SPAdes-like” means the following format:

NODE_1_length_61704_cov_114.517

Seqator can bin according to sequence length and coverage (length and cov in SPAdes-like line).

Seqator modes

Seqator works in two modes: dir and fasta_file:

  • If seqator mode is dir, then the script will move files which pass the filter from the input directory to the output directory.

  • If seqator mode is fasta_file, the script will copy sequences which pass the filter from the input file to the output file.

Filter

Seqator filter is customizable. You can choose:

  1. Filter parameter: seqator can filter sequences by two parameters: by coverage or by length.

  2. Filter mode defines how to compare sequence parameters to the threshold. E.g. if filter mode is lt (Less Than), then the script will match and copy sequences having their parameter (say, coverage) less then the threshold (say, 25.0). See the details in the Options section (-f option).

  3. Threshold: the threshold to filter by.

Dependencies

The script is written in Python, so you need Python interpreter (version 3.X) to use it. Here you can download Python.

Usage

Options

# Input

-i / --input
  Input directory or fasta file, depending on seqator_mode (-m).
  And input fasta file may be gzipped.
  Mandatory.

-x / --target-file-extention
  This option is applicable only if seqator_mode (-m) is 'dir'.
  '-x' is the extention of files to be checked, without the preceding dot.
  E.g. if you want to bin .fasta files, then specify '-x fasta'.
  Optional. Default: 'dna'.

# Output

-o / --output
  Output directory or fasta file, depending on '-m'.
  If the file name ends with '.gz', the output file will be gzipped.
  Optional. By default, the script will create an output directory in the wokring directory.

# Seqator mode

-m / --seqator-mode
  There are two modes: 'dir' and 'fasta_file'.
  If the mode is 'dir', the script will move files which pass the filter
    from the input directory to the output directory.
  If the mode is 'fasta_file', the script will copy sequences which pass the filter
    from the input file to the output file.
  Also, '-m' may be 'auto'. In 'auto' mode, the mode will be
    'dir' if '-i' is a directory and 'fasta_file' if '-i' is a regular file.
  Optional. Default: 'auto'.

# Filter

-p / --filter-parameter
  There are two sequence parameters to filter by: 'len' and 'cov':
    length and coverage, respectively.
  Optional. Default: 'cov'.

-f / --filter-mode
  There are basically six ways to compare numbers:
    'lt' (Less Than),    'le' (Less or Equal),
    'gt' (Greater Than), 'ge' (Greater or Equal),
    'eq' (EQual to),     'ne' (Not Equal to).
  E.g. if you specify '-p cov -f lt -t 12.5 -m fasta_file', then
    the script will copy all sequences having coverage less then 12.5 to the output file.
  Optional. Default: 'lt'.

-t / --threshold
  Threshold to use for filtering by '-p' parameter.
  See the example for '-f' option -- there you'll see how '-t' option works.
  Mandatory.

# Help and version

-h / --help
  Print help message and exit.

-v / --version
  Print version and exit.

Examples

Example 1

dir seqator mode.

Move .dna files having contig coverage greater than 10 from directory indir/ to outdir/.

python3 seqator.py \
  -i indir \
  -x dna \
  -p cov \
  -f gt \
  -t 10 \
  -o outdir
Example 2

dir seqator mode.

Move .fna files having sequence length equal to 1000 bp from directory indir/ to outdir/.

python3 seqator.py \
  -i indir \
  -x fna \
  -p len \
  -f eq \
  -t 1000 \
  -o outdir
Example 3

fasta_file seqator mode

Copy sequences from file input.fasta which have sequence length less than 1000 bp to file output.fasta.gz.

python3 seqator.py \
  -i input.fasta \
  -p len \
  -f lt \
  -t 1000 \
  -o output.fasta.gz
Example 4

fasta_file seqator mode

Copy sequences from file input.fasta.gz which have contig coverage greater than 10 to file output.fasta.

python3 seqator.py \
  -i input.fasta.gz \
  -p cov \
  -f gt \
  -t 10 \
  -o output.fasta
Clone this wiki locally