Skip to content

barapost binning

masikol edited this page May 26, 2023 · 8 revisions

Description

barapost-binnig.py -- this script is designed for binning (dividing into separate files) FASTQ and FASTA files processed by "barapost-local.py".

Moreover, it can bin FAST5 files according to taxonomic annotation of FASTQ files, that are in turn results of basecalling of these FAST5 files. See FAST5 binning and FAST5 untwisting sections below and Examples #4,5 in Workflow examples section.

Input files

  • Input FASTA and FASTQ files should be specified as positional arguments (see examples below).
  • Input files must have different names. I.e. files .../dir1/reads.fastq and .../dir2/reads.fastq are not allowed.

Default parameters

  • if no input files are specified, all FASTQ, FASTA and FAST5 files in current directory will be processed;
  • binning sensitivity (see -s option): 5 (genus);
  • output directory (-o option): directory named binning_result_<date_and_time_of_run> nested in working directory;
  • "FAST5 untwisting" is disalbed;
  • number of CPU threads to use (-t option): 1;
  • minimum mean quality of a read to keep (-q option): 10;
  • filtering by length (-m option) is disabled;
  • filtering by alignment identity (-i option) is disabled;
  • filtering by alignment coverage (-c option) is disabled;
  • barapost-binning generated trash file(s) (-n flag);

Options

    -h (--help) --- show help message.
        '-h' -- brief, '--help' -- full;

    -v (--version) --- show version;

    -r (--annot-resdir) --- result directory generated
        by script 'barapost-binnig.py'
        This is directory specified to 'barapost-binnig.py' with '-o' option
        and to 'barapost-local.py' with '-r' option.
        Default value is "barapost_result";

    -d (--indir) --- directory which contains FASTQ, FASTA or FAST5 files
        meant to be processed.
        I.e. all FASTQ, FASTA and FAST5 files in this direcory will be processed;

    -o (--outdir) --- output directory;

    -s (--binning-sensitivity) --- binning sensitivity,
        i.e. the lowest taxonomy rank that barapost-binning regards;
        Available values:
        0 for domain, 1 for phylum, 2 for class, 3 for order,
        4 for family, 5 for genus, 6 for species.
        Default is 5 (genus);

    -u (--untwist-fast5) --- flag option. If specified, FAST5 files will be
        binned considering that corresponding FASTQ files
        may contain reads from other FAST5 files
        and reads from a particular FAST5 file may be
        ditributed among multiple FASTQ files.
        For details, see "FAST5 untwisting" section below.
        Disabled by default;

    -t (--threads) --- number of CPU threads to use.
        Affects only FASTA and FASTQ binning
        (for resons see section "Notes about binning" #6).
        Sorter processes FAST5 files in 1 thread anyway
        (but it can perform "FAST5 untwisting" in parallel);

Filters:

  -n (--no-trash) --- flag option. If specified:
      1) trash files will not be outputed;
      2) sequences, which does not pass filters, won't be written anywhere.

  Quality and length filters:

    -q (--min-qual) --- threshold for quality filter;
        Reads of lower quality will be written to separate "trash" file;
        Default value: 10;

    -m (--min-seq-len) ---threshold for query length filter.
        Shorter sequences will be written to separate "trash" file (see Example #2).
        This filter is disabled by default;

  Alignment significance filters:

    -i (--min-pident) --- threshold (in percents) for alignment identity filter.
        Sequences, which align to best hit with lower identity will be
          written to separate "align_trash" file (see Example #2).
        This filter is disabled by default;

    -c (--min-coverage) --- threshold (in percents) for alignment coverage filter.
        Sequences, which align to best hit with lower coverage will be
          written to separate "align_trash" file (see Example #2).
        This filter is disabled by default;

Explanation of output files

  • binned files named according to taxonomic classification of input sequences (e.g. Pseudomonas.fastq.gz, Pectobacterium_carotovorum.fast5 and so on);

  • trash file. Too short (see -m option) and/or low-quality (see -q option) sequences will be placed in this file;

  • align_trash file. Sequences, which align to their best hit with low identity (see -i option) or coverage (see -c option) will be placed in this file;

    Note that barapost-binning applies identity and coverage filters after quality and length ones. It means that only those sequences will be placed to align_trash file, which pass quality and length filters.

  • an unknown file for sequences, for which no similarity was found during the classification;

  • a classification_not_found file for sequences, which have not been classified at all, but they exist in the input files.

Binning details

  1. If you include SPAdes or a5 assembly FASTA files (files that contain contigs) to your database, sequences that hit them will be binned in a specific way. There can be two situations:

    a) there is one FASTA file with assembly (or there are two files, but one of them was generated by SPAdes, and another one -- by a5). In this case if you bin your sequnces by rank from domain to genus, you'll get files named "SPAdes_assembly_NODE.fastq.gz" or "a5_assembly_scaffold.fastq.gz" depending on the assembler. It means that all sequences that hit contigs in your assembly will be placed in one file. If binning sensitivity is "species", you'll get separate files for each NODE (or scaffold, if you've used a5). For example: "SPAdes_assembly_NODE_6.fastq.gz" or "a5_assembly_scaffold_8.fastq.gz".

    b) there more than one FASTA file with assembly generated by one assembler. For example, you have two SPAdes outputs: "outdir_1/contigs.fasta" and "outdir_2/contigs.fasta". Paths to these files will be used to name binned files. In this case if you bin your sequnces by genus, you'll get files named "SPAdes_assembly__outdir_1_contigs.fasta_NODE.fastq.gz" and "SPAdes_assembly__outdir_2_contigs.fasta_NODE.fastq.gz". Situation with a5 is the same -- instead of "SPAdes" "a5" will be written (and "scaffold" instead of "NODE"). As you see, path separators are replaced by underscores in order not to held a bacchanalia in file system. If binning sensitivity is "species", you'll get separate files for each NODE (or scaffold, if you've used a5), just as in 'a)' section above, but paths will be included (e.g. "SPAdes_assembly__outdir_1_contigs.fasta_NODE_3.fastq.gz")

    "barapost-binnig.py" detects SPAdes and a5 assembly files by looking at sequence IDs. If ID is like "NODE_1_length_245432_cov_23.5412" -- probably it is SPAdes work. If ID is like "scaffold_1" -- it looks just like a5 output. "barapost-binning.py" regards only the first sequnce ID in file.

  2. Parallel FAST5 binning is not embedded and perhaps won't be -- it gives no performance profit. The point is that writing to FAST5 files takes much more time than 'calculating'. Thus threads mostly just stay in a queue for writing rather than doing their work.

FAST5 binning

FAST5 files can be binned by barapost-binnig.py.

! - Barapost toolkit does not perform basecalling of nanopore data.

Since it is recommended to keep your FAST5 files in order to re-basecall them later, with more accurate (e.g. more sensible for base modifications) basecall algorithms, it worth following the pipeline below:

  1. Basecall FAST5 files and get FASTQ.
  2. Perform taxonomic annotation of obtained FASTQ files.
  3. Sort source FAST5 files according to this taxonomic annotation.
  4. Keep binned FAST5 files in order to re-basecall them later.

See Examples #3,4 in Workflow examples for details.

FAST5 untwisting

The problem is following: basecallers (popular Guppy, in particular) often misassign names of input FAST5 and output FASTQ files. As a result, source FAST5 and basecalled FASTQ files contain different reads although their names match one another. Therefore, straitforward binning of FAST5 files, that relies on names of "corresponding" FASTQ files (that have undergone taxonomic annotation) is often impossible.

In barapost-binning.py, this issue is solved by implementing a "FAST5 untwisting" procedure (it can be enabled by specifying -u flag).

"Untwisting" is performed by creating a DBM index file that maps reads in FAST5 to TSV file containing taxonomic annotation information about each read. Subsequent binning goes on according to this index file.

See Example #4 in Workflow examples for details.

"Untwisting" procedure also determines, if all reads in input FAST5 files have undergone taxonomic annotation and gives you IDs of missing ones if there are any.

One obvious disadvantage: you may need to perform taxonomic annotation of all your FASTQ files to bin some (maybe not all) FAST5 files from the same data set.

Here another problem arises: how to find out, in which FASTQ file(s) are your reads from a FAST5 file placed? You can find this information in "sequencing_summary" file which is often generated by basecaller (at least, Guppy behaves so). But these files are rather bulky and not very enjoyable to use (and often lack essential information, like names of FASTQ files).

This problem leads to an auxiliary script "BCsummarizer.py".

Examples

Note for Windows users: run py -3 barapost-binning.py in Windows console. barapost-binning.py won't work.

  1. Process all FASTA, FASTQ and FAST5 files in working directory with default settings:

barapost-binnig.py

  1. Process all files starting with "some_fasta" in the working directory with default settings. Move reads with mean quality < Q15 to "trash" file. Move sequences with shorter than 3000 b.p. to "trash" file. Move sequences, which align to best hit with identity or coverage lower than 90% to "trash" file:

barapost-binnig.py some_my_fastq* -q 15 -m 3000 -i 90 -c 90

  1. Process one FASTQ file with default settings. File reads.fastq has been already classified and results of classification are in directory prober_outdir:

barapost-binnig.py reads.fastq.gz -r prober_outdir/

  1. Process a FASTQ file and a FASTA file, place results in outdir directory. Files reads.fastq.gz and another_sequences.fasta have been already classified and results of classification are in directory prober_outdir:

barapost-binnig.py reads.fastq.gz another_sequences.fasta -o outdir -r prober_outdir/

  1. Process all FASTQ, FASTA and FAST5 files in directory named dir_with_seqs. Sort by species (-s 6). All these files have been already classified and results of classification are in directory prober_outdir. Perform "FAST5 untwisting":

barapost-binnig.py -d dir_with_seqs -o outdir -r prober_outdir/ -s 6 -u