Skip to content

August 2nd, 2024. Release 4.3.0

Choose a tag to compare

@coissac coissac released this 02 Aug 12:32
· 379 commits to master since this release

Change of git repositiory

  • The OBITools4 git repository has been moved to the github repository.
    The new address is: https://github.com/metabarcoding/obitools4.
    Take care for using the new install script for retrieving the new version.

    curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh \
      | bash

    or with options:

    curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh \
      | bash -s -- --install-dir test_install --obitools-prefix k

CPU limitation

  • By default, OBITools4 tries to use all the computing power available on
    your computer. In some circumstances this can be problematic (e.g. if you
    are running on a computer cluster managed by your university). You can limit
    the number of CPU cores used by OBITools4 or by using the --max-cpu
    option or by setting the OBIMAXCPU environment variable. Some strange
    behaviour of OBITools4 has been observed when users try to limit the
    maximum number of usable CPU cores to one. This seems to be caused by the Go
    language, and it is not obvious to get OBITools4 to run correctly on a
    single core in all circumstances. Therefore, if you ask to use a single
    core, OBITools4 will print a warning message and actually set this
    parameter to two cores. If you really want a single core, you can use the
    --force-one-core option. But be aware that this can lead to incorrect
    calculations.

New features

  • The output of the obitools will evolve to produce results only in standard
    formats such as fasta and fastq. For non-sequential data, the output will be
    in CSV format, with the separator ,, the decimal separator ., and a
    header line with the column names. It is more convenient to use the output
    in other programs. For example, you can use the csvtomd command to
    reformat the csv output into a markdown table. The first command to initiate
    this change is obicount, which now produces a 3-line CSV output.

    obicount data.csv | csvtomd 
  • Adds the new experimental obicleandb utility to clean up reference
    database files created with obipcr. An easy way to create a reference
    database for obitag is to use obipcr on a local copy of Genbank or EMBL.
    However, these sequence databases are known to contain many taxonomic
    errors, such as bacterial sequences annotated with the taxid of their host
    species. obicleandb tries to detect these errors. To do this, it first keeps
    only sequences annotated with the taxid to which a species, genus, and
    family taxid can be assigned. Then, for each sequence, it compares the
    distance of the sequence to the other sequences belonging to the same genus
    to the same number of distances between the considered sequence and a
    randomly selected set of sequences belonging to another family using a
    Mann-Whitney U test. The alternative hypothesis is that out-of-family
    distances are greater than intrageneric distances. Sequences are annotated
    with the p-value of the Mann-Whitney U test in the obicleandb_trusted
    slot. Later, the distribution of this p-value can be analyzed to determine a
    threshold. Empirically, a threshold of 0.05 is a good compromise and allows
    to filter out less than 1‰ of the sequences. These sequences can then be
    removed using obigrep.

  • Adds a new obijoin utility to join information contained in a sequence
    file with that contained in another sequence or CSV file. The command allows
    you to specify the names of the keys in the main sequence file and in the
    secondary data file that will be used to perform the join operation.

  • Adds a new tool obidemerge to demerge a merge_xxx slot by recreating the
    multiple identical sequences having the slot xxx recreated with its initial
    value and the sequence count set to the number of occurences refered in the
    merge_xxx slot. During the operation, the merge_xxx slot is removed.

  • Adds CSV as one of the input format for every obitools command. To encode
    sequence the CSV file must includes a column named sequence and another
    column named id. An extra column named qualities can be added to specify
    the quality scores of the sequence following the same ascii encoding than the
    fastq format. All the other columns will be considered as annotations and will
    be interpreted as JSON objects encoding potentially for atomic values. If a
    calumn value can not be decoded as JSON it will be considered as a string.

  • A new option --version has been added to every obitools command. It will
    print the version of the command.

  • In obiscript a qualities method has been added to retrieve or set the
    quality scores from a BioSequence object.\

  • In obimultuplex the ngsfilter file describing the samples can be no provided
    not only using the classical nfsfilter format but also using the csv format.
    When using csv, the first line must contain the column names. 5 columns are
    expected:

    • experiment the name of the experiment
    • sample the name of the sample
    • sample_tag the tag used to identify the sample
    • forward_primer the forward primer sequence
    • reverse_primer the reverse primer sequence

    The order of the columns is not important, as long as they are present and
    named correctly. The obiparing command will print an error message if
    some column is missing. It now includes a **--template ** option that can
    be used to create an example CSV file.

    Supplementary columns are allowed. Their names and content will be used to
    annotate the sequence corresponding to the sample, as the key=value; did
    in the nfsfilter format.

    The CSV format used allows for comment lines starting with # character.
    Special data lines starting with @param in the first column allow to
    configure the algorithm. The options --template provided an over
    commented example of the csv format, including all the possible options.

Enhancement

  • In every OBITools command, the progress bar are automatically deactivated
    when the standard error output is redirected.
  • Because Genbank and ENA:EMBL contain very large sequences, while OBITools4
    are optimized As Genbank and ENA:EMBL contain very large sequences, while
    OBITools4 is optimised for short sequences, obipcr faces some problems
    with excessive consumption of computer resources, especially memory. Several
    improvements in the tuning of the default obipcr parameters and some new
    features, currently only available for FASTA and FASTQ file readers, have
    been implemented to limit the memory impact of obipcr without changing the
    computational efficiency too much.
  • Logging system and therefore format, have been homogenized.

Bug

  • In obitag, correct the wrong assignment of the obitag_bestmatch
    attribute.
  • In obiclean, the --no-progress-bar option disables all progress bars,
    not just the data.
  • Several fixes in reading FASTA and FASTQ files, including some code
    simplification and and factorization.
  • Fixed a bug in all obitools that caused the same file to be processed
    multiple times. when specifying a directory name as input.