Release August 2nd, 2024. Release 4.3.0 · metabarcoding/obitools4

Change of git repositiory

The OBITools4 git repository has been moved to the github repository.
The new address is: https://github.com/metabarcoding/obitools4.
Take care for using the new install script for retrieving the new version.

curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh \
  | bash

or with options:

curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh \
  | bash -s -- --install-dir test_install --obitools-prefix k

CPU limitation

By default, OBITools4 tries to use all the computing power available on
your computer. In some circumstances this can be problematic (e.g. if you
are running on a computer cluster managed by your university). You can limit
the number of CPU cores used by OBITools4 or by using the --max-cpu
option or by setting the OBIMAXCPU environment variable. Some strange
behaviour of OBITools4 has been observed when users try to limit the
maximum number of usable CPU cores to one. This seems to be caused by the Go
language, and it is not obvious to get OBITools4 to run correctly on a
single core in all circumstances. Therefore, if you ask to use a single
core, OBITools4 will print a warning message and actually set this
parameter to two cores. If you really want a single core, you can use the
--force-one-core option. But be aware that this can lead to incorrect
calculations.

New features

The output of the obitools will evolve to produce results only in standard
formats such as fasta and fastq. For non-sequential data, the output will be
in CSV format, with the separator ,, the decimal separator ., and a
header line with the column names. It is more convenient to use the output
in other programs. For example, you can use the csvtomd command to
reformat the csv output into a markdown table. The first command to initiate
this change is obicount, which now produces a 3-line CSV output.
```
obicount data.csv | csvtomd 
```
Adds the new experimental obicleandb utility to clean up reference
database files created with obipcr. An easy way to create a reference
database for obitag is to use obipcr on a local copy of Genbank or EMBL.
However, these sequence databases are known to contain many taxonomic
errors, such as bacterial sequences annotated with the taxid of their host
species. obicleandb tries to detect these errors. To do this, it first keeps
only sequences annotated with the taxid to which a species, genus, and
family taxid can be assigned. Then, for each sequence, it compares the
distance of the sequence to the other sequences belonging to the same genus
to the same number of distances between the considered sequence and a
randomly selected set of sequences belonging to another family using a
Mann-Whitney U test. The alternative hypothesis is that out-of-family
distances are greater than intrageneric distances. Sequences are annotated
with the p-value of the Mann-Whitney U test in the obicleandb_trusted
slot. Later, the distribution of this p-value can be analyzed to determine a
threshold. Empirically, a threshold of 0.05 is a good compromise and allows
to filter out less than 1‰ of the sequences. These sequences can then be
removed using obigrep.
Adds a new obijoin utility to join information contained in a sequence
file with that contained in another sequence or CSV file. The command allows
you to specify the names of the keys in the main sequence file and in the
secondary data file that will be used to perform the join operation.
Adds a new tool obidemerge to demerge a merge_xxx slot by recreating the
multiple identical sequences having the slot xxx recreated with its initial
value and the sequence count set to the number of occurences refered in the
merge_xxx slot. During the operation, the merge_xxx slot is removed.
Adds CSV as one of the input format for every obitools command. To encode
sequence the CSV file must includes a column named sequence and another
column named id. An extra column named qualities can be added to specify
the quality scores of the sequence following the same ascii encoding than the
fastq format. All the other columns will be considered as annotations and will
be interpreted as JSON objects encoding potentially for atomic values. If a
calumn value can not be decoded as JSON it will be considered as a string.
A new option --version has been added to every obitools command. It will
print the version of the command.
In obiscript a qualities method has been added to retrieve or set the
quality scores from a BioSequence object.\
In obimultuplex the ngsfilter file describing the samples can be no provided
not only using the classical nfsfilter format but also using the csv format.
When using csv, the first line must contain the column names. 5 columns are
expected:
- experiment the name of the experiment
- sample the name of the sample
- sample_tag the tag used to identify the sample
- forward_primer the forward primer sequence
- reverse_primer the reverse primer sequence
The order of the columns is not important, as long as they are present and
named correctly. The obiparing command will print an error message if
some column is missing. It now includes a **--template ** option that can
be used to create an example CSV file.

Supplementary columns are allowed. Their names and content will be used to
annotate the sequence corresponding to the sample, as the key=value; did
in the nfsfilter format.

The CSV format used allows for comment lines starting with # character.
Special data lines starting with @param in the first column allow to
configure the algorithm. The options --template provided an over
commented example of the csv format, including all the possible options.

Enhancement

In every OBITools command, the progress bar are automatically deactivated
when the standard error output is redirected.
Because Genbank and ENA:EMBL contain very large sequences, while OBITools4
are optimized As Genbank and ENA:EMBL contain very large sequences, while
OBITools4 is optimised for short sequences, obipcr faces some problems
with excessive consumption of computer resources, especially memory. Several
improvements in the tuning of the default obipcr parameters and some new
features, currently only available for FASTA and FASTQ file readers, have
been implemented to limit the memory impact of obipcr without changing the
computational efficiency too much.
Logging system and therefore format, have been homogenized.

Bug

In obitag, correct the wrong assignment of the obitag_bestmatch
attribute.
In obiclean, the --no-progress-bar option disables all progress bars,
not just the data.
Several fixes in reading FASTA and FASTQ files, including some code
simplification and and factorization.
Fixed a bug in all obitools that caused the same file to be processed
multiple times. when specifying a directory name as input.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

August 2nd, 2024. Release 4.3.0

Choose a tag to compare

Sorry, something went wrong.