August 2nd, 2024. Release 4.3.0
Change of git repositiory
-
The OBITools4 git repository has been moved to the github repository.
The new address is: https://github.com/metabarcoding/obitools4.
Take care for using the new install script for retrieving the new version.curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh \ | bashor with options:
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh \ | bash -s -- --install-dir test_install --obitools-prefix k
CPU limitation
- By default, OBITools4 tries to use all the computing power available on
your computer. In some circumstances this can be problematic (e.g. if you
are running on a computer cluster managed by your university). You can limit
the number of CPU cores used by OBITools4 or by using the --max-cpu
option or by setting the OBIMAXCPU environment variable. Some strange
behaviour of OBITools4 has been observed when users try to limit the
maximum number of usable CPU cores to one. This seems to be caused by the Go
language, and it is not obvious to get OBITools4 to run correctly on a
single core in all circumstances. Therefore, if you ask to use a single
core, OBITools4 will print a warning message and actually set this
parameter to two cores. If you really want a single core, you can use the
--force-one-core option. But be aware that this can lead to incorrect
calculations.
New features
-
The output of the obitools will evolve to produce results only in standard
formats such as fasta and fastq. For non-sequential data, the output will be
in CSV format, with the separator,, the decimal separator., and a
header line with the column names. It is more convenient to use the output
in other programs. For example, you can use thecsvtomdcommand to
reformat the csv output into a markdown table. The first command to initiate
this change isobicount, which now produces a 3-line CSV output.obicount data.csv | csvtomd -
Adds the new experimental
obicleandbutility to clean up reference
database files created withobipcr. An easy way to create a reference
database forobitagis to useobipcron a local copy of Genbank or EMBL.
However, these sequence databases are known to contain many taxonomic
errors, such as bacterial sequences annotated with the taxid of their host
species. obicleandb tries to detect these errors. To do this, it first keeps
only sequences annotated with the taxid to which a species, genus, and
family taxid can be assigned. Then, for each sequence, it compares the
distance of the sequence to the other sequences belonging to the same genus
to the same number of distances between the considered sequence and a
randomly selected set of sequences belonging to another family using a
Mann-Whitney U test. The alternative hypothesis is that out-of-family
distances are greater than intrageneric distances. Sequences are annotated
with the p-value of the Mann-Whitney U test in the obicleandb_trusted
slot. Later, the distribution of this p-value can be analyzed to determine a
threshold. Empirically, a threshold of 0.05 is a good compromise and allows
to filter out less than 1‰ of the sequences. These sequences can then be
removed usingobigrep. -
Adds a new
obijoinutility to join information contained in a sequence
file with that contained in another sequence or CSV file. The command allows
you to specify the names of the keys in the main sequence file and in the
secondary data file that will be used to perform the join operation. -
Adds a new tool
obidemergeto demerge amerge_xxxslot by recreating the
multiple identical sequences having the slotxxxrecreated with its initial
value and the sequence count set to the number of occurences refered in the
merge_xxxslot. During the operation, themerge_xxxslot is removed. -
Adds CSV as one of the input format for every obitools command. To encode
sequence the CSV file must includes a column namedsequenceand another
column namedid. An extra column namedqualitiescan be added to specify
the quality scores of the sequence following the same ascii encoding than the
fastq format. All the other columns will be considered as annotations and will
be interpreted as JSON objects encoding potentially for atomic values. If a
calumn value can not be decoded as JSON it will be considered as a string. -
A new option --version has been added to every obitools command. It will
print the version of the command. -
In
obiscriptaqualitiesmethod has been added to retrieve or set the
quality scores from a BioSequence object.\ -
In
obimultuplexthe ngsfilter file describing the samples can be no provided
not only using the classical nfsfilter format but also using the csv format.
When using csv, the first line must contain the column names. 5 columns are
expected:experimentthe name of the experimentsamplethe name of the samplesample_tagthe tag used to identify the sampleforward_primerthe forward primer sequencereverse_primerthe reverse primer sequence
The order of the columns is not important, as long as they are present and
named correctly. Theobiparingcommand will print an error message if
some column is missing. It now includes a **--template ** option that can
be used to create an example CSV file.Supplementary columns are allowed. Their names and content will be used to
annotate the sequence corresponding to the sample, as thekey=value;did
in the nfsfilter format.The CSV format used allows for comment lines starting with
#character.
Special data lines starting with@paramin the first column allow to
configure the algorithm. The options --template provided an over
commented example of the csv format, including all the possible options.
Enhancement
- In every OBITools command, the progress bar are automatically deactivated
when the standard error output is redirected. - Because Genbank and ENA:EMBL contain very large sequences, while OBITools4
are optimized As Genbank and ENA:EMBL contain very large sequences, while
OBITools4 is optimised for short sequences,obipcrfaces some problems
with excessive consumption of computer resources, especially memory. Several
improvements in the tuning of the defaultobipcrparameters and some new
features, currently only available for FASTA and FASTQ file readers, have
been implemented to limit the memory impact ofobipcrwithout changing the
computational efficiency too much. - Logging system and therefore format, have been homogenized.
Bug
- In
obitag, correct the wrong assignment of the obitag_bestmatch
attribute. - In
obiclean, the --no-progress-bar option disables all progress bars,
not just the data. - Several fixes in reading FASTA and FASTQ files, including some code
simplification and and factorization. - Fixed a bug in all obitools that caused the same file to be processed
multiple times. when specifying a directory name as input.