Skip to content

Annif 1.4

Choose a tag to compare

@juhoinkinen juhoinkinen released this 03 Sep 10:49
· 95 commits to main since this release
v1.4.0
1782fbf

This release introduces three new corpus formats: a JSON-based full text corpus format (one file per document) and two short-text formats, one based on JSON Lines and another based on CSV. All the new corpus formats include support for document IDs as well as metadata: it is now possible to include structured information such as titles and abstracts for documents. This flexibility is intended to improve the handling of documents that require additional context beyond just the text itself; projects may be configured to operate only on specific metadata fields using the new select transform. All the new corpus formats can be used alongside existing formats.

It is now possible to exclude and include subjects from a vocabulary. Excluding individual concepts can be useful in cases where algorithms frequently produce incorrect subject suggestions. Using exclude and include rules, it is also possible to define more specialized projects that operate on only one type or class of concepts.

Several improvements have been made to the REST API, including exposing vocabulary information via the vocabs method and disabling the learn method by default (controlled by the allow_learn setting in the NN ensemble backend).

The annif index command can now be used on short-text corpus formats (TSV, CSV or JSON Lines) in addition to full text formats (TXT+TSV or JSON). In the case of short-text formats, output including the suggested subjects and their scores is produced in JSON Lines format.

The hyperopt command has been enhanced to better support parallel processing on multiple CPU cores, which can significantly reduce overall processing time.

This release also adds support for Python 3.13, ensuring compatibility with the latest Python version. Furthermore, the tfidf backend has been refactored to eliminate the dependency on gensim, which addresses compatibility issues and simplifies the codebase. Support for Python 3.9 has been dropped. Various maintenance updates and bug fixes are included, such as resolving warnings related to Click and upgrading many libraries to more recent versions.

Special thanks to the German National Library (DNB) EMa team (@c-poley, @RietdorfC, @san-uh) for their work on proposing, specifying and testing the new features in this release!

Supported Python versions:

  • 3.10, 3.11, 3.12 and 3.13

Backward compatibility:

  • ⚠️ tfidf projects trained with Annif 1.3 or older need to be retrained.
  • For other projects, the warnings by SciKit-learn are harmless.
  • ⚠️ This is very likely the last Annif minor release to support the current fasttext backend, because the original fastText library is no longer maintained and there are compatibility issues with other libraries. We are looking for alternative implementations of fasttext.

Enhancements:
#875/#876 Add JSONL short text corpus format
#872/#868 Support metadata in fulltext corpus format / JSON fulltext corpus format
#886/#885 Support document_id in JSON(L) and CSV corpus formats & JSONL output
#889/#639/#877 Support for all corpus formats in annif index CLI command
#863/#140 Flexible fusion part 1: CSV short-text document corpus format
#864 Flexible fusion part 2: core functionality
#866 Flexible fusion part 3: CLI suggest option for additional metadata
#867 Flexible fusion part 4: REST API document metadata support
#844/#846 Support exclude/include rules for vocabulary concepts
#735/#840 Support subject exclusion / Dealing with overrepresented concepts / denylisting
#839/#837 Expose vocabulary information via REST API
#843 Disable /learn REST API method by default
#688/#873 Parallel hyperparameter optimization using multiple CPU cores

Maintenance:
#878 Remove gensim dependency in tfidf backend
#871 Update dependencies for 1.4 release
#890 Use NumPy 2 compatible fastText fork
#849 Drop Python 3.9 support
#850/#869 Support Python 3.13
#884 Upgrade to Poetry 2.0 / Resolve Poetry deprecation warnings
#848 Resolve DeprecationWarning: avoid use of datetime.utcfromtimestamp
#852/#891 Bump GitHub Actions versions

Fixes:
#882 Resolve UserWarning: The parameter --verbosity... for annif list-* CLI commands
#847 Add superclass constructor call to LMDBSequence, to prevent TensorFlow warning
#874 JSON corpus bugfix: avoid parsing subjects in annif index
#887 Fix slow annif train JSONL test & avoid slow jsonschema import