Skip to content

Releases: pkiraly/qa-catalogue

Version 0.7.0

18 Jul 12:18
Compare
Choose a tag to compare

The major features of this release

Improved PICA handling

PICA is an alternative bibliographic metadata schema used in Germany, The Netherlands and France. The development of PICA related features were done in cooperation with K10Plus, the largest union catalogue of Germany. Now the analyses of PICA records covers completeness, validation, subject heading and authority name analyses plus searching and displaying individual records.

Handling union catalogues

Union catalogues covers the collections of multiple libraries. Now QA catalogue could display the results of completeness, validation, searching and term list for both the whole catalogue and for any individual library.

SHACL4bib

Shape Expression Constraints Language (SHACL) has been adapted to MARC and PICA records. It provides a customized analysis for a library, so it can write a configuration file to check records against their own customs and ruleset which are not part of the core standard. This feature was party developed by Jean Michel Nzi Mba as part of his Bachelor thesis.

Other features

Improved command line interface, documentation. The code base has been more robust thanks to hints from code quality assessment framework Sonar.

Contributors

In the creation of this release Jakob Voß (VZG) and Jean Michel Nzi Mba (University of Göttingen) provided important contributions. Special thanks to Verbundzentrale des GBV (VZG), GWDG and JetBrains for supporting the development.

Details

Group values by library

  • #199: Group results in completeness
  • #200: Group results in issues
  • #246: Filter results in data tab
  • #254: Fixing performance issue for groupping validation
  • #253: Creation of id-groupid.csv required for validation

PICA changes

  • #163: PICA: general changes
  • #190: Extend PICA subject fields
  • #215: issue #215: Completeness: check occurrence numbers
  • #232: Adding XML serialization for PICA
  • #234: Making occurrence a first class citizen of PICA data fields
  • #247: Uniqueness of PICA field ranges reported wrongly
  • #251: PICA: fixing reading of gzipped files
  • #250: Copy Avram schema to output directory
  • Adjust K10plus Avram schema

Shacl4bib

  • #209: adding Shacl4bib
  • #217: issue #217: create a stub class

Command line interface

  • common-script: die if input files don't exist
  • common-script: disable colors if not run via terminal
  • common-script: emit DONE only for processing steps
  • common-script: show UPDATE on config
  • Add default settings to setdir.sh
  • Add configuration varaible UPDATE and summarize configuration
  • Add configuration variable ANALYSES for all-analyses
  • Refactor common-script
  • Allow globs in MASK
  • Fixing parameter removal from catalogue specific params
  • Ignore default input/output also when they are symlinks
  • Improve downloaders
  • Improve KB downloader
  • Update ONB downloader
  • Improve output of common-script
  • Add input directory to ONB downloader
  • #223: Create a configuration file for Zentralbibliothek Zürich #223
  • masking ZB
  • #265: 'all' command should run only the selected tasks if schema is PICA #265
  • Update catalogue scripts
  • Update catalogues
  • Make common-script more robust
  • Make setdir.sh optional
  • Make sqlite more robust
  • Remove unnecessary ; chars
  • Simplify bash scripts
  • Simplify catalogues/k10plus_*.sh
  • Remove duplicated DONE in catalog scripts
  • Remove unused parts
  • Support setting MASK in setdir.sh (k10plus_pica only)

Documentation

  • README.md: Adjust path to run helper script
  • Create CONTRIBUTING
  • Better definition of the tool in the README
  • Adding sponsors section
  • Adding Binghampton University Libraries to the list of users
  • Add SonarCloud badge
  • #196: issue #196: update README
  • #244: Document dependencies (close #244)
  • Rename CONTRIBUTING to CONTRIBUTING.md
  • Update test schema README file

CSV generation

  • #216: Completeness: use proper CSV library to generate .csv
  • #242: Validation: use proper CSV library to generate .csv

other

  • #227: The data field (without subfields) are categorized as "unknown origin" in marc-elements.csv #227

Dependency updates

  • upgrade com.fasterxml.jackson.core from 2.13.4 to 2.15.0
  • upgrade org.apache.logging.log4j from 2.19.0 to 2.20.0
  • upgrade org.apache.solr from 9.1.0 to 9.2.0
  • upgrade org.apache.spark from 3.3.1 to 3.3.2
  • upgrade org.mongodb:bson from 4.7.2 to 4.9.1
  • upgrade org.mongodb:mongo-java-driver from 3.12.11 to 3.12.13
  • upgrade org.xerial:sqlite-jdbc from 3.39.3.0 to 3.41.2.1

Debugging, refactoring, performance inmprovement

  • Implement Sonar suggestions.
  • #269: Build failure: testing
  • Add coveralls report integration
  • Improve performance of classification analysis
  • Improve test coverage
  • Improving performance
  • Fix a missing character from the Docker description.

Files

  • qa-catalogue-0.7.0-release.zip: all the files which need to run the software. Download, unzip and go!
  • qa-catalogue-0.7.0-jar-with-dependencies.jar: the Java library file with all the dependencies
  • qa-catalogue-0.7.0.jar: the Java library file without dependencies

Version 0.7.0 release candidate (1)

22 May 08:26
Compare
Choose a tag to compare
Pre-release

The current release has the following major features:

  • in case of union catalogues the main analyses (validation, completeness) and data is displayed both as a whole, and by individual libraries
  • more support of PICA records
  • improved command line interface
  • a new beta feature: validation against SHACL-like problem patterns

Group values by library

  • #199: Group results in completeness
  • #200: Group results in issues
  • #246: Filter results in data tab
  • #254: Fixing performance issue for groupping validation
  • #253: Creation of id-groupid.csv required for validation

PICA changes

  • #163: PICA: general changes
  • #190: Extend PICA subject fields
  • #215: issue #215: Completeness: check occurrence numbers
  • #232: Adding XML serialization for PICA
  • #234: Making occurrence a first class citizen of PICA data fields
  • #247: Uniqueness of PICA field ranges reported wrongly
  • #251: PICA: fixing reading of gzipped files
  • #250: Copy Avram schema to output directory
  • Adjust K10plus Avram schema

Shacl4bib

  • #209: adding Shacl4bib
  • #217: issue #217: create a stub class

Command line interface

  • common-script: die if input files don't exist
  • common-script: disable colors if not run via terminal
  • common-script: emit DONE only for processing steps
  • common-script: show UPDATE on config
  • Add default settings to setdir.sh
  • Add configuration varaible UPDATE and summarize configuration
  • Add configuration variable ANALYSES for all-analyses
  • Refactor common-script
  • Allow globs in MASK
  • Fixing parameter removal from catalogue specific params
  • Ignore default input/output also when they are symlinks
  • Improve downloaders
  • Improve KB downloader
  • Update ONB downloader
  • Improve output of common-script
  • Add input directory to ONB downloader
  • #223: Create a configuration file for Zentralbibliothek Zürich #223
  • masking ZB
  • #265: 'all' command should run only the selected tasks if schema is PICA #265
  • Update catalogue scripts
  • Update catalogues
  • Make common-script more robust
  • Make setdir.sh optional
  • Make sqlite more robust
  • Remove unnecessary ; chars
  • Simplify bash scripts
  • Simplify catalogues/k10plus_*.sh
  • Remove duplicated DONE in catalog scripts
  • Remove unused parts
  • Support setting MASK in setdir.sh (k10plus_pica only)

Documentation

  • README.md: Adjust path to run helper script
  • Create CONTRIBUTING
  • Better definition of the tool in the README
  • Adding sponsors section
  • Adding Binghampton University Libraries to the list of users
  • Add SonarCloud badge
  • #196: issue #196: update README
  • #244: Document dependencies (close #244)
  • Rename CONTRIBUTING to CONTRIBUTING.md
  • Update test schema README file

CSV generation

  • #216: Completeness: use proper CSV library to generate .csv
  • #242: Validation: use proper CSV library to generate .csv

other

  • #227: The data field (without subfields) are categorized as "unknown origin" in marc-elements.csv #227

Dependency updates

  • upgrade com.fasterxml.jackson.core from 2.13.4 to 2.15.0
  • upgrade org.apache.logging.log4j from 2.19.0 to 2.20.0
  • upgrade org.apache.solr from 9.1.0 to 9.2.0
  • upgrade org.apache.spark from 3.3.1 to 3.3.2
  • upgrade org.mongodb:bson from 4.7.2 to 4.9.1
  • upgrade org.mongodb:mongo-java-driver from 3.12.11 to 3.12.13
  • upgrade org.xerial:sqlite-jdbc from 3.39.3.0 to 3.41.2.1

Debugging, refactoring, performance inmprovement

  • Implement Sonar suggestions.
  • #269: Build failure: testing
  • Add coveralls report integration
  • Improve performance of classification analysis
  • Improve test coverage
  • Improving performance
  • Fix a missing character from the Docker description.

Release v0.6.0

22 Nov 14:41
Compare
Choose a tag to compare

The main focus of the current release is to support the basic analyses of PICA records, i.e.

  • validation
  • completeness
  • indexing
  • subject indexing
  • authority names
  • cataloguing history

PICA field definitions are not hardcoded as in case of MARC, but comes from external Avram schema, so the customization to a library's need is flexible. If no other schema is provided, QA catalogue uses the metadata schema of North German union catalogue, K10plus which can be downloaded from https://format.k10plus.de/avram.pl?profile=k10plus-title. The work on PICA is sponsored by Verbundzentrale (VZG) des Gemeinsamen Bibliotheksverbundes (GBV).

The release also contains other bug fixes and improvements.

The artefacts of the release are available in Maven Central as well: https://central.sonatype.dev/artifact/de.gwdg.metadataqa/metadata-qa-marc/0.6.0.

PICA related changes:

  • #137 filter out records
  • #138 parsing PICA Plain file
  • #140 parsing PICA records
  • #142 completeness of PICA records
  • #144 filter out internal fields
  • #145 Implement PICA Path
  • #151 validate PICA records
  • #152 ignorableFields parameter should support masking
  • #153 indexing PICA records
  • #154 subject indexing analysis for PICA
  • #155 name authority analysis
  • #161 cataloguing history
  • #164 Parsing PICA Plain with $ in field values
  • #174 FRBR functions
  • #187 add parameter to exclude issue types

Other changes:

  • #188 Move validators to distinct classes
  • #128 Implement incremental timeline
  • #127 Include version specific subfields to the JSON schema representation and completeness

Many thanks for @nichtich for being an excellent committer and product owner of this release!

Release v0.5.0

29 Mar 19:24
Compare
Choose a tag to compare

The highlihts in this release:

  • the British Library and KBR, the national library of Belgium started to use this tool, and both of them and Gent University Library sent important feedback, bug report and feature requests
  • the underlying Java version has been changed to Java 11, and several other technical changes has been implemented
  • improved documentation

In the future we will issue releases more frequently.

The list of important changes:

Release v0.4

02 Feb 12:15
Compare
Choose a tag to compare

The main features of the current release:

  • Full Solr index
  • Completeness calculation
  • MARC validation
  • Support of FRBR functions
  • Subject analysis
  • Authority names analysis
  • Serials analysis
  • Thompson—Traill completeness (ebook analysis)
  • Shelf-Ready completeness
  • Field frequency distribution
  • History of cataloging

Other features:

v0.3

15 Jan 16:35
Compare
Choose a tag to compare

Installation

  1. wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.3/metadata-qa-marc-0.3-release.zip
  2. unzip metadata-qa-marc-0.3-release.zip
  3. cd metadata-qa-marc-0.3/

Configuration

  1. cp setdir.sh.template setdir.sh
  2. nano setdir.sh

set your path to root MARC directories:

# the input directory, where your MARC dump files exist
BASE_INPUT_DIR=
# the input directory, where the output CSV files will land
BASE_OUTPUT_DIR=
  1. Create configuration based on some existing config files:
  • cp scripts/loc.sh scripts/[abbreviation-of-your-library].sh
  • edit scripts/[abbreviation-of-your-library].sh according to configuration guide

Use

scripts/[abbreviation-of-your-library].sh all-analyses
scripts/[abbreviation-of-your-library].sh all-solr

For a catalogue with around 1 milion record the first command will take 5-10 minutes, the later 1-2 hours.

Release for the SWIB19 conference

28 Nov 11:10
Compare
Choose a tag to compare

Installation

  1. wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.2.1/metadata-qa-marc-0.2-SNAPSHOT-release.zip
  2. unzip metadata-qa-marc-0.2-SNAPSHOT-release.zip
  3. cd metadata-qa-marc-0.2-SNAPSHOT/

Configuration

  1. cp setdir.sh.template setdir.sh
  2. nano setdir.sh

set your path to root MARC directories:

# the input directory, where your MARC dump files exist
BASE_INPUT_DIR=
# the input directory, where the output CSV files will land
BASE_OUTPUT_DIR=
  1. Create configuration based on some existing config files:
  • cp scripts/loc.sh scripts/[abbreviation-of-your-library].sh
  • edit scripts/[abbreviation-of-your-library].sh according to configuration guide

Use

scripts/[abbreviation-of-your-library].sh all-analyses
scripts/[abbreviation-of-your-library].sh all-solr

For a catalogue with around 1 milion record the first command will take 5-10 minutes, the later 1-2 hours.

v0.2

02 Jun 21:38
Compare
Choose a tag to compare
v0.2 Pre-release
Pre-release

Release prepared for the Data Quality workshop at ELAG 2018