18 Jul 12:18

pkiraly

d7f7ad0

Version 0.7.0 Latest

Latest

The major features of this release

Improved PICA handling

PICA is an alternative bibliographic metadata schema used in Germany, The Netherlands and France. The development of PICA related features were done in cooperation with K10Plus, the largest union catalogue of Germany. Now the analyses of PICA records covers completeness, validation, subject heading and authority name analyses plus searching and displaying individual records.

Handling union catalogues

Union catalogues covers the collections of multiple libraries. Now QA catalogue could display the results of completeness, validation, searching and term list for both the whole catalogue and for any individual library.

SHACL4bib

Shape Expression Constraints Language (SHACL) has been adapted to MARC and PICA records. It provides a customized analysis for a library, so it can write a configuration file to check records against their own customs and ruleset which are not part of the core standard. This feature was party developed by Jean Michel Nzi Mba as part of his Bachelor thesis.

Other features

Improved command line interface, documentation. The code base has been more robust thanks to hints from code quality assessment framework Sonar.

Contributors

In the creation of this release Jakob Voß (VZG) and Jean Michel Nzi Mba (University of Göttingen) provided important contributions. Special thanks to Verbundzentrale des GBV (VZG), GWDG and JetBrains for supporting the development.

Details

Group values by library

#199: Group results in completeness
#200: Group results in issues
#246: Filter results in data tab
#254: Fixing performance issue for groupping validation
#253: Creation of id-groupid.csv required for validation

PICA changes

#163: PICA: general changes
#190: Extend PICA subject fields
#215: issue #215: Completeness: check occurrence numbers
#232: Adding XML serialization for PICA
#234: Making occurrence a first class citizen of PICA data fields
#247: Uniqueness of PICA field ranges reported wrongly
#251: PICA: fixing reading of gzipped files
#250: Copy Avram schema to output directory
Adjust K10plus Avram schema

Shacl4bib

#209: adding Shacl4bib
#217: issue #217: create a stub class

Command line interface

common-script: die if input files don't exist
common-script: disable colors if not run via terminal
common-script: emit DONE only for processing steps
common-script: show UPDATE on config
Add default settings to setdir.sh
Add configuration varaible UPDATE and summarize configuration
Add configuration variable ANALYSES for all-analyses
Refactor common-script
Allow globs in MASK
Fixing parameter removal from catalogue specific params
Ignore default input/output also when they are symlinks
Improve downloaders
Improve KB downloader
Update ONB downloader
Improve output of common-script
Add input directory to ONB downloader
#223: Create a configuration file for Zentralbibliothek Zürich #223
masking ZB
#265: 'all' command should run only the selected tasks if schema is PICA #265
Update catalogue scripts
Update catalogues
Make common-script more robust
Make setdir.sh optional
Make sqlite more robust
Remove unnecessary ; chars
Simplify bash scripts
Simplify catalogues/k10plus_*.sh
Remove duplicated DONE in catalog scripts
Remove unused parts
Support setting MASK in setdir.sh (k10plus_pica only)

Documentation

README.md: Adjust path to run helper script
Create CONTRIBUTING
Better definition of the tool in the README
Adding sponsors section
Adding Binghampton University Libraries to the list of users
Add SonarCloud badge
#196: issue #196: update README
#244: Document dependencies (close #244)
Rename CONTRIBUTING to CONTRIBUTING.md
Update test schema README file

CSV generation

#216: Completeness: use proper CSV library to generate .csv
#242: Validation: use proper CSV library to generate .csv

other

#227: The data field (without subfields) are categorized as "unknown origin" in marc-elements.csv #227

Dependency updates

upgrade com.fasterxml.jackson.core from 2.13.4 to 2.15.0
upgrade org.apache.logging.log4j from 2.19.0 to 2.20.0
upgrade org.apache.solr from 9.1.0 to 9.2.0
upgrade org.apache.spark from 3.3.1 to 3.3.2
upgrade org.mongodb:bson from 4.7.2 to 4.9.1
upgrade org.mongodb:mongo-java-driver from 3.12.11 to 3.12.13
upgrade org.xerial:sqlite-jdbc from 3.39.3.0 to 3.41.2.1

Debugging, refactoring, performance inmprovement

Implement Sonar suggestions.
#269: Build failure: testing
Add coveralls report integration
Improve performance of classification analysis
Improve test coverage
Improving performance
Fix a missing character from the Docker description.

Files

qa-catalogue-0.7.0-release.zip: all the files which need to run the software. Download, unzip and go!
qa-catalogue-0.7.0-jar-with-dependencies.jar: the Java library file with all the dependencies
qa-catalogue-0.7.0.jar: the Java library file without dependencies

Assets 5

0 Join discussion

22 May 08:26

pkiraly

v0.7.0-rc1

6cc980e

Version 0.7.0 release candidate (1) Pre-release

Pre-release

The current release has the following major features:

in case of union catalogues the main analyses (validation, completeness) and data is displayed both as a whole, and by individual libraries
more support of PICA records
improved command line interface
a new beta feature: validation against SHACL-like problem patterns

Group values by library

#199: Group results in completeness
#200: Group results in issues
#246: Filter results in data tab
#254: Fixing performance issue for groupping validation
#253: Creation of id-groupid.csv required for validation

PICA changes

#163: PICA: general changes
#190: Extend PICA subject fields
#215: issue #215: Completeness: check occurrence numbers
#232: Adding XML serialization for PICA
#234: Making occurrence a first class citizen of PICA data fields
#247: Uniqueness of PICA field ranges reported wrongly
#251: PICA: fixing reading of gzipped files
#250: Copy Avram schema to output directory
Adjust K10plus Avram schema

Shacl4bib

#209: adding Shacl4bib
#217: issue #217: create a stub class

Command line interface

common-script: die if input files don't exist
common-script: disable colors if not run via terminal
common-script: emit DONE only for processing steps
common-script: show UPDATE on config
Add default settings to setdir.sh
Add configuration varaible UPDATE and summarize configuration
Add configuration variable ANALYSES for all-analyses
Refactor common-script
Allow globs in MASK
Fixing parameter removal from catalogue specific params
Ignore default input/output also when they are symlinks
Improve downloaders
Improve KB downloader
Update ONB downloader
Improve output of common-script
Add input directory to ONB downloader
#223: Create a configuration file for Zentralbibliothek Zürich #223
masking ZB
#265: 'all' command should run only the selected tasks if schema is PICA #265
Update catalogue scripts
Update catalogues
Make common-script more robust
Make setdir.sh optional
Make sqlite more robust
Remove unnecessary ; chars
Simplify bash scripts
Simplify catalogues/k10plus_*.sh
Remove duplicated DONE in catalog scripts
Remove unused parts
Support setting MASK in setdir.sh (k10plus_pica only)

Documentation

README.md: Adjust path to run helper script
Create CONTRIBUTING
Better definition of the tool in the README
Adding sponsors section
Adding Binghampton University Libraries to the list of users
Add SonarCloud badge
#196: issue #196: update README
#244: Document dependencies (close #244)
Rename CONTRIBUTING to CONTRIBUTING.md
Update test schema README file

CSV generation

#216: Completeness: use proper CSV library to generate .csv
#242: Validation: use proper CSV library to generate .csv

other

#227: The data field (without subfields) are categorized as "unknown origin" in marc-elements.csv #227

Dependency updates

upgrade com.fasterxml.jackson.core from 2.13.4 to 2.15.0
upgrade org.apache.logging.log4j from 2.19.0 to 2.20.0
upgrade org.apache.solr from 9.1.0 to 9.2.0
upgrade org.apache.spark from 3.3.1 to 3.3.2
upgrade org.mongodb:bson from 4.7.2 to 4.9.1
upgrade org.mongodb:mongo-java-driver from 3.12.11 to 3.12.13
upgrade org.xerial:sqlite-jdbc from 3.39.3.0 to 3.41.2.1

Debugging, refactoring, performance inmprovement

Implement Sonar suggestions.
#269: Build failure: testing
Add coveralls report integration
Improve performance of classification analysis
Improve test coverage
Improving performance
Fix a missing character from the Docker description.

Assets 2

22 Nov 14:41

pkiraly

v0.6.0

4fe7db8

Release v0.6.0

The main focus of the current release is to support the basic analyses of PICA records, i.e.

validation
completeness
indexing
subject indexing
authority names
cataloguing history

PICA field definitions are not hardcoded as in case of MARC, but comes from external Avram schema, so the customization to a library's need is flexible. If no other schema is provided, QA catalogue uses the metadata schema of North German union catalogue, K10plus which can be downloaded from https://format.k10plus.de/avram.pl?profile=k10plus-title. The work on PICA is sponsored by Verbundzentrale (VZG) des Gemeinsamen Bibliotheksverbundes (GBV).

The release also contains other bug fixes and improvements.

The artefacts of the release are available in Maven Central as well: https://central.sonatype.dev/artifact/de.gwdg.metadataqa/metadata-qa-marc/0.6.0.

PICA related changes:

#137 filter out records
#138 parsing PICA Plain file
#140 parsing PICA records
#142 completeness of PICA records
#144 filter out internal fields
#145 Implement PICA Path
#151 validate PICA records
#152 ignorableFields parameter should support masking
#153 indexing PICA records
#154 subject indexing analysis for PICA
#155 name authority analysis
#161 cataloguing history
#164 Parsing PICA Plain with $ in field values
#174 FRBR functions
#187 add parameter to exclude issue types

Other changes:

#188 Move validators to distinct classes
#128 Implement incremental timeline
#127 Include version specific subfields to the JSON schema representation and completeness

Many thanks for @nichtich for being an excellent committer and product owner of this release!

Contributors

nichtich

Assets 3

0 Join discussion

29 Mar 19:24

pkiraly

v0.5.0

969aa52

Release v0.5.0

The highlihts in this release:

the British Library and KBR, the national library of Belgium started to use this tool, and both of them and Gent University Library sent important feedback, bug report and feature requests
the underlying Java version has been changed to Java 11, and several other technical changes has been implemented
improved documentation

In the future we will issue releases more frequently.

The list of important changes:

#94 Change to Java 11
#89
Check definitins against MARC updates
Catalogue versions
- #124 KBR MARC version
- #99 B3Kat version
- #95 MARC21NO version
- #75 British Library version
Avram schema files related developments
- #127 Include version specific subfields to the JSON schema representation and completeness
- #109 Add solr fields of control fields into the JSON schema
- #118 Provide alternative Avram versions, and make them transparently available
Completeness
- #106 include indicators
- #86 display indicators
- #83 include control fields
- #81 use IDs for package info
- #80 count package by doctype
- #79 create a reusable FieldInfo class for completeness
- #78 Add package other/unknwon to the package statistics
General parameters
- #105 Add defaultEncoding parameter
- #101 Read gzipped MARC files
- #100 Creating a replacement for # in control field from Alma output
#126 Skip records without errors from issue-details.csv
#125 ignorableFields should not be mentioned in undefined fields
#116 Create an INSTALL.md file with installation instructions
#113 Reading MARCMaker format
#107 Reorganize scripts directory. Right now there is a catalogue directory for the catalogue specific configuration files, and a scripts directory with some subdirectories for the analyses

Assets 5

02 Feb 12:15

pkiraly

v0.4

1ad072b

Release v0.4

The main features of the current release:

Full Solr index
Completeness calculation
MARC validation
Support of FRBR functions
Subject analysis
Authority names analysis
Serials analysis
Thompson—Traill completeness (ebook analysis)
Shelf-Ready completeness
Field frequency distribution
History of cataloging

Other features:

docker version
the corresponding web user interface is available at https://github.com/pkiraly/metadata-qa-marc-web/releases/tag/0.4
MARC data elements from British Library

Assets 3

15 Jan 16:35

pkiraly

v0.3

a95d060

v0.3

Installation

wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.3/metadata-qa-marc-0.3-release.zip
unzip metadata-qa-marc-0.3-release.zip
cd metadata-qa-marc-0.3/

Configuration

cp setdir.sh.template setdir.sh
nano setdir.sh

set your path to root MARC directories:

# the input directory, where your MARC dump files exist
BASE_INPUT_DIR=
# the input directory, where the output CSV files will land
BASE_OUTPUT_DIR=

Create configuration based on some existing config files:

cp scripts/loc.sh scripts/[abbreviation-of-your-library].sh
edit scripts/[abbreviation-of-your-library].sh according to configuration guide

Use

scripts/[abbreviation-of-your-library].sh all-analyses
scripts/[abbreviation-of-your-library].sh all-solr

For a catalogue with around 1 milion record the first command will take 5-10 minutes, the later 1-2 hours.

Assets 3

28 Nov 11:10

pkiraly

v0.2.1

b00ceb3

Release for the SWIB19 conference

Installation

wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.2.1/metadata-qa-marc-0.2-SNAPSHOT-release.zip
unzip metadata-qa-marc-0.2-SNAPSHOT-release.zip
cd metadata-qa-marc-0.2-SNAPSHOT/

Configuration

cp setdir.sh.template setdir.sh
nano setdir.sh

set your path to root MARC directories:

# the input directory, where your MARC dump files exist
BASE_INPUT_DIR=
# the input directory, where the output CSV files will land
BASE_OUTPUT_DIR=

Create configuration based on some existing config files:

cp scripts/loc.sh scripts/[abbreviation-of-your-library].sh
edit scripts/[abbreviation-of-your-library].sh according to configuration guide

Use

scripts/[abbreviation-of-your-library].sh all-analyses
scripts/[abbreviation-of-your-library].sh all-solr

For a catalogue with around 1 milion record the first command will take 5-10 minutes, the later 1-2 hours.

Assets 3

02 Jun 21:38

pkiraly

v0.2

6efd1a2

v0.2 Pre-release

Pre-release

Release prepared for the Data Quality workshop at ELAG 2018

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The major features of this release

Improved PICA handling

Handling union catalogues

SHACL4bib

Other features

Contributors

Details

Group values by library

PICA changes

Shacl4bib

Command line interface

Documentation

CSV generation

other

Dependency updates

Debugging, refactoring, performance inmprovement

Files

Group values by library

PICA changes

Shacl4bib

Command line interface

Documentation

CSV generation

other

Dependency updates

Debugging, refactoring, performance inmprovement

Contributors

Installation

Configuration

Use

Installation

Configuration

Use

Releases: pkiraly/qa-catalogue

Version 0.7.0

The major features of this release

Improved PICA handling

Handling union catalogues

SHACL4bib

Other features

Contributors

Details

Group values by library

PICA changes

Shacl4bib

Command line interface

Documentation

CSV generation

other

Dependency updates

Debugging, refactoring, performance inmprovement

Files

Version 0.7.0 release candidate (1)

Group values by library

PICA changes

Shacl4bib

Command line interface

Documentation

CSV generation

other

Dependency updates

Debugging, refactoring, performance inmprovement

Release v0.6.0

Contributors

Release v0.5.0

Release v0.4

v0.3

Installation

Configuration

Use

Release for the SWIB19 conference

Installation

Configuration

Use

v0.2