Skip to content

pkiraly/marc-pipeline

Repository files navigation

marc-pipeline

MARC pipeline for quality assessment preparation. The purpose of this project to provide an automatic way to convert MARC binary or MARCXML files to JSON files ready to be processed by Apache Spark. It

  1. transforms binary MARC files to MARCXML (with yaz-marcdump)
  2. normalizes the UTF-8 encoding (with uconv),
  3. transforms MARCXML to JSON (with Catmandu)
  4. reformats the JSON files

The final JSON contains one record per line -- this is the way Apache Spark ingest files. Other differences between Catmandu produced JSON, and the JSON this project produces:

  • the order of the components is the same in every records (in Librecat output the order of components is varying)
  • the datafield's subfield component is always an array of object (in Librecat output it is an object if there is only one subfield)

prerequisited softwares

Catmandu requires a special installation, the other two tools are available as standard *nix tools.

processing single files

  1. one-file-to-json.sh - convert xml to json with Catmandu
  2. one-json-to-formatted.sh - change the json format generated by Catmandu with the formatCatmanduOutput.php script

processing multiple files

  1. marc-to-xml.sh - convert binary MARC files in marc directory to XML with yaz-marcdump, then split the files with split-xml.php. Each new file contains maximum 10.000 records.
  2. to-utf8.sh - convert each XML files in a directory to normal UTF-8 file with the uconv tool. The MARC to XML converters do not deal with the decomposed character. This step is needed if the accented charcters in XML remain decomposed (such as an a + ¨ instead of ä). See Unicode normalization and Combining and precomposed characters.
  3. split-xml.sh - splits MARCXML files in marc directory and place the new files into splitted. The script makes use of with split-xml.php. Each new file contains 10.000 records the maximum. If you start with binary MARC you don't have to apply this step because marc-to-xml.sh already contains it.
  4. xml-to-json.sh - convert XML files in splitted directory with Catmandu. Moves converted files to converted and .json to json/raw
  5. format-json.sh - convert .json files in json/raw into a more convenient JSON format. Saves the new files into json/formatted directory, moves the source file into json/processed

directories

  • marc - put here the original binary MARC or MARCXML files
  • splitted - the script puts the splitted XML files here temporary
  • converted - after JSON conversion the scripts moves here the splitted XML files
  • json/raw - the place of the Catmandu generated JSON files before format
  • json/processes - the final place of the Catmandu generated JSON files
  • json/formatted - the formatted JSON files. This is the end result of the process. If everything went correct, you can delete the content of the other directories.

running the XML to JSON process with cron scheduler

Edit crontab with the

crontab -e

command and add the following line:

*/1 * * * * cd /to/working/directory && php toJsonLauncher.php >> launch-report.log

This script runs the one-file-to-json.sh script on each files listed in the to-json-setlist.txt file.

running the JSON formatting process with cron scheduler

Edit crontab with the

crontab -e

command and add the following line:

*/1 * * * * cd /to/working/directory && php toFormattedLauncher.php >> launch-report.log

This script runs the one-json-to-formatted.sh script on each files listed in the to-formatted-setlist.txt file.

About

MARC pipeline for quality assessment preparation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published