Skip to content
This repository has been archived by the owner on Nov 25, 2021. It is now read-only.

Staging Server Processing

Michael Joyce edited this page Jun 1, 2015 · 40 revisions

Overview

OJS journals participating in the PKP PLN register new content (typically on the publication of a new issue) by initiating a SWORD deposit request against the PLN's staging server. After the new content is registered with the staging server, a series of processes is applied to the content to prepare it for preservation in the LOCKSS network. Each discrete process, or set of related processes, is carried out by a microservice, in the form of a Python script, as follows.

Harvesting microservice

Retrieves deposits with a state of "depositedByJournal" and downloads the payload file, which is a zipped Bag. If successful, updates the deposit's state to "harvested".

Verify payload microservice

Retrieves deposits with a state of "harvested" and verifies that the size and SHA-1 checksum value declared in the SWORD deposit match those of the downloaded file. If successful, updates the deposit's state to "payloadVerified".

Bag validation microservice

Retrieves deposits with a state of "payloadVerified", unzips the payload file, and validates the unzipped Bag. If the Bag validates, updates the deposit's state to "bagValidated".

Virus check microservice

Retrieves deposits with a state of "bagValidated". Extracts all files from OJS export XML (they are encoded within the XML using base64), writes them to a temporary directory, and checks them for viruses. The results of the check are written to a log file (one line per checked file) named "virus_report.txt". The version of ClamAV that is used and the date of the check is recorded in the file as well. If successful, updates the deposit's state to "virusChecked".

OJS content validation microservice

Retrieves deposits with a state of "virusChecked" and validates the OJS Export XML using python's lxml. Also generates an XML file containing the journal's UUID, title, ISSN, URL, contact email address, and the date the deposit was created; this file is added to the Bag created by the next microservice. If successful, updates the deposit's state to "contentVerified".

Re-Bagging microservice

Retrieves deposits with a state of "contentVerified". Copies the unzipped payload files files in the original Bag's data directory to a new location and verifies the MD5 checksums for those files using the values in the original Bag's manifest. Creates a new Bag from the unzipped payload files and virus check log, and sets status to "reserialized".

As part of the process of creating a new Bag, adds the following tags to bag-info.txt:

Bagging-Date
External-Description [value of the journal title, ISSN, volume, and issue number where applicable]
External-Identifier [value of the deposit URL]
PKP-PLN-Journal-Contact [contact email from the journal]
PKP-PLN-Journal-UUID
PKP-PLN-Deposit-UUID

The serialized (zipped) Bag is given a filename using the pattern journaluuid.issueuuid.zip.

Content staging microservice

Retrieves deposits with a state of "reserialized", copies the Bag created in the previous microservice to a location accessible to the PKP Private LOCKSS Network, and sets status to "staged".

LOCKSS-O-Matic deposit microservice

Retrieves deposits with a state of "staged" and issues a SWORD deposit against LOCKSS-O-Matic so it can initiate harvesting of the content into the PKP Private LOCKSS Network. Sets the status to "depositedToPln".

Querying the PLN to determine replication

The processes described above are scheduled so that a deposit created by an OJS instance should have its status set to "depositedToPln" within a day. One final periodic job runs to determine when the PLN has replicated the content across all of its nodes and they have reached "agreement" (which in LOCKSS means that all copies of the content are completely identical).

The query to determine if all nodes in the PLN are in "agreement" is not performed against the PLN directly, it is performed against LOCKSS-O-Matic. This job is a request for the SWORD Statement for the content deposited into LOCKSS-O-Matic. When the SWORD Statement indicates that all of the nodes in the network are in agreement, the PKP PLN SWORD client issues an update request to LOCKSS-O-Matic to indicate that the content should not be reharvested by the PLN. The content is then deleted from the staging server.

Implementation of the microservices

The microservices are not run directly but are executed via a controller script, which runs on a cron job specifying which microservice to invoke using a configuration such as this:

0  0 * * * /usr/bin/python /path/to/pkppln/pln-service.py harvest
0  6 * * * /usr/bin/python /path/to/pkppln/pln-service.py validate_payload
0  7 * * * /usr/bin/python /path/to/pkppln/pln-service.py validate_bag
0  9 * * * /usr/bin/python /path/to/pkppln/pln-service.py virus_check
0 10 * * * /usr/bin/python /path/to/pkppln/pln-service.py validate_export
0 12 * * * /usr/bin/python /path/to/pkppln/pln-service.py reserialize_bag
0 17 * * * /usr/bin/python /path/to/pkppln/pln-service.py stage_bag
0 20 * * * /usr/bin/python /path/to/pkppln/pln-service.py deposit_to_pln
0 22 * * * /usr/bin/python /path/to/pkppln/pln-service.py check_status

The three microservices that change or generate files (the harvesting, Bag validation, and rebagging microservices) create directories for their output:

# Input directory for the verify payload and Bag validation microservices
/var/pkppln/havested
# Input for the virus check, validate OJS content, and re-Bagging microservices
/var/pkppln/bagValidated
# Input for the content staging microservice
/var/pkppln/reserialized

Each microservice is configured to use the content in one of these three directories as its input, and will only run on a deposit if the previous microservice reports a status of "success" for that deposit. A fourth directory, /var/www/pkppln, is where the serialized Bag containing the content is stored until it is harvested by the LOCKSS boxes in the PKP PLN. Within all four of these directories, each deposit is identified by its UUID.

Results of the microservice, and any errors that occurred, are logged in the database (which also uses deposits' UUIDs to identify them). If the database fails to update, the database error is logged to a file and emailed to the PKP PLN administators.

Clone this wiki locally