Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deposit: fixes for JSON-based records #3364

Merged
merged 1 commit into from
Jul 15, 2015

Conversation

crepererum
Copy link
Contributor

  • Adds invenio-records the devel requirements.
  • INCOMPATIBLE Reworks the default (i.e. example/simple) workflow for
    deposit. SIPs are now sealed and dumped after we collected all form
    data and before make ANY further modification to the deposit. This
    also holds for adding the default collection information to the
    deposit. Further analysis and work is required to make this process
    standard compliant (e.g. splitting SIP, DIP, ...). Dumped SIPs are
    now in JSON (was XML).

Signed-off-by: Marco Neumann marco@crepererum.net

NOTE: This is currently only working when using the devel (i.e. unreleased) version of invenio-records. Keep this in mind when building Docker images for interactive testing. (reminder: there is an ENV-flag to control the dependency installation similar to our Travis setup).

* Adds invenio-records to the devel requirements.

* INCOMPATIBLE Reworks the default (i.e. example/simple) workflow for
  deposit. SIPs are now sealed and dumped after we collected all form
  data and before ANY further modification to the deposit are made.
  This also holds for adding the default collection information to the
  deposit. Further analysis and work is required to make this process
  standard compliant (e.g. splitting SIP, DIP, ...). Dumped SIPs are
  now in JSON (was XML).

Signed-off-by: Marco Neumann <marco@crepererum.net>
Reviewed-by: Jiri Kuncar <jiri.kuncar@cern.ch>
@jirikuncar
Copy link
Member

:shipit:

@jirikuncar jirikuncar merged commit daf2f1f into inveniosoftware:master Jul 15, 2015
@crepererum
Copy link
Contributor Author

  1. Why do we dump the JSON into the filesystem? The XML was being dumped before so that the bibupload task could upload the record, not sure if it is needed now.

Because we should dump everything what we get from the user without modifying it. That's for the case that someone wants to re-run the deposit on a new version, a different platform or after something went terribly wrong. Bibupload is dead. It is a bad thing which was designed before we were using a proper ACID database.

  1. Why seal the SIP at the beginning and have to change all the calls to d.get_latest_sip(sealed=True) instead of d.get_latest_sip(sealed=False) ? That breaks all our tasks that were expecting seal==False

It is not a SIP anyway and we never had a proper one. It is a best effort to show that we already dumped the user input and that everything you do now to the SIP won't be reproduced by other software. We now have

  • unsealed = as long as we are collecting data from the forms, uploads or (later, maybe) harvesters. You are free to ask for additional user input or clarifications. You are not allowed to add interpretations or extracted data to the record or to normalize the input.
  • sealed = everything is collected. We start our own internal processing. Whatever you do now won't be transmitted to SIP-backup systems (or other libraries). That gives you freedom but also the restriction to never ask for additional user input. And by never we mean never. The deposit is completed and you are one your own at this stage.

What might be important (and is written down in the commit message): We need to rework the SIP/DIP/... thing in the future to make it (finally, it never was) standard compliant. If you (or everyone else) wants to volunteer, feel free to sit down with some library people, discuss the best solution, implement it, submit a PR, wait for the OK from the Invenio team and the librarians and then finally make Invenio a proper library system.

Last little note: as far as we know we do NOT need to use MarcXML to conform with the standard. We were even wrong when storing the transformed user input as an SIP (without any other packages), because the form data (=JSON) to marc conversion is a heavy modification of the data and might destroy reproducibility.

@jmartinm
Copy link
Member

Thanks a lot for the thorough explanation @crepererum . It makes sense. If we will use that convention I will adapt our tasks to make use of it.

@tiborsimko
Copy link
Member

Some comments:

  • OAIS SIP should archive what we got on the input side, unenriched. Which in this case is basically "Deposit JSON". Hence removal of any MARCXML and other enrichment steps, hence moving the SIP archive creation higher up in the workflow.
  • OAIS practices (SIP, AIP, DIP, audit) will be globally revisited in the coming months, this is already planned.
  • BibUpload is deprecated and not to be used. It was doing more than just record upload ACID and we'll address any OAI matching and other goodies by new inputting workflows on top of the new record API.
  • Invenio is a proper library system :) but with improper OAIS SIP generation due to those extra MARCXML enrichment steps. (LS <> OAIS)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants