Publishing Workflow

dmiddle2000lb edited this page Jul 12, 2013 · 23 revisions

Attention: A publisher (Organization, Individual): can provide at least two data points:

  1. Valid e-mail (required) - for failure notifications
  2. Callback URL (optional) - for success and failure notifications

to obtain authentication access (API Key etc.)

Publishing a Schema

  1. client POSTs to get one or more GUIDs
  2. client PUTs a JSON-Schema object S to the schema endpoint using a GUID obtained from the GUID endpoint
  3. server N validates that the API key used has write permission
  4. server N caches S in the storage format (together with the owning org)
  5. server N sends a queue message to publish S
  6. server N returns a 202 Accepted along with a prospective link to the published schema ("a pointer to a status monitor")
  7. server M (maybe the same server, but maybe different) pulls a schema publish message off the queue
  8. server M pulls S off cache
  9. server M pulls the JSON-Schema object T that validates JSON-Schemas (cache, storage), and validates S with it
  • OPEN: if S refers to some other validating schema, should we import that and check against it?
  1. server M saves S to store
  • if S’s GUID exists already, ensure that the key-owning org owns it (since it might not have existed when we checked earlier); this implies that we’re storing metadata with the schema -- I don’t know if this is currently being handled
  • if S’s GUID doesn’t exist, use a query parameter to ensure it is only inserted if it still doesn’t exist during insert
  1. server M saves S to cache
  2. server M pushes notifications (http, email, maybe others) of publish or error

Notes

  • We're not validating that the publishing organization has the right to publish to this GUID until after we've sent a "202 Accepted" response. This is by design; permissions must be checked immediately before save in any case, so we avoid checking repeatedly.

Publishing a Document

  1. client POSTs to get one or more GUIDs
  2. client PUTs a document D to the document endpoint using a GUID obtained from the GUID endpoint
  • batched documents are accepted at a different endpoint, similarly to the RSS publishing endpoint, which is inherently batched due to the nature of RSS
  1. server N validates that the API key used has write permission
  2. server N caches D in the storage format
  3. server N sends a queue message to publish D
  4. server N returns a 202 Accepted along with a prospective link to the published document ("a pointer to a status monitor")
  5. server M (maybe the same server, but maybe different) pulls a schema publish message off the queue
  6. server M pulls S off cache
  7. server M validates that the API key used is for an org with an editor-rel link in the existing document, if it exists already
  • we do this early (and twice) to avoid expensive validation operations if they would be pointless
  1. server M validates D
  2. find a JSON-schema
  3. if, in the links section, there is a link with rel=schema, pull the schema S from cache or store
  • fail if there is such a link but we can't find S
  1. if no schema link exists, we use the base document schema for S
  2. validate D against S
  • if S refers to a parent schema T, repeat this process to validate against all such schemas
  1. server M saves D to store
  • if D's GUID exists already, ensure that the API key used is still able to write to this document (we need to introduce metadata about orgs in the editor-rel links in order to use a query parameter to ensure it's only inserted if this holds true)
  • if D's GUID doesn't exist, use a query parameter to ensure it is only inserted if it still doesn't exist during insert
  1. server M saves D to cache
  2. server M pushes notifications (http, email, maybe others) of publish or error

Notes

  • We're not validating that the publishing organization has the right to publish to this GUID until after we've sent a "202 Accepted" response. This is by design; permissions must be checked immediately before save in any case, so we avoid checking repeatedly.
  • Schemas and profiles are related in that they are intended to be the machine-readable and human-readable versions of constraints on a given document, but they should both be present on the document to which they apply. A document should have a single profile and a single schema. Schemas can reference more than one schema for extension.

Publishing a Blob

Note: by "blob" we mean binary data such as: images, audio, video.

  1. client POSTs a multipart/form-data payload G to the asset endpoint using field names meaningful to the client
  • no field name may be used more than once
  1. server N validates write permissions of the API key used for G.
  2. for each file F in G
  3. server N creates a GUID
  4. server N checks to ensure that there is no metadata for this GUID
  5. server N writes metadata to a store (DynamoDB or MySQL table, for example), containing
    • API key (to check permissions for delete)
    • GUID for F
    • temporary filename of F for investigation if something goes wrong ?
  6. server N saves F to S3
  7. server N returns a JSON document mapping original field names to GUIDs

Notes

  • This process knows nothing about asset documents, though asset document links may include links to the blob endpoint. Since GUIDs are created on the fly, asset blobs must be uploaded before links are finalized for an asset document (if that document uses any blobs hosted by PMP).
  • Currently, this API is CRD only; there is no update of a blob. Deletes are checked against the API key stored in the metadata store.
  • Ingesting Stories - To ingest stories from the NPR API, run NODE_CONFIG_DIR=../../../config node story.js from lib/users/ingest. This will ingest and create story docs, their associated image doc and optionally collection aggregations at random.