CDXJ

Sawood Alam edited this page Feb 15, 2017 · 4 revisions
Clone this wiki locally

Introduction

CDXJ is derived form a more generic format called ORS. It adds some syntactical restrictions to ORS and some encodes some semantics in it to make it useful for some Web archiving tools. However the use cases of the CDXJ are not limited to only archiving tools.

Background

CDXJ (or CDX-JSON) was born as a fusion of the file formats CDX and JSON. While maintaining the fast lookup and arbitrary file split and merge qualities of the CDX format it brings the flexibility and expressiveness of JSON format.

Lexical Grammar

CDXJ

The above railroad diagram illustrates the grammar of the CDXJ format. CDXJ is a subset of ORS as it introduces few extra restriction in the syntax that are not present in the ORS grammar. In the CDXJ format the definition of the key string is strict as it does not allow leading spaces before the key or empty string as the key. If there are spaces in the CDXJ key string, it is considered a compound key where every space separated segment has an independent meaning. Apart from the @-prefixed special keys, every key must have the same number of space separated fields and empty fields use the placeholder "-". CDXJ only allows a single SPACE character to be used as the delimiter between the parts of the compound key. It also enforces a SPACE character to separate the key from the JSON value block. As opposed to the ORS, CDXJ does not allow TAB character as the delimiter. Since the keys cannot be empty strings in CDXJ, there must be a non-empty key associated with every value in it. Additionally, the CDXJ format also prohibits empty lines. These restrictions are introduced in the CDXJ to encourage its use as sorted files to facilitate binary search on the disk. When sorting CDXJ files, byte-wise sorting is encouraged for greater interoperability.

Semantics

CDXJ introduces optional @-prefixed special keys to specify metadata, the @keys key to specify the field names of the data entries, and the @id and the @context keys to provision linked-data semantics inspired by JSON-LD.

Example

!context ["http://oduwsdl.github.io/contexts/arhiveprofiles"]
!id {"uri": "http://archive.org/"}
!keys ["surt_uri", "year"]
!meta {"name": "Internet Archive", "year": 1996}
!meta {"updated_at": "2015-09-03T13:27:52Z"}
com,cnn)/world - {"urim": {"min": 2, "max": 9, "total": 98}, "urir": 46}
uk,ac,rpms)/ - {"frequency": 241, "spread": 3}
uk,co,bbc)/images 2013 {"frequency": 725, "spread": 1}

Media Type and File Extension

We propose application/cdxj+ors as the media type and .cdxj as the file extension for the format.