Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Normalize dates in extracted metadata from binaries #71

Closed
ryangrimm opened this Issue · 3 comments

3 participants

@ryangrimm

Many binary formats include something along the lines of a creation date or a modification date. These dates can be under various names for various file formats. In order to support various queries and range indexes on this metadata, normalizing these dates into xs:dateTime values would be required.

To do so, the current plan is to attempt to parse the value of any piece of metadata that has date or time in its name. The parsing can be accomplished via the date parser that's already in use. New formats can easily be added if need be.

@ScottConroy

Will want to normalize the element names as well. Content extracted from PDF ends up with corona:modDate while Word ends up with corona:lastSavedDate (which I believe are conceptually the same thing). I did a quick inventory of a half dozen other formats and that's the main one I saw.

@ryangrimm

Normalizing last modification metadata to a corona:modDate element. Also running any piece of metadata that has "date" in the name through the date parser. If a date is extracted it's stored in a normalized-date attribute.

@ryangrimm ryangrimm closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.