Skip to content

hsci-r/octavo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Octavo - Open Interface for Text and Metadata

Octavo is a JSON API -based server for querying and analyzing textual collections with attendant metadata. It has been used to store and access for example:

  • Eighteenth Century Collections Online
  • British Library Newspapers / Burney Collection Newspapers
  • News articles from YLE and Helsingin Sanomat
  • The Finnish Digitized Newspaper Collection

The primary functionalities provided by Octavo are robust querying and constraining capabilities based on an expanded form of the Lucene query syntax (e.g. return all books containing the word consciousness, but only in the preface section, and only if the book is longer than 50 pages, and only if the book also contains the word philosophy or religion, or return all sentences where the lemma "Juha Sipilä" appears with an adjective at most 3 words apart).

On top of this search functionality, Octavo also provides performant support for collocation analysis based on the queries (to discover e.g. which words are associated with the query energy policy in articles written at YLE by someone other than the YLE news desk, grouped by year).

All of the above queries can be done on various levels, e.g. by sentence, paragraph, section or whole work, depending on what answering the end user's question requires.

Levels of operation

Octavo has been designed to as flexibly as possible support many different levels of operation, depending on the needs of the end user workflow.

For example, a custom data science workflow can just retrieve all complete documents (or e.g. their paragraphs) matching a particular criteria, for complete control over local processing.

On the other hand, for e.g. collocation-type queries, it makes sense to use the efficient Octavo index. To customize this workflow, multiple ready options are provided, but also Groovy scripts can be injected into the processing of such queries to further open new options.

Finally, there are also ready user interfaces built on top of Octavo for the most common use cases.