Skip to content

JH IIIF Search Service

jabrah edited this page May 25, 2017 · 11 revisions

Rough draft of a new IIIF Search specification

Introduction

JHU needs to be able to search more flexibly than allowed for by http://iiif.io/api/search/0.9/. In particular we need to be able to return canvases within a given manifest or collection which have certain characteristics. Additional requirements are nested Boolean queries and faceted searching.

Integration with IIIF Presentation API

Just as in IIIF Search API resources indicate the JH IIIF Search service is available.

    {
      "service": {
        "@context": "http://manuscriptlib.org/jhiiif/search/context.json",
        "@id": "http://example.org/manifest/name/jhsearch",
        "profile": "http://manuscriptlib.org/jhiiif/search/profile"
      }
    }

Search model

The JH IIIF Search Service allows a set of IIIF objects to be efficiently searched. Each object is indexed as a set of fields. Each field has a name and a value. The standard fields must be present as shown below. Depending on the content, implementations will add other fields to suit their needs.

A query returns objects that match unions and intersections of these fields and general information about the result. Each match includes contex and information about the matched object. Search matches are ordered and paged. A single query returns a sublist in the list of total results. The search service may support several different options for ordering matches.

Faceted search

Objects can be organized into categories, searched by those categories, and browsed by those categories. Objects are assigned one or more values in one or more categories. Searches can be restricted to objects in a category or assigned certain values in a category. In addition, any search can return what categories and values in those categories the matching objects have, together with a count of each value. The category counts are for the entire list of results, not for the particular sublist requested.

In a search, category constraints are represented by a list of terms. The list of categories has two effects on the search: restricting matched objects and returning category counts.

The name of the term is the name of a category and the value of the term is the desired value in the category. If terms specify the same category, objects that are in any of the categories will be matched. If terms specify different categories, then matched objects must be in all of the different categories. If the specified value of a category is empty, then objects that have any value assigned in that category will match. If the categories list is empty, there will be no category restriction on matching, but category counts will still be returned.

Standard fields

Field Type Cardinality Description
object_id URI 1 URI of matched presentation API object
object_type URI 1 Type of matched object
collection_id URI 0-* URI of collection containing object
manifest_id URI 0-1 URI of manifest containing object

AoR fields for canvas objects

Field Type Cardinality Description
marginalia Text 0-* Transcription of marginalia, translation of transcription, referenced text, book titles, people, locations
symbol Id 0-* symbol id, referenced text
drawing Id 0-* drawing id, referenced text
numeral Text 0-* numeral, referenced text
mark Id and Text 0-* mark id, referenced text
errata Text 0-* correction, referenced text
underline Text 0-* underlined text
emphasis Text 0-* underlined text in marginalia
cross_reference Text 0-* Book titles, people

Syntax

Grammar:

  • Query -> Term or ( Query (Operation Query)+ ) must be same operation
  • Operation -> & or |
  • Term -> Field:Value
  • Field -> [\w_-]+ (letters, _, -)
  • Value -> '.*' backslash is escape character

Grammar:

  • Query - a query can simple or complex. A simple query is a term. A complex query is a boolean operation on at least two sub-queries and must be surrounded my parenthesis.
    • Term
    • (QUERY OPERATION QUERY)
  • Operation - Mechanism to combine multiple queries.
    • & - Intersection operation - Match objects which match all sub-queries.
    • | - Union operation - Match objects which match any sub-query.
  • Term - Matches objects which contain a field which matches the value. The exact mechanics of that matching depend on the implementation.
    • FIELD:VALUE
  • Field - a search field recognized by the search service, used to narrow search results. Valid characters include any letter, lower case or upper case, underscore, and hyphen
    • [\w_-]+
  • Value - a word or phrase to search for. All characters are valid. The Value should be surrounded by single quotes. Any other single quotes (or apostrophe) should be escaped.
    • "" .* ""
    • TODO: reserve wildcard character

Examples

Lucene implementation notes (JHU implementation)

Text fields support lucene query syntax: [https://lucene.apache.org/core/5_2_1/queryparser/org/apache/lucene/queryparser/simple/SimpleQueryParser.html]

Note the special behavior of some characters. Those may have to be escaped in the lucene syntax depending on what the client wants the behavior to be.

Making a query

PRESENTATION_OBJECT_URI/jhsearch?q=(marginalia:earth & symbol:moon)&o=40&m=20

Example: http://example.org/iiif/FarmAnimals/manifest/jhsearch?q=subject:%27cow%27&o=40&m=20

Parameters:

  • q : (REQUIRED) query in syntax above. If it can't be parsed, returns 400 error.
  • c : (OPTIONAL) List of query terms specifying categories. If it can't be parsed, returns 400 error.
  • so : (OPTIONAL) order that results are sorted, either relevance or index.
  • o : (OPTIONAL) offset position of the first match which the search service should return in the list of total results.
  • m : (OPTIONAL) number of matches which the search service will return per results page.

Note about category parameter, c, category terms behave differently compared to search queries. This parameter is not a full search query, but is instead a list of terms. A category term, as defined elsewhere in this documentation, MUST contain a category id and some value surrounded by single quotes separated by a comma category_id:'value'. The value can be empty. Multiple terms do not have to be separated from each other by a space OR no character. No other characters can appear between terms.

c=                                        // Good
c=category_id                             // Bad
c=category_id:''                          // Good
c=category_id:'val1'                      // Good
c=category_id:'val1'category_id:'val2'    // Good
c=category_id:'val1' category_id:'val2'   // Good
c=category_id:'val1',cateogry_id:'val2'   // Bad

The category parameter can also be empty. This will force the search service to return all values and counts for all categories.

Search Result

TODO: Should this return JSON-LD?

    {
      "@context": "http://manuscriptlib.org/jhiiif/search/context.json",
      "@id": "http://example.org/service/manifest/jhsearch?q=cow",
      "@type": "jh:SearchResult",
      "total": "300",
      "offset": 40,
      "max_matches": 20,
      "query": "text:cow",
      "matches": [
        {
          "context": "<B>Cows</B> are a noble and beautiful creatures.",
          "object": {
             "@id": "http://example.org/service/canvas/3",
             "@type": "sc:Canvas",
             "label": "Page 3",
          },
          "manifest": {
             "@id": "http://example.org/service/manifest",
             "label": "Big Book of Cows" 
           }
        }
      ],
      "categories": [
        {
	  "name": "mammal"
	  "values": [
            { "label": "cow", "count": 10 }
          ]
        }
      ]
    }
  • total : total number of results in the search. The returned matches does not have to be equal to this number if the matches are paged.
  • offset : starting index of the current search page within the total result list.
  • max_matches : results returned per page.
  • query : the query (as a string) that was searched to generate these results
  • matches : array of search matches
    • context : contains HTML context that may include highlighting of query
    • object : a reference to the object on which this match was found
    • manifest : (OPTIONAL) a reference to the manifest that the matched object
  • categories: array of categories for entire list of search results
    • name: category ID, NOT the human readable label
    • values: array of values in that category with counts

IIIF Reference contains enough information about a IIIF Presentation object to display in a useful way. It contains the following data

  • @id : URI uniquely identifying the IIIF Presentation object that can be used to retreive the full object
  • @type : object type (sc:Canvas|sc:Manifest|sc:Collection|sc:Annotation)
  • label : human readable label to display to the user (simple string or HTML?)

Search Service info request

A search info request returns information needed by an application to use the service.

PRESENTATION_OBJECT_URI/jhsearch/info.json

Example: http://example.org/iiif/FarmAnimals/manifest/jhsearch/info.json

TODO: Should this return JSON-LD?

{
  'fields' : [{
    'name' : 'name',
    'label' : 'Title',
    'description' : 'Prefix or suffix of a name', 
    'values' : [
      { 
        'value': 'field_val',           
        'label': 'Human readable label' 
      }
    ],
  },
  ],
  'categories' : [
    {
      'name' : 'name',
      'label' : 'Title'
    },
  ],
  'default-fields' : ['title', 'description'] 
}

Example: AOR example search info request

  • fields - array of search fields.
    • name - field ID, this is used as the search field when generating a search query.
    • label - human readable label for display in a search widget
    • description - human readable text describing this field. Example use is help text in a tool tip.
    • values - (OPTIONAL) enumeration (array) of possible values for this field
      • value - search value to use when generating a search query. Search query: field:'enum.value'
      • label - human readable label
  • categories - array of search categories
    • name - category ID, this is used to generate a category list for searching
    • label - human readable label
  • default-fields - array of search field IDs. A "basic search" will search across all of these fields. For example, assume value: 'default-fields': ['one', 'two', 'three'], and the user does a basic search for the string: maxime omnium. The query that will be sent to the search service will look like: (one:'maxime omnium' | two:'maxime omnium' | three:'maxime omnium')

// TODO Abstract out sort order and report it in info.json // TODO Instead of the sort of ugly default-fields consider expanding the search syntax to support a default field?

Clone this wiki locally