# Indexing custom documents

Solr’s basic unit of information is a `document`, which is a set of data that describes something. A document about a book could contain the title, author, year of publication, number of pages, and so on. Documents are composed of `fields`, which are more specific pieces of information. Fields can contain different kinds of data. A title field, for example, is text and publication year could be a date or an integer. **If fields are defined correctly, Solr will be able to interpret field values correctly**. `Field analysis` tells Solr what to do with incoming data when building an index.

> In this lab, we will borrow field types handling concepts from `03-schema-api.ipynb` notebook. If you are not familiar with the notebook, I suggest you take time to understance what is therein before proceeding any further with this notebook.



In [1]:
from simplejson import loads
from requests import request

# define Solr instance resources
base_url = 'http://localhost:8983'
core_name = 'localDocs'
# define important paths
api_endpoint = f'{base_url}/api/cores/{core_name}' # note that we are using API V2
schema_endpoint = f'{api_endpoint}/schema'
# set http header content
headers = {
    'Content-type':'application/json'
}

def handle_request(method="POST", body={}, endpoint=schema_endpoint, headers=headers):
    r = request(method, endpoint, headers=headers, json=body)
    return loads(r.text)

In [10]:
# view fields defiened in our schema

handle_request('GET', endpoint=f"{schema_endpoint}/fields")

{'responseHeader': {'status': 0, 'QTime': 0},
 'fields': [{'name': '_nest_path_', 'type': '_nest_path_'},
  {'name': '_root_',
   'type': 'string',
   'docValues': False,
   'indexed': True,
   'stored': False},
  {'name': '_text_',
   'type': 'text_general',
   'multiValued': True,
   'indexed': True,
   'stored': False},
  {'name': '_version_', 'type': 'plong', 'indexed': False, 'stored': False},
  {'name': 'id',
   'type': 'string',
   'multiValued': False,
   'indexed': True,
   'required': True,
   'stored': True}]}

Minimally, a documents search engine should have at least:  
- Title field (document title)
- Author(s) field
- Publisher field
- Publication date field
- Language field (if targeting multi-lingual audience)
- ISBN field 
- Price field (free books could be tagged to 0 price)
- Document access mode field (ebook or hardcopy?)
- Store(s) field (if hardcopy, where can it be found)
- Authorized dealer(s) field (who is allowed to distribute the document)
- InStock field (is the document availale)
- Preface field (brief info about the document)
- Any other relevant information (dynamic fields)

Let's create these fields

In [8]:
handle_request('GET', endpoint=f"{schema_endpoint}/fieldtypes/pfloat")

{'responseHeader': {'status': 0, 'QTime': 0},
 'fieldType': {'name': 'pfloat',
  'class': 'solr.FloatPointField',
  'docValues': True}}

In [15]:
fields = {
    "add-field":[
        {
            'name':'title',
            'type':'text_en',
            'required':True # ensure all documents provided for indexing have a title
        },
        {
            'name':'author',
            'type':'string',
            'multiValued':True, # a document may have more than one author
            'required':True
        },
        {
            'name':'publisher',
            'type':'string',
            'required':True
        },
        {
            'name':'publication_date',
            'type':'pdate'
        },
        {
            'name':'language',
            'type':'string',
            'default':'english' # fall back to english if language not specified
        },
        {
            'name':'isbn',
            'type':'string'
        },
        {
            'name':'price',
            'type':'pfloat'
        },
        {
            'name':'access_mode',
            'type':'binary'
        },
        {
            'name':'store',
            'type':'string',
            'multiValued': True # a document can be stocked in several stores
        },
        {
            'name':'dealer',
            'type':'string',
            'multiValued':True
        },
        {
            'name':'inStock',
            'type':'boolean'
        },
        {
            'name':'preface',
            'type':'text_en'
        },
        {
            'name':'_text', # define a field that will be used to store any other detail of the document
            'type':'text_en',
            'multiValued': True # allow several attributes e.g biographies of several authors in a dict enclosed in a list
        }
    ],
    'add-copy-field':{ # add a catch all field
        'source':"*",
        "dest":"_text" # postpend _text to original field name
    }
}


# trigger fields creation
handle_request(body=fields)

{'responseHeader': {'status': 0, 'QTime': 225}}