# Indexing custom documents

Solr’s basic unit of information is a `document`, which is a set of data that describes something. A document about a book could contain the title, author, year of publication, number of pages, and so on. Documents are composed of `fields`, which are more specific pieces of information. Fields can contain different kinds of data. A title field, for example, is text and publication year could be a date or an integer. **If fields are defined correctly, Solr will be able to interpret field values correctly**. `Field analysis` tells Solr what to do with incoming data when building an index.

> In this lab, we will borrow field types handling concepts from `03-schema-api.ipynb` notebook. If you are not familiar with the notebook, I suggest you take time to understance what is therein before proceeding any further with this notebook.



In [1]:
from simplejson import loads
from requests import request

# define Solr instance resources
base_url = 'http://localhost:8983'
core_name = 'localDocs'
# define important paths
api_endpoint = f'{base_url}/api/cores/{core_name}' # note that we are using API V2
schema_endpoint = f'{api_endpoint}/schema'
# set http header content
headers = {
    'Content-type':'application/json'
}

def handle_request(method="POST", body={}, endpoint=schema_endpoint, headers=headers):
    r = request(method, endpoint, headers=headers, json=body)
    return loads(r.text)

In [10]:
# view fields defiened in our schema

handle_request('GET', endpoint=f"{schema_endpoint}/fields")

{'responseHeader': {'status': 0, 'QTime': 0},
 'fields': [{'name': '_nest_path_', 'type': '_nest_path_'},
  {'name': '_root_',
   'type': 'string',
   'docValues': False,
   'indexed': True,
   'stored': False},
  {'name': '_text_',
   'type': 'text_general',
   'multiValued': True,
   'indexed': True,
   'stored': False},
  {'name': '_version_', 'type': 'plong', 'indexed': False, 'stored': False},
  {'name': 'id',
   'type': 'string',
   'multiValued': False,
   'indexed': True,
   'required': True,
   'stored': True}]}

## Defining Document Fields  

Minimally, a documents search engine should have at least:  
- Title field (document title)
- Author(s) field
- Publisher field
- Publication date field
- Language field (if targeting multi-lingual audience)
- ISBN field 
- Pages field (number of pages making up the document)
- Price field (free books could be tagged to 0 price)
- Document access mode field (ebook or hardcopy?)
- Store(s) field (if hardcopy, where can it be found)
- Authorized dealer(s) field (who is allowed to distribute the document)
- InStock field (is the document availale)
- Preface field (brief info about the document)
- Any other relevant information (dynamic fields)

Let's create these fields

In [8]:
handle_request('GET', endpoint=f"{schema_endpoint}/fieldtypes/pfloat")

{'responseHeader': {'status': 0, 'QTime': 0},
 'fieldType': {'name': 'pfloat',
  'class': 'solr.FloatPointField',
  'docValues': True}}

In [15]:
fields = {
    "add-field":[
        {
            'name':'title',
            'type':'text_en',
            'required':True # ensure all documents provided for indexing have a title
        },
        {
            'name':'author',
            'type':'string',
            'multiValued':True, # a document may have more than one author
            'required':True
        },
        {
            'name':'publisher',
            'type':'string',
            'required':True
        },
        {
            'name':'publication_date',
            'type':'pdate'
        },
        {
            'name':'language',
            'type':'string',
            'default':'english' # fall back to english if language not specified
        },
        {
            'name':'isbn',
            'type':'string'
        },
        {
            'name':'pages',
            'type':'pint'
        },
        {
            'name':'price',
            'type':'pfloat'
        },
        {
            'name':'access_mode',
            'type':'binary'
        },
        {
            'name':'store',
            'type':'string',
            'multiValued': True # a document can be stocked in several stores
        },
        {
            'name':'dealer',
            'type':'string',
            'multiValued':True
        },
        {
            'name':'inStock',
            'type':'boolean'
        },
        {
            'name':'preface',
            'type':'text_en'
        },
        {
            'name':'_text', # define a field that will be used to store any other detail of the document
            'type':'text_en',
            'multiValued': True # allow several attributes e.g biographies of several authors in a dict enclosed in a list
        }
    ],
    'add-copy-field':{ # add a catch all field
        'source':"*",
        "dest":"_text" # postpend _text to original field name
    }
}


# trigger fields creation
handle_request(body=fields)

{'responseHeader': {'status': 0, 'QTime': 225}}

## Documents Indexing  

Now that we have our fields defined, it's time to index some documents

In [117]:
# we will use books library catalog data downloaded from https://www.usabledatabases.com/database/books-isbn-covers/sample/#table_author

import pandas as pd

books = pd.read_csv("data/book.csv", header=0, index_col="id", keep_default_na=False)
books.head()


Unnamed: 0_level_0,title,author,author_id,author_bio,authors,title_slug,author_slug,isbn13,isbn10,price,format,publisher,pubdate,edition,subjects,lexile,pages,dimensions,overview,excerpt,synopsis,toc,editorial_reviews
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1,Opening Spaces: An Anthology of Contemporary A...,Yvonne Vera,0,<p><P>EDITOR<p>Yvonne Vera was born and raised...,"Yvonne Vera (Editor), Yvonne Vera",opening-spaces,yvonne-vera,9780435910105,435910108,$14.52,Paperback,Heinemann,September 1999,1st Edition,General & Miscellaneous Literature Anthologies...,,186.0,5.07 (w) x 7.78 (h) x 0.42 (d),In this anthology the award-winning author Yvo...,,<p><p>African women are seldom given the space...,<P>Preface<p>The Girl Who Can - Ama Ata Aidoo ...,
2,The Caine Prize for African Writing 2010: 11th...,The Caine Prize for African Writing,0,,The Caine Prize for African Writing,the-caine-prize-for-african-writing-2010,the-caine-prize-for-african-writing,9781906523374,1906523371,$13.46,Paperback,New Internationalist,August 2010,,"Short Story Anthologies, African Fiction, Afri...",,208.0,5.00 (w) x 7.70 (h) x 0.70 (d),<p>The Caine Prize for African Writing is Afri...,,<p><p>The best in new short story fiction from...,<P>Introduction 6<P>Caine Prize 2010 Shortlist...,
3,African Folktales,Roger D. Abrahams,0,,"Roger D. Abrahams, Dan Frank",african-folktales,roger-d-abrahams,9780394721170,394721179,$18.95,Paperback,Knopf Doubleday Publishing Group,August 1983,,"Travel, Africa",,,,,,<p><P>Nearly 100 stories from over 40 tribe-re...,,
4,Unchained Voices: An Anthology of Black Author...,Vincent Carretta,0,,Vincent Carretta,unchained-voices,vincent-carretta,9780813190761,813190762,$30.00,Paperback,University Press of Kentucky,December 2003,Expanded,United States History - African American Histo...,,416.0,6.10 (w) x 9.40 (h) x 1.10 (d),Vincent Carretta has assembled the most compre...,,<p><P>Vincent Carretta has assembled the most ...,"<TABLE><TR><TD WIDTH=""20%""></TD><TD WIDTH=""70%...",<article>\n <h4>African American Review</h4...
5,Women Writing Africa: West Africa and the Sahel,Esi Sutherland-Addy,0,"<p><P>Esi Sutherland-Addy (Ph.D. Hon, Hon FCP)...","Esi Sutherland-Addy (Editor), Abena P. A. Busi...",women-writing-africa,esi-sutherland-addy,9781558615007,1558615008,$29.95,Paperback,"Feminist Press at CUNY, The",August 2005,,"Literary Criticism - General & Miscellaneous, ...",,560.0,6.00 (w) x 9.00 (h) x 1.30 (d),<p>The acclaimed Women Writing Africa project ...,,<p><P>A major literary and scholarly work that...,,<article>\n <h4>Library Journal</h4>This se...


In [95]:
# This data is not as clean as we would like. 
# we will use https://docs.python.org/3/library/html.parser.html to create a custom html to text parser
# Alternatively, we would have used html2text

from html.parser import HTMLParser

class Html2Text(HTMLParser):
    text = ""
    def handle_endtag(self, tag):
        new_line_tags = ['p', 'div', 'br']
        new_line_tags.extend([f"h{str(i)}" for i in range(1, 7)])
        if tag in new_line_tags:
            self.text += "\n"
        else:
            self.text += ' '

    def handle_data(self, data):
        self.text +=  f"{data} "

# See how the parser algorithm works
parser = Html2Text()
parser.feed(books['author_bio'].iloc[0])
parser.text


'EDITOR Yvonne Vera was born and raised in Bulawayo, Zimbabwe, gained her Ph.D. from York University in Canada, and was the Director of the National Gallery of Zimbabwe in Bulawayo. Yvonne Vera died at age 40 in 2005 Yvonne Vera’s Without a Name and Under the Tongue both won first prize in the Zimbabwe Publishers Literary Awards of 1995 and 1997 respectively. Under the Tongue won the 1997 Commonwealth Writers Prize (Africa Region). Yvonne Vera won the Swedish literary award The Voice of Africa 1999. \n'

In [118]:
# create a function that will be applied to columns of interest

def html2text(line):
    parser = Html2Text()
    parser.feed(line)
    return parser.text

html_columns = ['author_bio','overview','excerpt','synopsis','toc','editorial_reviews']

# apply parser function to every html column
books = books.apply(lambda col: col.apply(html2text) if col.name in html_columns else col)
books[50:60]

Unnamed: 0_level_0,title,author,author_id,author_bio,authors,title_slug,author_slug,isbn13,isbn10,price,format,publisher,pubdate,edition,subjects,lexile,pages,dimensions,overview,excerpt,synopsis,toc,editorial_reviews
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
51,The Best American Essays of the Century,Joyce Carol Oates,3,In a prolific and varied oeuvre that ranges ov...,"Joyce Carol Oates, Robert Atwan",the-best-american-essays-of-the-century,joyce-carol-oates,9780618155873,618155872,$14.84,Paperback,Houghton Mifflin Harcourt,October 2001,,"American Essays, American Literature Anthologies",,624.0,6.00 (w) x 9.00 (h) x 1.50 (d),This singular collection is nothing less than ...,Foreword \nThe Essay in the Twentieth Century ...,This singular collection is nothing less than ...,Foreword x Introduction xvii 1901: Corn...,\n From Barnes & Noble \nBookseller Review...
52,The Best Loved Poems of the American People,Hazel Felleman,0,,"Hazel Felleman (Selected by), Edward Frank All...",the-best-loved-poems-of-the-american-people,hazel-felleman,9780385000192,385000197,$17.92,Hardcover,Knopf Doubleday Publishing Group,October 1936,Reissue,"Poetry Anthologies, American Poetry, Poetry - ...",,670.0,5.99 (w) x 8.56 (h) x 2.06 (d),"More than 1,500,000 copies in print! Over 575 ...",,"More than 1,500,000 copies in print! Over 575 ...",,
53,The Norton Anthology of American Literature: V...,Wayne Franklin,0,"Nina Baym (General Editor), Ph.D. Harvard, i...","Wayne Franklin (Editor), Jerome Klinkowitz (Ed...",the-norton-anthology-of-american-literature,wayne-franklin,9780393927399,393927393,$37.77,Paperback,"Norton, W. W. & Company, Inc.",April 2007,7th Edition,American Literature Anthologies,,972.0,6.00 (w) x 9.20 (h) x 1.10 (d),Firmly grounded in the core strengths that hav...,,Firmly grounded in the core strengths that hav...,,
54,The Norton Anthology of Poetry,Margaret Ferguson,0,Margaret Ferguson (Ph.D. Yale University) is...,"Margaret Ferguson, Jon Stallworthy, Mary Jo Sa...",the-norton-anthology-of-poetry,margaret-ferguson,9780393979206,393979202,$66.30,Paperback,"Norton, W. W. & Company, Inc.",December 2004,5th Edition,"Poetry Anthologies, American Poetry, English P...",,2256.0,6.00 (w) x 9.20 (h) x 2.00 (d),Offering over one thousand years of verse from...,,Offering over one thousand years of verse from...,,
55,The Norton Anthology of African American Liter...,Henry Louis Gates Jr.,0,Henry Louis Gates Jr. (Ph.D. Cambridge) is A...,"Henry Louis Gates Jr. (Editor), Nellie Y. McKay",the-norton-anthology-of-african-american-liter...,henry-louis-gates-jr,9780393977783,393977781,$72.82,Paperback,"Norton, W. W. & Company, Inc.",December 2003,2nd Edition,Peoples & Cultures - American Anthologies,,2832.0,6.00 (w) x 9.30 (h) x 2.30 (d),"Welcomed on publication as ""brilliant, definit...",,"Welcomed on publication as ""brilliant, definit...",,\n Publishers Weekly\n ...
56,"Poems, Poets, Poetry: An Introduction and Anth...",Helen Vendler,0,"HELEN VENDLER , critic and scholar of English...",Helen Vendler,poems-poets-poetry,helen-vendler,9780312463199,312463197,$1.99,Paperback,Bedford/St. Martin's,October 2009,3rd Edition,"Poetry Anthologies, American Poetry, English P...",,752.0,5.90 (w) x 9.00 (h) x 1.00 (d),\nMany students today are puzzled by the meani...,,Written by a preeminent critic and legendary t...,Preface: About This Book Brief Contents Cont...,
57,The Poets Laureate Anthology,Elizabeth Hun Schmidt,0,"Elizabeth Hun Schmidt , a former poetry edito...","Elizabeth Hun Schmidt, Library of Congress Sta...",the-poets-laureate-anthology,elizabeth-hun-schmidt,9780393061819,393061817,$38.52,Hardcover,"Norton, W. W. & Company, Inc.",October 2010,New Edition,"Poetry, American Literature Anthologies, Antho...",,816.0,6.50 (w) x 9.30 (h) x 1.70 (d),The first anthology to gather poems by the for...,,The first anthology to gather poems by the for...,,\n Publishers Weekly \nThe United States h...
58,The Portable Beat Reader,Various,0,Ann Charters is the editor of The Portable Si...,"Various, Ann Charters",the-portable-beat-reader,various,9780142437537,142437530,$18.00,Paperback,Penguin Group (USA),July 2003,Reissue,"Literary Collections, American",,,,,,"Through poetry, fiction, essays, song lyrics, ...",,
59,The Best American Short Plays 2008-2009,Barbara Parisi,0,,Barbara Parisi,the-best-american-short-plays-2008-2009,barbara-parisi,9781557837608,1557837600,$14.85,Paperback,Applause Theatre Book Publishers,October 2010,,"Drama Anthologies, American Drama, American Li...",,356.0,5.50 (w) x 8.40 (h) x 1.10 (d),This edition of the highly esteemed and long-e...,,Applause is proud to continue the series that ...,"Foreword: A Simple, Brilliant Idea David Ives ...",
60,The Gift of Love,Lori Foster,0,Lori Foster is the New York Times and US...,"Lori Foster, Gia Dawn, Ann Christopher, Lisa C...",the-gift-of-love,lori-foster,9780425234280,425234282,$14.43,Paperback,Penguin Group (USA),June 2010,,"Short Story Anthologies, Family & Friendship -...",,368.0,5.10 (w) x 7.90 (h) x 1.00 (d),Edited by New York Times bestselling author...,,Edited by New York Times bestselling author...,,


In [97]:
# Our dataframe looks better now!
# one last check, column data types
books.dtypes

title                object
author               object
author_id             int64
author_bio           object
authors              object
title_slug           object
author_slug          object
isbn13                int64
isbn10               object
price                object
format               object
publisher            object
pubdate              object
edition              object
subjects             object
lexile               object
pages                object
dimensions           object
overview             object
excerpt              object
synopsis             object
toc                  object
editorial_reviews    object
dtype: object

In [119]:
# we need price as a float, pages as integer, pubdate as a date and author_id droped

def price_parser(price):
    price = price.replace("$", "")
    # handle instances where price = '' (NaN values)
    try:
        price = float(price)
    except:
        price = None
    return price

def pages_parser(pages):
    try:
        pages = int(pages)
    except:
        pages = None
    return pages

books['price'] = books['price'].apply(lambda price: price_parser(price))
books['pages'] = books['pages'].apply(lambda pages: pages_parser(pages))
books['pubdate'] = books['pubdate'].apply(lambda pubdate: pd.to_datetime(pubdate, format = "%B %Y").date())
books = books.drop(columns=['author_id']) # this column doesn't make sense
books.head()

Unnamed: 0_level_0,title,author,author_bio,authors,title_slug,author_slug,isbn13,isbn10,price,format,publisher,pubdate,edition,subjects,lexile,pages,dimensions,overview,excerpt,synopsis,toc,editorial_reviews
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,Opening Spaces: An Anthology of Contemporary A...,Yvonne Vera,EDITOR Yvonne Vera was born and raised in Bula...,"Yvonne Vera (Editor), Yvonne Vera",opening-spaces,yvonne-vera,9780435910105,435910108,14.52,Paperback,Heinemann,1999-09-01,1st Edition,General & Miscellaneous Literature Anthologies...,,186.0,5.07 (w) x 7.78 (h) x 0.42 (d),In this anthology the award-winning author Yvo...,,African women are seldom given the space to ex...,Preface The Girl Who Can - Ama Ata Aidoo (Ghan...,
2,The Caine Prize for African Writing 2010: 11th...,The Caine Prize for African Writing,,The Caine Prize for African Writing,the-caine-prize-for-african-writing-2010,the-caine-prize-for-african-writing,9781906523374,1906523371,13.46,Paperback,New Internationalist,2010-08-01,,"Short Story Anthologies, African Fiction, Afri...",,208.0,5.00 (w) x 7.70 (h) x 0.70 (d),The Caine Prize for African Writing is Africa'...,,The best in new short story fiction from Afric...,Introduction 6 Caine Prize 2010 Shortlisted St...,
3,African Folktales,Roger D. Abrahams,,"Roger D. Abrahams, Dan Frank",african-folktales,roger-d-abrahams,9780394721170,394721179,18.95,Paperback,Knopf Doubleday Publishing Group,1983-08-01,,"Travel, Africa",,,,,,Nearly 100 stories from over 40 tribe-related ...,,
4,Unchained Voices: An Anthology of Black Author...,Vincent Carretta,,Vincent Carretta,unchained-voices,vincent-carretta,9780813190761,813190762,30.0,Paperback,University Press of Kentucky,2003-12-01,Expanded,United States History - African American Histo...,,416.0,6.10 (w) x 9.40 (h) x 1.10 (d),Vincent Carretta has assembled the most compre...,,Vincent Carretta has assembled the most compre...,Acknowledgments Introduction 1 A Note o...,\n African American Review \nThis excellen...
5,Women Writing Africa: West Africa and the Sahel,Esi Sutherland-Addy,"Esi Sutherland-Addy (Ph.D. Hon, Hon FCP) is se...","Esi Sutherland-Addy (Editor), Abena P. A. Busi...",women-writing-africa,esi-sutherland-addy,9781558615007,1558615008,29.95,Paperback,"Feminist Press at CUNY, The",2005-08-01,,"Literary Criticism - General & Miscellaneous, ...",,560.0,6.00 (w) x 9.00 (h) x 1.30 (d),The acclaimed Women Writing Africa project “op...,,A major literary and scholarly work that trans...,,\n Library Journal \nThis second of four v...


In [120]:
# cross check data types
books.dtypes
# pages didn't convert to int, but that is fine

title                 object
author                object
author_bio            object
authors               object
title_slug            object
author_slug           object
isbn13                 int64
isbn10                object
price                float64
format                object
publisher             object
pubdate               object
edition               object
subjects              object
lexile                object
pages                float64
dimensions            object
overview              object
excerpt               object
synopsis              object
toc                   object
editorial_reviews     object
dtype: object