# Sophisticated Indexing and Search  

In this lab, we will look at various ways we can fine-tune our search results using Solr `analyzers`, `tokenizers` and `filters`.  
We will use books archive data downloaded from [open library](https://openlibrary.org/developers/dumps).  

Download [editions dump](https://openlibrary.org/data/ol_dump_editions_latest.txt.gz), this is a big file that requires some patience! When download process is done, decompress your file into data folder. I have included a sample text file with first 10 lines in the `ol_dump_editions_2021-03-19.txt` dump file. Let's take a look at first line in the file:  

In [37]:
fopen = open('data/editions.txt')
line = fopen.readline()
fopen.close()
line

'/type/edition\t/books/OL10000135M\t4\t2010-04-24T17:54:01.503315\t{"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}\n'

Each line is a book edition entry with attributes separated by tabs. Let's split the line for a clearer view!

In [38]:
line.strip().split('\t')

['/type/edition',
 '/books/OL10000135M',
 '4',
 '2010-04-24T17:54:01.503315',
 '{"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}']

We have five components with first four metadata not being useful (to us) as is the fifth JSON attribute. Lets draw attention into the fifth component!

In [39]:
import simplejson as json

line = line.strip().split('\t')

print(json.dumps(json.loads(line[4]), indent=2))

{
  "publishers": [
    "Bernan Press"
  ],
  "physical_format": "Hardcover",
  "subtitle": "9th November - 3rd December, 1992",
  "key": "/books/OL10000135M",
  "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93",
  "identifiers": {
    "goodreads": [
      "6850240"
    ]
  },
  "isbn_13": [
    "9780107805401"
  ],
  "languages": [
    {
      "key": "/languages/eng"
    }
  ],
  "number_of_pages": 64,
  "isbn_10": [
    "0107805405"
  ],
  "publish_date": "December 1993",
  "last_modified": {
    "type": "/type/datetime",
    "value": "2010-04-24T17:54:01.503315"
  },
  "authors": [
    {
      "key": "/authors/OL2645777A"
    }
  ],
  "latest_revision": 4,
  "works": [
    {
      "key": "/works/OL7925046W"
    }
  ],
  "type": {
    "key": "/type/edition"
  },
  "subjects": [
    "Government - Comparative",
    "Politics / Current Events"
  ],
  "revision": 4
}


The progress looks promissing. From openlibrary [API docs](https://openlibrary.org/dev/docs/restful_api), with the attributes from edition data, we can read authors and book information. Take sometime to check the printout above against the [API guide](https://openlibrary.org/dev/docs/restful_api).  


In [26]:
# Given that the data that we are dealing with is very big, we would rather use unblocking libraries and perform I/O tasks asyncronously!
import asyncio
import aiohttp
import simplejson as json

# set http header content
headers = {
    'Content-type':'application/json'
}
ol_url = 'https://openlibrary.org' 

async def get_author_info(author_key):
    '''
    Read author details using openlibrary API
    ----
    Parameters: 
        author_key: url mapping of author in the API graph
    ---
    Returns: JSON response with author details or error message
    '''
    async with aiohttp.ClientSession() as session:
        async with session.get(f'{ol_url}{author_key}.json', headers=headers) as resp:
            return await resp.text()

async def get_book_info(book_key):
    '''
    Read edition book details using openlibrary API
    ----
    Parameters: 
        book_key: url mapping of edition book in the API graph
    ----
    Returns: JSON response with edition book details or error message
    '''
    async with aiohttp.ClientSession() as session:
        async with session.get(f'{ol_url}{book_key}.json', headers=headers) as resp:
            return await resp.text()

async def load_editions_data():
    '''
    Process edition line data as demonstrated above to extract the JSON component! 
    ----
    Parameters: None
    ----
    Returns : None
    '''
    eof = False
    fopen = open('data/editions.txt', 'r')
    editions = []

    # read the file line after line, process the line fully and index on Solr
    # Why index each line at a time? The data is too big to store in memory!
    while not eof:
        line = fopen.readline()
        if not line:
            eof=True
        # process the data concurrently in independent threads
        else:
            edition = json.loads(line.strip().split('\t')[4])
            authors = [await get_author_info(author.get('key')) for author in edition.get('authors')]
            book = await get_book_info(edition.get('key'))
            edition['authors'] = [json.loads(author) for author in authors] 
            edition['book'] = json.loads(book)
            del [edition['key'], edition['type'], edition['works']]
            editions.append(edition)

    fopen.close()
    return editions
    
editions = await load_editions_data()
print(json.dumps(editions[0], indent=2))

{
  "publishers": [
    "Bernan Press"
  ],
  "physical_format": "Hardcover",
  "subtitle": "9th November - 3rd December, 1992",
  "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93",
  "identifiers": {
    "goodreads": [
      "6850240"
    ]
  },
  "isbn_13": [
    "9780107805401"
  ],
  "languages": [
    {
      "key": "/languages/eng"
    }
  ],
  "number_of_pages": 64,
  "isbn_10": [
    "0107805405"
  ],
  "publish_date": "December 1993",
  "last_modified": {
    "type": "/type/datetime",
    "value": "2010-04-24T17:54:01.503315"
  },
  "authors": [
    {
      "name": "HMSO Books",
      "last_modified": {
        "type": "/type/datetime",
        "value": "2008-04-29 13:35:46.87638"
      },
      "key": "/authors/OL2645777A",
      "type": {
        "key": "/type/author"
      },
      "id": 9978471,
      "revision": 1
    }
  ],
  "latest_revision": 4,
  "subjects": [
    "Government - Comparative",
    "Politics / Current Events"
  ],
  "revision": 4,


This data now looks great!  

We can now proceed with the rest of indexing and searching work!  

## Back to Solr

In [49]:
# define Solr instance resources
base_url = 'http://localhost:8983'
core_name = 'openLib'
# define important paths
api_endpoint = f'{base_url}/api/cores/{core_name}'
schema_endpoint = f'{api_endpoint}/schema'

## Create Solr index schema