# Loading data into Elasticsearch

Welcome! Hopefully you are reading this with elasticsearch running successfully by using the docker-compose.yml. If you are not and are running elasticsearch on a different service or host, you will want to change the following code. This gets the hostname for the elasticsearch server from the environment variable set in the docker-compose.yml. Change ES_HOSTNAME to point to your own ES if you need to.

_Start at the top of this file and step through it by clicking on the ">|  Run" button above. Make sure that each "cell" completes and doesn't throw any errors._

In [1]:
import os
ES_HOSTNAME = os.getenv("ES_HOSTNAME")

# Now, we need to go get some data to index. Let's start with a metadata dataset that includes details
# of all the books scanned in as part of the Microsoft Digitisation project at the British Library
# (From https://github.com/BL-Labs/imagedirectory/blob/master/book_metadata.json)

# This is a big file, so we will save it as it is downloaded:

import requests

url = "https://raw.githubusercontent.com/BL-Labs/imagedirectory/master/book_metadata.json"

r = requests.get(url, stream = True) # Setting 'stream' to True is crucial.

size = 0
# Adding a check to see if you've already downloaded the file and skipping that step
# Remove the "book_data.json" file if you wish this to download it again.
if not os.path.isfile("book_data.json"):
    with open("book_data.json", "wb") as book_fp:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                book_fp.write(chunk)
                size += len(chunk)

    print("Downloaded {0}MB".format(size / (1024 * 1024)))
else:
    print("Already got it!")

Downloaded 23.81989097595215MB


Great! We (hopefully) have metadata in this JSON file and an elasticsearch instance into which we can index it! Let's get that setup now!

In [2]:
from elasticsearch import Elasticsearch, helpers

es = Elasticsearch(ES_HOSTNAME)
# Do we have a connection?
es.ping()

True

Time to load the book metadata and have a look at what a typical record looks like:

In [3]:
import json

with open("book_data.json", "r") as bkfile:
    bookmd = json.load(bkfile)

print(json.dumps(bookmd[0], indent=4))

{
    "datefield": "[1888]",
    "shelfmarks": [
        "British Library HMNTS 10347.cc.13.(4.)"
    ],
    "flickr_url_to_book_images": "http://www.flickr.com/photos/britishlibrary/tags/sysnum000000037",
    "publisher": "A. Heywood & Son",
    "edition": "",
    "place": "Manchester",
    "issuance": "monographic",
    "authors": {},
    "date": "1888",
    "title": [
        "A Gossip about Old Manchester. With illustrations. [Signed: A.]"
    ],
    "identifier": "000000037",
    "corporate": {}
}


Elasticsearch expects records to have a few fields like "\_type" and "\_index" to collate the data we send it. The following will create an index in elasticsearch and provides a generator that takes the book data and generates suitable records for it to index.

In [6]:
es.indices.create("bookmd")



{'acknowledged': True, 'shards_acknowledged': True, 'index': 'bookmd'}

In [7]:
md_gen = ({"_type": "book", "_index": "bookmd", "_source": record} for record in bookmd)

# test it out?
md_gen.__next__()

{'_type': 'book',
 '_index': 'bookmd',
 '_source': {'datefield': '[1888]',
  'shelfmarks': ['British Library HMNTS 10347.cc.13.(4.)'],
  'flickr_url_to_book_images': 'http://www.flickr.com/photos/britishlibrary/tags/sysnum000000037',
  'publisher': 'A. Heywood & Son',
  'edition': '',
  'place': 'Manchester',
  'issuance': 'monographic',
  'authors': {},
  'date': '1888',
  'title': ['A Gossip about Old Manchester. With illustrations. [Signed: A.]'],
  'identifier': '000000037',
  'corporate': {}}}

Great, seem to work. Now to put this into elasticsearch. Warning, the following could take time if this is running on a slow system!

In [8]:
# recreate the generator so we get all the records in:
md_gen = ({"_type": "book", "_index": "bookmd", "_source": record} for record in bookmd)

# Using the Elasticsearch-py bulk helper here:
helpers.bulk(es, md_gen)

(49509, [])

Ingested? Let's see what we get:

In [9]:
es.count(index="bookmd")

{'count': 49509,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}}

49k records loaded. Now to query it. Wonder if any of these books were published in London in 1885?

In [10]:
# Use the lucene style of querying (more complex means of querying this is possible! See Elasticsearch's Query DSL)
# return just 2 hits:

results = es.search(index="bookmd", 
                    q="place:London AND datefield:1885",
                    size=2)

print(json.dumps(results, indent=4))

{
    "took": 67,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 428,
        "max_score": 4.9648094,
        "hits": [
            {
                "_index": "bookmd",
                "_type": "book",
                "_id": "jhjOA2QBJxh6vfXFFiRS",
                "_score": 4.9648094,
                "_source": {
                    "datefield": "1885",
                    "shelfmarks": [
                        "British Library HMNTS 11649.gg.1."
                    ],
                    "flickr_url_to_book_images": "http://www.flickr.com/photos/britishlibrary/tags/sysnum000119643",
                    "publisher": "Macmillan & Co.",
                    "edition": "[Another edition.]",
                    "place": "London",
                    "issuance": "monographic",
                    "authors": {
                        "creator": [
                           

Congratulations! You now have a search engine full of book metadata to query!

In [13]:
results = es.search(index="bookmd", 
                    q="title:Atlantis Arisen",
                    size=2)

first_hit = results['hits']['hits'][0]['_source']

print(first_hit['title'])
print(first_hit['datefield'])
print(first_hit['identifier'])
print(first_hit['flickr_url_to_book_images'])

['Atlantis Arisen; or, talks of a tourist about Oregon and Washington ... Illustrated']
1891
003786719
http://www.flickr.com/photos/britishlibrary/tags/sysnum003786719
