This notebook shows how to use Elastic Search to index and search through data. We will use a dataset called CMU Book summaries [dataset](http://www.cs.cmu.edu/~dbamman/booksummaries.html). Alternateively, the dataset's link can be found in the `BookSummaries_Link.md` file under the Data folder in Ch7. 

For this code to work, elastic search instance has to be running in the background. 
For this you need to follow these steps :

Docker :
   1. Install Docker
   2. Create Network : docker network create elastic
   3. Pull Image : docker pull docker.elastic.co/elasticsearch/elasticsearch:8.9.1
   4. Run Image : docker run --name elasticsearch --net elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -t docker.elastic.co/elasticsearch/elasticsearch:8.9.1
   5. After Running the last command the container will give you the username and password to login, the user most likely be named as "elastic" and password will be given just beside it.

Linux :

   1. Go to the elasticsearch-X.Y.Z/bin folder on your machine
   2. Run ./elasticsearch.  
    
Windows :

   1.  Download the latest [release](https://www.elastic.co/guide/en/elasticsearch/reference/current/windows.html)
   2.  Run .\bin\elasticsearch.bat
   
[ElasticSearch Documentation](https://www.elastic.co/guide/index.html)
    
You should now be able to access this instance on localhost:9200



In [1]:
!pip install elasticsearch

Collecting elasticsearch
  Obtaining dependency information for elasticsearch from https://files.pythonhosted.org/packages/bb/06/81b1d71ba0567ff39d0f98f3637e810846df92f6733aee46004a194b51ea/elasticsearch-8.9.0-py3-none-any.whl.metadata
  Downloading elasticsearch-8.9.0-py3-none-any.whl.metadata (5.2 kB)
Collecting elastic-transport<9,>=8 (from elasticsearch)
  Downloading elastic_transport-8.4.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Downloading elasticsearch-8.9.0-py3-none-any.whl (395 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.5/395.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: elastic-transport, elasticsearch
Successfully installed elastic-transport-8.4.0 elasticsearch-8.9.0


In [18]:
from elasticsearch import Elasticsearch 
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

In [24]:
#elastic search instance has to be running on the machine. Default port is 9200. 

#Call the Elastic Search instance, and delete any pre-existing index
es=Elasticsearch([{'host':'localhost','port':9200,'scheme':"https"}], http_auth=('elastic', 'yQIpShQvIXJ5gqcImcM9'), verify_certs=False)
if es.indices.exists(index="myindex"):
    es.indices.delete(index='myindex', ignore=[400, 404]) #Deleting existing index for now 

In [25]:
#Build an index from booksummaries dataset. I am using only 500 documents for now.
path = "./Data/booksummaries/booksummaries.txt" #Add your path.
count = 1
for line in open(path):
    fields = line.split("\t")
    doc = {'id' : fields[0],
            'title': fields[2],
            'author': fields[3],
            'summary': fields[6]
          }

    res = es.index(index="myindex", id=fields[0], body=doc)
    count = count+1
    if count%100 == 0:
        print("indexed 100 documents")
    if count == 501:
        break

indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents


In [26]:
#Check to see how big is the index
res = es.search(index="myindex", body={"query": {"match_all": {}}})
print("Your index has %d entries" % res['hits']['total']['value'])

Your index has 414 entries


In [27]:
#Try a test query. The query searches "summary" field which contains the text
#and does a full text query on that field.
res = es.search(index="myindex", body={"query": {"match": {"summary": "animal"}}})
print("Your search returned %d results." % res['hits']['total']['value'])

Your search returned 16 results.


In [28]:
#Printing the title field and summary field's first 100 characters for 2nd result
print(res["hits"]["hits"][2]["_source"]["title"])
print(res["hits"]["hits"][2]["_source"]["summary"][:100])


Dead Air
 The first person narrative begins on 11 September 2001, and Banks uses the protagonist's conversati


In [30]:
#match query considers both exact matches, and fuzzy matches and works as a OR query. 
#match_phrase looks for exact matches.
while True:
    query = input("Enter your search query: ")
    if query == "STOP":
        break
    res = es.search(index="myindex", body={"query": {"match_phrase": {"summary": query}}})
    print("Your search returned %d results:" % res['hits']['total']['value'])
    for hit in res["hits"]["hits"]:
        print(hit["_source"]["title"])
        #to get a snippet 100 characters before and after the match
        loc = hit["_source"]["summary"].lower().index(query)
        print(hit["_source"]["summary"][:100])
        print(hit["_source"]["summary"][loc-100:loc+100])

    

Your search returned 7 results:
All's Well That Ends Well
 Helena, the orphan daughter of a famous physician, is the ward of the Countess of Rousillon, and ho

The Last Man
 Mary Shelley states in the introduction that in 1818 she discovered, in the Sibyl's cave near Naple
ng leaves the throne, the monarchy come to an end and a republic is created. When the king dies the Countess attempts to raise their son, Adrian, to reclaim the throne, but Adrian opposes his mother a
The Luck of Barry Lyndon
 Redmond Barry of Bally Barry, born to a genteel but ruined Irish family, fancies himself a gentlema
chy, where they win considerable sums of money and Redmond cleverly sets up a plan to marry a young countess of some means. Again, fortune turns against him, and a series of circumstances undermines h
Carmilla
 The story is presented by Le Fanu as part of the casebook of Dr Hesselius, whose departures from me
ily heirloom restored portraits arrives at the castle, Laura finds one of her ancestors,