# Exercise #1: Indexing using Elasticsearch

Index the sample document using Elasticsearch.
The parsing code is already provided.

## Initialize Elasticsearch

In [9]:
from elasticsearch import Elasticsearch

In [10]:
es = Elasticsearch()

Checking if service is running.

In [11]:
es.info()

{'cluster_name': 'elasticsearch',
 'cluster_uuid': 'LMlf8WX9RPC0aJ0eB5R69Q',
 'name': 'Krisztians-MacBook-Pro.local',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2019-09-27T08:36:48.569419Z',
  'build_flavor': 'default',
  'build_hash': '22e1767283e61a198cb4db791ea66e3f11ab9910',
  'build_snapshot': False,
  'build_type': 'tar',
  'lucene_version': '8.2.0',
  'minimum_index_compatibility_version': '6.0.0-beta1',
  'minimum_wire_compatibility_version': '6.8.0',
  'number': '7.4.0'}}

### Create index

In [12]:
INDEX_NAME = "reuters"
DOC_TYPE = "doc"

# Create index
if not es.indices.exists(INDEX_NAME):
    es.indices.create(index=INDEX_NAME)

## Processing the input document collection

  - The collection is given as a single XML file. 
  - Each document is inside `<REUTERS ...> </REUTERS>`.
  - We extract the contents of the `<DATE>`, `<TITLE>`, and `<BODY>` tags.
  - After each extracted document, the provided callback function is called and all document data is passed in a single dict argument.

In [13]:
from xml.dom import minidom

In [14]:
def index_collection(input_file, callback):
    xmldoc = minidom.parse(input_file)
    # Iterate documents in the XML file
    itemlist = xmldoc.getElementsByTagName("REUTERS")
    doc_id = 0
    for doc in itemlist:
        doc_id += 1
        date = doc.getElementsByTagName("DATE")[0].firstChild.nodeValue
        # Skip documents without a title or body
        if not (doc.getElementsByTagName("TITLE") and doc.getElementsByTagName("BODY")):
            continue
        title = doc.getElementsByTagName("TITLE")[0].firstChild.nodeValue
        body = doc.getElementsByTagName("BODY")[0].firstChild.nodeValue
        callback({
            "doc_id": doc_id,
            "date": date,
            "title": title,
            "body": body
            })

This method is called for each document that is to be indexed.

In [15]:
def index_doc(doc):
    es.index(index=INDEX_NAME, doc_type=DOC_TYPE, id=doc["doc_id"], body={
        "date": doc["date"],
        "title": doc["title"],
        "body": doc["body"]
    })

In [16]:
index_collection("data/reuters21578-000.xml", index_doc)

## Test

In [19]:
from pprint import pprint

Retrieve the contents of a given document from the index

In [20]:
doc = es.get(index=INDEX_NAME, doc_type=DOC_TYPE, id=3)
pprint(doc)

{'_id': '3',
 '_index': 'reuters',
 '_primary_term': 1,
 '_seq_no': 2,
 '_source': {'body': "Texas Commerce Bancshares Inc's Texas\n"
                     'Commerce Bank-Houston said it filed an application with '
                     'the\n'
                     'Comptroller of the Currency in an effort to create the '
                     'largest\n'
                     'banking network in Harris County.\n'
                     '    The bank said the network would link 31 banks '
                     'having\n'
                     '13.5 billion dlrs in assets and 7.5 billion dlrs in '
                     'deposits.\n'
                     '       \n'
                     ' Reuter\n',
             'date': '26-FEB-1987 15:03:27.51',
             'title': 'TEXAS COMMERCE BANCSHARES <TCB> FILES PLAN'},
 '_type': 'doc',
 '_version': 1,
 'found': True}


## Retrieval

Run some query against the index

In [27]:
query = "nuclear weapons"
hits = es.search(index=INDEX_NAME, q=query, _source=False, size=10)['hits']['hits']

# Print document id, title, and retrieval score
for h in hits:
    doc = es.get(index=INDEX_NAME, doc_type=DOC_TYPE, id=h['_id'])
    print("{} {} ({})".format(h['_id'], doc['_source']['title'], h['_score']))

454 NUCLEAR DATA <NDI> GETS EXTENSIONS ON LOANS (8.902655)
368 PHILADELPHIA PORT CLOSED BY TANKER CRASH (7.5782523)
155 VARIAN <VAR>, SIEMENS FORM JOINT VENTURE (6.4732018)
195 AGENCY VOTES TO END LOCAL NUCLEAR PLANT VETO (6.2643485)
563 SEC STAFF ADVISES FRAUD CHARGES AGAINST WPPSS (5.473523)


## Feedback

Please give (anonymous) feedback on this exercise by filling out [this form](https://forms.gle/22o3ursi5YsR1Ztb8).