# Elasticsearch Setup

For this project, we will be building an open-domain question answering system. There are three major components to such a system:

* Database

* Retriever

* Reader

In this notebook we will setup the first part, the *database* - where we will be using Elasticsearch.

Before creating our Elasticsearch index, we need to load our data. We will be using *Meditations* by Marcus Aurelius - a clean version of this can be found at:

https://raw.githubusercontent.com/jamescalam/transformers/main/data/text/meditations/clean.txt

We will download this through requests.

In [1]:
import requests

In [2]:
data = requests.get('https://raw.githubusercontent.com/jamescalam/transformers/main/data/text/meditations/clean.txt')
text = data.text.split('\n')

In [3]:
text[:3]

['From my grandfather Verus I learned good morals and the government of my temper.',
 'From the reputation and remembrance of my father, modesty and a manly character.',
 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.']

Now we can move onto setting up an index in elasticsearch. Let's confirm Elasticsearch is up and running.

In [4]:
# requests.get('http://localhost:9200/_cluster/health').json()

def reading(file_name = '../11_reader-retriever_qa_with_haystack/credentials.txt'):
    s = open(file_name, 'r').read()
    dict = eval(s)
    return(dict)

credential_dict = reading()

from elasticsearch import Elasticsearch, RequestsHttpConnection
es = Elasticsearch(host='localhost', connection_class=RequestsHttpConnection, http_auth=(credential_dict['username'], credential_dict['pwd']),use_ssl=True, verify_certs=False)
print(es.cluster.health())

{'cluster_name': 'elasticsearch', 'status': 'yellow', 'timed_out': False, 'number_of_nodes': 1, 'number_of_data_nodes': 1, 'active_primary_shards': 5, 'active_shards': 5, 'relocating_shards': 0, 'initializing_shards': 0, 'unassigned_shards': 3, 'delayed_unassigned_shards': 0, 'number_of_pending_tasks': 0, 'number_of_in_flight_fetch': 0, 'task_max_waiting_in_queue_millis': 0, 'active_shards_percent_as_number': 62.5}




And check currently active indices.

In [5]:
# print(requests.get('http://localhost:9200/_cat/indices').text)
es.indices.get(index="*")



{'aurelius': {'aliases': {},
  'mappings': {'dynamic_templates': [{'strings': {'path_match': '*',
      'match_mapping_type': 'string',
      'mapping': {'type': 'keyword'}}}],
   'properties': {'content': {'type': 'text'},
    'content_type': {'type': 'keyword'},
    'embedding': {'type': 'dense_vector', 'dims': 768},
    'name': {'type': 'keyword'},
    'source': {'type': 'keyword'}}},
  'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}},
    'number_of_shards': '1',
    'provided_name': 'aurelius',
    'creation_date': '1668568092396',
    'analysis': {'analyzer': {'default': {'type': 'standard'}}},
    'number_of_replicas': '1',
    'uuid': '-BBVZF5AQGCFFwH-BlO_lQ',
    'version': {'created': '8050099'}}}},
 'label': {'aliases': {},
  'mappings': {'properties': {'answer': {'type': 'flattened'},
    'created_at': {'type': 'date',
     'format': 'yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis'},
    'document': {'type': 'flattened'},
  

Now let's initialize a new index *aurelius* which we will use to store our *Meditations* dataset.

In [6]:
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

doc_store = ElasticsearchDocumentStore(
    host='localhost',
    scheme='https', 
    username=credential_dict['username'], 
    password=credential_dict['pwd'], 
    ca_certs=credential_dict['ca_certs'],
    index='aurelius'
)

2022-11-15 20:19:19.863178: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-15 20:19:19.976030: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-15 20:19:20.290710: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-11-15 20:19:20.290803: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

In [7]:
# print(requests.get('http://localhost:9200/_cat/indices').text)
es.indices.get(index="*")['aurelius']



{'aliases': {},
 'mappings': {'dynamic_templates': [{'strings': {'path_match': '*',
     'match_mapping_type': 'string',
     'mapping': {'type': 'keyword'}}}],
  'properties': {'content': {'type': 'text'},
   'content_type': {'type': 'keyword'},
   'embedding': {'type': 'dense_vector', 'dims': 768},
   'name': {'type': 'keyword'},
   'source': {'type': 'keyword'}}},
 'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}},
   'number_of_shards': '1',
   'provided_name': 'aurelius',
   'creation_date': '1668568092396',
   'analysis': {'analyzer': {'default': {'type': 'standard'}}},
   'number_of_replicas': '1',
   'uuid': '-BBVZF5AQGCFFwH-BlO_lQ',
   'version': {'created': '8050099'}}}}

Now we need to format our data into a list of dictionaries before passing it along to Elasticsearch. We will create the format:

```json
{
    'text': '<paragraph>',
    'meta': {
        'source': 'meditations'
    }
}
```

In [8]:
data_json = [
    {
        'content': paragraph,
        'meta': {
            'source': 'meditations'
        }
    } for paragraph in text
]

In [9]:
data_json[:3]

[{'content': 'From my grandfather Verus I learned good morals and the government of my temper.',
  'meta': {'source': 'meditations'}},
 {'content': 'From the reputation and remembrance of my father, modesty and a manly character.',
  'meta': {'source': 'meditations'}},
 {'content': 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.',
  'meta': {'source': 'meditations'}}]

In [10]:
len(data_json)

507

Now we simply write our data to Elasticsearch.

In [11]:
doc_store.write_documents(data_json)

And confirm that we have uploaded *507* items.

In [12]:
# requests.get('http://localhost:9200/aurelius/_count').json()
result=es.search(index='aurelius')
print(result['hits']['total'])

{'value': 507, 'relation': 'eq'}




Perfect!