# Elasticsearch With Haystack

We will be communicating with our Elasticsearch document store via Haystack. First, we need to install Haystack using pip:

On Windows:

```
pip install farm-haystack -f https://download.pytorch.org/whl/torch_stable.html
```

Anything else:

```
pip install farm-haystack
```

We will start by indexing the SQuAD dev data. So let's load that into our notebook first.



In [1]:
import json

with open('../../data/squad/dev.json', 'r') as f:
    squad = json.load(f)

Next, we initialize a connection between Haystack and our local Elasticsearch instance like so:

In [2]:
def reading(file_name = 'credentials.txt'):
    s = open(file_name, 'r').read()
    dict = eval(s)
    return(dict)

credential_dict = reading()

In [4]:
# Note: Needed to add scheme='https'
document_store = ElasticsearchDocumentStore(host='localhost', scheme='https', username=credential_dict['username'], password=credential_dict['pwd'], ca_certs=credential_dict['ca_certs'], index='squad_docs')

Great, we've established our connection, now let's try querying our Elasticsearch instance. We will do this through the `requests` library.

In [5]:
import requests

Let's check our cluster *health* (eg the general status of our Elasticsearch instance). We do this by sending a **GET** request to the `_cluster/health` endpoint.

In [6]:
from elasticsearch import Elasticsearch, RequestsHttpConnection
es = Elasticsearch(host='localhost', connection_class=RequestsHttpConnection, http_auth=(credential_dict['username'], credential_dict['pwd']),use_ssl=True, verify_certs=False)
print(es.cluster.health())

# This has problems running due to certificate/https issues in the latest version of elastic net

# res = requests.get('https://localhost:9200/_cluster/health')

# res.json()

{'cluster_name': 'elasticsearch', 'status': 'yellow', 'timed_out': False, 'number_of_nodes': 1, 'number_of_data_nodes': 1, 'active_primary_shards': 4, 'active_shards': 4, 'relocating_shards': 0, 'initializing_shards': 0, 'unassigned_shards': 2, 'delayed_unassigned_shards': 0, 'number_of_pending_tasks': 0, 'number_of_in_flight_fetch': 0, 'task_max_waiting_in_queue_millis': 0, 'active_shards_percent_as_number': 66.66666666666666}




Okay we can see that the cluster is definitely running. The cluster status is *yellow*, ideally we want to aim for *green* but the reason we see yellow here is because not all replica shards have been allocated to nodes. The details of this don't really matter, but it essentially just means that we don't have a full set of backup (*replica*) data shards - which is only a problem if our *primary* data sources get corrupted/lost. That is beyond the scope of what we are doing here however.

## Adding Data

Right now our Elasticsearch instance contains a single, empty index called *'squad_docs'*. We need to populate this with our `squad` data. We populate our index through the `document_store.write_documents(<input_data>)` method, where our *\<input_data\>* must be a list of dictionaries in the format:

```json
{
    'text': '<document text here>',
    'meta': {
        'other': '<other info here>'
    }
}
```

#### Note: As of haystack version 1.0 'text' has been replaced with 'content'

We **must** include the ~~'text'~~`~~'content'~~` key. The *content* must contain the text from each sample, which in our case is a *context* string. The `'meta'` data is optional, but is usually used to contain anything else that might be relevant, so for example we might want to include the *group* that the context came from (eg 'Beyonce', or 'Matter').

In [7]:
# Updated 'text' to 'content'
squad_docs = []

for sample in squad:
    squad_docs.append({
        # 'text': sample['context']
        'content': sample['context']
    })

Then we add our data to the index like this:

In [8]:
document_store.write_documents(squad_docs)

## Retrieving Data

When we're retrieving data from Elasticsearch we will be retrieving documents using either the TF-IDF, or BM25 algorithms.

**TF-IDF** is a common *relevance* scoring algorithm, the built is calculated using:

* **TF**, the volume of words in the query (question) that appear in the document.

* **IDF**, the inverse of the fraction of documents that contain the same word (eg common words like *'the'* don't score well, whereas *'Beyonce'* would).

We integrate TD-IDF using:

In [9]:
# from haystack.retriever.sparse import TfidfRetriever
from haystack.nodes import TfidfRetriever

retriever = TfidfRetriever(document_store)

We can see here that when building our retriever, it identified a total of *16209* 'candidate paragraphs'. These are all of the contexts from our `squad` data:

In [10]:
len(squad)

16209

For now, we can return data from Elasticsearch, using the **TF-IDF** algorithm, with the `retrieve` method.

In [11]:
query = "What century did the Normans first gain their separate identity?"

retriever.retrieve(query)

[<Document: {'content': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.', 'content_type': 'text', 'score': None, 'meta': {}, 'embedding': None, 'id': '220619be4e664f1551c59977faf57fdc'}>,
 <Document: {'content': "A few years after the First Crusade, in 1107, the Normans under the command of Bohemond, Ro

## Manipulating the Index

This query returns a huge number of duplicates. The reason we have these is because our data contained duplicates of the same context because each context could be tied to several different questions. So now, we need to restart by first deleting everything inside our *squad_docs* index. Then re-indexing our deduplicated data.

We can delete every document in our index by sending a **POST** request to the `<index_name>/_delete_by_query` endpoint:

In [12]:
# from elasticsearch import Elasticsearch, RequestsHttpConnection
# es = Elasticsearch(host='localhost', connection_class=RequestsHttpConnection, http_auth=(credential_dict['username'], credential_dict['pwd']),use_ssl=True, verify_certs=False)

es.delete_by_query(index="squad_docs", body={"query": {'match_all': {}}})

# Does not work in current version of elasticsearch
# res = requests.post('http://localhost:9200/squad_docs/_delete_by_query',
#                     json={
#                         'query': {
#                             'match_all': {}
#                         }
#                     })

# res.json()



{'took': 57,
 'timed_out': False,
 'total': 1204,
 'deleted': 1204,
 'batches': 2,
 'version_conflicts': 0,
 'noops': 0,
 'retries': {'bulk': 0, 'search': 0},
 'throttled_millis': 0,
 'requests_per_second': -1.0,
 'throttled_until_millis': 0,
 'failures': []}

Our response shows `'deleted': 16209`, which means all *16209* documents have been deleted from our *squad_docs* index. We can confirm this by calling the `<index_name>/_count` endpoint too:

In [13]:
es.cat.count(index="squad_docs", params={"format": "json"})

# Does not work in current version of elasticsearch
# res = requests.get('http://localhost:9200/squad_docs/_count')

# res.json()



[{'epoch': '1668402264', 'timestamp': '05:04:24', 'count': '0'}]

Now that we've cleared the index, it's time to remove duplicates from our SQuAD contexts and re-index them.

In [14]:
# create list of contexts (we cannot do this using current dictionary format)
contexts = [sample['context'] for sample in squad]

# convert to set to remove duplicates, then back to list
contexts = list(set(contexts))

# convert back to dictionary format we need
squad_docs = [{'content': sample} for sample in contexts]

Finally, we can re-index our Elasticsearch as we did before.

In [18]:
len(squad_docs)

1204

In [19]:
document_store.write_documents(squad_docs)

Because we have changed the contents of our index, we initialize our retriever once more.

In [20]:
retriever = TfidfRetriever(document_store)

And this time we see that our retriever found *1204* documents (much less than the *16209* we found before). Now it's time to query our data again!

In [21]:
retriever.retrieve(query)

[<Document: {'content': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.', 'content_type': 'text', 'score': None, 'meta': {}, 'embedding': None, 'id': '220619be4e664f1551c59977faf57fdc'}>,
 <Document: {'content': "A few years after the First Crusade, in 1107, the Normans under the command of Bohemond, Ro

## TFIDF vs BM25:

#### TF: Term frequency

#### IDF: Inverse Document Frequency = log ( total number of documents in a collection / document frequency of term )

* BM25 is a variation of TFIDF
* BM25 still calculates TF and IDF; but TF score is dampened when returning large number of matches between query and context
* Also considers document length which normalizes the score - i.e. short documents score better than documents that return the same number of matches

Now we're returning a set of relevant documents, without duplicates.

Finally, let's return back to the other *sparse retriever* that we can use with Elasticsearch. We already used **TF-IDF**, by switching `TfidfRetriever` for `ElasticsearchRetriever` we can switch to the **BM25** algorithm, which is an *improved* version of **TF-IDF** and is recommended by Haystack.

So, let's initialize that and make another query with it.

In [23]:
# import BM25 retriever
from haystack.nodes import BM25Retriever

# intialize
retriever = BM25Retriever(document_store)

# and query
retriever.retrieve(query)

[<Document: {'content': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.', 'content_type': 'text', 'score': 0.9038276026059531, 'meta': {}, 'embedding': None, 'id': '220619be4e664f1551c59977faf57fdc'}>,
 <Document: {'content': 'In the visual arts, the Normans did not have the rich and distinctive traditi

Okay great, this is a pretty big notebook but it covers everything we need to know to get started with Haystack + Elastic (and a little more).