# Sparse Retrieval using Elastic Search

## Goals

- Understand the Python Elastic Search Client
- Map BM25 to Elastic Search 
- Compute Evaluation metrics 
- Other users of Elastic Search

## Imports

In [1]:
import pandas as pd
import tqdm.auto
import numpy as np
import glob
import concurrent.futures
import multiprocessing
import requests
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk
import pprint

In [2]:
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## Data

For this workshop, we have two file.s

`posts.parquet` : contains a subset of stackoverflow posts

`related_posts.parquet`: contains questions pairs that were marked as duplicates

In [3]:
path_posts = "gs://np-training-tmp/stackoverflow/final_subset/posts.parquet"
path_posts_related = "gs://np-training-tmp/stackoverflow/final_subset/related_posts.parquet"

In [4]:
#ELASTIC_HOST="np-database.c.np-training.internal"
ELASTIC_HOST="localhost"
ELASTIC_INDEX="stackoverflow"
ELASTIC_PORT=9200

In [18]:
def create_index(client,index:str, num_shards=3):
    """Creates an index in Elasticsearch. Delete old index."""
    
    client.indices.delete(index=index)
    
    client.indices.create(
        index=index,
        body={
            "settings": {"number_of_shards": num_shards},
            # "mappings": {
            #     "properties": {
            #         "name": {"type": "text"},
            #         "borough": {"type": "keyword"},
            #         "cuisine": {"type": "keyword"},
            #         "grade": {"type": "keyword"},
            #         "location": {"type": "geo_point"},
            #     }
            # },
        },
        ignore=400,
    )


def generate_docs(df:pd.DataFrame):
    """
    Given a datframe containing posts data, yields a generator of dicitionary 
    """
    
    # iterate over dataframe contains posts with metadata
    for index, row in df.iterrows():
        doc = {**row} 
        
        # use PostId as document id
        doc['_id'] = doc["Id"]
        
        for k in list(doc.keys()):
            # don't insert nan fields
            if type(doc[k]) !=list and (doc[k] ==None or  ( pd.isna( doc[k] )  )) :
                del doc[k]
                
        del doc['Id']
        yield doc
        


def fetch_results(client:Elasticsearch, query:str,  num_hits=5, fields = ["Title", "QuestionBody"]):
    """
    With the passed elastic search client, return documents that contain the passed `query` in the fields specified by `fields`

    If the fields is empty, it will search all text fields
    """

    request_body = {
            "query": {
                "multi_match": {
                    "query": query,
                    "fields": fields,
                }
            }
        }

    if fields:
        request_body["query"]["multi_match"]["fields"] = fields
    
    resp = client.search(
        body=request_body,
        size = num_hits
    )
    
    return resp
    

        

## Index Documents

In [7]:
df = pd.read_parquet(path_posts)
df['Tags']  = df['Tags'].apply(lambda x: x.tolist())

In [8]:
df.head()

Unnamed: 0,Id,AcceptedAnswerId,Title,QuestionBody,Tags,ViewCount,AnswerCount,CommentCount,Score,CreationDate,AnswerId,AcceptedAnswerBody
1,15020895,,Python int-byte efficient data structure,i am currently storing key-values of type int-...,"[python, data-structures]",155,0,3,1,2013-02-22 09:33:26.360,,
9,68487902,,Why does the Variance of Laplace very differen...,TL;DR: How can I use skimage.filters.laplace(i...,"[python, opencv, image-processing, computer-vi...",391,0,5,1,2021-07-22 15:50:34.220,,
15,61391327,,Why input never ends,I have python 3.7 installed and I have this co...,"[python, python-3.x, input]",104,1,6,3,2020-04-23 15:43:03.497,,
27,28852710,,Crashes with piecewise linear objective for gu...,We have a complex optimization problem which i...,"[python, crash, gurobi, piecewise]",403,1,1,3,2015-03-04 10:58:16.370,,
29,24043029,,Python TypeError: plotdatehist() got an unexpe...,apologies beforehand if this is a stupid quest...,"[python, typeerror]",419,0,7,0,2014-06-04 16:42:32.257,,


A sample document from our input file.  

The main fields we will searching against are the `Title` and `QuestionBody`

In [9]:
df.iloc[0].to_dict()

{'Id': 15020895,
 'AcceptedAnswerId': nan,
 'Title': 'Python int-byte efficient data structure',
 'QuestionBody': 'i am currently storing key-values of type int-int. For fast access, I am currently using an BTrees.IIBTree structure stored in memory. It is not stored on disk at all since we need the most recent data.\n\nHowever, the current solution barely fits into memory, so I am looking for a more efficient database or data structure in terms of access time. In case it would be stored in memory it also needs to be efficient in terms of memory space. \n\nOne idea would be to replace the BTrees.IIBTree structure with a int-byte hash written in C as an extension for Python, but the data would still be lost in case the machine fails (not a terrible thing in our case).\n\nWhat are your suggestions?\n',
 'Tags': ['python', 'data-structures'],
 'ViewCount': 155,
 'AnswerCount': 0,
 'CommentCount': 3,
 'Score': 1,
 'CreationDate': Timestamp('2013-02-22 09:33:26.360000'),
 'AnswerId': nan,
 '

In [10]:
?Elasticsearch

[0;31mInit signature:[0m
[0mElasticsearch[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mhosts[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mMapping[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mint[0m[0;34m][0m[0;34m][0m[0;34m,[0m [0melastic_transport[0m[0;34m.[0m[0mNodeConfig[0m[0;34m][0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcloud_id[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mapi_key[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mTuple[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m 

create a client object to our elastic search cluster

In [15]:
client = Elasticsearch(
    [f'http://{ELASTIC_HOST}:{ELASTIC_PORT}']
)

tell elastic search to create an index     
An ES index is a collection of documents. 

ES suports inferring the documents without specifying the schema before hand 

In [19]:
create_index(client, index= ELASTIC_INDEX)

  client.indices.create(
  client.indices.create(


In [20]:
requests.get(f"http://{ELASTIC_HOST}:{ELASTIC_PORT}/_all/_settings").json()

{'test-index': {'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}},
    'number_of_shards': '5',
    'provided_name': 'test-index',
    'creation_date': '1666564416488',
    'number_of_replicas': '1',
    'uuid': 'x7OXazwcTMWVVb7EeDW9pQ',
    'version': {'created': '8040399'}}}},
 'stackoverflow': {'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}},
    'number_of_shards': '3',
    'provided_name': 'stackoverflow',
    'creation_date': '1667073999152',
    'number_of_replicas': '1',
    'uuid': 'RaAXbjfMTVuY1s02_EjCTA',
    'version': {'created': '8040399'}}}}}

The index we created is composed of `3` shards and `1` replica.   

When searching , ES queries each shard independantly and combines it

In [30]:
df.head(10)

Unnamed: 0,Id,AcceptedAnswerId,Title,QuestionBody,Tags,ViewCount,AnswerCount,CommentCount,Score,CreationDate,AnswerId,AcceptedAnswerBody
1,15020895,,Python int-byte efficient data structure,i am currently storing key-values of type int-...,"[python, data-structures]",155,0,3,1,2013-02-22 09:33:26.360,,
9,68487902,,Why does the Variance of Laplace very differen...,TL;DR: How can I use skimage.filters.laplace(i...,"[python, opencv, image-processing, computer-vi...",391,0,5,1,2021-07-22 15:50:34.220,,
15,61391327,,Why input never ends,I have python 3.7 installed and I have this co...,"[python, python-3.x, input]",104,1,6,3,2020-04-23 15:43:03.497,,
27,28852710,,Crashes with piecewise linear objective for gu...,We have a complex optimization problem which i...,"[python, crash, gurobi, piecewise]",403,1,1,3,2015-03-04 10:58:16.370,,
29,24043029,,Python TypeError: plotdatehist() got an unexpe...,apologies beforehand if this is a stupid quest...,"[python, typeerror]",419,0,7,0,2014-06-04 16:42:32.257,,
33,42740305,,heat map for adjusted and unadjusted correlati...,"For the variable set, I want to plot the unadj...","[r, heatmap, correlation]",49,0,4,0,2017-03-11 20:33:31.470,,
52,24345637,,Why doesn't numpy.random and multiprocessing p...,"I have a random walk function, that uses numpy...","[python, arrays, numpy, random, multiprocessing]",5134,1,7,8,2014-06-21 20:32:30.203,,
58,46726368,,Tensor' object has no attribute 'shape',"when I using tensorflow with jupyter,I meet wi...","[python-2.7, tensorflow]",1033,1,0,0,2017-10-13 09:11:34.487,,
78,71711836,,When I delete last comment has just added I go...,I'm trying to add a comment section to my flas...,"[python, html, flask]",26,1,0,0,2022-04-01 19:19:35.823,,
79,48123342,,opencv read image assertion failed,I am a newbie to python and opencv.\ntrying to...,"[python, image, opencv, assertion]",1888,2,1,-1,2018-01-06 01:08:40.720,,


In [22]:
df_subset = df.head(5_000_000)
number_of_docs = len(df_subset)

Bulk insert all of our documents

In [23]:
with tqdm.auto.tqdm(total=number_of_docs , unit="docs" ) as pbar:
    successes = 0


    for ok, action in streaming_bulk(
            client=client, index=ELASTIC_INDEX, actions=generate_docs(df_subset) ,
        ):
        pbar.update(1)
        successes += ok


  0%|          | 0/219841 [00:00<?, ?docs/s]

Inserting `N` documents at `x` docs/seconf on a single node is pretty good

# Evaluate

In [26]:
?client.get

[0;31mSignature:[0m
[0mclient[0m[0;34m.[0m[0mget[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mid[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0merror_trace[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbool[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfilter_path[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mTuple[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0;34m...[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhuman[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbool[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpreference[0m[0;34m:

getting a specific document by their id

In [31]:
client.get(index=ELASTIC_INDEX, id=15020895)

ObjectApiResponse({'_index': 'stackoverflow', '_id': '15020895', '_version': 1, '_seq_no': 0, '_primary_term': 1, '_ignored': ['QuestionBody.keyword'], 'found': True, '_source': {'Title': 'Python int-byte efficient data structure', 'QuestionBody': 'i am currently storing key-values of type int-int. For fast access, I am currently using an BTrees.IIBTree structure stored in memory. It is not stored on disk at all since we need the most recent data.\n\nHowever, the current solution barely fits into memory, so I am looking for a more efficient database or data structure in terms of access time. In case it would be stored in memory it also needs to be efficient in terms of memory space. \n\nOne idea would be to replace the BTrees.IIBTree structure with a int-byte hash written in C as an extension for Python, but the data would still be lost in case the machine fails (not a terrible thing in our case).\n\nWhat are your suggestions?\n', 'Tags': ['python', 'data-structures'], 'ViewCount': 155

retrieve a document with a query

In [32]:
# https://stackoverflow.com/questions/34147471/elasticsearch-how-to-search-for-a-value-in-any-field-across-all-types-in-one

resp = client.search(
    body={
        "query": {
            "multi_match": {
                "query": "pandas memmory issue",
                # "fields": ["Title", "QuestionBody"],
            }
        }
    },
    size=5,
    explain=True
)


  resp = client.search(


In [34]:
resp

ObjectApiResponse({'took': 6108, 'timed_out': False, '_shards': {'total': 8, 'successful': 8, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 10000, 'relation': 'gte'}, 'max_score': 13.040086, 'hits': [{'_shard': '[stackoverflow][2]', '_node': 'JW5Vy75nR5SxTJKsjne9JA', '_index': 'stackoverflow', '_id': '50681803', '_score': 13.040086, '_ignored': ['QuestionBody.keyword'], '_source': {'Title': 'convert sql code to python pandas dataframe operation', 'QuestionBody': 'So I am having trouble with processing and manipulating large ammount of data.\nMy table 1 consist of 2 milions records for example: \n\n[CODE]\n\nand another table with data: \n\n[CODE]\n\nI am creating a join:\n\n[CODE]\n\nThe result is a really big table (SELECT should return around 40milions rows\nThen I use groupby and filter method to further filter my records. I have a problem because I get MemmoryError when running my code. I was thinking about changing csv to better accomodate pandas dataframe (to avoid usin

In [35]:
pdf_related = pd.read_parquet(path_posts_related)

In [36]:
pdf_related.head()

Unnamed: 0,PostId,PostTitle,RelatedPostIds,RelatedPostTitles,num_candidates
1,3494593,Shading a kernel density plot between two points.,"[3494593, 14863744, 14094644, 16504452, 488531...",[Shading a kernel density plot between two poi...,16
2,37949409,Dictionary in a numpy array?,"[37949409, 47689224, 61517741]","[Dictionary in a numpy array?, How to access t...",3
8,19876079,Cannot find module cv2 when using OpenCV,"[19876079, 62443365, 64580641, 45606137, 60294...","[Cannot find module cv2 when using OpenCV, How...",7
12,35082143,Error: package or namespace load failed for ‘car’,"[35082143, 65941744, 68515009, 56409535]",[Error: package or namespace load failed for ‘...,4
14,2673651,inheritance from str or int,"[2673651, 48465797, 3120562, 15085917, 3238350...","[inheritance from str or int, Inherited class ...",15


In [37]:
len (pdf_related)

6114

In [38]:
resp['hits']['hits'][0];

In [41]:
def format_resp(resp, row):
    payload = []
    query = row['PostTitle']
    for hit in resp['hits']['hits']:
        doc_id = int(hit['_id'])
        
        r = {
             'query': query
             , 'query_id' : row['PostId']
             ,'doc_id' : doc_id
             , 'is_relevant' : doc_id in row['RelatedPostIds']
             ,'score' : hit['_score']
             ,'doc_title' : hit['_source']['Title']


        }
        payload.append(r)    
    return payload

def fetch_as_relevancy_eval(row,num_hits=10):
    client = Elasticsearch(
    [f'http://{ELASTIC_HOST}:{ELASTIC_PORT}']
    
    )
    
    
    query = row['PostTitle']
    resp = fetch_results(client, query, num_hits=num_hits)
    payload = format_resp(resp, row)
    
    return pd.DataFrame(payload)
    

def evaluate_relevancy_hits(df,num_hits=10):
    
    payload = []
    for index, row in df.iterrows():

        payload_query = fetch_as_relevancy_eval(row)
        
        payload.extend(payload_query.to_dict(orient='records') )

    
    #return pd.DataFrame.from_records(payload)
    return pd.DataFrame(payload)
    



def evaluate_relevancy_hits2(df,num_hits=20):
    
    
    res = df.parallel_apply(fetch_as_relevancy_eval,num_hits=num_hits, axis = 1)

    return res
    

In [42]:
fetch_as_relevancy_eval(pdf_related.iloc[0].to_dict() )

  resp = client.search(


Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
0,Shading a kernel density plot between two points.,3494593,14094644,True,49.934696,draw the following shaded area in R
1,Shading a kernel density plot between two points.,3494593,7787114,True,45.700478,polygon in density plot?
2,Shading a kernel density plot between two points.,3494593,3494593,True,45.222992,Shading a kernel density plot between two points.
3,Shading a kernel density plot between two points.,3494593,27294822,False,39.258953,Shading a kernel density estimate between two ...
4,Shading a kernel density plot between two points.,3494593,27189453,True,34.868088,Shade (fill or color) area under density curve...
5,Shading a kernel density plot between two points.,3494593,30900745,False,24.448137,Plot 2D-kernel density from a dataframe: set n...
6,Shading a kernel density plot between two points.,3494593,35381762,False,24.032297,Are SciPy Kernel Density values dependent upon...
7,Shading a kernel density plot between two points.,3494593,29244352,False,23.956926,Python: Overlap between two functions (PDF of ...
8,Shading a kernel density plot between two points.,3494593,55131398,False,23.93893,matplotlib scatter: the more overlapping point...
9,Shading a kernel density plot between two points.,3494593,47644966,False,23.525637,Change color of seaborn distribution line


In [43]:
evaluate_relevancy_hits(pdf_related.iloc[0:2])

  resp = client.search(
  resp = client.search(


Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
0,Shading a kernel density plot between two points.,3494593,14094644,True,49.934696,draw the following shaded area in R
1,Shading a kernel density plot between two points.,3494593,7787114,True,45.700478,polygon in density plot?
2,Shading a kernel density plot between two points.,3494593,3494593,True,45.222992,Shading a kernel density plot between two points.
3,Shading a kernel density plot between two points.,3494593,27294822,False,39.258953,Shading a kernel density estimate between two ...
4,Shading a kernel density plot between two points.,3494593,27189453,True,34.868088,Shade (fill or color) area under density curve...
5,Shading a kernel density plot between two points.,3494593,30900745,False,24.448137,Plot 2D-kernel density from a dataframe: set n...
6,Shading a kernel density plot between two points.,3494593,35381762,False,24.032297,Are SciPy Kernel Density values dependent upon...
7,Shading a kernel density plot between two points.,3494593,29244352,False,23.956926,Python: Overlap between two functions (PDF of ...
8,Shading a kernel density plot between two points.,3494593,55131398,False,23.93893,matplotlib scatter: the more overlapping point...
9,Shading a kernel density plot between two points.,3494593,47644966,False,23.525637,Change color of seaborn distribution line


In [None]:
r = evaluate_relevancy_hits2(pdf_related.head(1000000) )

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=765), Label(value='0 / 765'))), HB…

  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.search(
  resp = client.

In [50]:
df_res  = pd.concat(list(r) ,ignore_index = True)

In [51]:
df_res

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
0,Shading a kernel density plot between two points.,3494593,14094644,True,49.934696,draw the following shaded area in R
1,Shading a kernel density plot between two points.,3494593,7787114,True,45.700478,polygon in density plot?
2,Shading a kernel density plot between two points.,3494593,3494593,True,45.222992,Shading a kernel density plot between two points.
3,Shading a kernel density plot between two points.,3494593,27294822,False,39.258953,Shading a kernel density estimate between two ...
4,Shading a kernel density plot between two points.,3494593,27189453,True,34.868088,Shade (fill or color) area under density curve...
...,...,...,...,...,...,...
19995,How to add element in Python to the end of lis...,30212447,63341821,False,16.860931,How to check list for double elements and then...
19996,How to add element in Python to the end of lis...,30212447,38002790,False,16.690474,Create a list from a list without nesting
19997,How to add element in Python to the end of lis...,30212447,71056190,False,16.680052,Python Why when I remove an item from one list...
19998,How to add element in Python to the end of lis...,30212447,56865551,False,16.595814,How do you sample from a list of probabilities...


In [52]:
query_id = 30212447

In [53]:
pdf_related [ pdf_related['PostId'] == query_id ].iloc[0].to_dict()

{'PostId': 30212447,
 'PostTitle': 'How to add element in Python to the end of list using list.insert?',
 'RelatedPostIds': array([30212447, 70342396, 64223356, 54052453, 53932704]),
 'RelatedPostTitles': array(['How to add element in Python to the end of list using list.insert?',
        'Some confusion about swapping two elements in a list using a function',
        'while using "-1" as index number,the element is inserting at last 2nd position. how its happening?',
        'Insert an item to the last but one position in list',
        'Array: Insert with negative index'], dtype=object),
 'num_candidates': 5}

In [54]:
df_res[ df_res.query_id==query_id]

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
19980,How to add element in Python to the end of lis...,30212447,30212447,True,37.783924,How to add element in Python to the end of lis...
19981,How to add element in Python to the end of lis...,30212447,21939652,False,23.490698,Insert at first position of a list in Python
19982,How to add element in Python to the end of lis...,30212447,47621511,False,19.65197,Do Python list comprehensions append at each i...
19983,How to add element in Python to the end of lis...,30212447,70946087,False,19.495499,Python how to add quote to one of the element ...
19984,How to add element in Python to the end of lis...,30212447,23143011,False,18.778458,"Python - regex, blank element at the end of th..."
19985,How to add element in Python to the end of lis...,30212447,24612665,False,18.51469,Insert function use with nested list
19986,How to add element in Python to the end of lis...,30212447,25495944,False,18.439268,How to add two element into a list using list ...
19987,How to add element in Python to the end of lis...,30212447,47440037,False,18.421326,Extending list by adding element to special po...
19988,How to add element in Python to the end of lis...,30212447,63873586,False,17.91314,Returning smallest positive int that does not ...
19989,How to add element in Python to the end of lis...,30212447,52957447,False,17.82909,Python print out integers with suffixes in the...


In [55]:
def metrics(result):
    
    result = list(result) 
    
    mrr = 0
    
    if True in result:
        first_index = result.index(True) 
        mrr = 1 /  (first_index + 1)
    
    res=  {
     "p@1" :  sum(result[:1])  
    , "p@5" :  sum(result[:5]) / 5 
    , "p@10" :  sum(result[:10])  / 10 
    , "mrr" :  mrr

        
        
    }
    return pd.Series(res)

In [56]:
#?df_res.groupby

In [57]:
df_agg_res  = df_res.groupby(['query_id'], as_index=False).apply (lambda x: metrics(x['is_relevant']))



In [58]:
df_agg_res.drop(columns='query_id').agg(np.mean)

p@1     0.979000
p@5     0.321400
p@10    0.194400
mrr     0.988167
dtype: float64

## Conclusion

Hope this notebook , showed how simple it is to implement a Sparse Retriever using Elastic Search