# Sparse Retrieval using Elastic Search

## Goals

- Understand the Python Elastic Search Client
- Map BM25 to Elastic Search 
- Compute Evaluation metrics 
- Other users of Elastic Search

## Imports

In [3]:
import pandas as pd
import tqdm.auto
import numpy as np
import glob
import concurrent.futures
import multiprocessing
import requests
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk
import pprint
import rich
import json

import IPython.display
from IPython.display import JSON
import metrics_utils

In [4]:
pd.options.display.max_colwidth = 500 # increase column width

In [5]:
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## Data

For this workshop, we have two files

`posts.parquet` : contains a subset of stackoverflow posts

`related_posts.parquet`: contains questions pairs that were marked as duplicates

In [6]:
path_posts = "gs://np-training-tmp/stackoverflow/final_subset/posts.parquet"
path_posts_related = "gs://np-training-tmp/stackoverflow/final_subset/related_posts.parquet"

In [7]:
#ELASTIC_HOST="np-database.c.np-training.internal"
ELASTIC_HOST="localhost"
ELASTIC_INDEX="stackoverflow"
ELASTIC_PORT=9200

ELASTIC_FULL_URL =f"http://{ELASTIC_HOST}:{ELASTIC_PORT}"

In [8]:
ELASTIC_FULL_URL+"/_analyze"

'http://localhost:9200/_analyze'

## Elastic Search Default Analyzers and Tokenizers

### Elastic Search Analyzer

ElasticSearch has many default analyzer.

Analyzers are composed of `tokenizers` and `normalizers`.

tokenization: breaking a text down into smaller chunks 

normalizers: format the token

[ElasticDoc](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-overview.html)

[Documentation for analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html)

### Built in tokenizers

In [9]:
def elastic_tokenize(tokenizer,  text, url = ELASTIC_FULL_URL+"/_analyze"):
    r =requests.post(url, 
              json =
                    {
                      "tokenizer": tokenizer ,
                      "text": text
                    }
    
    
        )

    rich.print (r.json() )
    
    
    

In [10]:
sentence = "<p> ELASTICSEARCH is built on top of the open-source <b>Apache Lucene</b>. </p>"

whitespace tokenizer

In [11]:
elastic_tokenize (tokenizer= "whitespace",  text= sentence)

standard tokenizer

In [12]:
elastic_tokenize (tokenizer= "standard",  text= sentence)

ngram tokenizer

In [13]:
elastic_tokenize (tokenizer= "ngram",  text= "Quick")

### Analyzers

In [14]:
def elastic_analyze(analyzer,  text, url = ELASTIC_FULL_URL+"/_analyze"):
    r =requests.post(url, 
              json =
                    {
                      "analyzer": analyzer ,
                      "text": text, 
                    }
        )

    rich.print (r.json() )
    

**whitespace analyzer**

The whitespace analyzer breaks text into terms whenever it encounters a whitespace character.



In [15]:
elastic_analyze(analyzer = "whitespace", text = sentence )

**stop analyzer**

breaks text into tokens at any non-letter character    
changes uppercase to lowercase.
also uses _english_ stop words.

In [16]:
elastic_analyze(analyzer = "stop", text = sentence )

**standard analyzer**

default analyzer       
grammar based tokenization
stopword disabled



In [17]:
elastic_analyze(analyzer = "standard", text = sentence )

## Elastic Search Indexing

### Helper Code

In [84]:
def create_index(client,index:str, num_shards=3):
    """Creates an index in Elasticsearch. Delete old index."""
    
    client.indices.delete(index=index)
    
    client.indices.create(
        index=index
        ,settings = {"number_of_shards": num_shards}
            # "mappings": {
            #     "properties": {
            #         "name": {"type": "text"},
            #         "borough": {"type": "keyword"},
            #         "cuisine": {"type": "keyword"},
            #         "grade": {"type": "keyword"},
            #         "location": {"type": "geo_point"},
            #     }
            # },
       
        #,ignore=400
    )


def generate_docs(df:pd.DataFrame):
    """
    Given a datframe containing posts data, yields a generator of dicitionary 
    """
    
    # iterate over dataframe contains posts with metadata
    for index, row in df.iterrows():
        doc = {**row} 
        
        # use PostId as document id
        doc['_id'] = doc["Id"]
        
        for k in list(doc.keys()):
            # don't insert nan fields
            if type(doc[k]) !=list and (doc[k] ==None or  ( pd.isna( doc[k] )  )) :
                del doc[k]
                
        del doc['Id']
        yield doc
        


def fetch_results(client:Elasticsearch, query:str,  num_hits=5, fields = ["Title", "QuestionBody"]):
    """
    With the passed elastic search client, return documents that contain the passed `query` in the fields specified by `fields`

    If the fields is empty, it will search all text fields
    
    We are using mult-match, which by default uses `or`
    https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
    """



    
    resp = client.search(
        query= {
                "multi_match": {
                    "query": query,
                    "fields": fields,
                   # "operator": "and" 
                }
            }
        ,size = num_hits
    )
    
    return resp
    

        

### Index Documents

In [19]:
df = pd.read_parquet(path_posts)
df['Tags']  = df['Tags'].apply(lambda x: x.tolist())

In [20]:
df.head()

Unnamed: 0,Id,AcceptedAnswerId,Title,QuestionBody,Tags,ViewCount,AnswerCount,CommentCount,Score,CreationDate,AnswerId,AcceptedAnswerBody
1,15020895,,Python int-byte efficient data structure,"i am currently storing key-values of type int-int. For fast access, I am currently using an BTrees.IIBTree structure stored in memory. It is not stored on disk at all since we need the most recent data.\n\nHowever, the current solution barely fits into memory, so I am looking for a more efficient database or data structure in terms of access time. In case it would be stored in memory it also needs to be efficient in terms of memory space. \n\nOne idea would be to replace the BTrees.IIBTree s...","[python, data-structures]",155,0,3,1,2013-02-22 09:33:26.360,,
9,68487902,,Why does the Variance of Laplace very different for OpenCV and scikit-image?,"TL;DR: How can I use skimage.filters.laplace(image).var() in a way to get the same value as cv2.Laplacian(image, CV_64F).var() and skimage.filters.sobel(image) to get same value as cv2.Sobel(image) ?\nI have the following code to find the Laplace Variance for blur detection\n[CODE]\nSo when I try to find the Laplace variance from OpenCV and scikit-image, it gives me two different values:\n[CODE]\nWhich one should I use or how can I get same number from both the functions?\nAlso, How can I us...","[python, opencv, image-processing, computer-vision, scikit-image]",391,0,5,1,2021-07-22 15:50:34.220,,
15,61391327,,Why input never ends,"I have python 3.7 installed and I have this code:\n\n[CODE]\n\nI was writing the name and press enter but the input is not over, it is still running and waiting for more inputs\n\nEdit: the problem is that input is never ending, doesn't matter how many enters I press\n","[python, python-3.x, input]",104,1,6,3,2020-04-23 15:43:03.497,,
27,28852710,,Crashes with piecewise linear objective for gurobi 6.0.2 / setPWLObj,"We have a complex optimization problem which includes several quadratic terms with integer and continous variables (using Anaconda Python / Pycharm with Gurobi 6.0.2). We applied the setPWLObj function to apprixmate the quadratic objective components. The code for this is as follows:\n\n[CODE]\n\nWith l1 and l2 being continous variables.\n\nThe problem behaves inconsistently. Running it on a Mac mostly delivers the exit codes 138 and 139 (correspondent to Bus Error 10), sometimes the same pr...","[python, crash, gurobi, piecewise]",403,1,1,3,2015-03-04 10:58:16.370,,
29,24043029,,Python TypeError: plotdatehist() got an unexpected keyword argument,"apologies beforehand if this is a stupid question...\n\nI've been using some Manchester University code to record, analyse, and graphically display bird box activity using IR emitters/receivers using a Raspberry Pi.\nAnyway, I've run into a problem in the graphical display part. \n\nThe part of the code causing the error is: \n\n[CODE]\n\nand the error which keeps coming up reads\n\n[CODE]\n\nI've heard that similar problems can be fixed by updating software, but as far as I can tell everyth...","[python, typeerror]",419,0,7,0,2014-06-04 16:42:32.257,,


A sample document from our input file.  

The main fields we will searching against are the `Title` and `QuestionBody`

In [21]:
df.iloc[0].to_dict()

{'Id': 15020895,
 'AcceptedAnswerId': nan,
 'Title': 'Python int-byte efficient data structure',
 'QuestionBody': 'i am currently storing key-values of type int-int. For fast access, I am currently using an BTrees.IIBTree structure stored in memory. It is not stored on disk at all since we need the most recent data.\n\nHowever, the current solution barely fits into memory, so I am looking for a more efficient database or data structure in terms of access time. In case it would be stored in memory it also needs to be efficient in terms of memory space. \n\nOne idea would be to replace the BTrees.IIBTree structure with a int-byte hash written in C as an extension for Python, but the data would still be lost in case the machine fails (not a terrible thing in our case).\n\nWhat are your suggestions?\n',
 'Tags': ['python', 'data-structures'],
 'ViewCount': 155,
 'AnswerCount': 0,
 'CommentCount': 3,
 'Score': 1,
 'CreationDate': Timestamp('2013-02-22 09:33:26.360000'),
 'AnswerId': nan,
 '

In [22]:
?Elasticsearch

[0;31mInit signature:[0m
[0mElasticsearch[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mhosts[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mMapping[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mint[0m[0;34m][0m[0;34m][0m[0;34m,[0m [0melastic_transport[0m[0;34m.[0m[0mNodeConfig[0m[0;34m][0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcloud_id[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mapi_key[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mTuple[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m 

create a client object to our elastic search cluster

In [23]:
client = Elasticsearch(
    [ELASTIC_FULL_URL]
)

tell elastic search to create an index     
An ES index is a collection of documents. 

ES suports inferring the documents without specifying the schema before hand 

In [24]:
create_index(client, index= ELASTIC_INDEX)

In [25]:
?client.indices.create

[0;31mSignature:[0m
[0mclient[0m[0;34m.[0m[0mindices[0m[0;34m.[0m[0mcreate[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maliases[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mMapping[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mMapping[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mAny[0m[0;34m][0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0merror_trace[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbool[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfilter_path[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mTuple[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0;34m...[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[

In [26]:
requests.get(f"{ELASTIC_FULL_URL}/_all/_settings").json()

{'test-index': {'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}},
    'number_of_shards': '5',
    'provided_name': 'test-index',
    'creation_date': '1666564416488',
    'number_of_replicas': '1',
    'uuid': 'x7OXazwcTMWVVb7EeDW9pQ',
    'version': {'created': '8040399'}}}},
 'stackoverflow': {'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}},
    'number_of_shards': '3',
    'provided_name': 'stackoverflow',
    'creation_date': '1667339325912',
    'number_of_replicas': '1',
    'uuid': 'b3YXQYMrSxm-s5EEg8apTA',
    'version': {'created': '8040399'}}}}}

The index we created is composed of `3` shards and `1` replica.   

When searching , ES queries each shard independantly and combines it

In [27]:
len(df)

219841

In [28]:
df_subset = df.head(5_000_000)
number_of_docs = len(df_subset)

Bulk insert all of our documents

In [29]:
with tqdm.auto.tqdm(total=number_of_docs , unit="docs" ) as pbar:
    successes = 0


    for ok, action in streaming_bulk(
            client=client, index=ELASTIC_INDEX, actions=generate_docs(df_subset) ,
        ):
        pbar.update(1)
        successes += ok


  0%|          | 0/219841 [00:00<?, ?docs/s]

Inserting `200k` documents at `4000` docs/sec on a single node is pretty good

In [30]:
f"{ELASTIC_FULL_URL}/_cat/shards"

'http://localhost:9200/_cat/shards'

In [31]:
In

['',
 'resp_title = evaluate_relevancy_hits_parallel(pdf_related , fields= ["Title"] )\ndf_res_title  = pd.concat(list(resp_title) ,ignore_index = True)\ndf_agg_res_title  = df_res_title.groupby([\'query_id\'], as_index=False).apply (lambda x: pd.Series(metrics_utils.all_metrics(x[\'is_relevant\'])))',
 "df_agg_res_title.drop(columns='query_id').agg(np.mean)",
 'import pandas as pd\nimport tqdm.auto\nimport numpy as np\nimport glob\nimport concurrent.futures\nimport multiprocessing\nimport requests\nfrom elasticsearch import Elasticsearch\nfrom elasticsearch.helpers import streaming_bulk\nimport pprint\nimport rich\nimport json\n\nimport IPython.display\nfrom IPython.display import JSON\nimport metrics_utils',
 'pd.options.display.max_colwidth = 500 # increase column width',
 'from pandarallel import pandarallel\n\npandarallel.initialize(progress_bar=True)',
 'path_posts = "gs://np-training-tmp/stackoverflow/final_subset/posts.parquet"\npath_posts_related = "gs://np-training-tmp/stacko

In [32]:
rich.print (
    requests.get(f"{ELASTIC_FULL_URL}/_cat/shards/{ELASTIC_INDEX}?v=true").content.decode()
    
)

In [33]:
rich.print (
    requests.get(f"{ELASTIC_FULL_URL}/_cat/nodes?v=true").content.decode()
    
)



In [34]:
rich.print (
    requests.get(f"{ELASTIC_FULL_URL}/{ELASTIC_INDEX}/_mapping").json()
    
)

note that by default, the default schema for text content stored content as full text and keywords.      
It is ignored as keyword, if the length is greater than 256 tokens

[ignore_above reference](https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html)

## Evaluate

In [35]:
?client.get

[0;31mSignature:[0m
[0mclient[0m[0;34m.[0m[0mget[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mid[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0merror_trace[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbool[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfilter_path[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mTuple[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0;34m...[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhuman[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbool[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpreference[0m[0;34m:

getting a specific document by their id

In [36]:
resp = client.get(index=ELASTIC_INDEX, id=15020895)
resp.body

{'_index': 'stackoverflow',
 '_id': '15020895',
 '_version': 1,
 '_seq_no': 0,
 '_primary_term': 1,
 '_ignored': ['QuestionBody.keyword'],
 'found': True,
 '_source': {'Title': 'Python int-byte efficient data structure',
  'QuestionBody': 'i am currently storing key-values of type int-int. For fast access, I am currently using an BTrees.IIBTree structure stored in memory. It is not stored on disk at all since we need the most recent data.\n\nHowever, the current solution barely fits into memory, so I am looking for a more efficient database or data structure in terms of access time. In case it would be stored in memory it also needs to be efficient in terms of memory space. \n\nOne idea would be to replace the BTrees.IIBTree structure with a int-byte hash written in C as an extension for Python, but the data would still be lost in case the machine fails (not a terrible thing in our case).\n\nWhat are your suggestions?\n',
  'Tags': ['python', 'data-structures'],
  'ViewCount': 155,
  '

retrieve a document with a query

In [37]:
query = "pandas Shuffle DataFrame rows"

In [38]:
# https://stackoverflow.com/questions/34147471/elasticsearch-how-to-search-for-a-value-in-any-field-across-all-types-in-one


resp = client.search(
    query = {
            "multi_match": {
                "query": query,
                # "fields": ["Title", "QuestionBody"],
                            }
            }
    , size=5
    , explain=False
)

In [39]:
JSON(resp.body, expanded = True)

<IPython.core.display.JSON object>

### Explain the score

In [40]:
resp = client.search(
    query = {
            "multi_match": {
                "query": query,
                 "fields": ["Title"],
                            }
            }
    , size=2
    , explain=True
    , source = ["Title"]
)

In [41]:
JSON (resp.body , expanded=True)

#print ( json.dumps(resp.body, indent=2) )

<IPython.core.display.JSON object>

in the `hits.hits['idx']['_expanation']` , we see individual score computed for each of the components that make BM25
```
weight(Title:pandas in 35543) [PerFieldSimilarity], result of:"
```

In [42]:
# resp = client.search(
#     query = {
#         "bool" : {
#           "must" : {
#             "multi_match" : { "query" : query, "fields": ["Title"] }
#           },
#           "filter": {
#             "term" : { "_id" : "55047745" }
#           }
#         }
#       }

#     , size=2
#     , explain=True
#     , source = ["Title"]
# )

### Distributed tf-idf

we are running an elastic search cluster with three shards.

ES has two ways to compute the distributed term frequencies



`query_then_fetch`     
(Default) Distributed term frequencies are calculated locally for each shard running the search.    

We recommend this option for faster searches with potentially less accurate scoring.

`dfs_query_then_fetch`    
Distributed term frequencies are calculated globally, using information gathered from all shards running the search.   
While this option increases the accuracy of scoring, it adds a round-trip to each shard, which can result in slower searches.

taken from ES [docs](https://www.elastic.co/guide/en/elasticsearch/reference/8.4/search-search.html)

searching with the default mode

In [43]:
resp = client.search(
    query = {
            "multi_match": {
                "query": query,
                 "fields": ["Title"],
                            }
            }
    , size=2
    #, explain=True
    , source = ["Title"]
    , search_type = "query_then_fetch"
)

In [44]:
JSON (resp.body , expanded=True)


<IPython.core.display.JSON object>

searching with the global dfs mode

In [45]:
resp = client.search(
    query = {
            "multi_match": {
                "query": query,
                 "fields": ["Title"],
                            }
            }
    , size=2
    #, explain=True
    , source = ["Title"]
    , search_type = "dfs_query_then_fetch"
)

In [46]:
JSON (resp.body , expanded=True)


<IPython.core.display.JSON object>

score difference between the different search types

before: 19.409405, 16.83802
    
after:  19.50764, 16.823097

## Evaluate on Golden data

In [47]:
pdf_related = pd.read_parquet(path_posts_related)

In [48]:
pdf_related.head()

Unnamed: 0,PostId,PostTitle,RelatedPostIds,RelatedPostTitles,num_candidates
1,3494593,Shading a kernel density plot between two points.,"[3494593, 14863744, 14094644, 16504452, 48853178, 36948624, 47308146, 34029811, 31215748, 29499914, 41484896, 7787114, 27189453, 23680729, 36224394, 18742693]","[Shading a kernel density plot between two points., adding percentile lines to a density plot, draw the following shaded area in R, color a portion of the normal distribution, How can I shade the area under a curve?, Shade area under a curve, Shading a region under a PDF, Fill different colors for each quantile in geom_density() of ggplot, How to shade part of a density curve in ggplot (with no y axis data), r density plot - fill area under curve, Fill negative value area below geom_line, po...",16
2,37949409,Dictionary in a numpy array?,"[37949409, 47689224, 61517741]","[Dictionary in a numpy array?, How to access the elements in numpy array of sets?, opening npy array. can view but not index?]",3
8,19876079,Cannot find module cv2 when using OpenCV,"[19876079, 62443365, 64580641, 45606137, 60294113, 65227902, 63039959]","[Cannot find module cv2 when using OpenCV, How to use opencv module in python(I'm using pycharm), build opencv from source: ModuleNotFoundError: No module named 'cv2', ImportError: No module named cv2 when executing Python script, 'opencv-python' installed but still shows 'ModuleNotFoundError: No module named cv2 ', Installed OpenCV successfully, but cannot import it within modules, On raspberry pi terminal cv2 works but on my project didnt work how can i fix this]",7
12,35082143,Error: package or namespace load failed for ‘car’,"[35082143, 65941744, 68515009, 56409535]","[Error: package or namespace load failed for ‘car’, Error: package or namespace load failed for ‘tidyverse’ there is no package called ‘reprex’, Truble loading 'Hmisc', > library(ez) Error: package or namespace load failed for ‘ez’ in loadNamespace]",4
14,2673651,inheritance from str or int,"[2673651, 48465797, 3120562, 15085917, 3238350, 4827303, 29751474, 50051365, 5693942, 59567148, 30045106, 37764447, 65568299, 24736813, 38873373]","[inheritance from str or int, Inherited class of int doesn't take additional arguments, Python, subclassing immutable types, Inheriting from immutable types, Subclassing int in Python, problem subclassing builtin type, Customizing immutable types in Python, Class inheritance not working while creating a Dimension custom class with int parent class in Python 3.6, Subclassing int and overriding the __init__ method - Python, How to inherit class complex in python?, Python how to extend `str` an...",15


In [49]:
len (pdf_related)

6114

For each `PostId`, we have questions that were marked as related.  
Note the `PostId` itsef is `RelatedPostIds`

### helper code

In [50]:
def format_resp(resp, row):
    payload = []
    query = row['PostTitle']
    for hit in resp['hits']['hits']:
        doc_id = int(hit['_id'])
        
        r = {
             'query': query
             , 'query_id' : row['PostId']
             ,'doc_id' : doc_id
             , 'is_relevant' : doc_id in row['RelatedPostIds']
             ,'score' : hit['_score']
             ,'doc_title' : hit['_source']['Title']


        }
        payload.append(r)    
    return payload

def fetch_as_relevancy_eval(row,num_hits=10,fields = ["Title", "QuestionBody"]):
    client = Elasticsearch(
    [ELASTIC_FULL_URL]
    
    )
    
    
    query = row['PostTitle']
    resp = fetch_results(client, query, num_hits=num_hits,fields=fields)
    payload = format_resp(resp, row)
    
    return pd.DataFrame(payload)
    

def evaluate_relevancy_hits(df,fields,num_hits=10):
    
    payload = []
    for index, row in df.iterrows():

        payload_query = fetch_as_relevancy_eval(row,fields=fields)
        
        payload.extend(payload_query.to_dict(orient='records') )

    
    #return pd.DataFrame.from_records(payload)
    return pd.DataFrame(payload)
    



def evaluate_relevancy_hits_parallel(df,fields,num_hits=20):
    
    
    res = df.parallel_apply(fetch_as_relevancy_eval,num_hits=num_hits,fields=fields, axis = 1)

    return res
    

Things to consider:         
    - the candidate query is known to be in the dataset     
    - so, ideal system would return the exact same item first    
    - there are multiple candidates, so returning any of the candidates is a hit   

### results on single example

In [51]:
fetch_as_relevancy_eval(pdf_related.iloc[0].to_dict(), fields= ["Title"] )

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
0,Shading a kernel density plot between two points.,3494593,3494593,True,45.222992,Shading a kernel density plot between two points.
1,Shading a kernel density plot between two points.,3494593,27294822,False,37.431713,Shading a kernel density estimate between two points - with transparency
2,Shading a kernel density plot between two points.,3494593,8808751,False,18.390955,Difference between two density plots
3,Shading a kernel density plot between two points.,3494593,5468280,False,17.628784,Scale a series between two points
4,Shading a kernel density plot between two points.,3494593,50526344,False,16.815237,Points with density gradient
5,Shading a kernel density plot between two points.,3494593,66490428,False,16.577562,Can one use ggMarginal on a plot combining points and density lines?
6,Shading a kernel density plot between two points.,3494593,60270301,False,16.513664,Kernel Density Plots and Histogram overlay
7,Shading a kernel density plot between two points.,3494593,24044475,False,16.47087,increasing distance between points in plot
8,Shading a kernel density plot between two points.,3494593,64546583,False,16.316025,plot multiple arrows between scatter points
9,Shading a kernel density plot between two points.,3494593,69029161,False,16.137531,shading the timeseries plot in python


In [52]:
fetch_as_relevancy_eval(pdf_related.iloc[0].to_dict(), fields= ["Title","QuestionBody"] )

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
0,Shading a kernel density plot between two points.,3494593,14094644,True,49.934696,draw the following shaded area in R
1,Shading a kernel density plot between two points.,3494593,7787114,True,45.700478,polygon in density plot?
2,Shading a kernel density plot between two points.,3494593,3494593,True,45.222992,Shading a kernel density plot between two points.
3,Shading a kernel density plot between two points.,3494593,27294822,False,39.258953,Shading a kernel density estimate between two points - with transparency
4,Shading a kernel density plot between two points.,3494593,27189453,True,34.868088,Shade (fill or color) area under density curve by quantile
5,Shading a kernel density plot between two points.,3494593,30900745,False,24.448137,"Plot 2D-kernel density from a dataframe: set number of grid positions, bandwith and lims"
6,Shading a kernel density plot between two points.,3494593,35381762,False,24.032297,Are SciPy Kernel Density values dependent upon the density per unit area or volume when using 2D or 3D poiint data?
7,Shading a kernel density plot between two points.,3494593,29244352,False,23.956926,Python: Overlap between two functions (PDF of kde and normal)
8,Shading a kernel density plot between two points.,3494593,55131398,False,23.93893,matplotlib scatter: the more overlapping points the bigger the marker
9,Shading a kernel density plot between two points.,3494593,47644966,False,23.525637,Change color of seaborn distribution line


In [53]:
evaluate_relevancy_hits(pdf_related.iloc[0:2], fields= ["Title","QuestionBody"] )

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
0,Shading a kernel density plot between two points.,3494593,14094644,True,49.934696,draw the following shaded area in R
1,Shading a kernel density plot between two points.,3494593,7787114,True,45.700478,polygon in density plot?
2,Shading a kernel density plot between two points.,3494593,3494593,True,45.222992,Shading a kernel density plot between two points.
3,Shading a kernel density plot between two points.,3494593,27294822,False,39.258953,Shading a kernel density estimate between two points - with transparency
4,Shading a kernel density plot between two points.,3494593,27189453,True,34.868088,Shade (fill or color) area under density curve by quantile
5,Shading a kernel density plot between two points.,3494593,30900745,False,24.448137,"Plot 2D-kernel density from a dataframe: set number of grid positions, bandwith and lims"
6,Shading a kernel density plot between two points.,3494593,35381762,False,24.032297,Are SciPy Kernel Density values dependent upon the density per unit area or volume when using 2D or 3D poiint data?
7,Shading a kernel density plot between two points.,3494593,29244352,False,23.956926,Python: Overlap between two functions (PDF of kde and normal)
8,Shading a kernel density plot between two points.,3494593,55131398,False,23.93893,matplotlib scatter: the more overlapping points the bigger the marker
9,Shading a kernel density plot between two points.,3494593,47644966,False,23.525637,Change color of seaborn distribution line


### results on entire dataset

In [77]:
resp = evaluate_relevancy_hits_parallel(pdf_related , fields= ["Title","QuestionBody"] )

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=765), Label(value='0 / 765'))), HB…

In [78]:
df_res  = pd.concat(list(resp) ,ignore_index = True)

In [79]:
df_res

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
0,Shading a kernel density plot between two points.,3494593,14094644,True,49.934696,draw the following shaded area in R
1,Shading a kernel density plot between two points.,3494593,7787114,True,45.700478,polygon in density plot?
2,Shading a kernel density plot between two points.,3494593,3494593,True,45.222992,Shading a kernel density plot between two points.
3,Shading a kernel density plot between two points.,3494593,27294822,False,39.258953,Shading a kernel density estimate between two points - with transparency
4,Shading a kernel density plot between two points.,3494593,27189453,True,34.868088,Shade (fill or color) area under density curve by quantile
...,...,...,...,...,...,...
15606,Source code for str.split?,40332743,40332743,True,28.942324,Source code for str.split?
15607,Cannot perform a backup or restore operation within a transaction,27443414,53216877,True,55.897240,Can't perform a backup or restore operation within a transaction
15608,Cannot perform a backup or restore operation within a transaction,27443414,27443414,True,52.947180,Cannot perform a backup or restore operation within a transaction
15609,What is the exact meaning of stride's list in tf.nn.conv2d?,48536681,48536681,True,49.084730,What is the exact meaning of stride's list in tf.nn.conv2d?


In [80]:
query_id = 30212447

In [81]:
pdf_related [ pdf_related['PostId'] == query_id ].iloc[0].to_dict()

{'PostId': 30212447,
 'PostTitle': 'How to add element in Python to the end of list using list.insert?',
 'RelatedPostIds': array([30212447, 70342396, 64223356, 54052453, 53932704]),
 'RelatedPostTitles': array(['How to add element in Python to the end of list using list.insert?',
        'Some confusion about swapping two elements in a list using a function',
        'while using "-1" as index number,the element is inserting at last 2nd position. how its happening?',
        'Insert an item to the last but one position in list',
        'Array: Insert with negative index'], dtype=object),
 'num_candidates': 5}

In [82]:
df_res[ df_res.query_id==query_id]

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
2966,How to add element in Python to the end of list using list.insert?,30212447,30212447,True,35.865273,How to add element in Python to the end of list using list.insert?


### summarize metrics

In [60]:
df_agg_res  = df_res.groupby(['query_id'], as_index=False).apply (lambda x: pd.Series(metrics_utils.all_metrics(x['is_relevant'])))


In [61]:
df_agg_res

Unnamed: 0,query_id,p@1,p@5,p@10,mrr,map
0,972,1.0,0.2,0.3,1.0,0.449060
1,8948,1.0,0.4,0.3,1.0,0.666667
2,20794,1.0,0.2,0.1,1.0,1.000000
3,32404,1.0,0.4,0.3,1.0,0.591667
4,32899,1.0,0.8,0.4,1.0,1.000000
...,...,...,...,...,...,...
6109,71792480,1.0,0.2,0.1,1.0,1.000000
6110,71992622,1.0,0.4,0.2,1.0,1.000000
6111,72050038,1.0,0.2,0.1,1.0,1.000000
6112,72369460,1.0,0.2,0.1,1.0,0.558824


In [62]:
df_agg_res.drop(columns='query_id').agg(np.mean)

p@1     0.983808
p@5     0.266143
p@10    0.146402
mrr     0.990946
map     0.902330
dtype: float64

with title only

In [75]:
resp_title = evaluate_relevancy_hits_parallel(pdf_related , fields= ["Title"] )
df_res_title  = pd.concat(list(resp_title) ,ignore_index = True)
df_agg_res_title  = df_res_title.groupby(['query_id'], as_index=False).apply (lambda x: pd.Series(metrics_utils.all_metrics(x['is_relevant'])))


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=765), Label(value='0 / 765'))), HB…

In [64]:
df_agg_res_title.drop(columns='query_id').agg(np.mean)

p@1     0.996565
p@5     0.251652
p@10    0.137439
mrr     0.998146
map     0.920565
dtype: float64

### Queries where we didn't do well

queries where we didn't retrieve at position 1 and p@5 didn't improve much

In [65]:
df_agg_res [  (df_agg_res['p@1'] < 1) & (df_agg_res['p@5'] >= 0.4) ]

Unnamed: 0,query_id,p@1,p@5,p@10,mrr,map
413,2778840,0.0,0.4,0.2,0.5,0.583333
433,2941995,0.0,0.4,0.3,0.5,0.409659
597,4106178,0.0,0.4,0.2,0.5,0.583333
855,6123378,0.0,0.4,0.2,0.5,0.383333
1402,10767010,0.0,0.4,0.2,0.5,0.5
2032,16249466,0.0,0.4,0.4,0.5,0.5
2083,16819956,0.0,0.6,0.6,0.5,0.571256
2309,18595695,0.0,0.4,0.2,0.333333,0.416667
3540,31087111,0.0,0.6,0.3,0.333333,0.477778
3889,34837707,0.0,0.6,0.3,0.5,0.533333


In [66]:
query_id = 38744285

query_id = 47068709

query_id = 40809503

query_id = 38118598



In [67]:
df_res [ df_res.query_id==query_id ].iloc[0]['query']

'3D animation using matplotlib'

In [68]:
df_res [ df_res.query_id==query_id ][['doc_title','is_relevant'] ]

Unnamed: 0,doc_title,is_relevant
61400,How to animate 3d spheres in matplotlib so that they revolve around a central point?,False
61401,3D animation using matplotlib,True
61402,3D waves animation using Matplotlib 3D,False
61403,Error while drawing animation of seaborn heatmap for 3D volume,False
61404,how to clear pervious data in live 3d plot in while loop(Python3),True
61405,"""""TypeError: 'Axes3D' object is not subscriptable"""" for 3d animation from data",False
61406,Saving scatterplot animations with matplotlib,False
61407,Simple Python/Matplotlib animation shows: empty graph - why?,False
61408,Want to use matpoltlib animation in subplots from gridspec,False
61409,Animate 3D surface over an initial 3D plot with matplotlib,False


In [69]:
df_res[ df_res.query_id==query_id]

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
61400,3D animation using matplotlib,38118598,56407653,False,26.944769,How to animate 3d spheres in matplotlib so that they revolve around a central point?
61401,3D animation using matplotlib,38118598,38118598,True,25.949009,3D animation using matplotlib
61402,3D animation using matplotlib,38118598,23480833,False,25.418022,3D waves animation using Matplotlib 3D
61403,3D animation using matplotlib,38118598,62396274,False,22.496458,Error while drawing animation of seaborn heatmap for 3D volume
61404,3D animation using matplotlib,38118598,53311187,True,21.967499,how to clear pervious data in live 3d plot in while loop(Python3)
61405,3D animation using matplotlib,38118598,69251079,False,20.547255,"""""TypeError: 'Axes3D' object is not subscriptable"""" for 3d animation from data"
61406,3D animation using matplotlib,38118598,14739969,False,18.986065,Saving scatterplot animations with matplotlib
61407,3D animation using matplotlib,38118598,48259542,False,18.504963,Simple Python/Matplotlib animation shows: empty graph - why?
61408,3D animation using matplotlib,38118598,47125849,False,18.470299,Want to use matpoltlib animation in subplots from gridspec
61409,3D animation using matplotlib,38118598,72005485,False,17.819656,Animate 3D surface over an initial 3D plot with matplotlib


In [70]:
df.to_parquet("../tmp/posts.parquet", index=False)
df.head()

Unnamed: 0,Id,AcceptedAnswerId,Title,QuestionBody,Tags,ViewCount,AnswerCount,CommentCount,Score,CreationDate,AnswerId,AcceptedAnswerBody
1,15020895,,Python int-byte efficient data structure,"i am currently storing key-values of type int-int. For fast access, I am currently using an BTrees.IIBTree structure stored in memory. It is not stored on disk at all since we need the most recent data.\n\nHowever, the current solution barely fits into memory, so I am looking for a more efficient database or data structure in terms of access time. In case it would be stored in memory it also needs to be efficient in terms of memory space. \n\nOne idea would be to replace the BTrees.IIBTree s...","[python, data-structures]",155,0,3,1,2013-02-22 09:33:26.360,,
9,68487902,,Why does the Variance of Laplace very different for OpenCV and scikit-image?,"TL;DR: How can I use skimage.filters.laplace(image).var() in a way to get the same value as cv2.Laplacian(image, CV_64F).var() and skimage.filters.sobel(image) to get same value as cv2.Sobel(image) ?\nI have the following code to find the Laplace Variance for blur detection\n[CODE]\nSo when I try to find the Laplace variance from OpenCV and scikit-image, it gives me two different values:\n[CODE]\nWhich one should I use or how can I get same number from both the functions?\nAlso, How can I us...","[python, opencv, image-processing, computer-vision, scikit-image]",391,0,5,1,2021-07-22 15:50:34.220,,
15,61391327,,Why input never ends,"I have python 3.7 installed and I have this code:\n\n[CODE]\n\nI was writing the name and press enter but the input is not over, it is still running and waiting for more inputs\n\nEdit: the problem is that input is never ending, doesn't matter how many enters I press\n","[python, python-3.x, input]",104,1,6,3,2020-04-23 15:43:03.497,,
27,28852710,,Crashes with piecewise linear objective for gurobi 6.0.2 / setPWLObj,"We have a complex optimization problem which includes several quadratic terms with integer and continous variables (using Anaconda Python / Pycharm with Gurobi 6.0.2). We applied the setPWLObj function to apprixmate the quadratic objective components. The code for this is as follows:\n\n[CODE]\n\nWith l1 and l2 being continous variables.\n\nThe problem behaves inconsistently. Running it on a Mac mostly delivers the exit codes 138 and 139 (correspondent to Bus Error 10), sometimes the same pr...","[python, crash, gurobi, piecewise]",403,1,1,3,2015-03-04 10:58:16.370,,
29,24043029,,Python TypeError: plotdatehist() got an unexpected keyword argument,"apologies beforehand if this is a stupid question...\n\nI've been using some Manchester University code to record, analyse, and graphically display bird box activity using IR emitters/receivers using a Raspberry Pi.\nAnyway, I've run into a problem in the graphical display part. \n\nThe part of the code causing the error is: \n\n[CODE]\n\nand the error which keeps coming up reads\n\n[CODE]\n\nI've heard that similar problems can be fixed by updating software, but as far as I can tell everyth...","[python, typeerror]",419,0,7,0,2014-06-04 16:42:32.257,,


In [71]:
pdf_related.to_parquet("../tmp/related.parquet", index=False)
pdf_related.head()

Unnamed: 0,PostId,PostTitle,RelatedPostIds,RelatedPostTitles,num_candidates
1,3494593,Shading a kernel density plot between two points.,"[3494593, 14863744, 14094644, 16504452, 48853178, 36948624, 47308146, 34029811, 31215748, 29499914, 41484896, 7787114, 27189453, 23680729, 36224394, 18742693]","[Shading a kernel density plot between two points., adding percentile lines to a density plot, draw the following shaded area in R, color a portion of the normal distribution, How can I shade the area under a curve?, Shade area under a curve, Shading a region under a PDF, Fill different colors for each quantile in geom_density() of ggplot, How to shade part of a density curve in ggplot (with no y axis data), r density plot - fill area under curve, Fill negative value area below geom_line, po...",16
2,37949409,Dictionary in a numpy array?,"[37949409, 47689224, 61517741]","[Dictionary in a numpy array?, How to access the elements in numpy array of sets?, opening npy array. can view but not index?]",3
8,19876079,Cannot find module cv2 when using OpenCV,"[19876079, 62443365, 64580641, 45606137, 60294113, 65227902, 63039959]","[Cannot find module cv2 when using OpenCV, How to use opencv module in python(I'm using pycharm), build opencv from source: ModuleNotFoundError: No module named 'cv2', ImportError: No module named cv2 when executing Python script, 'opencv-python' installed but still shows 'ModuleNotFoundError: No module named cv2 ', Installed OpenCV successfully, but cannot import it within modules, On raspberry pi terminal cv2 works but on my project didnt work how can i fix this]",7
12,35082143,Error: package or namespace load failed for ‘car’,"[35082143, 65941744, 68515009, 56409535]","[Error: package or namespace load failed for ‘car’, Error: package or namespace load failed for ‘tidyverse’ there is no package called ‘reprex’, Truble loading 'Hmisc', > library(ez) Error: package or namespace load failed for ‘ez’ in loadNamespace]",4
14,2673651,inheritance from str or int,"[2673651, 48465797, 3120562, 15085917, 3238350, 4827303, 29751474, 50051365, 5693942, 59567148, 30045106, 37764447, 65568299, 24736813, 38873373]","[inheritance from str or int, Inherited class of int doesn't take additional arguments, Python, subclassing immutable types, Inheriting from immutable types, Subclassing int in Python, problem subclassing builtin type, Customizing immutable types in Python, Class inheritance not working while creating a Dimension custom class with int parent class in Python 3.6, Subclassing int and overriding the __init__ method - Python, How to inherit class complex in python?, Python how to extend `str` an...",15


In [72]:
df_agg_res.to_parquet("../tmp/df_agg_res__elasticsearch.parquet", index=False)
df_agg_res.head()

Unnamed: 0,query_id,p@1,p@5,p@10,mrr,map
0,972,1.0,0.2,0.3,1.0,0.44906
1,8948,1.0,0.4,0.3,1.0,0.666667
2,20794,1.0,0.2,0.1,1.0,1.0
3,32404,1.0,0.4,0.3,1.0,0.591667
4,32899,1.0,0.8,0.4,1.0,1.0


In [73]:
df_res.to_parquet("../tmp/df_res__elasticsearch.parquet", index=False)
df_res.head()

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
0,Shading a kernel density plot between two points.,3494593,14094644,True,49.934696,draw the following shaded area in R
1,Shading a kernel density plot between two points.,3494593,7787114,True,45.700478,polygon in density plot?
2,Shading a kernel density plot between two points.,3494593,3494593,True,45.222992,Shading a kernel density plot between two points.
3,Shading a kernel density plot between two points.,3494593,27294822,False,39.258953,Shading a kernel density estimate between two points - with transparency
4,Shading a kernel density plot between two points.,3494593,27189453,True,34.868088,Shade (fill or color) area under density curve by quantile


## Conclusion

Hope this notebook , showed how simple it is to implement a Sparse Retriever using Elastic Search