# Token Retrieval using BM25


# About

In this notebook we will do follows:
- Create a text corpus by using descriptions of images
- Explore the built in `analyzers` in ElasticSearch
- Explore ES apis
- Explore the different results of bm25 

# Setup

In [120]:
import pandas as pd
from pathlib import Path
import datasets

from IPython.display import Image, JSON
from IPython.core.display import HTML
import rich
import re

import requests
import tqdm.auto
import ipywidgets as widgets
from ipywidgets import interact

import ipyplot

from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk
import os

##### We load the dataset

In [84]:
dset = datasets.load_from_disk("../data/processed")

##### Dataset consists of 24995 rows and each row has columns that contains information about the photo like "photo description", "size", etc..

In [85]:
dset

Dataset({
    features: ['photo_id', 'photo_url', 'photo_image_url', 'photo_submitted_at', 'photo_featured', 'photo_width', 'photo_height', 'photo_aspect_ratio', 'photo_description', 'photographer_username', 'photographer_first_name', 'photographer_last_name', 'exif_camera_make', 'exif_camera_model', 'exif_iso', 'exif_aperture_value', 'exif_focal_length', 'exif_exposure_time', 'photo_location_name', 'photo_location_latitude', 'photo_location_longitude', 'photo_location_country', 'photo_location_city', 'stats_views', 'stats_downloads', 'ai_description', 'ai_primary_landmark_name', 'ai_primary_landmark_latitude', 'ai_primary_landmark_longitude', 'ai_primary_landmark_confidence', 'blur_hash', 'description_final', 'image', 'image_path_local'],
    num_rows: 24992
})

##### We use "description_final" field of the photos to create a text corpus

In [86]:
dset['description_final'][:5]

['Woman exploring a forest',
 'Succulents in a terrarium',
 'Rural winter mountainside',
 'Poppy seeds and flowers',
 'Silhouette near dark trees']

In [87]:
dset[0]

{'photo_id': 'XMyPniM9LF0',
 'photo_url': 'https://unsplash.com/photos/XMyPniM9LF0',
 'photo_image_url': 'https://images.unsplash.com/uploads/14119492946973137ce46/f1f2ebf3',
 'photo_submitted_at': '2014-09-29 00:08:38.594364',
 'photo_featured': 't',
 'photo_width': 4272,
 'photo_height': 2848,
 'photo_aspect_ratio': 1.5,
 'photo_description': 'Woman exploring a forest',
 'photographer_username': 'michellespencer77',
 'photographer_first_name': 'Michelle',
 'photographer_last_name': 'Spencer',
 'exif_camera_make': 'Canon',
 'exif_camera_model': 'Canon EOS REBEL T3',
 'exif_iso': 400.0,
 'exif_aperture_value': '1.8',
 'exif_focal_length': '50.0',
 'exif_exposure_time': '1/100',
 'photo_location_name': None,
 'photo_location_latitude': None,
 'photo_location_longitude': None,
 'photo_location_country': None,
 'photo_location_city': None,
 'stats_views': 2375421,
 'stats_downloads': 6967,
 'ai_description': 'woman walking in the middle of forest',
 'ai_primary_landmark_name': None,
 'ai_

# Elastic Search

In [88]:
ELASTIC_HOST="localhost"
ELASTIC_INDEX="unsplash"
ELASTIC_PORT=9200

ELASTIC_FULL_URL =f"http://{ELASTIC_HOST}:{ELASTIC_PORT}"

## Elastic Search Default Analyzers 

### Elastic Search Analyzer

ElasticSearch has many default analyzer.

Analyzers are composed of `tokenizers` and `normalizers`.

tokenization: breaking a text down into smaller chunks 

normalizers: format the token

[ElasticDoc](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-overview.html)

[Documentation for analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html)

### Analyzers

In [89]:
def elastic_analyze(analyzer,  text, url = ELASTIC_FULL_URL+"/_analyze"):
    r =requests.post(url, 
              json =
                    {
                      "analyzer": analyzer ,
                      "text": text
                    }
        )

    rich.print (r.json() )
    

In [90]:
sentence = "Two dogs walking in the snow. ❄️ 雪"

**whitespace analyzer**

The whitespace analyzer breaks text into terms whenever it encounters a whitespace character.



In [91]:
elastic_analyze(analyzer = "whitespace", text = sentence )

**stop analyzer**

changes uppercase to lowercase.  
also uses _english_ stop words.  
breaks text into tokens at any non-letter character    


In [92]:
elastic_analyze(analyzer = "stop", text = sentence )

**standard analyzer**

default analyzer       
grammar based tokenization
stopword disabled



In [93]:
elastic_analyze(analyzer = "standard", text = sentence )

In [94]:
elastic_analyze(analyzer = "simple", text = sentence )

In [95]:
elastic_analyze(analyzer = "english", text = sentence )

## Index Documents


In [96]:
client = Elasticsearch(
    [ELASTIC_FULL_URL]
)

In [98]:
def create_index(client,index:str, num_shards=1):
    """Creates an index in Elasticsearch. Delete old index."""
    
    if client.indices.exists(index=index):
        client.indices.delete(index=index)
    
    client.indices.create(
        index=index
        ,settings = {"number_of_shards": num_shards}
        ,mappings= {
            
            "properties": {
                        "description_final": {"type": "text"}
                        ,"photo_image_url": {"type": "text" ,"index":False}
                        ,"photo_id": {"type": "keyword" ,"index":False}
                
                   }
            }
       
        #,ignore=400
    )


def generate_docs(df:pd.DataFrame):
    """
    Given a datframe containing posts data, yields a generator of dicitionary 
    """
    
    df = df[['photo_id','description_final','photo_image_url']]
    
    # iterate over dataframe contains posts with metadata
    for index, row in df.iterrows():
        doc = {**row} 
        
        # use PostId as document id
        doc['_id'] = doc["photo_id"]
        
        for k in list(doc.keys()):
            # don't insert nan fields
            if type(doc[k]) !=list and (doc[k] ==None or  ( pd.isna( doc[k] )  )) :
                del doc[k]
        
        yield doc
        


def fetch_results(client:Elasticsearch, query:str,  num_hits=5, fields = ["description_final"], analyzer ="stop"):
    """
    With the passed elastic search client, return documents that contain the passed `query` in the fields specified by `fields`

    If the fields is empty, it will search all text fields
    
    We are using mult-match, which by default uses `or`
    https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
    """



    
    resp = client.search(
        query= {
                "multi_match": {
                    "query": query,
                    "fields": fields,
                     "analyzer": analyzer
                   # "operator": "and" 
                },
               
            }
        ,size = num_hits
    )
    
    return resp
    

        

tell elastic search to create an index     
An ES index is a collection of documents. 

ES suports inferring the documents without specifying the schema before hand 

In [99]:
create_index(client, index= ELASTIC_INDEX, num_shards=1)

In [100]:
?client.indices.create

[0;31mSignature:[0m
[0mclient[0m[0;34m.[0m[0mindices[0m[0;34m.[0m[0mcreate[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maliases[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mMapping[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mMapping[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mAny[0m[0;34m][0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0merror_trace[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbool[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfilter_path[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mTuple[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0;34m...[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[

In [101]:
requests.get(f"{ELASTIC_FULL_URL}/_all/_settings").json()

{'unsplash': {'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}},
    'number_of_shards': '1',
    'provided_name': 'unsplash',
    'creation_date': '1682450664263',
    'number_of_replicas': '1',
    'uuid': '7gKPz7FiTE2FXojjGpABgA',
    'version': {'created': '8070099'}}}}}

The index we created is composed of `1` shards and `1` replica.   

When searching , ES queries each shard independantly and combines it

In [102]:
len(dset)

24992

In [103]:
df_subset = dset.to_pandas()
number_of_docs = len(df_subset)

Bulk insert all of our documents

In [104]:
df_subset.iloc[0].to_dict()

{'photo_id': 'XMyPniM9LF0',
 'photo_url': 'https://unsplash.com/photos/XMyPniM9LF0',
 'photo_image_url': 'https://images.unsplash.com/uploads/14119492946973137ce46/f1f2ebf3',
 'photo_submitted_at': '2014-09-29 00:08:38.594364',
 'photo_featured': 't',
 'photo_width': 4272,
 'photo_height': 2848,
 'photo_aspect_ratio': 1.5,
 'photo_description': 'Woman exploring a forest',
 'photographer_username': 'michellespencer77',
 'photographer_first_name': 'Michelle',
 'photographer_last_name': 'Spencer',
 'exif_camera_make': 'Canon',
 'exif_camera_model': 'Canon EOS REBEL T3',
 'exif_iso': 400.0,
 'exif_aperture_value': '1.8',
 'exif_focal_length': '50.0',
 'exif_exposure_time': '1/100',
 'photo_location_name': None,
 'photo_location_latitude': nan,
 'photo_location_longitude': nan,
 'photo_location_country': None,
 'photo_location_city': None,
 'stats_views': 2375421,
 'stats_downloads': 6967,
 'ai_description': 'woman walking in the middle of forest',
 'ai_primary_landmark_name': None,
 'ai_pr

In [105]:
with tqdm.auto.tqdm(total=number_of_docs , unit="docs" ) as pbar:
    successes = 0


    for ok, action in streaming_bulk(
            client=client, index=ELASTIC_INDEX, actions=generate_docs(df_subset) ,
        ):
        pbar.update(1)
        successes += ok


  0%|          | 0/24992 [00:00<?, ?docs/s]

Inserting `20k` documents at `3000` docs/sec on a single node is pretty good

In [106]:
rich.print (
    requests.get(f"{ELASTIC_FULL_URL}/_cat/nodes?v=true").content.decode()
    
)

In [107]:
rich.print (
    requests.get(f"{ELASTIC_FULL_URL}/{ELASTIC_INDEX}/_mapping").json()
    
)

In [108]:
rich.print (
    requests.get(f"{ELASTIC_FULL_URL}/_cat/shards/{ELASTIC_INDEX}?v=true").content.decode()
    
)

getting a specific document by their id

In [109]:
resp = client.get(index=ELASTIC_INDEX, id="XMyPniM9LF0")
resp.body

{'_index': 'unsplash',
 '_id': 'XMyPniM9LF0',
 '_version': 1,
 '_seq_no': 0,
 '_primary_term': 1,
 'found': True,
 '_source': {'photo_id': 'XMyPniM9LF0',
  'description_final': 'Woman exploring a forest',
  'photo_image_url': 'https://images.unsplash.com/uploads/14119492946973137ce46/f1f2ebf3'}}

## Evaluate

retrieve a document with a query

In [110]:
query = "Two dogs playing in the snow"

In [111]:
# https://stackoverflow.com/questions/34147471/elasticsearch-how-to-search-for-a-value-in-any-field-across-all-types-in-one


resp = client.search(
    query = {
            "multi_match": {
                "query": query,
                # "fields": ["Title", "QuestionBody"],
                            }
            }
    , size=5
    , explain=False
)

In [112]:
JSON(resp.body, expanded = True)

<IPython.core.display.JSON object>

## Explain the score

In [113]:
query

'Two dogs playing in the snow'

In [114]:
resp = client.search(
    query = {
            "multi_match": {
                "query": query,
                 "fields": ["description_final"],
                            }
            }
    , size=2
    , explain=True
    , source = ["description_final"]
)

In [115]:
JSON (resp.body , expanded=True)

#print ( json.dumps(resp.body, indent=2) )

<IPython.core.display.JSON object>

in the `hits.hits['idx']['_expanation']` , we see individual score computed for each of the components that make BM25
```
weight(Title:pandas in 35543) [PerFieldSimilarity], result of:"
```

# More Examples

##### Lets go over the method below. It gets the search query and k value that is the recall limit.
- Stop, stem and tokenize the query
- Get bm25 scores of the documents
- Sort the documents by bm25 scores and get top k

In [116]:
def fetch_results(client:Elasticsearch, query:str,  num_hits=5, fields_search = ["description_final"],  analyzer ="stop", explain=False
                  , fields_metadata=["photographer_username","photographer_first_name","photographer_last_name","photo_image_url"]):
    """
    With the passed elastic search client, return documents that contain the passed `query` in the fields specified by `fields`

    If the fields is empty, it will search all text fields
    
    We are using mult-match, which by default uses `or`
    https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
    """


    
    resp = client.search(
        query= {
                "multi_match": {
                    "query": query,
                    "fields": fields_search,
                     "analyzer": analyzer,
                     
                    #  "operator": "and" 
                },
               
            }
        ,size = num_hits
        ,explain = explain
        , source = fields_search + fields_metadata
    )
    
    return resp
    

In [134]:
def find_results(query:str , k =5, analyzer="english", explain=False,image_path_local_dir = "../data/raw/images"):
    
    
    top_items = fetch_results(client,query=query,num_hits=k, analyzer=analyzer, explain=explain)
    
    

    display(HTML(f"<h3>Query: {query} </h3>"))
    
    elastic_analyze(  analyzer = analyzer, text =query)

    images = []
    labels = []
    
    # Iterate over the top k results
    for hit in top_items['hits']['hits']:
        doc_id = hit['_id']
        
        photo_data = hit["_source"]
        
        # Display the photo
        
        image_path_local = f"{image_path_local_dir}/{doc_id}.jpg"
        if os.path.exists(image_path_local):
            image_path = image_path_local
        else:
            image_path = photo_data["photo_image_url"]
        images.append(image_path)
        score = "{:.2f}".format(hit['_score'])
        
        labels.append (f"""
                     Photo title: {photo_data["description_final"]}   <br/>
                     Distance: {score}
            
                     """)
        
        #display(Image(url=photo_data["photo_image_url"] + "?w=200"))

        # # Display the attribution text
        # display(HTML(f"""
        #              Photo title: {photo_data["description_final"]}   <br/>
        #              Photo by <a href="https://unsplash.com/@{photo_data["photographer_username"]}?utm_source=SearchWorkshop&utm_medium=referral">{photo_data["photographer_first_name"]} {photo_data.get("photographer_last_name","")}</a> on <a href="https://unsplash.com/?utm_source=SearchWorkshop&utm_medium=referral">Unsplash</a> <br/>
        #              Distance: {hit['_score']}
        #              """
        #                                 ))
        
        
    ipyplot.plot_images(images=images, labels=labels, img_width=200)
        
        
    if explain:
        return JSON (top_items.body , expanded=False)


In [135]:
query = "Two dogs playing in the snow"
analyzer = "english"

In [136]:
find_results( query, analyzer=analyzer, explain=True)

<IPython.core.display.JSON object>

In [147]:
find_results( "the boy and girl on a beach", analyzer="stop", explain=True)

<IPython.core.display.JSON object>

In [138]:
find_results( "image of a man in a desert", analyzer="english", explain=True)

<IPython.core.display.JSON object>

In [139]:
find_results( "light at the end of the tunnel", analyzer="standard")



In [168]:
sample_queries = [

"person on top of mountain"
, "picture of a man in a desert"
, "person in a desert"    

, "the boy and girl on a beach"
, "children in beach"    

, "Two dogs playing in the snow"

, "light at the end of the tunnel"
, "seven wonders of the world"
    
    
, "water droplets on a leaf"
    
, "ripley's aquarium of canada, toronto, canada"
, "the butterfly atrium at hershey gardens"
    
, "salar de uyuni uyuni bolivia"
, "沙漠青蛙 沙漠青蛙" #(desert frog)
, "por do sol no mar"
, "conhece te a ti mesmo" #	 ( Greek for know thyself)


, "there is no planet b"


, "nova scotia duck tolling retriever"
    
]

Questions to Ask:

How does the model do for clear 

In [169]:
"""

"ripley's aquarium of canada, toronto, canada" (not there)
salar de uyuni uyuni bolivia (not there)

沙漠青蛙 沙漠青蛙 desert frog (not tokenizing correclty) 
conhece te a ti mesmo (not understandin latin)

planet B (because not tokenized)

seven wonders of the world: not understanding intent
"""

'\n\n"ripley\'s aquarium of canada, toronto, canada" (not there)\nsalar de uyuni uyuni bolivia (not there)\n\n沙漠青蛙 沙漠青蛙 desert frog (not tokenizing correclty) \nconhece te a ti mesmo (not understandin latin)\n\nplanet B (because not tokenized)\n\nseven wonders of the world: not understanding intent\n'

In [173]:
@interact
def interact_find_results(query=sample_queries, analyzer=["english","stop","standard"], explain=[True,False], k =10):
    find_results( query, analyzer=analyzer,explain=explain, k =k)


interactive(children=(Dropdown(description='query', options=('person on top of mountain', 'picture of a man in…

1) Do the results of `person on top of mountain` differ between the tokenization?

2) How do the results of `picture of a man in a desert` and `person in a desert` compare ? 

3) How do `the boy and girl on a beach` and `children in beach` compare ?

4) Are the results of `seven wonders of the world`relevnat ?

# Performance

In [172]:
%%timeit
search_query = "Two dogs playing in the snow"
k =5 
top_items = fetch_results(client,query=query,num_hits=k)



1.97 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


We are able to search 25k docs in ~3 ms

# Other

other topics to be covered if there is time

### Built in tokenizers

In [None]:
def elastic_tokenize(tokenizer,  text, url = ELASTIC_FULL_URL+"/_analyze"):
    r =requests.post(url, 
              json =
                    {
                      "tokenizer": tokenizer ,
                      "text": text
                    }
    
    
        )

    rich.print (r.json() )
    
    
    

In [None]:
sentence = "<p> ELASTICSEARCH is built on top of the open-source <b>Apache Lucene</b>. </p>"

whitespace tokenizer

In [None]:
elastic_tokenize (tokenizer= "whitespace",  text= sentence)

standard tokenizer

In [None]:
elastic_tokenize (tokenizer= "standard",  text= sentence)

ngram tokenizer

In [None]:
elastic_tokenize (tokenizer= "ngram",  text= "Quick")

note that by default, the default schema for text content stored content as full text and keywords.      
It is ignored as keyword, if the length is greater than 256 tokens

[ignore_above reference](https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html)

### Distributed tf-idf

we are running an elastic search cluster with three shards.

ES has two ways to compute the distributed term frequencies



`query_then_fetch`     
(Default) Distributed term frequencies are calculated locally for each shard running the search.    

We recommend this option for faster searches with potentially less accurate scoring.

`dfs_query_then_fetch`    
Distributed term frequencies are calculated globally, using information gathered from all shards running the search.   
While this option increases the accuracy of scoring, it adds a round-trip to each shard, which can result in slower searches.

taken from ES [docs](https://www.elastic.co/guide/en/elasticsearch/reference/8.4/search-search.html)

searching with the default mode

In [None]:
resp = client.search(
    query = {
            "multi_match": {
                "query": query,
                 "fields": ["description_final"],
                            }
            }
    , size=2
    #, explain=True
    , source = ["description_final"]
    , search_type = "query_then_fetch"
)

In [None]:
JSON (resp.body , expanded=True)


searching with the global dfs mode

In [None]:
resp = client.search(
    query = {
            "multi_match": {
                "query": query,
                 "fields": ["description_final"],
                            }
            }
    , size=2
    #, explain=True
    , source = ["description_final"]
    , search_type = "dfs_query_then_fetch"
)

In [None]:
JSON (resp.body , expanded=True)


score difference between the different search types

before: 14.775831, 13.5637
    
after:  14.706409, 13.708656