# Elastic Rank Eval Demo

## Install Prerequisites

In [10]:
! pip install -q -U -r requirements.txt

## Create Environment

In [11]:
%%bash
terraform -chdir=terraform init  -upgrade
terraform -chdir=terraform apply -auto-approve

[0m[1mInitializing the backend...[0m
[0m[1mInitializing provider plugins...[0m
- Finding elastic/elasticstack versions matching "~> 0.12"...
- Finding elastic/ec versions matching "~> 0.12"...
- Using previously-installed elastic/ec v0.12.2
- Using previously-installed elastic/elasticstack v0.12.2

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.[0m

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  [32m+[0m create[0m

Terraform will perform the following actions:

[1m  #

## Create Environment File

In [12]:
%%bash
cat > .env << EOF
ELASTIC_USERNAME=$(terraform -chdir=terraform output elastic_username)
ELASTIC_PASSWORD=$(terraform -chdir=terraform output elastic_password)
ELASTIC_CLOUD_ID=$(terraform -chdir=terraform output elastic_cloud_id)
JINA_API_KEY=$(terraform -chdir=terraform output jina_api_key)
GEMINI_API_KEY=$(terraform -chdir=terraform output gemini_api_key)
EOF

# Create Document and Judgment Sets

In [13]:
import os
from dotenv import load_dotenv
from google import genai
from pydantic import BaseModel, Field, ConfigDict
import random
import json

MODEL = "gemini-3-pro-preview"
TERM_SETS = [
    ["algorithm", "data", "system", "network", "software", "hardware", "security", "optimization", "automation", "scalability", "performance", "integration"],
    ["patient", "treatment", "diagnosis", "clinical", "therapeutic", "medical", "healthcare", "wellness", "prevention", "symptoms", "recovery", "medicine"],
    ["investment", "portfolio", "returns", "risk", "capital", "market", "assets", "revenue", "profit", "liquidity", "valuation", "dividend"],
    ["learning", "students", "curriculum", "teaching", "assessment", "pedagogy", "education", "training", "skills", "knowledge", "academic", "instruction"],
    ["environmental", "sustainable", "renewable", "emissions", "conservation", "ecosystem", "climate", "green", "carbon", "pollution", "biodiversity"]
]

load_dotenv(override=True)

def generate_documents(term_set):
    prompt = f"""
    - You are an expert JSON generator. Your task is to strictly adhere to the user's prompt and the provided JSON schema to generate a valid array of JSON objects.
    - Generate 100 unique JSON documents
    - The first 50 documents should utilize the following terms in a grammatically correct manner in the title and content fields: {term_set}
    - The remaining 50 documents should be on random topics.  Those topics could use some of the provided terms.
    - Each document should have a unique integer id starting from 1.
    - The title should be concise, between 5 to 10 words.
    - The content should be a detailed paragraph of at least 30 words.
    - Don't put any bolding, asterisks or quotation marks in the output.
    - Generate the documents such that there is a mix of relevance values when performing lexical or
    semantic search on them with the given terms.
    """

    class Document(BaseModel):
        id: int = Field(alias="_id", description="The unique identifier of the document.")
        title: str = Field(description="The title of the document.")
        content: str = Field(description="The content of the document.")
        model_config = ConfigDict(
            populate_by_name=True,
        )

    class DocumentList(BaseModel):
        documents: list[Document] = Field(description="A list of generated documents.")

    client = genai.Client()
    response = client.models.generate_content(
        model=MODEL,
        contents=prompt,
        config={
            "tools":[],
            "response_mime_type":"application/json",
            "response_schema": DocumentList.model_json_schema(),
        }
    )

    result_list = DocumentList.model_validate_json(response.text)
    return [doc.model_dump(by_alias=True) for doc in result_list.documents]


def generate_judgments(term_set, docs):
    prompt = f"""
    You are an expert search relevance rater. Your task is to generate 1 search query
    and 20 relevance judgments from the provided documents based on that query.
    - The search query should be concise, between 3 to 7 words, and utilize some of the following terms: {term_set}.
    - These search query should be selected such that it has varying degrees of relevance to the 
    provided documents in both lexical and semantic terms.
    - The relevance judgments should be based on how relevant each document is to the query.
    - Relevance is rated on a scale from 1 (least relevant/match) to 5 (perfect match).
    - Consider both lexical and semantic relevance when rating the documents.
    - The index field should be the value 'test-index' for all judgments.
    - The id field should correspond to the document id being rated.

    **Documents to Rate Against:**
    {docs[0:50]}

    Generate the judgment list in the required JSON schema.
    """

    class Judgment(BaseModel):
        query_text: str = Field(description="The search query text.")
        index: str = Field(alias="_index", description="The name of the index being queried.")
        id: str = Field(alias="_id", description="The unique identifier of the document.")
        rating: int = Field(description="The relevance score of the document.")

    class JudgmentList(BaseModel):
        judgments: list[Judgment] = Field(description="A list of generated relevance judgments.")

    client = genai.Client()
    response = client.models.generate_content(
        model=MODEL,
        contents=prompt,
        config={
            "tools":[],
            "response_mime_type":"application/json",
            "response_schema": JudgmentList.model_json_schema(),
        }
    )

    result_list = JudgmentList.model_validate_json(response.text)
    return [judgment.model_dump(by_alias=True) for judgment in result_list.judgments]

if not os.path.exists("documents.jsonl") or not os.path.exists("judgments.jsonl"):
    term_set = random.choice(TERM_SETS)

    documents = generate_documents(term_set)
    with open("documents.jsonl", "w") as f:
        for doc in documents:
            f.write(json.dumps(doc) + "\n")

    judgments = generate_judgments(term_set,documents)
    with open("judgments.jsonl", "w") as f:
        for judgment in judgments:
            f.write(json.dumps(judgment) + "\n")

with open("documents.jsonl", "r") as f:
    line = f.readline()
    print("*** Sample Generated Document ***")
    print(json.dumps(json.loads(line), indent=2))

with open("judgments.jsonl", "r") as f:
    line = f.readline()
    print("\n*** Sample Generated Judgment ***")
    print(json.dumps(json.loads(line), indent=2))

*** Sample Generated Document ***
{
  "_id": 1,
  "title": "Modern Pedagogy and Effective Student Learning Strategies",
  "content": "In the realm of modern education, effective pedagogy requires a deep understanding of how students process information. Teachers must design a curriculum that not only delivers academic knowledge but also fosters critical thinking skills through rigorous assessment and tailored instruction."
}

*** Sample Generated Judgment ***
{
  "query_text": "assessment strategies for student learning",
  "_index": "test-index",
  "_id": "3",
  "rating": 5
}


# Create Jina Reranker Inference Endpoint

In [14]:
import os
from elasticsearch import Elasticsearch

es = Elasticsearch(cloud_id=os.getenv("ELASTIC_CLOUD_ID"), 
    request_timeout=160,
    basic_auth=(os.getenv("ELASTIC_USERNAME"), 
    os.getenv("ELASTIC_PASSWORD"))
)

es.options(ignore_status=[404]).inference.delete(inference_id="jina-reranker-v3")
response = es.inference.put(
    task_type="rerank",
    inference_id="jina-reranker-v3",
    body={
        "service": "jinaai",
        "service_settings": {
            "api_key": os.getenv("JINA_API_KEY"),
            "model_id": "jina-reranker-v3"
        }, 
        "task_settings": {
            "top_n": 10,
            "return_documents": True
        } 
    }
)
print(response)

{'inference_id': 'jina-reranker-v3', 'task_type': 'rerank', 'service': 'jinaai', 'service_settings': {'model_id': 'jina-reranker-v3', 'rate_limit': {'requests_per_minute': 2000}}, 'task_settings': {'top_n': 10, 'return_documents': True}}


# Indexing

In [15]:
from elasticsearch.helpers import bulk
import json

INDEX_NAME = "test-index"
mappings = {
    "properties": {
        "title": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        },
        "content": {
            "type": "text",
            "fields": {
                "embedding": {
                    "type": "semantic_text",
                    "inference_id": ".elser-2-elastic"
                }
            }
        }
    }
}            

es.options(ignore_status=[404]).indices.delete(index=INDEX_NAME)
es.indices.create(index=INDEX_NAME, body={"mappings": mappings})

def gen_data():
    with open("documents.jsonl", "r") as f:
        for line in f:    
            yield json.loads(line.strip())
            
result = bulk(client=es, index=INDEX_NAME, actions=gen_data())
print(result[0], "documents indexed")

100 documents indexed


# Create Query Templates

In [16]:
ratings = []
query_string = ""
with open("judgments.jsonl", "r") as f:
    for line in f:
        obj = json.loads(line.strip())
        query_string = obj["query_text"]
        del obj["query_text"]
        ratings.append(obj)

# Lexical Query - BM25
lexical_retriever = {
    "standard": {
        "query": {
            "multi_match": {
                "query": "{{query_string}}",
                "fields": ["title", "content"]
            }
        }
    }
}

lexical_query_template = {
    "id": "lexical_query_template",
    "template": {
        "source": {
            "size": 10,
            "retriever": lexical_retriever
        }
    }
}

lexical_request =  {
    "id": "lexical_query",
    "ratings": ratings,
    "template_id": "lexical_query_template",
    "params": {
        "query_string": query_string
    }
}

# Semantic Query - ELSER on GPU
semantic_retriever = {
    "standard": {
        "query": {
            "match": {
                "content.embedding": {
                    "query": "{{query_string}}"
                }
            }
        }
    }
}

semantic_query_template = {
    "id": "semantic_query_template",
    "template": {
        "source": {
            "size": 10,
            "retriever": semantic_retriever
        }
    }
}

semantic_request = {
    "id": "semantic_query",
    "ratings": ratings,
    "template_id": "semantic_query_template",
    "params": {
        "query_string": query_string
    }
}

# Rescore Query - Linear + Rescorer
linear_retriever = {
    "linear": {
        "retrievers": [
            {
                "retriever": lexical_retriever,
                "weight": 0.3
            }, 
            {
                "retriever": semantic_retriever,   
                "weight": 0.7   
            }
        ],
        "normalizer": "l2_norm"
    }
}

rescore_retriever = {
    "rescorer": {
        "rescore": {
            "window_size": 10,
            "query": {
                "rescore_query": {
                    "dis_max": {
                        "queries": [
                            {
                                "match_phrase": {
                                    "title": {
                                        "query": "{{query_string}}",
                                        "boost": 10.0,
                                        "slop": 15
                                    }
                                }
                            },
                            {
                                "multi_match": {
                                    "query": "{{query_string}}",
                                    "fields": ["title", "content^2"]
                                }
                            },
                            {
                                "match_phrase": {
                                    "content": {
                                        "query": "{{query_string}}",
                                        "boost": 3.0,
                                        "slop": 15
                                    }
                                }
                            }
                        ],
                        "tie_breaker": 0.3
                    }
                },
                "query_weight": 1,
                "rescore_query_weight": 1.0
            }
        },
        "retriever": linear_retriever
    }
}

rescore_query_template = {
    "id": "rescore_query_template",
    "template": {
        "source": {
            "size": 10,
            "retriever": rescore_retriever
        }
    }
}
  
rescore_request = {
    "id": "rescore_query",
    "ratings": ratings,
    "template_id": "rescore_query_template",
    "params": {
        "query_string": query_string
    }
}

# Hybrid Query - Weighted RRF
rrf_retriever = {
    "rrf": {
        "rank_window_size": 10,
        "retrievers": [
            {
                "retriever": lexical_retriever,
                "weight": .25
            }, 
            {
                "retriever": semantic_retriever,
                "weight": .75
            }
        ]
    }
}

rrf_query_template = {
    "id": "rrf_query_template",
    "template": {
        "source": {
            "size": 10,
            "retriever": rrf_retriever
        }
    }
}

rrf_request = {
    "id": "rrf_query",
    "ratings": ratings,
    "template_id": "rrf_query_template",
    "params": {
        "query_string": query_string
    }
}

# Rerank Query - Weighted RRF + Jina Reranker
rerank_retriever = {
    "text_similarity_reranker": {
        "retriever": rrf_retriever,
        "field": "content",
        "inference_id": "jina-reranker-v3",
        "inference_text": "{{query_string}}"
    }
}
    
rerank_query_template = {
    "id": "rerank_query_template",
    "template": {
        "source": {
            "size": 10,
            "retriever": rerank_retriever
        }
    }
}

rerank_request = {
    "id": "rerank_query",
    "ratings": ratings,
    "template_id": "rerank_query_template",
    "params": {
        "query_string": query_string
    }
}

print("*** Lexical Eval Template ***")
print(json.dumps(lexical_query_template, indent=2))

print("\n*** Semantic Eval Template ***")
print(json.dumps(semantic_query_template, indent=2))

print("\n*** Rescore Eval Template ***")
print(json.dumps(rescore_query_template, indent=2))

print("\n*** RRF Eval Template ***")
print(json.dumps(rrf_query_template, indent=2))

print("\n*** Rerank Eval Template ***")
print(json.dumps(rerank_query_template, indent=2))

*** Lexical Eval Template ***
{
  "id": "lexical_query_template",
  "template": {
    "source": {
      "size": 10,
      "retriever": {
        "standard": {
          "query": {
            "multi_match": {
              "query": "{{query_string}}",
              "fields": [
                "title",
                "content"
              ]
            }
          }
        }
      }
    }
  }
}

*** Semantic Eval Template ***
{
  "id": "semantic_query_template",
  "template": {
    "source": {
      "size": 10,
      "retriever": {
        "standard": {
          "query": {
            "match": {
              "content.embedding": {
                "query": "{{query_string}}"
              }
            }
          }
        }
      }
    }
  }
}

*** Rescore Eval Template ***
{
  "id": "rescore_query_template",
  "template": {
    "source": {
      "size": 10,
      "retriever": {
        "rescorer": {
          "rescore": {
            "window_size": 10,
            "query": {
   

# Execute Evaluations

In [17]:
import pandas as pd

templates = [lexical_query_template, semantic_query_template, rrf_query_template, rerank_query_template, rescore_query_template]
requests = [lexical_request, semantic_request, rrf_request, rerank_request, rescore_request]
metrics = [
    {"dcg": {"k": 10, "normalize": True}},
    {"expected_reciprocal_rank": {"k": 10, "maximum_relevance": 5}},
]
results = {}

for metric in metrics:
    metric_name = list(metric.keys())[0]
    eval = {
        "templates": templates,
        "requests": requests,
        "metric": metric
    }
    result = es.rank_eval(body=eval, index=INDEX_NAME)
    for query in result['details']:
        if metric_name not in results:
            results[metric_name] = {}
        results[metric_name][query] = result['details'][query]['metric_score']

df = pd.DataFrame(results).round(3)
df.rename(columns={'dcg': 'NDCG', 'expected_reciprocal_rank': 'ERR'}, inplace=True)
df.rename(index={'lexical_query': 'Lexical', 'semantic_query': 'Semantic', 'rrf_query': 'RRF', 'rerank_query': 'Rerank', 'rescore_query': 'Rescore'}, inplace=True)
df = df.reindex(['Lexical', 'Semantic', 'Rescore', 'RRF', 'Rerank'])
display(df)

Unnamed: 0,NDCG,ERR
Lexical,0.443,0.471
Semantic,0.91,0.984
Rescore,0.742,0.605
RRF,0.833,0.98
Rerank,0.822,0.979


## Destroy Environment

In [18]:
%%bash
terraform -chdir=terraform destroy -auto-approve
rm -f .env

[0m[1mec_elasticsearch_project.demo_project: Refreshing state... [id=cb5446332b244e8ba93e1bb274f764a8][0m

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  [31m-[0m destroy[0m

Terraform will perform the following actions:

[1m  # ec_elasticsearch_project.demo_project[0m will be [1m[31mdestroyed[0m
[0m  [31m-[0m[0m resource "ec_elasticsearch_project" "demo_project" {
      [31m-[0m[0m alias         = "demoproject" [90m-> null[0m[0m
      [31m-[0m[0m cloud_id      = "demo_project:dXMtY2VudHJhbDEuZ2NwLmVsYXN0aWMuY2xvdWQkY2I1NDQ2MzMyYjI0NGU4YmE5M2UxYmIyNzRmNzY0YTguZXMkY2I1NDQ2MzMyYjI0NGU4YmE5M2UxYmIyNzRmNzY0YTgua2I=" [90m-> null[0m[0m
      [31m-[0m[0m credentials   = {
          [31m-[0m[0m password = "55SMgEkc34c92qSFd6s89a8o" [90m-> null[0m[0m
          [31m-[0m[0m username = "admin" [90m-> null[0m[0m
        } [90m-> null[0m[0m
      [31m-[0m[0m e