### About
### Information retrieval using elasticsearch with vector. Calculate recall, mrr from search results
### The code is copied from [LLM_ZOOMCAMP](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/evaluate-vector.ipynb)

### Import necessary libraries and packages

In [1]:
pip install elasticsearch -qq 

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install sentence_transformers==2.7.0 -qq

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import io
import os
import json
import requests
from tqdm.notebook import tqdm
import subprocess
import time
import json
import elasticsearch
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer
from IPython.display import clear_output

  from tqdm.autonotebook import tqdm, trange


### Elasticsearch setup

In [3]:
es_client = Elasticsearch('http://localhost:9200/', request_timeout=60) 

In [4]:
!curl localhost:9200

{
  "name" : "49805aac582e",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "qJTNmNE-Seyi_lJm1cfY6g",
  "version" : {
    "number" : "8.4.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
    "build_date" : "2022-10-04T07:17:24.662462378Z",
    "build_snapshot" : false,
    "lucene_version" : "9.3.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}


### Get data

In [None]:
!wget https://raw.githubusercontent.com/hariprasath-v/Nnet101_Assistant/refs/heads/main/data/nnet_101_qna_with_id.json

### Load the document

In [6]:
with open('nnet_101_qna_with_id.json', 'rt') as f_in:
    documents = json.load(f_in)

In [7]:
documents[0]

{'question': 'How to choose the number of hidden layers and nodes in a feedforward neural network?',
 'tags': 'model-selection|neural-networks',
 'answer': "**Network Configuration in Neural Networks**\n\n**Standardization**\nThere is no single standardized method for configuring networks. However, guidelines exist for setting the number and type of network layers, as well as the number of neurons in each layer.\n\n**Initial Architecture Setup**\nBy following specific rules, one can establish a competent network architecture. This involves determining the number and type of neuronal layers and the number of neurons within each layer. This approach provides a foundational architecture but may not be optimal.\n\n**Iterative Tuning**\nOnce the network is initialized, its configuration can be iteratively tuned during training. Ancillary algorithms, such as pruning, can be used to eliminate unnecessary nodes, optimizing the network's size and performance.\n\n**Network Layer Types and Sizing

### Get sentence transformer model

In [8]:
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)



### Sample Encode

In [9]:
v = model.encode('what is dropout?')

### Embedding dimension

In [10]:
v.shape

(384,)

### Create index for elasticsearch

In [11]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "answer": {"type": "text"},
            "question": {"type": "text"},
            "tags": {"type": "keyword"},
            "id": {"type": "keyword"},
            "question_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "answer_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "question_answer_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
        }
    }
}

index_name = "nnet101"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'nnet101'})

### Encode the question, text, and combined text+question and add to the document

In [12]:
for doc in tqdm(documents):
    question = doc['question']
    answer = doc['answer']
    qa = question + ' ' + answer

    doc['question_vector'] = model.encode(question, show_progress_bar=False)
    doc['answer_vector'] = model.encode(answer,  show_progress_bar=False)
    doc['question_answer_vector'] = model.encode(qa,  show_progress_bar=False)

  0%|          | 0/500 [00:00<?, ?it/s]

### Document after adding vectors

In [13]:
for k in documents[0].keys():
    if k in ['question_vector', 'answer_vector', 'question_answer_vector']:
        print(k,':',len(documents[0][k]))
    else:
        print(k,':',documents[0][k])

question : How to choose the number of hidden layers and nodes in a feedforward neural network?
tags : model-selection|neural-networks
answer : **Network Configuration in Neural Networks**

**Standardization**
There is no single standardized method for configuring networks. However, guidelines exist for setting the number and type of network layers, as well as the number of neurons in each layer.

**Initial Architecture Setup**
By following specific rules, one can establish a competent network architecture. This involves determining the number and type of neuronal layers and the number of neurons within each layer. This approach provides a foundational architecture but may not be optimal.

**Iterative Tuning**
Once the network is initialized, its configuration can be iteratively tuned during training. Ancillary algorithms, such as pruning, can be used to eliminate unnecessary nodes, optimizing the network's size and performance.

**Network Layer Types and Sizing**
Every neural network 

### Add index to elasticsearch

In [14]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/500 [00:00<?, ?it/s]

### Sample search

In [15]:
query = 'what is pooling layer?'

In [16]:
v_q = model.encode(query,  show_progress_bar=False)

In [17]:
q={
    "field": "answer_vector",
    "query_vector": v_q,
    "k": 5,
    "num_candidates": 10000, 
}
es_client.search(index=index_name, knn=q, 
                 source=["answer", "tags", "question"])["hits"]["hits"]

[{'_index': 'nnet101',
  '_id': 'JaWrapIBD6eHS5kjCZfY',
  '_score': 0.76976824,
  '_source': {'question': 'How to calculate output shape in 3D convolution',
   'answer': '**Convolution Layer Summary:**\n\nThe convolution formula determines the output size of a convolution layer. It considers the input size, receptive field (kernel) size, stride, and zero padding. In the example, with $W=40$, $F=3$, $S=1$, and $P=0$, the output size is $(38, 62, 62, 8)$.\n\n**Pooling Layer Summary:**\n\nPooling layers reduce spatial dimensions. By default, they halve each dimension with a receptive field of $(2, 2, 2)$ and a stride of $(2, 2, 2)$. However, if the stride is set to $(1, 1, 1)$, each dimension is reduced by 1 instead. For instance, the tensor $(38, 62, 62, 8)$ would become $(19, 31, 31, 8)$ with a stride of $(2, 2, 2)$ and $(37, 61, 61, 8)$ with a stride of $(1, 1, 1)$.',
   'tags': 'machine-learning|neural-networks|convolutional-neural-network'}},
 {'_index': 'nnet101',
  '_id': '-KWrapIB

### KNN vector search

In [18]:
def elastic_search_knn(field, vector):
    knn = {
        "field": field,
        "query_vector": vector,
        "k": 5,
        "num_candidates": 10000,
        
    }

    search_query = {
        "knn": knn,
        "_source": ["answer", "tags", "question", "id"]
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

### Get ground truth data(LLM generated question data) from course repo

In [19]:
df_ground_truth_url = 'https://raw.githubusercontent.com/hariprasath-v/Nnet101_Assistant/refs/heads/main/data/ground-truth-data.csv'
df_ground_truth=pd.read_csv(df_ground_truth_url)
df_ground_truth.head()

Unnamed: 0,question,tags,document
0,How do I choose the number of hidden layers in...,model-selection|neural-networks,f55240b8
1,How many nodes should I use in each hidden layer?,model-selection|neural-networks,f55240b8
2,When should I use pruning to optimize network ...,model-selection|neural-networks,f55240b8
3,What is the relationship between the number of...,model-selection|neural-networks,f55240b8
4,How can I determine the optimal network size f...,model-selection|neural-networks,f55240b8


### Convert each row to dictionary

In [20]:
ground_truth = df_ground_truth.to_dict(orient='records')

### Hit-rate metric

In [21]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

### MRR metric

In [22]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [23]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

### Function to embed a question and return the result based on the question vector

In [24]:
def question_vector_knn(q):
    question = q['question']


    v_q = model.encode(question,  show_progress_bar=False)

    return elastic_search_knn('question_vector', v_q)

### Evaluation result for question vector

In [25]:
evaluate(ground_truth, question_vector_knn)

  0%|          | 0/2500 [00:00<?, ?it/s]

{'hit_rate': 0.6256, 'mrr': 0.5150333333333336}

### Function to embed a answer and return the result based on the answer vector

In [26]:
def text_vector_knn(q):
    question = q['question']
    

    v_q = model.encode(question, show_progress_bar=False)

    return elastic_search_knn('answer_vector', v_q, )

### Evaluation result for answer vector

In [27]:
evaluate(ground_truth, text_vector_knn)

  0%|          | 0/2500 [00:00<?, ?it/s]

{'hit_rate': 0.8308, 'mrr': 0.7089066666666664}

### Function to embed a question and answer, and return the result based on the question and asnwer vector.

In [28]:
def question_text_vector_knn(q):
    question = q['question']
   

    v_q = model.encode(question, show_progress_bar=False)

    return elastic_search_knn('question_answer_vector', v_q,)



### Evaluation result for combined answer and question

In [29]:
evaluate(ground_truth, question_text_vector_knn)

  0%|          | 0/2500 [00:00<?, ?it/s]

{'hit_rate': 0.8548, 'mrr': 0.7323599999999993}

### Custom scoring based on the cosine similarity between answer, question, question+answer vectors

In [30]:
def elastic_search_knn_combined(vector):
    search_query = {
        "size": 5,
        "query": {
            "script_score": {
                "query": {"match_all": {}},
                "script": {
                    "source": """
                        cosineSimilarity(params.query_vector, 'question_vector') + 
                        cosineSimilarity(params.query_vector, 'answer_vector') + 
                        cosineSimilarity(params.query_vector, 'question_answer_vector') + 
                        1
                    """,
                    "params": {
                        "query_vector": vector
                    }
                }
            }
        },
        "_source": ["answer", "question", "tags", "id"]
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs


### Search based on combined vector similarity

In [31]:
def vector_combined_knn(q):
    question = q['question']
    

    v_q = model.encode(question, show_progress_bar=False)

    return elastic_search_knn_combined(v_q)



### Evaluation result for combined vector similarity

In [32]:
evaluate(ground_truth, vector_combined_knn)

  0%|          | 0/2500 [00:00<?, ?it/s]

{'hit_rate': 0.832, 'mrr': 0.7055066666666656}

In [33]:


results = pd.DataFrame([{'type':'question_vector_elasticsearch', 'hit_rate': 0.6256, 'mrr': 0.5150333333333336},
{'type':'answer_vector_elasticsearch', 'hit_rate': 0.8308, 'mrr': 0.7089066666666664},
{'type':'question-answer_vector_elasticsearch', 'hit_rate': 0.8548, 'mrr': 0.7323599999999993},
{'type':'custom-combined_vector_scoring_elasticsearch', 'hit_rate': 0.832, 'mrr': 0.7055066666666656}
])

In [34]:
results

Unnamed: 0,type,hit_rate,mrr
0,question_vector_elasticsearch,0.6256,0.515033
1,answer_vector_elasticsearch,0.8308,0.708907
2,question-answer_vector_elasticsearch,0.8548,0.73236
3,custom-combined_vector_scoring_elasticsearch,0.832,0.705507


In [35]:
results.to_csv("/workspaces/Nnet101_Assistant/data/multiple_retrieval_evaluation_elasticsearch_scores.csv", index=False)