### About 
### Information retrieval using elasticsearch and custom search methods. Calculate recall, mrr from  the search results
### The code is copied from [LLM_ZOOMCAMP](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/evaluate-text.ipynb)

### Import necessary libraries and packages

In [1]:
%pip install elasticsearch -qq 

Note: you may need to restart the kernel to use updated packages.


In [4]:
%pip install --upgrade ipywidgets -qq

Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import io
import os
import json
import requests
from tqdm.notebook import tqdm
import subprocess
import time
import json
import elasticsearch
from elasticsearch import Elasticsearch

### Elasticsearch setup

In [2]:
es_client = Elasticsearch('http://localhost:9200/', request_timeout=60) 

In [3]:
!curl localhost:9200

{
  "name" : "39fa54c0395d",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "F9eEOskoTla861ciE6Pi5w",
  "version" : {
    "number" : "8.4.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
    "build_date" : "2022-10-04T07:17:24.662462378Z",
    "build_snapshot" : false,
    "lucene_version" : "9.3.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}


### Get document

In [4]:
!wget https://raw.githubusercontent.com/hariprasath-v/Nnet101_Assistant/refs/heads/main/data/nnet_101_qna_with_id.json

--2024-10-07 17:05:47--  https://raw.githubusercontent.com/hariprasath-v/Nnet101_Assistant/refs/heads/main/data/nnet_101_qna_with_id.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 

200 OK
Length: 534194 (522K) [text/plain]
Saving to: ‘nnet_101_qna_with_id.json’


2024-10-07 17:05:48 (2.69 MB/s) - ‘nnet_101_qna_with_id.json’ saved [534194/534194]



### Load the document

In [5]:
with open('nnet_101_qna_with_id.json', 'rt') as f_in:
    documents = json.load(f_in)

In [6]:
documents[0]

{'question': 'How to choose the number of hidden layers and nodes in a feedforward neural network?',
 'tags': 'model-selection|neural-networks',
 'answer': "**Network Configuration in Neural Networks**\n\n**Standardization**\nThere is no single standardized method for configuring networks. However, guidelines exist for setting the number and type of network layers, as well as the number of neurons in each layer.\n\n**Initial Architecture Setup**\nBy following specific rules, one can establish a competent network architecture. This involves determining the number and type of neuronal layers and the number of neurons within each layer. This approach provides a foundational architecture but may not be optimal.\n\n**Iterative Tuning**\nOnce the network is initialized, its configuration can be iteratively tuned during training. Ancillary algorithms, such as pruning, can be used to eliminate unnecessary nodes, optimizing the network's size and performance.\n\n**Network Layer Types and Sizing

### Create and add index to elasticsearch

In [7]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "question": {"type": "text"},
            "answer": {"type": "text"},
            "tags": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

index_name = "nnet101"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'nnet101'})

In [8]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/500 [00:00<?, ?it/s]

### Elasticsearch query

In [10]:
def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "answer", "tags"],
                        "type": "best_fields"
                    }
                },
                
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

### Sample search

In [13]:
elastic_search(
    query="what is hidden layer?")

[{'question': 'What does the hidden layer in a neural network compute?',
  'tags': 'machine-learning|neural-networks|nonlinear-regression',
  'answer': "**Three-sentence summary:**\n\nNeural networks apply functions such as linear transformations and nonlinearities to data, with each layer building upon the previous one. Hidden layers transform the data for easier processing by the output layer, which produces the final result.\n\n**Like you're 5:**\n\nImagine you want a computer to recognize buses. You can create detectors for wheels, boxes, and size. These detectors work together in hidden layers to form a toolset for bus recognition. If all detectors are active, there's a good chance a bus is present. Neural networks provide easy ways to combine many detectors.\n\n**Like you're an adult:**\n\nNeural networks apply functions (e.g., linear transformations and nonlinearities) to data, with subsequent layers building upon each other. The hidden layer's activation (transformation of inpu

### Get ground truth data

In [14]:
df_ground_truth_url = 'https://raw.githubusercontent.com/hariprasath-v/Nnet101_Assistant/refs/heads/main/data/ground-truth-data.csv'
df_ground_truth=pd.read_csv(df_ground_truth_url)
df_ground_truth.head()

Unnamed: 0,question,tags,document
0,How do I choose the number of hidden layers in...,model-selection|neural-networks,f55240b8
1,How many nodes should I use in each hidden layer?,model-selection|neural-networks,f55240b8
2,When should I use pruning to optimize network ...,model-selection|neural-networks,f55240b8
3,What is the relationship between the number of...,model-selection|neural-networks,f55240b8
4,How can I determine the optimal network size f...,model-selection|neural-networks,f55240b8


### Convert each row to dictionary

In [15]:
ground_truth = df_ground_truth.to_dict(orient='records')

In [16]:
ground_truth[0]

{'question': 'How do I choose the number of hidden layers in a neural network?',
 'tags': 'model-selection|neural-networks',
 'document': 'f55240b8'}

### Search the question and match relevance based on the doc_id

In [20]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = elastic_search(query=q['question'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)
    

  0%|          | 0/2500 [00:00<?, ?it/s]

### Sample relevance result

In [21]:
example = [
    [True, False, False, False, False], # 1, 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1 
    [False, False, True, False, False],  # 1/3
    [False, False, False, False, False], # 0
]

# 1 => 1
# 2 => 1 / 2 = 0.5
# 3 => 1 / 3 = 0.3333
# 4 => 0.25
# 5 => 0.2
# rank => 1 / rank
# none => 0

### Hit-rate(True positive rate, Recall, Senstivity) Calculation

```
[True, False, False, False, False], # 1, 
[False, False, False, False, False], # 0
[False, False, False, False, False], # 0 
[False, False, False, False, False], # 0
[False, False, False, False, False], # 0 
[True, False, False, False, False], # 1
[True, False, False, False, False], # 1
[True, False, False, False, False], # 1
[True, False, False, False, False], # 1
[True, False, False, False, False], # 1 
[False, False, True, False, False],  # 1
[False, False, False, False, False] # 0

Total_1 = 7
Total_sample = 12
Hit-rate(True positive rate, Recall, Sensitivity) = TP/TP+FN = 7/12 = 0.5833333333333334

```

In [22]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

### Mean Reciprocal Rank (MRR) calculation
```
True and position(rank) of true = reciprocal rank

[True, False, False, False, False], # 1/1
[False, False, False, False, False], # 0
[False, False, False, False, False], # 0 
[False, False, False, False, False], # 0
[False, False, False, False, False], # 0 
[True, False, False, False, False], # 1/1
[True, False, False, False, False], # 1/1
[True, False, False, False, False], # 1/1
[True, False, False, False, False], # 1/1
[True, False, False, False, False], # 1/1
[False, False, True, False, False],  # 1/3
[False, False, False, False, False] # 0

Total_1_rank = 1/1 + 1/1 + 1/1 + 1/1 + 1/1 +1/1 + 1/3 = 6.333333333333333
Total_sample = 12
Mean Reciprocal Rank (MRR) = rank/total_sample = 6.333333333333333/12 = 0.5277777777777778

```

In [23]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [24]:
hit_rate(example)

0.5833333333333334

In [25]:
mrr(example)

0.5277777777777778

### Hit-rate and MRR for entire document

In [26]:
hit_rate(relevance_total), mrr(relevance_total)

(0.5828, 0.43964666666666774)

### Get minisearch script from course repo

In [27]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-10-07 17:12:07--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py’


2024-10-07 17:12:07 (37.3 MB/s) - ‘minsearch.py’ saved [3832/3832]



### Create TF-IDF

In [28]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "answer"],
    keyword_fields=["tags", "id"]
)

index.fit(documents)

<minsearch.Index at 0x7bfd79fcc710>

### Cosine similarity search based on the TF-IDF

In [46]:
def minsearch_search(query):
    boost = {'question': 3.0}

    results = index.search(
        query=query,
        boost_dict=boost,
        num_results=5
    )

    return results

### Search relevance document using custom search

In [47]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)
    

  0%|          | 0/2500 [00:00<?, ?it/s]

### Result

In [48]:
hit_rate(relevance_total), mrr(relevance_total)

(0.5572, 0.4396800000000009)

### Funtion to return hit-rate and mrr

In [35]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

### Elasticsearch relevance results 

In [36]:
evaluate(ground_truth, lambda q: elastic_search(q['question']))

  0%|          | 0/2500 [00:00<?, ?it/s]

{'hit_rate': 0.5828, 'mrr': 0.43964666666666774}

### Custom search relevance results 

In [37]:
evaluate(ground_truth, lambda q: minsearch_search(q['question']))

  0%|          | 0/2500 [00:00<?, ?it/s]

{'hit_rate': 0.5572, 'mrr': 0.4396800000000009}

In [44]:
results = pd.DataFrame([{'type':'text_elasticsearch', 'hit_rate': 0.5828, 'mrr': 0.43964666666666774},
{'type':'text_customsearch', 'hit_rate': 0.5572, 'mrr': 0.4396800000000009}])

In [45]:
results.to_csv("/workspaces/Nnet101_Assistant/data/text_custom_and_elasticsearch_scores.csv", index=False)