<a href="https://colab.research.google.com/github/maor63/PassageRetrievalTechniques/blob/main/Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [23]:
%%capture
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
  !pip install langchain
  !pip install faiss-cpu
  !pip install huggingface_hub
  !pip install sentence_transformers
  !pip install gradio

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
import json
from tqdm import tqdm
from transformers import AutoTokenizer
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import faiss
from scipy import stats
from sklearn.metrics import ndcg_score
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import gradio as gr

# Preprocess data

In [3]:
embedding_model = 'sentence-transformers/multi-qa-MiniLM-L6-cos-v1'

In [1]:
# data_path = Path("/content/drive/MyDrive/HomeProjects/Sleek/data")
data_path = Path("")
with open(data_path / "J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt", "r") as f:
    book = f.read()


NameError: name 'Path' is not defined

In [5]:
def split_book_into_chapters(book):
  """Splits the book text into chapters and their paragraphs."""

  chapters = {}
  current_chapter = None
  current_paragraphs = []

  for line in book.split('\n'):
    line = line.strip()
    if not line:  # Skip empty lines
        continue

    if line.isupper() and '"' not in line:
      if current_chapter:
          chapters[current_chapter] = current_paragraphs
      current_chapter = line
      current_paragraphs = []
    elif current_chapter:
      current_paragraphs.append(line)

  if current_chapter:  # Add the last chapter
      chapters[current_chapter] = current_paragraphs

  return chapters


# Example usage with your book variable
chapters = split_book_into_chapters(book)

In [6]:
for chapter, paragraphs in chapters.items():
  print(chapter)
  if len(paragraphs) > 0:
    sizes = np.array([len(paragraph.split()) for paragraph in paragraphs])
    print('Avg paraghraph size:', sizes.mean(), sizes.std())
    print('Max paraghraph size:', sizes.max())
    print('Min paraghraph size:', sizes.min())

CHAPTER ONE
THE BOY WHO LIVED
Avg paraghraph size: 11.205314009661835 3.361313116563452
Max paraghraph size: 17
Min paraghraph size: 1
CHAPTER TWO
THE VANISHING GLASS
Avg paraghraph size: 10.974683544303797 3.774413135854229
Max paraghraph size: 17
Min paraghraph size: 1
CHAPTER THREE
THE LETTERS FROM NO ONE
Avg paraghraph size: 10.531506849315068 3.9589979606830834
Max paraghraph size: 16
Min paraghraph size: 1
BOOM.
Avg paraghraph size: 10.0 2.0
Max paraghraph size: 12
Min paraghraph size: 8
CHAPTER FOUR
THE KEEPER OF THE KEYS
Avg paraghraph size: 9.714285714285714 4.233009254294015
Max paraghraph size: 16
Min paraghraph size: 2
SMASH!
Avg paraghraph size: 10.481481481481481 4.243859190071655
Max paraghraph size: 17
Min paraghraph size: 1
CHAPTER FIVE
DIAGON ALLEY
Avg paraghraph size: 9.5390625 4.300985250043733
Max paraghraph size: 16
Min paraghraph size: 1
UNIFORM
Avg paraghraph size: 8.0 2.0816659994661326
Max paraghraph size: 10
Min paraghraph size: 4
COURSE BOOKS
Avg paraghraph 

In [7]:
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"

text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    AutoTokenizer.from_pretrained(embedding_model_name),
    chunk_size=200,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", " ", ""],
)

print("Splitting documents...")
source_docs = []
for chapter, paragraphs in chapters.items():
  source_docs.append(Document(page_content='\n'.join(paragraphs), metadata={"source": chapter}))

docs_processed = text_splitter.split_documents(source_docs)
for i, doc in enumerate(docs_processed):
  doc.metadata['index'] = i

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Splitting documents...


# Lexical Retrieval

## Embed paraghraphs with TF-IDF

In [8]:
corpus = []
for doc in docs_processed:
  corpus.append(doc.page_content)
corpus = pd.Series(corpus)

tf_idf_vectorizer = TfidfVectorizer()
tf_idf_embeddings = tf_idf_vectorizer.fit_transform(corpus)
tf_idf_embeddings.shape

(1411, 5755)

In [9]:
dimension = tf_idf_embeddings.shape[1]
vectors = tf_idf_embeddings.toarray()

# # L2 distance
# index = faiss.IndexFlatL2(dimension)
# index.add(vectors)

## Cosine distance
index = faiss.IndexFlatIP(dimension)
# norm_vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
index.add(vectors)

faiss.write_index(index, str(data_path / "harry_potter_book1_tf_idf_index.faiss"))

In [10]:
def lexical_search(query, top_k=5):
    query_embedding = tf_idf_vectorizer.transform([query])
    similarities, indices = index.search(query_embedding.toarray(), top_k)
    results = [(docs_processed[i], similarity) for i, similarity in zip(indices[0], similarities[0])]
    return results

## Example 1: "Where does Uncle Vernon work?"

In [11]:
query = "Where does Uncle Vernon work?"
top_k = 5
results = lexical_search(query, top_k=5)
for i, (doc, similarity) in enumerate(results, 1):
  print('#### Result', i, 'similarity:', similarity, '####')
  print(doc.page_content)
  print()

#### Result 1 similarity: 0.35822332 ####
I want --" he began, but Uncle Vernon was tearing the letters into
pieces before his eyes. Uncle Vernon didnt go to work that day. He
stayed at home and nailed up the mail slot.
"See," he explained to Aunt Petunia through a mouthful of nails, "if
they can't deliver them they'll just give up."
"I'm not sure that'll work, Vernon."
"Oh, these people's minds work in strange ways, Petunia, they're not
like you and me," said Uncle Vernon, trying to knock in a nail with the
piece of fruitcake Aunt Petunia had just brought him.
On Friday, no less than twelve letters arrived for Harry. As they

#### Result 2 similarity: 0.33413026 ####
"I'm not having one in the house, Petunia! Didn't we swear when we took
him in we'd stamp out that dangerous nonsense?"
That evening when he got back from work, Uncle Vernon did something he'd
never done before; he visited Harry in his cupboard.
"Where's my letter?" said Harry, the moment Uncle Vernon had squeezed
through

## Example 2: "What does Professor Snape teach?"

In [12]:
query = "What does Professor Snape teach?"
top_k = 5
results = lexical_search(query, top_k=5)
for i, (doc, similarity) in enumerate(results, 1):
  print('#### Result', i, 'similarity:', similarity, '####')
  print(doc.page_content)
  print()

#### Result 1 similarity: 0.2432616 ####
was twitching.
"Professor Quirrell!" said Hagrid. "Harry, Professor Quirrell will be
one of your teachers at Hogwarts."
"P-P-Potter," stammered Professor Quirrell, grasping Harry's hand,
"c-can't t-tell you how p- pleased I am to meet you."
"What sort of magic do you teach, Professor Quirrell?"
"D-Defense Against the D-D-Dark Arts," muttered Professor Quirrell, as
though he'd rather not think about it. "N-not that you n-need it, eh,
P-P-Potter?" He laughed nervously. "You'll be g-getting all your

#### Result 2 similarity: 0.24086753 ####
they caught every word -- like Professor McGonagall, Snape had y caught
every word -- like Professor McGonagall, Snape had the gift of keeping a
class silent without effort. "As there is little foolish wand-waving
here, many of you will hardly believe this is magic. I don't expect you
will really understand the beauty of the softly simmering cauldron with
its shimmering fumes, the delicate power of liquids that

## Example 3: "How did Harry get to Hogwarts?"

In [13]:
query = "How did Harry get to Hogwarts?"
top_k = 5
results = lexical_search(query, top_k=5)
for i, (doc, similarity) in enumerate(results, 1):
  print('#### Result', i, 'similarity:', similarity, '####')
  print(doc.page_content)
  print()

#### Result 1 similarity: 0.21687797 ####
He saw the three of them look stunned and raised his eyebrows.
"It's not that unusual, yeh get a lot o' funny folk in the Hog's Head --
that's the pub down in the village. Mighta bin a dragon dealer, mightn'
he? I never saw his face, he kept his hood up."
Harry sank down next to the bowl of peas. "What did you talk to him
about, Hagrid? Did you mention Hogwarts at all?"
"Mighta come up," said Hagrid, frowning as he tried to remember.
"Yeah... he asked what I did, an' I told him I was gamekeeper here....

#### Result 2 similarity: 0.20068368 ####
hand.
"Call me Hagrid," he said, "everyone does. An' like I told yeh, I'm
Keeper of Keys at Hogwarts -- yeh'll know all about Hogwarts, o' course.
"Er -- no," said Harry.
Hagrid looked shocked.
"Sorry," Harry said quickly.
"Sony?" barked Hagrid, turning to stare at the Dursleys, who shrank back
into the shadows. "It' s them as should be sorry! I knew yeh weren't
gettin' yer letters but I never thought y

## Issues with TF-IDF
* Exact word matches are required; synonyms or related terms won't work.
* No understanding of semantic relationships.

# Semantic Search

I choose to use the [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) model since it is the best preforming MiniLM model for semantic search. <br>
[Sentence transformer performance leaderboard](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html)

In [14]:
embedding_model = SentenceTransformer(embedding_model_name)

chunks = []
for doc in docs_processed:
  chunks.append(doc.page_content)

chunk_embeddings = embedding_model.encode(chunks, show_progress_bar=True)
# Create Faiss index
dimension = chunk_embeddings.shape[1]
semantic_faiss_index = faiss.IndexFlatIP(dimension)
norm_chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings, axis=1, keepdims=True)
semantic_faiss_index.add(np.array(norm_chunk_embeddings))  # Add chunk vectors to the index
# faiss.write_index(semantic_faiss_index, str(data_path / "harry_potter_book1_MiniLM-L6_index.faiss"))

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/45 [00:00<?, ?it/s]

In [15]:
def semantic_search(query, top_k=5):
    query_embedding = embedding_model.encode([query])
    similarities, indices = semantic_faiss_index.search(np.array(query_embedding), top_k)
    results = [(docs_processed[i], similarity) for i, similarity in zip(indices[0], similarities[0])]
    return results

## Example 1: "Where does Uncle Vernon work?"

In [16]:
query = "Where does Uncle Vernon work?"
top_k = 5
results = semantic_search(query, top_k=top_k)
for i, (doc, similarity) in enumerate(results, 1):
  print('#### Result', i, 'similarity:', similarity, '####')
  print(doc.page_content)
  print()

#### Result 1 similarity: 0.52995336 ####
"Er -- Uncle Vernon?"
Uncle Vernon grunted to show he was listening.
"Er -- I need to be at King's Cross tomorrow to -- to go to Hogwarts."
Uncle Vernon grunted again.
"Would it be all right if you gave me a lift?"
Grunt. Harry supposed that meant yes.
"Thank you."
He was about to go back upstairs when Uncle Vernon actually spoke.
"Funny way to get to a wizards' school, the train. Magic carpets all got
punctures, have they?"
Harry didn't say anything.
"Where is this school, anyway?"
"I don't know," said Harry, realizing this for the first time. He pulled

#### Result 2 similarity: 0.51595914 ####
Uncle Vernon ripped open the bill, snorted in disgust, and flipped over
the postcard.
"Marge's ill," he informed Aunt Petunia. "Ate a funny whelk. --."
"Dad!" said Dudley suddenly. "Dad, Harry's got something!"
Harry was on the point of unfolding his letter, which was written on the
same heavy parchment as the envelope, when it was jerked sharply out o

## Example 2: "What does Professor Snape teach?"

In [17]:
query = "What does Professor Snape teach?"
top_k = 5
results = semantic_search(query, top_k=top_k)
for i, (doc, similarity) in enumerate(results, 1):
  print('#### Result', i, 'similarity:', similarity, '####')
  print(doc.page_content)
  print()

#### Result 1 similarity: 0.57575035 ####
apart from you."
Hagrid's chest swelled at these last words. Harry and Ron beamed at
Hermione.
"Well, I don' s'pose it could hurt ter tell yeh that... let's see... he
borrowed Fluffy from me... then some o' the teachers did enchantments...
Professor Sprout -- Professor Flitwick -- Professor McGonagall --" he
ticked them off on his fingers, "Professor Quirrell -- an' Dumbledore
himself did somethin', o' course. Hang on, I've forgotten someone. Oh
yeah, Professor Snape."
"Snape?"

#### Result 2 similarity: 0.56885344 ####
Professor Sprout -- Professor Flitwick -- Professor McGonagall --" he
ticked them off on his fingers, "Professor Quirrell -- an' Dumbledore
himself did somethin', o' course. Hang on, I've forgotten someone. Oh
yeah, Professor Snape."
"Snape?"
"Yeah -- yer not still on abou' that, are yeh? Look, Snape helped
protect the Stone, he's not about ter steal it."
Harry knew Ron and Hermione were thinking the same as he was. If Snape
had

## Example 3: "How did Harry get to Hogwarts?"

In [18]:
query = "How did Harry get to Hogwarts?"
top_k = 5
results = semantic_search(query, top_k=top_k)
for i, (doc, similarity) in enumerate(results, 1):
  print('#### Result', i, 'similarity:', similarity, '####')
  print(doc.page_content)
  print()

#### Result 1 similarity: 0.66185766 ####
after him to hurry up, and he must have done so, because a second later,
he had gone -- but how had he done it?
Now the third brother was walking briskly toward the barrier he was
almost there -- and then, quite suddenly, he wasn't anywhere.
There was nothing else for it.
"Excuse me," Harry said to the plump woman.
"Hello, dear," she said. "First time at Hogwarts? Ron's new, too."
She pointed at the last and youngest of her sons. He was tall, thin, and
gangling, with freckles, big hands and feet, and a long nose.

#### Result 2 similarity: 0.6113595 ####
friendly.
"Taking Dudley to the hospital," growled Uncle Vernon. "Got to have that
ruddy tail removed before he goes to Smeltings."
Harry woke at five o'clock the next morning and was too excited and
nervous to go back to sleep. He got up and pulled on his jeans because
he didn't want to walk into the station in his wizard's robes -- he'd
change on the train. He checked his Hogwarts list yet ag

# UI

In [19]:
def retrieve(query, model_choice, top_k):
    if model_choice == "TF-IDF":
        results = lexical_search(query, top_k=top_k)
    else:
        results = semantic_search(query, top_k=top_k)

    # Format the results for display
    output = []
    for i, (doc, similarity) in enumerate(results, 1):
      output.append(f'#### Result {i} Similarity: {similarity} ####\n{doc.page_content}')
    return "\n\n".join(output)


interface = gr.Interface(
    fn=retrieve,
    inputs=[
        gr.Textbox(label="Enter Your Query"),
        gr.Radio(["TF-IDF", "MiniLM"], label="Choose Embedding Model", value="TF-IDF"),
        gr.Number(value=5, label="Number of Results", minimum=1)
    ],
    outputs="text",
    title="Text Retrieval System",
    description="Retrieve relevant passages from the Harry Potter book using either TF-IDF or MiniLM embeddings.",
)

# Launch the interface
interface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://063d285db2faaeea26.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# Evaluation
We assume that semantic search is ground truth

## P@K
Measure the precision at the top K results. This measure help us understands how many relevant result are at the top of the retrieval. High precision means TF-IDF retrieves mostly relevant results.

In [20]:
queries = [
    "Where does Uncle Vernon work?",
    "What does Professor Snape teach?",
    "How did Harry get to Hogwarts?",
    "What is the last name of Hermione?",
    "In what house Malfoy is?"
]
ks = [5, 10, 50, 100]
scores = []

for query in queries:
  print(f'Query: {query}')
  score = []
  for k in ks:
    semantic_results = semantic_search(query, top_k=k)
    lexical_results = lexical_search(query, top_k=k)

    semantic_results_ids = set([doc.metadata['index'] for doc, _ in semantic_results])
    lexical_results_ids = set([doc.metadata['index'] for doc, _ in lexical_results])
    print(f'  P@{k}: {len((semantic_results_ids & lexical_results_ids)) / k}')
    score.append(len((semantic_results_ids & lexical_results_ids)) / k)
  scores.append(score)

print()
summay = np.array(scores).mean(axis=0)
for i, k in enumerate(ks):
  print(f'Mean P@{k}: {summay[i]:.3f}')

Query: Where does Uncle Vernon work?
  P@5: 0.2
  P@10: 0.5
  P@50: 0.64
  P@100: 0.7
Query: What does Professor Snape teach?
  P@5: 0.6
  P@10: 0.4
  P@50: 0.44
  P@100: 0.53
Query: How did Harry get to Hogwarts?
  P@5: 0.0
  P@10: 0.0
  P@50: 0.12
  P@100: 0.2
Query: What is the last name of Hermione?
  P@5: 0.0
  P@10: 0.0
  P@50: 0.18
  P@100: 0.27
Query: In what house Malfoy is?
  P@5: 0.0
  P@10: 0.0
  P@50: 0.42
  P@100: 0.5

Mean P@5: 0.160
Mean P@10: 0.180
Mean P@50: 0.360
Mean P@100: 0.440


## Spearman’s Rank Correlation
Measure how similar TF-IDF ranking score is to the MiniLM ranking scores. High values means that both scores behave similarly.

In [21]:
scores = []
for query in queries:
  semantic_results = semantic_search(query, top_k=len(docs_processed))
  lexical_results = lexical_search(query, top_k=len(docs_processed))

  semantic_similarity_dict = {doc.metadata['index']: similarity for doc, similarity in semantic_results}
  lexical_similarity_dict = {doc.metadata['index']: similarity for doc, similarity in lexical_results}
  df = pd.DataFrame()
  df['doc_id'] = range(len(docs_processed))
  df['semantic_similarity'] = [semantic_similarity_dict[i] for i in range(len(docs_processed))]
  df['lexical_similarity'] = [lexical_similarity_dict[i] for i in range(len(docs_processed))]

  res = stats.spearmanr(df['semantic_similarity'].tolist(), df['lexical_similarity'].tolist())
  print(f'Query: {query}, Correlation: {res.statistic:.3f}, p-val {res.pvalue:.3f}')
  scores.append(res.statistic)
print(f'Mean correlation: {np.mean(scores):.3f}')

Query: Where does Uncle Vernon work?, Correlation: 0.383, p-val 0.000
Query: What does Professor Snape teach?, Correlation: 0.540, p-val 0.000
Query: How did Harry get to Hogwarts?, Correlation: 0.260, p-val 0.000
Query: What is the last name of Hermione?, Correlation: 0.403, p-val 0.000
Query: In what house Malfoy is?, Correlation: 0.296, p-val 0.000
Mean correlation: 0.376


## Normalized Discounted Cumulative Gain (NDCG@k)
This measures how well TF-IDF ranked the top@K results compared to MiniLM both in relevance and in order.
High NDCG indicates that TF-IDF ranks the relevant passages well.

In [22]:
scores = []
for query in queries:
  semantic_results = semantic_search(query, top_k=len(docs_processed))
  lexical_results = lexical_search(query, top_k=len(docs_processed))

  semantic_similarity_dict = {doc.metadata['index']: similarity for doc, similarity in semantic_results}
  lexical_similarity_dict = {doc.metadata['index']: similarity for doc, similarity in lexical_results}
  df = pd.DataFrame()
  df['doc_id'] = range(len(docs_processed))
  df['semantic_similarity'] = [semantic_similarity_dict[i] for i in range(len(docs_processed))]
  df['lexical_similarity'] = [lexical_similarity_dict[i] for i in range(len(docs_processed))]
  df['semantic_rank'] = df['semantic_similarity'].rank(ascending=True)
  df['lexical_rank'] = df['lexical_similarity'].rank(ascending=True)
  df = df.sort_values(by='lexical_rank', ascending=False)
  print(f'Query: {query}')
  for k in ks:
    print(f'ndcg@{k}: {ndcg_score([df["semantic_rank"].tolist()], [df["lexical_rank"].tolist()], k=k)}')
  scores.append([ndcg_score([df["semantic_rank"].tolist()], [df["lexical_rank"].tolist()], k=k) for k in ks])
summay = np.array(scores).mean(axis=0)
for i, k in enumerate(ks):
  print(f'Mean ndcg@{k}: {summay[i]:.3f}')

Query: Where does Uncle Vernon work?
ndcg@5: 0.9908895205978725
ndcg@10: 0.928477362749368
ndcg@50: 0.9515478376299134
ndcg@100: 0.9335702873082129
Query: What does Professor Snape teach?
ndcg@5: 0.917868814756603
ndcg@10: 0.9049401785843039
ndcg@50: 0.9181074688972838
ndcg@100: 0.9152405176678413
Query: How did Harry get to Hogwarts?
ndcg@5: 0.7742626599369046
ndcg@10: 0.7900328719736166
ndcg@50: 0.7264645394389662
ndcg@100: 0.6955092760146937
Query: What is the last name of Hermione?
ndcg@5: 0.8675342453058863
ndcg@10: 0.7982305699685781
ndcg@50: 0.7736129590443079
ndcg@100: 0.778641421110567
Query: In what house Malfoy is?
ndcg@5: 0.9842168129131151
ndcg@10: 0.9797593029161215
ndcg@50: 0.8808577998062742
ndcg@100: 0.8396320958927885
Mean ndcg@5: 0.907
Mean ndcg@10: 0.880
Mean ndcg@50: 0.850
Mean ndcg@100: 0.833


**Conclusion:**<br>
Our analysis shows that while TF-IDF ranks the top results in a reasonable order (as indicated by a high NDCG@k), it struggles to place the most relevant result at the very top (resulting in a low P@k). Additionally, TF-IDF ranks results only slightly similarly to MiniLM, as reflected by the low positive correlation between their rankings.

# Open Qs

Q1: Discuss potential improvements for retrieval using an LLM. Clarify whether this approach applies to lexical, semantic, or both searches.

Potential improvements for retrieval using an LLM include:

* For Both Searches:

  1. Use an LLM to expand queries by generating synonyms, related terms, or paraphrased versions. Calculate the average similarity scores of the reformulated queries against the results to identify the most relevant passages.
  2. Ask an LLM to generate relevance scores directly for each result, enhancing the ranking precision by incorporating semantic and contextual understanding.
  3. Apply a natural language inference (NLI) model to filter irrelevant results. An NLI model evaluates the relationship between a **premise** and a **hypothesis**:
      * Use the NLI model to determine whether the query (hypothesis) is entailed by the result (premise).


* For Semantic Search:
  4.  Fine-tune the word embeddings on the Harry Potter domain or a similar dataset. This would replace the general-purpose MiniLM embeddings with ones more attuned to the narrative style and vocabulary of the target text.





Q2: Highlight the drawbacks or limitations of your suggestion.

* LLMs are computationally expensive
* Post-analysis steps, such as query expansion, relevance scoring, or applying NLI filtering, increasing the system's response time.
* LLM and NLI can make mistakes, and it important to evaluate them.
* Finetuning a word embeddings on a specific domain will probably reduce the embeddings ability to generalize to other domains.

Q3:  Explain how you would create a reliable ground truth for evaluation (instead of simply using the miniLM results). How would you tag each chunk?

I would use the following ways to create a reliable ground truth:
1. Generate pairs of questions and answers for each chapter in the book using an LLM (e.g. ChatGPT) and also ask it to provide quotes from the book that support the answer.
2. Manually review a sample of the generated question-answer pairs to ensure that the answers are correct and fully supported by the quoted text. This step helps validate the quality of the LLM's output.
3. Use both semantic and lexical search methods to generate a set of results (retrieved paragraphs) for each question.
4. For each question-result pair, use an LLM to evaluate the relevance of the retrieved paragraph and specifically, ask **How relevant is the paragraph to the question?** and **Does the paragraph explicitly contain the answer to the question?**
5. Assign labels based on the LLM’s evaluation:
  * Label 2: The paragraph explicitly contains the answer.
  * Label 1: The paragraph is relevant (contains information that helps infer the answer) but does not explicitly contain the answer.
  * Label 0: The paragraph is irrelevant.
  * Since it is possible that the answer can be infered only from combining multiple paragraph it is important to label such paragraph as relevant (label 1).
6. Manually verify a sample of the LLM-labelled pairs to ensure accuracy and consistency in the labelling process.
7. Use this new dataset to evaluate the retrieval system.