# Vector Search Evaluation

First of all we need to create a **ground truth** dataset.

This can be done:
1. Manually by annotators / domain experts
2. Getting the data from users queries
3. Generate with LLM


Generally for one query, we might have multiple relevant documents, but for this use case, we have 1 relevant document(answer) for 1 query (user question).

The automatic generation of the dataset will be done as follows:
1. For every user query (question) LLM will be prompted to generate 5 similar questions
2. Apply vector search using our LLM-generated questions as queries to find relevant document in the knowledge base 
3. During the test phase we will evaluate our vector search to be able to detect relevant document for similar queries (aka. generated ones)

In [1]:
import json

In [2]:
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [3]:
documents = []

for course in docs_raw:
    for doc in course['documents']:
        doc['course'] = course['course']
        documents.append(doc)


In [4]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

## 1. Preprocessing (Indexing)

In order to be able to link FAQ records (queries) and newly generated questions(queries) we need to adapt an indexing system.

**Better use case to use google API to get our data**


### 1.1 Order based index

Not the best way => risk when updatting FAQ the index of the whole KB can be broken

In [5]:
#for i in range(len(documents)):
#    documents[i]['id'] = i

### 1.1 Content based index

Hashing function using content information to generate a unique id.

Dont depend on the order, but exclusively on the content. And of coarse, if the content is chnaged the hash index ill be broker, again will have problem.

In this particular case we take also first 10 characters of text field, otherwise we would have lots of identical indexes

In [6]:
# indexing based on the content
import hashlib

def generate_doc_id(doc):
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]
    return document_id

In [7]:
for doc in documents:
    doc['id'] = generate_doc_id(doc)

In [8]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [9]:
from collections import defaultdict

In [10]:
# checking if our ids are unique
hashes = defaultdict(list)

for doc in documents:
    doc_id = doc['id']
    hashes[doc_id].append(doc)

len(hashes), len(documents)

(947, 948)

The length is not the same which means that 2 records have identical index.

In [11]:
# searching if the idxs are uniques
for k, v in hashes.items():
    if len(v) > 1:
        print(k, len(v))

593f7569 2


In [12]:
# hash collision
# clearly a duplicate
hashes['593f7569']

[{'text': "They both do the same, it's just less typing from the script.\nAsked by Andrew Katoch, Added by Edidiong Esu",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'},
 {'text': "They both do the same, it's just less typing from the script.",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'}]

In [13]:
# saving preprocessed records
with open('documents_with_ids.json', 'wt') as f_out:
    json.dump(documents, f_out, indent=2)

In [14]:
! head documents_with_ids.json

[
  {
    "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
    "section": "General course-related questions",
    "question": "Course - When will the course start?",
    "course": "data-engineering-zoomcamp",
    "id": "c02e79ef"
  },
  {
    "text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",


## 2. Data generation

In [15]:
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record.

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [16]:
doc = documents[2]
prompt = prompt_template.format(**doc)
print(prompt)

You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record.

The record:

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]


In [17]:
from openai import OpenAI

client = OpenAI()

def generate_questions(doc):
    """Sends generated prompt with context to OpenAI API"""
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

In [18]:
#from tqdm.auto import tqdm

Generate the ground truth dataset

In [19]:
results = {}

In [None]:
for doc in tqdm(documents):
    doc_id = doc['id']
    if doc_id in results:
        continue

    questions = generate_questions(doc)
    results[doc_id] = questions

In [20]:
import pickle

In [21]:
with open('results.bin', 'rb') as f_in:
    results = pickle.load(f_in)

We need to parse our results

In [22]:
# this element has complex escaping
print(results['58c9f99f'])

[
"How can I resolve the Docker error 'invalid mode: \\Program Files\\Git\\var\\lib\\postgresql\\data'?",
"What should I do if I encounter an invalid mode error in Docker on Windows?",
"What is the correct mounting path to use in Docker for PostgreSQL data on Windows?",
"Can you provide an example of a correct Docker mounting path for PostgreSQL data?",
"How do I correct the mounting path error in Docker for \\\Program Files\\Git\\var\\lib\\postgresql\\data'?"
]


In [23]:
json_questions = [
r"How can I resolve the Docker error 'invalid mode: \Program Files\Git\var\lib\postgresql\data'?",
"What should I do if I encounter an invalid mode error in Docker on Windows?",
"What is the correct mounting path to use in Docker for PostgreSQL data on Windows?",
"Can you provide an example of a correct Docker mounting path for PostgreSQL data?",
r"How do I correct the mounting path error in Docker for \Program Files\Git\var\lib\postgresql\data'?"
]

In [24]:
json.dumps(json_questions)

'["How can I resolve the Docker error \'invalid mode: \\\\Program Files\\\\Git\\\\var\\\\lib\\\\postgresql\\\\data\'?", "What should I do if I encounter an invalid mode error in Docker on Windows?", "What is the correct mounting path to use in Docker for PostgreSQL data on Windows?", "Can you provide an example of a correct Docker mounting path for PostgreSQL data?", "How do I correct the mounting path error in Docker for \\\\Program Files\\\\Git\\\\var\\\\lib\\\\postgresql\\\\data\'?"]'

In [25]:
results['58c9f99f'] = json.dumps(json_questions)

In [26]:
parsed_res = {}

for doc_id, json_questions in results.items():
    try:
        # Try to parse the JSON string
        parsed_res[doc_id] = json.loads(json_questions)
    except json.JSONDecodeError as e:
        # Print the doc_id and the problematic JSON string
        print(f"Error decoding JSON for doc_id {doc_id}: {e}")
        print(f"Problematic JSON string: {json_questions}")

        # Try to fix the JSON string by replacing unescaped backslashes
        #json_questions_fixed = re.sub(r'\\', r'\\\\',json_questions)

        #json_questions_fixed = json_questions.replace('\\\\', '\\')

        try:
            parsed_res[doc_id] = json.loads(json_questions_fixed)
            print("Problematic JSON string is fixed")
        except json.JSONDecodeError as e:
            print(f"Failed to fix JSON for doc_id {doc_id}: {e}")

In [27]:
parsed_res['58c9f99f']

["How can I resolve the Docker error 'invalid mode: \\Program Files\\Git\\var\\lib\\postgresql\\data'?",
 'What should I do if I encounter an invalid mode error in Docker on Windows?',
 'What is the correct mounting path to use in Docker for PostgreSQL data on Windows?',
 'Can you provide an example of a correct Docker mounting path for PostgreSQL data?',
 "How do I correct the mounting path error in Docker for \\Program Files\\Git\\var\\lib\\postgresql\\data'?"]

In [28]:
# lookup dict
doc_idx = {d['id'] : d for d in documents}

In [29]:
final_results = []

for doc_id, questions in parsed_res.items():
    course = doc_idx[doc_id]['course']
    for q in questions:
        final_results.append((q, course, doc_id))

In [30]:
final_results

[('When does the course begin?', 'data-engineering-zoomcamp', 'c02e79ef'),
 ('How can I get the course schedule?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('What is the link for course registration?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('How can I receive course announcements?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('Where do I join the Slack channel?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('Where can I find the prerequisites for this course?',
  'data-engineering-zoomcamp',
  '1f6520ca'),
 ('How do I check the prerequisites for this course?',
  'data-engineering-zoomcamp',
  '1f6520ca'),
 ('Where are the course prerequisites listed?',
  'data-engineering-zoomcamp',
  '1f6520ca'),
 ('What are the requirements for joining this course?',
  'data-engineering-zoomcamp',
  '1f6520ca'),
 ('Where is the list of prerequisites for the course?',
  'data-engineering-zoomcamp',
  '1f6520ca'),
 ('Can I enroll in the course after it starts?',
  'data-engineerin

In [31]:
import pandas as pd

In [32]:
# save into csv
df = pd.DataFrame(final_results, columns=['question', 'course', 'document'])
df.head(10)

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef
5,Where can I find the prerequisites for this co...,data-engineering-zoomcamp,1f6520ca
6,How do I check the prerequisites for this course?,data-engineering-zoomcamp,1f6520ca
7,Where are the course prerequisites listed?,data-engineering-zoomcamp,1f6520ca
8,What are the requirements for joining this cou...,data-engineering-zoomcamp,1f6520ca
9,Where is the list of prerequisites for the cou...,data-engineering-zoomcamp,1f6520ca


In [33]:
df.to_csv('ground-truth-data.csv', index=False)
