Homework 3: Vector Search

Q1. Getting the embeddings model - multi-qa-distilbert-cos-v1
What's the first value of the resulting vector?
7.82226548e-02 = 0.078222655 ~
0.07

In [3]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("multi-qa-distilbert-cos-v1")

In [4]:
user_question = "I just discovered the course. Can I still join it?"
v=embedding_model.encode(user_question)


In [5]:
v[0]


0.078222655

Prepare the documents and filter it for "machine-learning-zoomcamp"
-key= course
-value = machine-learning-zoomcamp
-375 documents 


In [6]:
import requests 

base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '03-vector-search/eval/documents-with-ids.json'
docs_url = f'{base_url}/{relative_url}?raw=1'
docs_response = requests.get(docs_url)
documents = docs_response.json()
len(documents)

948

In [7]:
# filter the documents by machine-learning-zoomcamp 
ml_documents = [ml_questions for ml_questions in documents 
                if ml_questions["course"] == 'machine-learning-zoomcamp']
print(ml_documents)
len(ml_documents)




375

In [8]:
ml_documents[1]

{'text': 'The course videos are pre-recorded, you can start watching the course right now.\nWe will also occasionally have office hours - live sessions where we will answer your questions. The office hours sessions are recorded too.\nYou can see the office hours as well as the pre-recorded course videos in the course playlist on YouTube.',
 'section': 'General course-related questions',
 'question': 'Is it going to be live? When?',
 'course': 'machine-learning-zoomcamp',
 'id': '39fda9f0'}

Q2. Creating the embeddings
Now for each document, we will create an embedding for both question and answer fields.

We want to put all of them into a single matrix X:

-Create a list embeddings
-Iterate over each document
-qa_text = f'{question} {text}'
-compute the embedding for qa_text
-append to embeddings
-let X = np.array(embeddings) 

What's the shape of X? (X.shape).
(375,768)




In [9]:
from tqdm.auto import tqdm
import numpy as np

In [10]:
embeddings = []

for doc in tqdm(ml_documents):
    question = doc['question']
    text = doc['text']
    qa_text = f"{question} {text}"

    doc['question_vector'] = embedding_model.encode(question)
    doc['text_vector'] = embedding_model.encode(text)
    doc['question_text_vector'] = embedding_model.encode(qa_text)

    embeddings.append(doc['question_text_vector'])




  0%|          | 0/375 [00:00<?, ?it/s]

In [11]:
embeddings[1]
len(embeddings)


375

In [12]:
print(qa_text)

Any advice for adding the Machine Learning Zoomcamp experience to your LinkedIn profile? I’ve seen LinkedIn users list DataTalksClub as Experience with titles as:
Machine Learning Fellow
Machine Learning Student
Machine Learning Participant
Machine Learning Trainee
Please note it is best advised that you do not list the experience as an official “job” or “internship” experience since DataTalksClub did not hire you, nor financially compensate you.
Other ways you can incorporate the experience in the following sections:
Organizations
Projects
Skills
Featured
Original posts
Certifications
Courses
By Annaliese Bronz
Interesting question, I put the link of my project into my CV as showcase and make posts to show my progress.
By Ani Mkrtumyan


In [13]:
X = np.array(embeddings)
X.shape

(375, 768)

Q3. Search
the cosine similarity between the vector from Q1 (v) and the matrix from Q2 (X).
What's the highest score in the results?
0.65

In [14]:
scores = X.dot(v)

In [15]:
print(scores.max())

0.65065724


Vector search
-implement our own vector search

In [16]:
class VectorSearchEngine():
    def __init__(self, documents, embeddings):
        self.documents = documents
        self.embeddings = embeddings

    def search(self, v_query, num_results=10):
        scores = self.embeddings.dot(v_query)
        idx = np.argsort(-scores)[:num_results]
        return [self.documents[i] for i in idx]

search_engine = VectorSearchEngine(documents=ml_documents, embeddings=X)
search_engine.search(v, num_results=5)

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp',
  'id': 'ee58a693',
  'question_vector': array([ 0.07717121, -0.04749376,  0.02866319, -0.01141074,  0.08245005,
         -0.04042277, -0.02613376,  0.04122073, -0.04840768,  0.01509397,
         -0.00149667, -0.01334825,  0.04618281,  0.02318396,  0.04547327,
         -0.0080999 ,  0.07718319, -0.03334849, -0.0418002 , -0.02304634,
         -0.01866886,  0.00298916, -0.00631757,  0.03931605, -0.0228994 ,
          0.07724463,  0.06296352,  0.037800

Q4. Hit-rate for our search engine
-use the code from the module to calculate the hitrate of VectorSearchEngine with num_results=5.

What did you get?

0.93

In [17]:
import pandas as pd

base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '03-vector-search/eval/ground-truth-data.csv'
ground_truth_url = f'{base_url}/{relative_url}?raw=1'

df_ground_truth = pd.read_csv(ground_truth_url)
df_ground_truth = df_ground_truth[df_ground_truth.course == 'machine-learning-zoomcamp']
ground_truth = df_ground_truth.to_dict(orient='records')

In [18]:
ground_truth[0]

{'question': 'Where can I sign up for the course?',
 'course': 'machine-learning-zoomcamp',
 'document': '0227b872'}

In [19]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = search_engine.search(v_query=embedding_model.encode(q['question']), num_results=5)
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1830 [00:00<?, ?it/s]

In [20]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

In [21]:
hit_rate(relevance_total)

0.9398907103825137

Q5. Indexing with Elasticsearch
-Create the index with the same settings as in the module change the dimensions t0 768
-Index the embeddings -qa_text
-perform the search of the same query from Q1.

What's the ID of the document with the highest score?
ee58a693

In [23]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200') 

es_client.info()

ObjectApiResponse({'name': '86915c30372b', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'H-26fk2IRWCNb27tCwRsJw', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [24]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} ,
            "id": {"type": "keyword"},
            "question_text_vector": {"type": "dense_vector", "dims": 768, "index": True, "similarity": "cosine"},
        }
    }
}

In [25]:
index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [26]:
from tqdm.auto import tqdm

In [27]:
for doc in tqdm(ml_documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/375 [00:00<?, ?it/s]

In [28]:
def elastic_search_knn(field, vector, course):
    knn = {
        "field": field,
        "query_vector": vector,
        "k": 5,
        "num_candidates": 10000,
        "filter": {
            "term": {
                "course": course
            }
        }
    }

    search_query = {
        "knn": knn,
        "_source": ["text", "section", "question", "course", "id"]
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [30]:
course = q['course']
results = elastic_search_knn('question_text_vector', v, course)
results[0]

{'question': 'The course has already started. Can I still join it?',
 'course': 'machine-learning-zoomcamp',
 'section': 'General course-related questions',
 'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
 'id': 'ee58a693'}

Q6. Hit-rate for Elasticsearch
The search engine we used in Q4 computed the similarity between the query and ALL the vectors in our database. Usually this is not practical, as we may have a lot of data.

-Elasticsearch uses approximate techniques to make it faster.

-evaluate how worse the results are when we switch from exact search (as in Q4) to approximate search with Elastic.

-What's hitrate for our dataset for Elastic?

0.93

In [31]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total)
    }

In [32]:
def question_text_vector_knn(q):
    
    question = q['question']
    course = q['course']

    v_q = embedding_model.encode(question)

    return elastic_search_knn('question_text_vector', v_q, course)

evaluate(ground_truth, question_text_vector_knn)

  0%|          | 0/1830 [00:00<?, ?it/s]

{'hit_rate': 0.9398907103825137}