## Q1. Embedding the query  
- **Answer:** `-0.11`  

## Q2. Cosine similarity with another vector  
- **Answer:** `0.90`  

## Q3. Ranking by cosine  
- **Answer:** `1`  

## Q4. Ranking by cosine, version two  
- **Answer:** `0`  

In Q4, the highest scoring document changed from index 1 to index 0.  
This is because concatenating the question with the text provided more context,  
and the question of document 0 closely matched the query’s meaning.  
The richer embedding led to a better semantic match.  

## Q5. Selecting the embedding model  
- **Answer:** `384`  

## Q6. Indexing with qdrant (2 points)  
- **Answer:** `0.87`  

## Q1. Embedding the query


In [3]:
from fastembed import TextEmbedding
import numpy as np

# Create the embedder
embedder = TextEmbedding(model_name="jinaai/jina-embeddings-v2-small-en")

# Your query
q = "I just discovered the course. Can I join now?"

# Compute the embedding (returns a generator)
embedding = next(embedder.embed([q]))

# Confirm shape
print(f"Shape of embedding: {embedding.shape}")

# Find minimal value
min_value = np.min(embedding)
print(f"Minimal value in the embedding: {min_value}")

Shape of embedding: (512,)
Minimal value in the embedding: -0.11726373885183883


### Cosine similarity


In [4]:
np.linalg.norm(embedding)

np.float64(1.0)

In [5]:
import numpy as np

# Sum of all elements
sum_of_elements = np.sum(embedding)
print(f"Sum of all elements: {sum_of_elements}")

# Norm (length of vector)
norm = np.linalg.norm(embedding)
print(f"Norm of embedding: {norm}")

Sum of all elements: -0.3649978092789794
Norm of embedding: 1.0


## Q2. Cosine similarity with another vector

In [6]:

# Your query and doc
query = "I just discovered the course. Can I join now?"
doc = "Can I still join the course after the start date?"

# Compute embeddings
query_vec = next(embedder.embed([query]))
doc_vec = next(embedder.embed([doc]))

# Cosine similarity (dot product since both are normalized)
cosine_sim = query_vec.dot(doc_vec)

print(f"Cosine similarity: {cosine_sim}")

Cosine similarity: 0.9008528895674548


## Q3. Ranking by cosine

In [7]:
from fastembed import TextEmbedding
import numpy as np

# Instantiate embedder
embedder = TextEmbedding(model_name="jinaai/jina-embeddings-v2-small-en")

# Your query
query = "I just discovered the course. Can I join now?"

# Compute query embedding
query_vec = next(embedder.embed([query]))

# List of documents
documents = [
    {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
     'section': 'General course-related questions',
     'question': 'Course - Can I still join the course after the start date?',
     'course': 'data-engineering-zoomcamp'},
    
    {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
     'section': 'General course-related questions',
     'question': 'Course - Can I follow the course after it finishes?',
     'course': 'data-engineering-zoomcamp'},
    
    {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
     'section': 'General course-related questions',
     'question': 'Course - When will the course start?',
     'course': 'data-engineering-zoomcamp'},
    
    {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
     'section': 'General course-related questions',
     'question': 'Course - What can I do before the course starts?',
     'course': 'data-engineering-zoomcamp'},
    
    {'text': 'Star the repo! Share it with friends if you find it useful ❣️\nCreate a PR if you see you can improve the text or the structure of the repository.',
     'section': 'General course-related questions',
     'question': 'How can we contribute to the course?',
     'course': 'data-engineering-zoomcamp'}
]

# Compute embeddings for the text field
text_list = [doc['text'] for doc in documents]
text_embeddings = list(embedder.embed(text_list))

# Convert to matrix
V = np.vstack(text_embeddings)

# Compute cosine similarity
scores = V.dot(query_vec)

# Find the index with highest similarity
best_idx = np.argmax(scores)

print(f"Cosine similarity scores: {scores}")
print(f"Best document index: {best_idx}")

Cosine similarity scores: [0.76296845 0.81823782 0.80853974 0.71330788 0.73044992]
Best document index: 1


## Q4. Ranking by cosine, version two


In [16]:
# Build the new full_text field: question + text
full_text_list = [
    doc['question'] + ' ' + doc['text'] 
    for doc in documents
]

# Embed the combined question + text for each document
full_text_embeddings = list(embedder.embed(full_text_list))

# Convert to matrix
V_full = np.vstack(full_text_embeddings)

# Compute cosine similarities with the query vector
scores_full = V_full.dot(query_vec)

# Find the index with the highest similarity
best_idx_full = np.argmax(scores_full)

print(f"Cosine similarity scores (question + text): {scores_full}")
print(f"Best document index (question + text): {best_idx_full}")

Cosine similarity scores (question + text): [0.85145432 0.8436594  0.84082872 0.77551577 0.80860079]
Best document index (question + text): 0


## Q5. Selecting the embedding model

In [17]:
from fastembed import TextEmbedding

# List supported models
for model in TextEmbedding.list_supported_models():
    print(f"Model: {model['model']}, dim: {model['dim']}")

Model: BAAI/bge-base-en, dim: 768
Model: BAAI/bge-base-en-v1.5, dim: 768
Model: BAAI/bge-large-en-v1.5, dim: 1024
Model: BAAI/bge-small-en, dim: 384
Model: BAAI/bge-small-en-v1.5, dim: 384
Model: BAAI/bge-small-zh-v1.5, dim: 512
Model: mixedbread-ai/mxbai-embed-large-v1, dim: 1024
Model: snowflake/snowflake-arctic-embed-xs, dim: 384
Model: snowflake/snowflake-arctic-embed-s, dim: 384
Model: snowflake/snowflake-arctic-embed-m, dim: 768
Model: snowflake/snowflake-arctic-embed-m-long, dim: 768
Model: snowflake/snowflake-arctic-embed-l, dim: 1024
Model: jinaai/jina-clip-v1, dim: 768
Model: Qdrant/clip-ViT-B-32-text, dim: 512
Model: sentence-transformers/all-MiniLM-L6-v2, dim: 384
Model: jinaai/jina-embeddings-v2-base-en, dim: 768
Model: jinaai/jina-embeddings-v2-small-en, dim: 512
Model: jinaai/jina-embeddings-v2-base-de, dim: 768
Model: jinaai/jina-embeddings-v2-base-code, dim: 768
Model: jinaai/jina-embeddings-v2-base-zh, dim: 768
Model: jinaai/jina-embeddings-v2-base-es, dim: 768
Model:

## Q6. Indexing with qdrant (2 points)

In [18]:
import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()


documents = []

for course in documents_raw:
    course_name = course['course']
    if course_name != 'machine-learning-zoomcamp':
        continue

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [26]:
from qdrant_client import QdrantClient, models
from fastembed import TextEmbedding
import uuid

# Connect to Qdrant
client = QdrantClient("http://localhost:6333")

# Create collection (if not already created)
client.create_collection(
    collection_name="ml-zoomcamp-faq",
    vectors_config=models.VectorParams(
        size=384,  # embedding size for BAAI/bge-small-en
        distance=models.Distance.COSINE
    )
)

# Initialize embedder
embedder = TextEmbedding(model_name="BAAI/bge-small-en")

# Prepare points
points = []
for doc in documents:
    # Concatenate question + text
    text = doc["question"] + " " + doc["text"]
    
    # Embed the text
    vector = next(embedder.embed([text]))
    
    # Create point with unique ID, vector, and payload
    points.append(
        models.PointStruct(
            id=uuid.uuid4().hex,
            vector=vector,
            payload=doc
        )
    )

# Upsert points into Qdrant collection
client.upsert(
    collection_name="ml-zoomcamp-faq",
    points=points
)

print(f"Upserted {len(points)} points successfully!")

Upserted 375 points successfully!


In [30]:
from qdrant_client import models

query = "I just discovered the course. Can I join now?"

results = client.query_points(
    collection_name="ml-zoomcamp-faq",
    query=models.Document(
        text=query,
        model="BAAI/bge-small-en"
    ),
    limit=1,
    with_payload=True
)

print(f"Highest score: {results.points[0].score}")
print(f"Top match payload: {results.points[0].payload}")

Highest score: 0.87031734
Top match payload: {'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.', 'section': 'General course-related questions', 'question': 'The course has already started. Can I still join it?', 'course': 'machine-learning-zoomcamp'}
