# 🧠 Homework: From REST to Reasoning with `dlt` and Qdrant

This notebook walks through the LLM Zoomcamp homework on loading FAQ data using `dlt` and storing it in the Qdrant vector database.


## ✅ Question 1 – What version of `dlt` is installed?

To begin, we install the required libraries with Qdrant and embedding support:

```bash
pip install -q "dlt[qdrant]" "qdrant-client[fastembed]"

Next, we check the installed version of dlt:

import dlt
print(dlt.__version__)

📦 Answer: 1.13.0



In [1]:
import dlt
print(dlt.__version__)

1.13.0


In [2]:
import dlt
import requests

@dlt.resource
def zoomcamp_data():
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    for course in documents_raw:
        course_name = course['course']
        for doc in course['documents']:
            doc['course'] = course_name
            yield doc

## ✅ Question 2 — Create a `dlt` pipeline and load data into Qdrant

In this step, we:

1. Set up a **Qdrant** destination
2. Created a `dlt.pipeline` to manage our data workflow
3. Ran the pipeline with the `zoomcamp_data()` resource
4. Verified how many rows were inserted into Qdrant

📊 **Result**  
According to the pipeline trace output:

✅ **Answer**: `948 rows` were inserted into the `zoomcamp_data` collection.

In [4]:
from dlt.destinations import qdrant

# 1. Define Qdrant destination
qdrant_destination = qdrant(
    qd_path="db.qdrant"  # this creates a local folder for the vector DB
)

# 2. Create the pipeline
pipeline = dlt.pipeline(
    pipeline_name="zoomcamp_pipeline",
    destination=qdrant_destination,
    dataset_name="zoomcamp_tagged_data"
)

# 3. Run the pipeline
load_info = pipeline.run(zoomcamp_data())

# 4. Print summary info
print(pipeline.last_trace)

Run started at 2025-07-10 09:16:33.281782+00:00 and COMPLETED in 5.02 seconds with 4 steps.
Step extract COMPLETED in 0.33 seconds.

Load package 1752138993.847789 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.08 seconds.
Normalized data for the following tables:
- zoomcamp_data: 948 row(s)

Load package 1752138993.847789 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 4.06 seconds.
Pipeline zoomcamp_pipeline load step completed in 4.05 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_tagged_data
The qdrant destination used /workspaces/llm-zoomcamp/03-graph-based-retrieval-rag/db.qdrant location to store data
Load package 1752138993.847789 is LOADED and contains no failed jobs

Step run COMPLETED in 5.02 seconds.
Pipeline zoomcamp_pipeline load step completed in 4.05 seconds
1 load package(s) were loaded to destination qdrant and i

## ✅ Question 3 – What embedding model was used?

When we ran the `dlt` pipeline, it automatically embedded the text before storing it in Qdrant.

To find out **which embedding model** was used, we can inspect the `meta.json` file that was created by `dlt` inside the local Qdrant folder:

📁 Path:

🔍 We looked under the `"vectors"` section and found:

```json
"vectors": {
  "fast-bge-small-en": {
    "distance": "Cosine",
    "size": 384
 
  }
}

In [5]:
import json

# Open the metadata file created by dlt inside the qdrant directory
with open("db.qdrant/meta.json") as f:
    meta = json.load(f)

# Print the entire content to see which model was used
import pprint
pprint.pprint(meta)

{'aliases': {},
 'collections': {'zoomcamp_tagged_data': {'hnsw_config': None,
                                          'init_from': None,
                                          'on_disk_payload': None,
                                          'optimizers_config': None,
                                          'quantization_config': None,
                                          'replication_factor': None,
                                          'shard_number': None,
                                          'sharding_method': None,
                                          'sparse_vectors': None,
                                          'strict_mode_config': None,
                                          'vectors': {'fast-bge-small-en': {'datatype': None,
                                                                            'distance': 'Cosine',
                                                                            'hnsw_config': None,
                           