# Homework - DLT with Qdrant

In this homework, we will load the data from our FAQ to Qdrant

## Question 1: dlt Version

In [1]:
# Install dlt with Qdrant support and Qdrant client
!pip install -q "dlt[qdrant]" "qdrant-client[fastembed]"

In [2]:
# Check dlt version
import dlt
print(f"dlt version: {dlt.__version__}")

dlt version: 1.12.3


## dlt Resource

For reading the FAQ data, we have this helper function that we'll annotate with `@dlt.resource`:

In [3]:
import requests
import dlt

@dlt.resource
def zoomcamp_data():
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    for course in documents_raw:
        course_name = course['course']

        for doc in course['documents']:
            doc['course'] = course_name
            yield doc

## Question 2: dlt Pipeline

Now let's create a pipeline and configure the Qdrant destination:

In [4]:
from dlt.destinations import qdrant

# Configure Qdrant destination
qdrant_destination = qdrant(
    qd_path="db.qdrant", 
)

In [5]:
# Create and run the pipeline
pipeline = dlt.pipeline(
    pipeline_name="zoomcamp_pipeline",
    destination=qdrant_destination,
    dataset_name="zoomcamp_tagged_data"
)

load_info = pipeline.run(zoomcamp_data())
print(pipeline.last_trace)

Run started at 2025-07-06 17:29:27.331327+00:00 and COMPLETED in 4.38 seconds with 4 steps.
Step extract COMPLETED in 0.66 seconds.

Load package 1751822968.7473862 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.07 seconds.
Normalized data for the following tables:
- zoomcamp_data: 948 row(s)

Load package 1751822968.7473862 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 2.25 seconds.
Pipeline zoomcamp_pipeline load step completed in 2.24 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_tagged_data
The qdrant destination used /Users/kevmo/llmz/hw/db.qdrant location to store data
Load package 1751822968.7473862 is LOADED and contains no failed jobs

Step run COMPLETED in 4.38 seconds.
Pipeline zoomcamp_pipeline load step completed in 2.24 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_tagged_da

### Analyzing the Results

Let's examine the trace output to find how many rows were inserted:

In [6]:
# Let's also check the load info for more details
print(f"\nLoad info summary:")
print(f"Pipeline name: {load_info.pipeline.pipeline_name}")
print(f"Dataset name: {load_info.pipeline.dataset_name}")

# Check normalized data info
if hasattr(load_info, 'metrics'):
    print(f"\nMetrics:")
    for key, value in load_info.metrics.items():
        print(f"  {key}: {value}")

# Alternative way to check row counts
print(f"\nLoad packages:")
for load_package in load_info.load_packages:
    print(f"  Load ID: {load_package.load_id}")
    if hasattr(load_package, 'jobs_completed_count'):
        print(f"  Jobs completed: {load_package.jobs_completed_count}")


Load info summary:
Pipeline name: zoomcamp_pipeline
Dataset name: zoomcamp_tagged_data

Metrics:
  1751822968.7473862: [{'started_at': DateTime(2025, 7, 6, 17, 29, 29, 473056, tzinfo=Timezone('UTC')), 'finished_at': DateTime(2025, 7, 6, 17, 29, 31, 711475, tzinfo=Timezone('UTC')), 'job_metrics': {'zoomcamp_data.b7d45ff09d.jsonl': LoadJobMetrics(job_id='zoomcamp_data.b7d45ff09d.jsonl', file_path='/Users/kevmo/.dlt/pipelines/zoomcamp_pipeline/load/normalized/1751822968.7473862/started_jobs/zoomcamp_data.b7d45ff09d.0.jsonl', table_name='zoomcamp_data', started_at=DateTime(2025, 7, 6, 17, 29, 30, 358224, tzinfo=Timezone('UTC')), finished_at=DateTime(2025, 7, 6, 17, 29, 31, 282130, tzinfo=Timezone('UTC')), state='completed', remote_url=None)}}]

Load packages:
  Load ID: 1751822968.7473862


In [7]:
# Let's check the normalized folder to count the actual rows
import json

normalized_path = f"/Users/kevmo/.dlt/pipelines/zoomcamp_pipeline/load/normalized/1751742539.893121/started_jobs/zoomcamp_data.d7bb15b5be.0.jsonl"

try:
    with open(normalized_path, 'r') as f:
        lines = f.readlines()
        print(f"Number of rows in zoomcamp_data: {len(lines)}")
        
        # Let's also peek at the first row to see the structure
        if lines:
            first_row = json.loads(lines[0])
            print(f"\nFirst row keys: {list(first_row.keys())}")
except FileNotFoundError:
    print("File not found. The load ID might be different in your run.")

File not found. The load ID might be different in your run.


In [8]:
# Let's parse the trace more carefully
if pipeline.last_trace:
    # Look for extract info
    for step in pipeline.last_trace.steps:
        if step.step == "extract":
            print(f"Extract step info:")
            print(f"  Started at: {step.started_at}")
            print(f"  Finished at: {step.finished_at}")
            if hasattr(step, 'metrics') and step.metrics:
                print(f"  Metrics: {step.metrics}")
        
        # Look for normalize info
        if step.step == "normalize":
            print(f"\nNormalize step info:")
            print(f"  Started at: {step.started_at}")
            print(f"  Finished at: {step.finished_at}")
            if hasattr(step, 'metrics') and step.metrics:
                for table, info in step.metrics.items():
                    if isinstance(info, dict) and 'row_count' in info:
                        print(f"  Table '{table}': {info['row_count']} rows")
                    
# Also check if the trace has any table summaries
if hasattr(pipeline.last_trace, 'normalized_metrics'):
    print("\nNormalized metrics:")
    for table, metrics in pipeline.last_trace.normalized_metrics.items():
        print(f"  {table}: {metrics}")

Extract step info:
  Started at: 2025-07-06 17:29:28.735005+00:00
  Finished at: 2025-07-06 17:29:29.391095+00:00

Normalize step info:
  Started at: 2025-07-06 17:29:29.392182+00:00
  Finished at: 2025-07-06 17:29:29.462964+00:00


In [9]:
# Let's count the documents directly from the source
test_data = list(zoomcamp_data())
print(f"Total documents from zoomcamp_data(): {len(test_data)}")
print(f"\nCourses found:")
courses = {}
for doc in test_data:
    course = doc.get('course', 'Unknown')
    courses[course] = courses.get(course, 0) + 1

for course, count in courses.items():
    print(f"  {course}: {count} documents")

Total documents from zoomcamp_data(): 948

Courses found:
  data-engineering-zoomcamp: 435 documents
  machine-learning-zoomcamp: 375 documents
  mlops-zoomcamp: 138 documents


## Question 3: Embeddings

Let's check which embedding model was used by inspecting the meta.json file:

In [10]:
import json
import os

# Find and read the meta.json file in the db.qdrant folder
meta_file_path = None
for root, dirs, files in os.walk("db.qdrant"):
    if "meta.json" in files:
        meta_file_path = os.path.join(root, "meta.json")
        break

if meta_file_path:
    print(f"Found meta.json at: {meta_file_path}")
    with open(meta_file_path, 'r') as f:
        meta_data = json.load(f)
    
    print("\nMeta.json content:")
    print(json.dumps(meta_data, indent=2))
    
    # Look for embedding model information
    if 'config' in meta_data:
        print("\nConfig section:")
        print(json.dumps(meta_data['config'], indent=2))
else:
    print("meta.json file not found")

Found meta.json at: db.qdrant/meta.json

Meta.json content:
{
  "collections": {
    "zoomcamp_tagged_data": {
      "vectors": {
        "fast-bge-small-en": {
          "size": 384,
          "distance": "Cosine",
          "hnsw_config": null,
          "quantization_config": null,
          "on_disk": null,
          "datatype": null,
          "multivector_config": null
        }
      },
      "shard_number": null,
      "sharding_method": null,
      "replication_factor": null,
      "write_consistency_factor": null,
      "on_disk_payload": null,
      "hnsw_config": null,
      "wal_config": null,
      "optimizers_config": null,
      "init_from": null,
      "quantization_config": null,
      "sparse_vectors": null,
      "strict_mode_config": null
    },
    "zoomcamp_tagged_data__dlt_loads": {
      "vectors": {
        "fast-bge-small-en": {
          "size": 384,
          "distance": "Cosine",
          "hnsw_config": null,
          "quantization_config": null,
       

## Summary

**Question 1 Answer:** The dlt version installed is 1.12.3

**Question 2 Answer:** Look for "Normalized data for the following tables:" in the trace output above to find the number of rows inserted.

**Question 3 Answer:** The embedding model information can be found in the meta.json file output above.