### From REST to reasoning: ingest, index, and query with dlt and Cognee Homework

* Video: https://www.youtube.com/watch?v=MNt_KK32gys

#### Resources

* [Slides](https://docs.google.com/presentation/d/1oHQilxEVqGGW4S2ctNEE0wHY2LgcjYLaRUziAoinsis/edit?usp=sharing)
* [Colab Notebook](https://colab.research.google.com/drive/1vBA9OIGChcKjjg8r5hHduR0v3A5D6rmH?usp=sharing) 

### Question 1. dlt Version

In this homework, we will load the data from our FAQ to Qdrant

Let's install dlt with Qdrant support and the Qdrant client:   (note: we will also need networkx)

In [1]:
%pip install -q "networkx" "dlt[qdrant]" "qdrant-client[fastembed]"

Note: you may need to restart the kernel to use updated packages.


What's the version of dlt that you installed?

In [2]:
import dlt 
import requests 
import os
from dlt.destinations import qdrant

In [3]:
!pip show dlt

Name: dlt
Version: 1.12.3
Summary: dlt is an open-source python-first scalable data loading library that does not require any backend to run.
Home-page: https://github.com/dlt-hub
Author: 
Author-email: "dltHub Inc." <services@dlthub.com>
License-Expression: Apache-2.0
Location: /home/codespace/.python/current/lib/python3.12/site-packages
Requires: click, fsspec, gitpython, giturlparse, hexbytes, humanize, jsonpath-ng, orjson, packaging, pathvalidate, pendulum, pluggy, pytz, pyyaml, requests, requirements-parser, rich-argparse, semver, setuptools, simplejson, sqlglot, tenacity, tomlkit, typing-extensions, tzdata
Required-by: cognee


### A1. dlt version 1.12.3 is installed

### dlt Resource

For reading the FAQ data, we have this helper function:

(Note:  the yield transforms the python function into a generator function, which is an object that produces a sequence of values, one at a time, instead of computing and storing all values in memory at once.  yield preserves the function stats)

In [4]:
@dlt.resource
def zoomcamp_data():
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    for course in documents_raw:
        course_name = course['course']

        for doc in course['documents']:
            doc['course'] = course_name
            yield doc

Annotate the function with @dlt.resource.  We will use it when creating a dlt pipeline.

### Question 2. dlt pipeline

Now let's create a pipeline.

We need to define a destination for that.  Let's use the qdrant one:

In [5]:
qdrant_destination = qdrant(
    qd_path="db.qdrant",
)

In this case, we tell dlt (and Qdrant) to create a folder with our data, and the name for it will be db.qdrant

Let's run it:

In [6]:
pipeline = dlt.pipeline(
    pipeline_name="zoomcamp_pipeline",
    destination=qdrant_destination,
    dataset_name="zoomcamp_tagged_data"

)
load_info = pipeline.run(zoomcamp_data())
print(pipeline.last_trace)

Run started at 2025-07-05 18:10:09.107155+00:00 and COMPLETED in 6.27 seconds with 4 steps.
Step extract COMPLETED in 0.30 seconds.

Load package 1751739011.2645357 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.09 seconds.
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- zoomcamp_data: 948 row(s)

Load package 1751739011.2645357 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 3.74 seconds.
Pipeline zoomcamp_pipeline load step completed in 3.72 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_tagged_data
The qdrant destination used /workspaces/llm-zoomcamp2025/workshops/dlt/db.qdrant location to store data
Load package 1751739011.2645357 is LOADED and contains no failed jobs

Step run COMPLETED in 6.27 seconds.
Pipeline zoomcamp_pipeline load step completed in 3.72 seconds
1 load package(s) were loaded to 

How many rows were inserted into the zoomcamp_data collection?

Look for "Normalized data for the following tables:" in the trace output. 

### A2. 948 rows were inserted into the zoomcamp_data collection

### Question 3. Embeddings

When inserting the data, an embedding model was used.  Which one?

You can find this out by inspecting the meta.json file created in the target folder.

### A3. "fast-bge-small-en" was the embedding model used for this exercise