# Intro dlt -> LanceDB loading example

https://lu.ma/cnpdoc5n

If you want to play around with this notebook and make edits in future, we highly recommend making a copy since the link is view only! Also make sure you're signed in with your Google account to be able to add secrets.

Before going into a more complex example, we will go through a simple example of how to load the course Q&A data into LanceDB.


## Install requirements

To create a json -> lancedb pipeline, we need to install:

dlt with lancedb extras
sentence-transformers: we need to use an embedding model to vectorize and store data inside LanceDB. For this we choose the open-source model "sentence-transformers/all-MiniLM-L6-v2".

In [1]:
%%capture
!pip install dlt[lancedb]==0.5.1a0
!pip install sentence-transformers

## Load the data

We'll first load the data just into LanceDB, without embedding it. LanceDB stores both the data and the embeddings, and can also embed data and queries on the fly.

Some definitions:

A dlt source is a grouping of resources (e.g. all your data from Hubspot)
A dlt resource is a function that yields data (e.g. a function yielding all your Hubspot companies)
A dlt pipeline is how you ingest your data
Loading the data consists of a few steps:

Use the requests library to get the data
Define a dlt resource that yields the individual documents
Create a dlt pipeline and run it

In [2]:
import requests
import dlt

qa_dataset = requests.get("https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1").json()

@dlt.resource
def qa_documents():
  for course in qa_dataset:
    yield course["documents"]

pipeline = dlt.pipeline(pipeline_name="from_json", destination="lancedb", dataset_name="qanda")

load_info = pipeline.run(qa_documents, table_name="documents")

print(load_info)

  from .autonotebook import tqdm as notebook_tqdm


_dlt_pipeline_state
[{'name': 'version', 'data_type': 'bigint', 'nullable': False}, {'name': 'engine_version', 'data_type': 'bigint', 'nullable': False}, {'name': 'pipeline_name', 'data_type': 'text', 'nullable': False}, {'name': 'state', 'data_type': 'text', 'nullable': False}, {'name': 'created_at', 'data_type': 'timestamp', 'nullable': False}, {'name': 'version_hash', 'data_type': 'text', 'nullable': True}, {'name': '_dlt_load_id', 'data_type': 'text', 'nullable': False}, {'name': '_dlt_id', 'data_type': 'text', 'nullable': False, 'unique': True}]
documents
[{'name': 'text', 'data_type': 'text', 'nullable': True}, {'name': 'section', 'data_type': 'text', 'nullable': True}, {'name': 'question', 'data_type': 'text', 'nullable': True}, {'name': '_dlt_load_id', 'data_type': 'text', 'nullable': False}, {'name': '_dlt_id', 'data_type': 'text', 'nullable': False, 'unique': True}]
_dlt_loads
[{'name': 'load_id', 'data_type': 'text', 'nullable': False}, {'name': 'schema_name', 'data_type': '

In [3]:
import lancedb

db = lancedb.connect("./.lancedb")
print(db.table_names())

['qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version', 'qanda___dltSentinelTable', 'qanda___documents']


In [4]:
db_table = db.open_table("qanda___documents")

db_table.to_pandas()

Unnamed: 0,id__,text,section,question,_dlt_load_id,_dlt_id
0,6faff58b-fdd4-5edd-97cf-fcc9c4367621,The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,1721269342.5232055,maZaihIsaJYCCQ
1,4e67bd26-7103-58cf-8707-b58c6ce536a0,GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,1721269342.5232055,xYWV3FxNDWxzkA
2,0333379e-89f6-5efe-bfdb-23f6566a5256,"Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,1721269342.5232055,spR+2eQAyFJsbQ
3,fe5852db-0597-5a3c-9850-1611e3c8e5ff,You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,1721269342.5232055,Wj0KiQo1p1xoSw
4,b27814dd-0c6a-50a7-a3b9-cdd872b38ccb,You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,1721269342.5232055,C+M5yqHlAqnb2g
...,...,...,...,...,...,...
943,449d931e-dd03-578f-b62f-b5854311455e,Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,1721269342.5232055,kvSYBCh4/5r/iQ
944,81686f67-95ba-550b-9b3c-47ac6d394aa3,Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,1721269342.5232055,TSdx0tmrRte+hQ
945,7716c58d-9b45-5754-94d3-92be2e83dad6,Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,1721269342.5232055,xkPvc4zWaUip3g
946,0981b1c6-44fb-5ca3-933b-e35b19cf0bc7,Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,1721269342.5232055,qefkL8u/k06lyA
