<a href="https://colab.research.google.com/github/hamidb201214-svg/Lectures/blob/main/M3_3_LanceDB_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://lancedb.github.io/lancedb/assets/ecosystem-illustration.png)

LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrieval, filtering and management of embeddings.

In [None]:
!pip install -U langchain_huggingface

In [None]:
!pip install lancedb --q
!pip install pypdf --q
# !pip install -qqq chromadb==0.4.10 --progress-bar off
!pip install -qqq langchain==0.0.299 --progress-bar off
!pip install -qqq sentence_transformers==2.2.2 --progress-bar off

In [None]:
import lancedb

uri = "/content/data/sample-lancedb"
db = lancedb.connect(uri)

In [None]:
table = db.create_table("my_table",
                        data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
                              {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])

In [None]:
import lancedb

# IMPORTANT: use the same URI/path you used when creating the table
db = lancedb.connect("/content/data/sample-lancedb")   # e.g. "./lancedb" or "/path/to/lancedb"

table = db.open_table("my_table")   # load existing table by name


In [None]:
result = table.search([100, 100]).limit(2).to_list()

In [None]:
result

# Implementing a Vector Database for Documents

In [None]:
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.document_loaders import PyPDFLoader
from langchain.llms import HuggingFaceHub
from langchain.embeddings import HuggingFaceEmbeddings

loader = PyPDFLoader("/content/attention.pdf")

docs = loader.load()
len(docs)

The Markdown file we're loading is the original Attention paper: "Attention is all you need!". Let's see how we can use the RecursiveCharacterTextSplitter to split the document into smaller chunks:

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts = text_splitter.split_documents(docs)
len(texts)

Splitting the document into chunks is required due to the limited number of tokens a LLM can look at once (4096 for Llama 2). Next, we'll use the HuggingFaceEmbeddings class to create embeddings for the chunks:

In [None]:
import lancedb
from langchain.vectorstores import LanceDB
from langchain.embeddings import HuggingFaceEmbeddings


uri = "/content/data/paper-lancedb-"
db = lancedb.connect(uri)

In [None]:
!pip install -U langchain_huggingface --q

In [None]:
!pip install -U sentence-transformers --q

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

In [None]:
table = db.create_table(
    "paper_table",
    data=[
        {
            "vector": embeddings.embed_query("Hello World"),
            "text": "Hello World",
            "id": "1",
            "source": "", # Add source field
            "page": 0,    # Add page field
        }
    ],
    mode="overwrite",
)

docsearch = LanceDB.from_documents(texts[5:20], embeddings, connection=table)

In [None]:
retriever = docsearch.as_retriever(search_kwargs={'k': 2})

In [None]:
retriever

In [None]:
texts[0].page_content

In [None]:
result = table.search(embeddings.embed_query(texts[0].page_content)).limit(2).to_list()

In [None]:
result[0].keys()

## Exercise 1: Create a LanceDB for Two Papers and Load Each into a Table

# ChromaDB

![](https://images.datacamp.com/image/upload/v1693482377/image4_7b6910cd7c.png)

In the spirit of using free tools, we're also using free embeddings hosted by HuggingFace. We'll use Chroma database to store/cache the embeddings and make it easy to search them:

To combine the LLM with the database, we'll use the RetrievalQA chain:

In [None]:
!pip install -qqq chromadb==0.4.10 --progress-bar off

In [None]:
from langchain.vectorstores import Chroma

db = Chroma.from_documents(texts, embeddings, persist_directory="db")
results = db.similarity_search("Transformer models", k=2)
print(results[0].page_content)

In [None]:
results[0]

## Exercise 2: Create a ChromaDB for Two Papers and Load Each into a collection