# EXAMPLES (RAG)
- [RAG](https://docs.activeloop.ai/examples/rag)
  - [RAG Quickstart](https://docs.activeloop.ai/examples/rag/quickstart)
  - [RAG Tutorials](https://docs.activeloop.ai/examples/rag/tutorials)
    - [Vector Store Basics](https://docs.activeloop.ai/examples/rag/tutorials/vector-store-basics)
    - [Vector Search Options](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options)
      - [LangChain API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/langchain-api)
      - [Deep Lake Vector Store API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/vector-store-api)
      - [Managed Database REST API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/rest-api)
    - [Customizing Your Vector Store](https://docs.activeloop.ai/examples/rag/tutorials/step-4-customizing-vector-stores)
    - [Image Similarity Search](https://docs.activeloop.ai/examples/rag/tutorials/image-similarity-search)
    - [Improving Search Accuracy using Deep Memory](https://docs.activeloop.ai/examples/rag/tutorials/deepmemory)
  - [**LangChain Integration**](https://docs.activeloop.ai/examples/rag/langchain-integration)
  - [LlamaIndex Integration](https://docs.activeloop.ai/examples/rag/llamaindex-integration)
  - [Managed Tensor Database](https://docs.activeloop.ai/examples/rag/managed-database)
    - [REST API](https://docs.activeloop.ai/examples/rag/managed-database/rest-api)
    - [Migrating Datasets to the Tensor Database](https://docs.activeloop.ai/examples/rag/managed-database/migrating-datasets-to-the-tensor-database)
  - [Deep Memory](https://docs.activeloop.ai/examples/rag/deep-memory)
    - [How it Works](https://docs.activeloop.ai/examples/rag/deep-memory/how-it-works)

## RAG (LangChain Integration)

### Use Deep Lake as a Vector Store in LangChain
*Deep Lake can be used as a VectorStore in LangChain for building Apps that require filtering and vector search. In this tutorial we will show how to create a Deep Lake Vector Store in LangChain and use it to build a Q&A App about the Twitter OSS recommendation algorithm.*

In [1]:
# !pip install langchain deeplake openai tiktoken

In [2]:
# !pip install -U langchain-deeplake

#### Downloading and Preprocessing the Data

In [3]:
# from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
# from langchain.vectorstores import DeepLake
from langchain_community.vectorstores import DeepLake
# from langchain_deeplake.vectorstores import DeeplakeVectorStore
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
# from langchain.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
import os
from dotenv import load_dotenv

load_dotenv(override = True)
open_api_key = os.getenv('OPENAI_API_KEY')
activeloop_token = os.getenv('ACTIVELOOP_TOKEN')



In [4]:
MODEL_GPT = 'gpt-4o-mini'

In [5]:
# Clone the Twitter OSS recommendation algorithm

# !git clone https://github.com/twitter/the-algorithm

In [6]:
# Load all the files from the repo into a list

repo_path = '/the-algorithm'
# repo_path = './the-algorithm'
repo_path = './the-algorithm/twml/twml/layers'

docs = []
for dirpath, dirnames, filenames in os.walk(repo_path):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
            print(file)  # TODO: COMMENT
        except Exception as e: 
            print(e)
            pass

batch_prediction_tensor_writer.py
batch_prediction_writer.py
data_record_tensor_writer.py
full_dense.py
full_sparse.py
isotonic.py
layer.py
mdl.py
partition.py
percentile_discretizer.py
sequential.py
sparse_max_norm.py
stitch.py
__init__.py


In [7]:
print(type(docs))
print(len(docs))

<class 'list'>
26


In [8]:
# [Note on chunking text files]
# - Text files are typically split into chunks before creating embeddings.
# - In general, more chunks increases the relevancy of data that is fed into the language model,
#   since granular data can be selected with higher precision.
# - However, since an embedding will be created for each chunk, more chunks increase the computational complexity.

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

Created a chunk of size 1684, which is longer than the specified 1000
Created a chunk of size 1760, which is longer than the specified 1000
Created a chunk of size 1157, which is longer than the specified 1000
Created a chunk of size 2504, which is longer than the specified 1000
Created a chunk of size 1427, which is longer than the specified 1000
Created a chunk of size 1438, which is longer than the specified 1000


In [9]:
print(type(texts))
print(len(texts))

<class 'list'>
86


**Chunks in the above context should not be confused with Deep Lake chunks!**

#### Creating the Deep Lake Vector Store

In [10]:
# dataset_path = 'hub://<org-id>/twitter_algorithm'
dataset_path = 'hub://pavelkloscz/twitter_algorithm_twml'  # [twml] subdirectory of this github repo

In [11]:
# Specify an OpenAI algorithm for creating the embeddings, and create the VectorStore.
# This process creates an embedding for each element in the texts lists and stores it in Deep Lake format at the specified path

embeddings = OpenAIEmbeddings()

In [12]:
db = DeepLake.from_documents(texts, embeddings, dataset_path=dataset_path)

Your Deep Lake dataset has been successfully created!


Creating 86 embeddings in 1 batches of size 86:: 100%|███████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.23s/it]

Dataset(path='hub://pavelkloscz/twitter_algorithm_twml', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (86, 1)      str     None   
 metadata     json      (86, 1)      str     None   
 embedding  embedding  (86, 1536)  float32   None   
    id        text      (86, 1)      str     None   





**Deep Lake Vector Store has 4 tensors including the text, embedding, ids, and  metadata.**

#### Use the Vector Store in a Q&A App

In [13]:
# Use the VectorStore in Q&A app, where the embeddings will be used to filter relevant documents (texts)
#   that are fed into an LLM in order to answer a question.
# If we were on another machine, we would load the existing Vector Store without recalculating the embeddings.

db = DeepLake(dataset_path=dataset_path, read_only=True, embedding=embeddings)

  db = DeepLake(dataset_path=dataset_path, read_only=True, embedding=embeddings)


Deep Lake Dataset in hub://pavelkloscz/twitter_algorithm_twml already exists, loading from the storage


In [14]:
# Create a retriever object and specify the search parameters

retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['k'] = 20

In [15]:
# Create an RetrievalQA chain in LangChain and run it

# model = ChatOpenAI(model='gpt-4') # 'gpt-3.5-turbo',
model = ChatOpenAI(model=MODEL_GPT) # 'gpt-3.5-turbo',
qa = RetrievalQA.from_llm(model, retriever=retriever)

In [16]:
# qa.run('What programming language is most of the SimClusters written in?')
# qa.invoke('What programming language is most of the SimClusters written in?')
qa.invoke('What programming language is most of the Batch Prediction written in?')  ## batch_prediction

{'query': 'What programming language is most of the Batch Prediction written in?',
 'result': 'Most of the Batch Prediction code is written in Python.'}

In [17]:
qa.invoke('What programming language is most of the TWML written in?')  ## twml

{'query': 'What programming language is most of the TWML written in?',
 'result': 'Most of the TWML is written in Python, as indicated by the use of Python syntax and libraries such as TensorFlow in the provided contexts.'}

**We can tune k in the retriever depending on whether the prompt exceeds the model's token limit.**<br>
**Higher k increases the accuracy by including more data in the prompt.**

#### Adding data to to an existing Vector Store

In [18]:
# Data can be added to an existing Vector Store by loading it using its path and adding documents or texts

db = DeepLake(dataset_path=dataset_path, embedding=embeddings)

# Don't run this here in order to avoid data duplication
# db.add_documents(texts)

Deep Lake Dataset in hub://pavelkloscz/twitter_algorithm_twml already exists, loading from the storage


#### Adding Hybrid Search to the Vector Store

In [19]:
# Since embeddings search can be computationally expensive, you can simplify the search by filtering out data using an
#   explicit search on top of the embeddings search. Suppose we want to answer to a question related to the trust and safety models.
# We can filter the filenames (source) in the metadata using a custom function that is added to the retriever
def filter(deeplake_sample):
    return 'trust_and_safety_models' in deeplake_sample['metadata'].data()['value']['source']

retriever.search_kwargs['filter'] = filter

In [20]:
qa = RetrievalQA.from_llm(model, retriever=retriever)

# qa.run("What do the trust and safety models do?")
qa.invoke("What do the trust and safety models do?")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 86/86 [00:00<00:00, 145.51it/s]


{'query': 'What do the trust and safety models do?',
 'result': 'Trust and safety models are designed to protect users and maintain a safe environment on platforms, typically in the context of online services and communities. They help identify, prevent, and address harmful behaviors such as harassment, abuse, misinformation, and fraud. These models often include the use of algorithms and human moderation to monitor user interactions, enforce community guidelines, and ensure compliance with legal and ethical standards. Ultimately, their goal is to create a secure and positive experience for all users.'}

In [21]:
# Filters can also be specified as a dictionary.
# For example, if the metadata tensor had a key year, we can filter based on that key using

# retriever.search_kwargs['filter'] = {"metadata": {"year": 2020}}

#### Using Deep Lake in Applications that Require Concurrency

In [22]:
# For applications that require writing of data concurrently, users should set up a lock system to queue
#   the write operations and prevent multiple clients from writing to the Deep Lake Vector Store at the same time.
# This can be done with a few lines of code in the example below

# [Concurrency Using Zookeeper Locks]
# - https://docs.activeloop.ai/technical-details/best-practices/concurrent-writes/concurrency-using-zookeeper-locks

#### Accessing the Low Level Deep Lake API (Advanced)

In [None]:
# When using a Deep Lake Vector Store in LangChain, the underlying Vector Store and its low-level Deep Lake dataset can be accessed via

# LangChain Vector Store
db = DeepLake(dataset_path=dataset_path)

# Deep Lake Vector Store object
ds = db.vectorstore

# Deep Lake Dataset object
ds = db.vectorstore.dataset

#### SelfQueryRetriever with Deep Lake

In [None]:
# Deep Lake supports the SelfQueryRetriever implementation in LangChain, which translates a user prompt into a metadata filters.

In [None]:
# This section of the tutorial requires installation of additional packages

# !pip install "deeplake[enterprise]" lark

In [None]:
# First let's create a Deep Lake Vector Store with relevant data using the documents below

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "rating": 9.9,
            "director": "Andrei Tarkovsky",
            "genre": "science fiction",
            "rating": 9.9,
        },
    ),
]

In [None]:
# Since this feature uses Deep Lake's Tensor Query Language under the hood, the Vector Store must be stored in or connected to Deep Lake,
#   which requires registration with Activeloop

org_id = <YOUR_ORG_ID> #By default, your username is an org_id
dataset_path = f"hub://{org_id}/self_query"

vectorstore = DeepLake.from_documents(
    docs, embeddings, dataset_path = dataset_path, overwrite = True,
)

In [None]:
# Instantiate our retriever by providing information about the metadata fields that
#   our documents support and a short description of the document contents

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]

document_content_description = "Brief summary of a movie"
llm = OpenAI(temperature=0)

retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

In [None]:
# Use our retriever

# This example only specifies a relevant query
retriever.get_relevant_documents("What are some movies about dinosaurs")

# [OUTPUT]
# [Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'rating': 7.7, 'genre': 'science fiction'}),
#  Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'}),
#  Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'}),
#  Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6})]

In [None]:
# Run a query to find movies that are above a certain ranking

# This example only specifies a filter
retriever.get_relevant_documents("I want to watch a movie rated higher than 8.5")

# [OUTPUT]
# [Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6}),
#  Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'})]

In [1]:
# !pip install datasets
# !pip install ipywidgets

In [2]:
from deeplake import VectorStore
# from deeplake.core.vectorstore.deeplake_vectorstore import VectorStore
# from deeplake.core.vectorstore import VectorStore
import os
import getpass
import datasets
import openai
from pathlib import Path
from dotenv import load_dotenv

load_dotenv(override = True)
open_api_key = os.getenv('OPENAI_API_KEY')
activeloop_token = os.getenv('ACTIVELOOP_TOKEN')



In [5]:
# Download the dataset locally

# corpus = datasets.load_dataset("scifact", "corpus")
corpus = datasets.load_dataset("scifact", "corpus", trust_remote_code=True)

Downloading data:   0%|          | 0.00/3.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5183 [00:00<?, ? examples/s]