<a href="https://colab.research.google.com/github/jcaw07/ArXivChatGuru/blob/main/Agentic_RAG_Redis_Cohere.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer Support RAG Agent
*powered by Redis*

In this guide, you build an **agent** to perform **RAG** and answer questions related to a car manual PDF.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [8]:
# @title
%pip install -U llama-index llama-parse llama-hub
%pip install llama-index-vector-stores-redis
%pip install llama-index-storage-docstore-redis
%pip install llama-index-storage-chat-store-redis
%pip install llama-index-llms-cohere
%pip install llama-index-embeddings-cohere
%pip install llama-index-embeddings-huggingface

Collecting llama-index
  Downloading llama_index-0.10.29-py3-none-any.whl (6.9 kB)
Collecting llama-parse
  Downloading llama_parse-0.4.1-py3-none-any.whl (7.3 kB)
Collecting llama-hub
  Downloading llama_hub-0.0.79.post1-py3-none-any.whl (103.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.9/103.9 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.2.2-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.11-py3-none-any.whl (26 kB)
Collecting llama-index-core<0.11.0,>=0.10.29 (from llama-index)
  Downloading llama_index_core-0.10.29-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m58.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embe

In [None]:
%load_ext autoreload
%autoreload 2

## Setup and Download Data

In this section, we'll set up a simple Redis db, configure the environment, and ingest the PDF document.

### Setup Redis

In [None]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


In [None]:
REDIS_HOST="localhost"
REDIS_PORT=6379
REDIS_PASSWORD=""

### Environment Configuration
You will need both a LlamaCloud API Key and a Cohere API Key.

In [None]:
import os

os.environ["LLAMA_CLOUD_API_KEY"] = "YOUR API KEY"
os.environ["CO_API_KEY"] = "YOUR API KEY"

In [None]:
# need this for running llama-index code in Jupyter Notebooks
import nest_asyncio
nest_asyncio.apply()

### Download, Parse and Ingest Document
First we will download the PDF for this example. We will use a simple bash command to pull the file from a related github project.

In [None]:
!mkdir -p 'data/'
!wget 'https://raw.githubusercontent.com/redis-developer/LLM-Document-Chat/main/docs/2022-chevrolet-colorado-ebrochure.pdf' -O 'data/2022-chevrolet-colorado-ebrochure.pdf'

--2024-04-15 17:35:46--  https://raw.githubusercontent.com/redis-developer/LLM-Document-Chat/main/docs/2022-chevrolet-colorado-ebrochure.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3566101 (3.4M) [application/octet-stream]
Saving to: ‘data/2022-chevrolet-colorado-ebrochure.pdf’


2024-04-15 17:35:47 (75.6 MB/s) - ‘data/2022-chevrolet-colorado-ebrochure.pdf’ saved [3566101/3566101]



Using LlamaParse on LlamaCloud, parsing the PDF is done with great precision and accuracy.

In [None]:
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

parser = LlamaParse(
    result_type="markdown"  # "markdown" and "text" are available
)

file_extractor = {".pdf": parser}
reader = SimpleDirectoryReader("./data", file_extractor=file_extractor)
documents = reader.load_data()

Started parsing the file under job_id 0fca28d8-a7b3-489b-8711-0be46f35db9e


Below we build a custom index schema for the `RedisVectorStore` that uses the cohere embedding model and some custom index specifications.

In [None]:
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.core.ingestion import (
    DocstoreStrategy,
    IngestionPipeline,
    IngestionCache,
)
from llama_index.storage.kvstore.redis import RedisKVStore as RedisCache
from llama_index.storage.docstore.redis import RedisDocumentStore
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.redis import RedisVectorStore

from redisvl.schema import IndexSchema


embed_model = CohereEmbedding(input_type="search_document")

custom_schema = IndexSchema.from_dict(
    {
        "index": {
            "name": "chevy-colorado",
            "prefix": "pdf:chunk",
            "key_separator": ":"
          },
        # customize fields that are indexed
        "fields": [
            # required fields for llamaindex
            {"type": "tag", "name": "id"},
            {"type": "tag", "name": "doc_id"},
            {"type": "text", "name": "text"},
            # custom vector field for cohere embeddings
            {
                "type": "vector",
                "name": "vector",
                "attrs": {
                    "dims": 1024,
                    "algorithm": "hnsw",
                    "distance_metric": "cosine",
                },
            },
        ],
    }
)

Now we can build an end to end ingestion pipeline as a sequence of transformations backed by a cache, document store, and a sink. **Notice that Redis is used at all stages of the ingest pipeline to process documents at scale, minimizing redundant compute (and thus long-running costs).**

In [None]:
vector_index_pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        embed_model,
    ],
    docstore=RedisDocumentStore.from_host_and_port(
        REDIS_HOST, REDIS_PORT, namespace="doc-store"
    ),
    vector_store=RedisVectorStore(
        schema=custom_schema,
        redis_url=f"redis://{REDIS_HOST}:{REDIS_PORT}",
    ),
    cache=IngestionCache(
        cache=RedisCache.from_host_and_port(REDIS_HOST, REDIS_PORT),
        collection="doc-cache",
    ),
    docstore_strategy=DocstoreStrategy.UPSERTS,
)

In [None]:
vector_index_pipeline.run(documents=documents, show_progress=True)

### Test pipeline consistency and optimizations
Since we are using the document store and cache, we can run the exact same document through, and note that nothing else is ingested because it's already been done. **This helps prevent redundant computation on ETL, improving costs and throughput at scale.**


In [None]:
vector_index_pipeline.run(documents=documents)

[]

## Building the Agent

In this section we define a ReAct agent that will perform RAG over a PDF document using the Cohere `command-r-plus` language model.

We define both a vector index (for semantic search) and summary index (for summarization) for the document. The two query engines are then converted into tools that are passed to the agent.

This agent can dynamically choose to perform semantic search or summarization within the document.

In [None]:
# Setup Cohere as the base embedding model and LLM
from llama_index.llms.cohere import Cohere
from llama_index.core import Settings

llm = Cohere(model="command-r-plus")
Settings.llm = llm
Settings.embed_model = CohereEmbedding(input_type="search_query")

In [None]:
# Set up memory for the Agent
from llama_index.storage.chat_store.redis import RedisChatStore
from llama_index.core.memory import ChatMemoryBuffer

# build memory
chat_store = RedisChatStore(redis_url=f"redis://{REDIS_HOST}:{REDIS_PORT}", ttl=300)

chat_memory = ChatMemoryBuffer.from_defaults(
    token_limit=3000,
    chat_store=chat_store,
    chat_store_key="user_1"
)

In [None]:
from llama_index.core.agent import ReActAgent
from llama_index.core import SummaryIndex, VectorStoreIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.node_parser import SentenceSplitter

import pickle


async def build_doc_agent(doc):
    # run ingestion
    vector_index_pipeline.run(documents=[doc], show_progress=True)

    # grab the nodes
    node_parser = SentenceSplitter()
    nodes = node_parser.get_nodes_from_documents([doc])

    # ID will be base + parent
    file_name = doc.metadata["file_name"]
    file_id = file_name.replace("-", "_").strip(".pdf")

    print(file_id)

    file_path = f"./data/{file_name}"
    summary_out_path = f"./data/{file_name}_summary.pkl"
    vector_index = VectorStoreIndex.from_vector_store(
        vector_index_pipeline.vector_store
    )

    # build summary index
    summary_index = SummaryIndex(nodes)

    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    summary_query_engine = summary_index.as_query_engine(
        response_mode="tree_summarize"
    )

    # extract a summary
    summary = str(
        await summary_query_engine.aquery(
            "Extract a concise 1-2 line summary of this document"
        )
    )
    pickle.dump(summary, open(summary_out_path, "wb"))


    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name=f"vector_tool_{file_id}",
                description=f"Useful for questions related to specific facts about the chevy colorado",
            ),
        )
    ]

    # build ReAct agent
    agent = ReActAgent.from_tools(
        query_engine_tools,
        llm=llm,
        verbose=True,
        memory=chat_memory,
        context=f"""\
You are a specialized, trustworthy, helpful, and technical customer support agent designed to answer queries about the Chevy Colorado 2022 vehicle.
Use the available tools provided when answering a question. Do NOT just blindly make things up about the car unless it is grounded by the retrieved sources.\
""")

    return agent, summary


In [None]:
agent, doc_summary = await build_doc_agent(documents[0])

2022_chevrolet_colorado_ebrochure
17:37:36 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"


In [None]:
doc_summary

'The 2022 Chevrolet Colorado is a midsize pickup truck with four models, three engine options, and various special editions, offering comfort, style, and off-road capabilities.'

## Using the Agent

In [None]:
response = agent.chat("What is the seating capacity of the vehicle?")
print(str(response))

17:38:22 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;38;5;200mThought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: vector_tool_2022_chevrolet_colorado_ebrochure
Action Input: {'input': 'How many people can fit in the 2022 Chevy Colorado?'}
[0m17:38:22 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/embed "HTTP/1.1 200 OK"
17:38:22 llama_index.vector_stores.redis.base INFO   Querying index chevy-colorado with filters *
17:38:22 llama_index.vector_stores.redis.base INFO   Found 2 results for query with id ['pdf:chunk:114a33c8-392a-4513-8b85-657e096b1280', 'pdf:chunk:eab44d43-9244-4eb0-9bda-5f68322eb866']
17:38:23 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;34mObservation: The 2022 Chevrolet Colorado can seat up to five people.
[0m17:38:24 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;38;5;200m

In [None]:
response = agent.chat("What is the towing capacity?")
print(str(response))

17:38:39 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;38;5;200mThought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: vector_tool_2022_chevrolet_colorado_ebrochure
Action Input: {'input': 'towing capacity'}
[0m17:38:39 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/embed "HTTP/1.1 200 OK"
17:38:39 llama_index.vector_stores.redis.base INFO   Querying index chevy-colorado with filters *
17:38:39 llama_index.vector_stores.redis.base INFO   Found 2 results for query with id ['pdf:chunk:114a33c8-392a-4513-8b85-657e096b1280', 'pdf:chunk:62a70373-57d1-43ba-9961-18c3e44bc956']
17:38:44 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;34mObservation: The 2022 Chevrolet Colorado has a maximum towing capacity of 7,700 lbs when equipped with the available Duramax 2.8L Turbo-Diesel engine. This is based on the Crew Cab Short Box LT 2WD model with the 

In [None]:
response = agent.chat("Is there a trailer hitch on the back of the truck?")
print(str(response))

17:38:59 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;38;5;200mThought: I can answer without using any more tools. I'll use the user's language to answer.
Answer: Yes, the 2022 Chevrolet Colorado is available with a trailer hitch receiver, which is located at the back of the truck. This allows for easy towing and hauling of trailers or other equipment.
[0mYes, the 2022 Chevrolet Colorado is available with a trailer hitch receiver, which is located at the back of the truck. This allows for easy towing and hauling of trailers or other equipment.


In [None]:
response = agent.chat("Tell me about the pros and cons of this truck.")
print(str(response))

17:39:14 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;38;5;200mThought: I need to use a tool to help me answer the question.
Action: vector_tool_2022_chevrolet_colorado_ebrochure
Action Input: {'input': 'Pros and cons of the 2022 Chevrolet Colorado.'}
[0m17:39:15 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/embed "HTTP/1.1 200 OK"
17:39:15 llama_index.vector_stores.redis.base INFO   Querying index chevy-colorado with filters *
17:39:15 llama_index.vector_stores.redis.base INFO   Found 2 results for query with id ['pdf:chunk:114a33c8-392a-4513-8b85-657e096b1280', 'pdf:chunk:eab44d43-9244-4eb0-9bda-5f68322eb866']
17:39:20 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;34mObservation: Pros: 
- A wide range of models, cab styles, and engines to choose from, ensuring customers can find a configuration that suits their needs.
- Impressive fuel efficiency, with a diesel engine option offering up to 30

In [None]:
agent.memory.chat_store.get_messages("user_1")

[ChatMessage(role=<MessageRole.USER: 'user'>, content='What is the seating capacity of the vehicle?', additional_kwargs={}),
 ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content='The 2022 Chevrolet Colorado can seat up to five people.', additional_kwargs={}),
 ChatMessage(role=<MessageRole.USER: 'user'>, content='What is the towing capacity?', additional_kwargs={}),
 ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content="The towing capacity of the 2022 Chevrolet Colorado varies depending on the specific model and its features. When equipped with the available Duramax 2.8L Turbo-Diesel engine, the Colorado can tow up to an impressive 7,700 lbs. This is based on the Crew Cab Short Box LT 2WD model with specific packages. \n\nHowever, other factors like the cab style and engine type can influence the towing capacity. For example, the ZR2 model with the same Duramax 2.8L engine has a maximum towing capacity of 5,000 lbs. \n\nIt's important to carefully review the specif

### Incorportating Semantic Caching
We can also take advantage of frequently asked questions (live or prefetched) in order to improve response times.

In [None]:
from redisvl.extensions.llmcache import SemanticCache
from redisvl.utils.vectorize import HFTextVectorizer

emb = HFTextVectorizer(model="BAAI/bge-small-en-v1.5")

cache = SemanticCache(
    name="chevy_cache",
    prefix="cache",
    distance_threshold=0.1,
    ttl=60,
    vectorizer=emb
)

17:57:25 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
17:57:26 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: cpu


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
def invoke_agent(prompt: str) -> str:
    if cached_result := cache.check(prompt=prompt):
        response = cached_result[0]['response']
        return response
    response = agent.chat(prompt)
    cache.store(prompt=prompt, response=response.response)
    return response.response

Now we can perform a simple test with our agent and semantic caching enabled.

In [None]:
invoke_agent("How many doors does the truck have?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

17:57:34 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;38;5;200mThought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: vector_tool_2022_chevrolet_colorado_ebrochure
Action Input: {'input': 'How many doors does the truck have?'}
[0m17:57:34 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/embed "HTTP/1.1 200 OK"
17:57:34 llama_index.vector_stores.redis.base INFO   Querying index chevy-colorado with filters *
17:57:34 llama_index.vector_stores.redis.base INFO   Found 2 results for query with id ['pdf:chunk:8224d4aa-fb23-4689-9a1d-f79d192ca2e5', 'pdf:chunk:114a33c8-392a-4513-8b85-657e096b1280']
17:57:35 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;34mObservation: The truck has four doors.
[0m17:57:36 httpx INFO   HTTP Request: POST https://api.cohere.ai/v1/chat "HTTP/1.1 200 OK"
[1;3;38;5;200mThought: I can answer without using any more t

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'The Chevy Colorado has four doors.'

In [None]:
invoke_agent("How many doors does the chevy have?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'The Chevy Colorado has four doors.'

### Extending Semantic Caching


There are a few options for working with semantic caching in a true production setting:
1.   Extract FAQs from your Knowledge Base (pdfs...). Use an LLM to help! Or use human experts. Prefetch into the cache.
2. Carefully, extract FAQs from conversation history. Prefetch in batches into the cache each day or week.
3. Cache in realtime, which is tricky and requires some tuning. This should typically be done only at the USER level, not across the entire domain or application.

**Stay tuned for more dedicated guides on semantic caching for RAG Agents and Redis.**


