# RAG and Agentic Patterns at phData

## What is RAG?

**RAG (Retrieval-Augmented Generation)** is an AI technique that combines retrieval of relevant information with generative models to produce more context-aware and accurate responses. Unlike traditional generative AI models that rely solely on the knowledge encoded during training, RAG leverages external knowledge sources, such as databases, document stores, or knowledge bases, to provide up-to-date information, improving the quality and relevance of responses. Leveraging structured or semi-structured **knowledge bases** can help create a more targeted RAG solution. In our example we have been using Wikipedia.

In a RAG workflow, the generative model interacts with a retriever that fetches relevant documents or data based on a user's query. These retrieved items are then used as context to generate a final response. This approach allows the generative AI to generate answers that are both informed by the latest knowledge and tailored to the user's question.

## Why RAG?

RAG is useful for a wide range of scenarios because it combines the benefits of generative language models with the ability to reference up-to-date, external information. Here are some key reasons why RAG is beneficial:

1. **Enhanced Accuracy**: Improves the factual accuracy of the generated responses, making it ideal for dynamic industries or domains where the knowledge base is constantly changing. For instance, in legal or medical fields, where new information is regularly updated, RAG ensures that generated responses reflect the most current and reliable data.

2. **Personalization**: Retrieval can be tailored to individual users, allowing the AI to generate more personalized content by retrieving documents relevant to a user's history or preferences. In customer service, for example, RAG can be used to retrieve past interactions and provide customized responses that improve user experience.

3. **Domain Specificity**: Leverage domain-specific knowledge bases, providing detailed, accurate answers that are particularly useful for specialized fields like medicine, law, and finance. By accessing curated knowledge bases or internal documentation, the generative model can provide domain-specific responses that would not be possible with a general model.

4. **Reduced Hallucination**: Generative models sometimes "hallucinate" and produce incorrect facts. RAG mitigates this by grounding generation in real, retrieved content, reducing the likelihood of incorrect or misleading information. This grounding mechanism is critical in high-stakes scenarios where misinformation can lead to significant consequences.


## When to Use RAG?

RAG is particularly useful in the following situations:

1. **Dynamic Knowledge Requirements**: When the knowledge required to answer a question is constantly changing or being updated, such as in news, legal, or medical information.

2. **Complex Queries**: When answering complex or multi-faceted queries that require synthesizing information from multiple sources.

3. **Highly Specialized Domains**: When the domain of the question is highly specialized, such as industry-specific customer support, technical documentation, or research, where relying solely on a pre-trained model may not suffice.

4. **Customer Service**: When personalization and context are required to provide accurate, user-specific responses in customer support or chatbots.

## When Not to Use RAG?

While RAG is a powerful approach, it isn't always the best fit for every problem. Some situations where RAG might not be useful include:

1. **Purely Creative Content**: If the task is focused purely on creative storytelling or generating novel content without strict factual requirements, a traditional generative model without retrieval might be more suitable. For instance, writing a fictional story or creating poetry can benefit from the model's inherent creativity without the need for external references.

2. **Simple Queries**: For straightforward queries that require factual answers readily available in the language model's pre-trained dataset, adding retrieval can add unnecessary complexity. For example, questions like "What is the capital of France?" can be directly answered by a pre-trained model without the need for a retrieval step.

3. **Real-Time Interaction**: RAG can add latency due to the retrieval process. If the application requires near-instantaneous responses, such as in chatbots where response time is crucial to user satisfaction, RAG may introduce delays that impact the user experience.

---
## Demo: Using DSPY and Langchain to answer queries augumented by Wikipedia

Before we begin we need to setup our environment.  
- In our example we are using an [Ollama](https://ollama.com/) to serve the LLM model.  
- For the "Advanced Techiniques" section we use redis so be sure to setup a local [Redis](https://redis.io/docs/latest/operate/oss_and_stack/install/install-redis/) server. 

``` yaml
services:
  redis:
    image: docker.io/redis/redis-stack-server:latest
    container_name: redis
    restart: always
    ports:
      - "6379:6379"
```

**Question Of Interest**

In [1]:
question = "Which team won the MLB World Series in 2024?"

Let's dive into a practical example where we use DSPY and Langchain to set up a RAG pattern for retrieving information from Wikipedia.

This code snippet below is creating a system where the model will be fed a question and context, and it will return an answer based on the context provided. The dspy library facilitates the management of inputs and outputs, while the model is used to generate the answer.

In [3]:
import dspy

lm = dspy.LM('ollama_chat/llama3.2', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)


class QuestionAndAnswer(dspy.Signature):
    """
    You are a helpful assistant that answers general questions
    You will be provided a question and context 

    **Fields**:
    - `question`: The question  from the user.
    - `context`: The context provided to help answer the question.
    - `answer`: Response to the question
    """

    question = dspy.InputField(desc="The question being asked.")
    context = dspy.InputField(desc="The context provided to generate an accurate answer.")
    answer = dspy.OutputField(desc="Concise reasoning or justification for the answer provided.")

Let's try and see how our LLM responds without any extra context and just relying on its trained data.


**Inference With No Context**

In [4]:
# No Context
answer = dspy.Predict(QuestionAndAnswer)

with dspy.context(lm=lm):

    inference = answer(
        question=question,
        context = ""
    )

print(f"Context Length: {len("")}")
print(f"=== LLM Inference === \n{inference.answer}")

Context Length: 0
=== LLM Inference === 
I don't have information about future events, including the outcome of the 2024 MLB World Series. The answer will be provided once the event has occurred.


The output shows that the LLM cannot answer this question, which makes sense. Typically the whole training process takes a long time, and it's not uncommon for the training data to be two years out of date for any given LLM

Next step is to query the information needed and prepare to pass it as an input to our LLM.  

**Query From Wikipedia**

In this code snippet we are using langchaing_community's WikipediaLoader to load a specific amount of documents from Wikipedia based on a query.
The print statement reassures us that we collected the correct information.

In [5]:
from langchain_community.document_loaders import WikipediaLoader
simple_context = WikipediaLoader(query=question, load_max_docs=2).load()
print(simple_context)

[Document(metadata={'title': '2024 World Series', 'summary': "The 2024 World Series was the championship series of Major League Baseball's (MLB) 2024 season. The 120th edition of the World Series, it was a best-of-seven playoff between the National League (NL) champion Los Angeles Dodgers and the American League (AL) champion New York Yankees. It was the Dodgers' first World Series appearance and win since 2020, and the Yankees' first World Series appearance since 2009. The series began on October 25 and ended on October 30 with the Dodgers winning in five games. Freddie Freeman was named the MVP of the series, tying a World Series record with 12 runs batted in (RBIs) while hitting home runs in the first four games of the series, including the first walk-off grand slam in World Series history in Game 1.\nThe Dodgers and Yankees entered the 2024 MLB postseason as the top seeds in their respective leagues. The Dodgers had home-field advantage in the series due to their better regular sea

Now we are going to run an inference using our context

In [6]:
with dspy.context(lm=lm):

    inference = answer(
        question=question,
        context=simple_context[0].page_content
    )
print(f"Context Length: {len(simple_context[0].page_content)}")
print(f"=== LLM Inference === \n{inference.answer}")

Context Length: 4000
=== LLM Inference === 
The Los Angeles Dodgers won the 2024 MLB World Series.


Congratulations! We just used a simple RAG technique that enabled our LLM to use current information to answer a question.

## Advanced RAG Techniques

To make your RAG setup even more powerful, several advanced techniques can be applied:

### 1. Embeddings

**Embeddings** are vector representations of text that allow for semantic comparison. By using embeddings for both the query and documents, you can measure their similarity in a high-dimensional space, leading to more accurate retrieval. Embeddings capture the meaning of words and phrases, enabling a more nuanced search that takes into account context and relationships between terms.

**AWS Service**: You can use **Amazon SageMaker** to create and manage embeddings using models like BERT or custom-trained models. **Amazon Kendra** also provides capabilities to build embeddings for efficient semantic search.


Query more context from Wikipedia

In [7]:
queries = [
    "What was the final score of the last game of the 2024 World Series?",
    "Who was named MVP of the 2024 World Series?",
    "Which stadium hosted the opening game of the 2024 World Series?",
    "Which teams participated in the 2024 World Series?",
    "What dates was the 2024 World Series played on?",
    "How many games did the 2024 World Series last?"
]
wikipedia_docs = []
for eq in queries:
    docs = WikipediaLoader(query=eq, load_max_docs=2).load()
    wikipedia_docs.extend(docs)

# Combine extra Wikipedia docs with your original wikipedia_context
adv_context = wikipedia_docs + simple_context

Connect to Redis and Create an Index

In [8]:
import uuid
import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
import array
from sentence_transformers import SentenceTransformer

# Initialize the model (will download it if not cached)
embedder = SentenceTransformer('msmarco-distilbert-base-v4')

# Connect to Redis
r = redis.Redis(host="localhost", port=6379, db=0)

# Generate a new, unique index name for each run.
INDEX_NAME = f"docs_idx_{uuid.uuid4().hex}"

# Now create a new index with that unique name
DIMENSIONS = 768 
r.ft(INDEX_NAME).create_index(
    fields=[
        VectorField(
            "embedding",
            "FLAT",
            {
                "TYPE": "FLOAT32",
                "DIM": DIMENSIONS,
                "DISTANCE_METRIC": "COSINE"
            }
        ),
        TextField("page_content")
    ],
    definition=IndexDefinition(prefix=["doc:"], index_type=IndexType.HASH)
)

print(f"Created a new index named: {INDEX_NAME}")


Created a new index named: docs_idx_006c8ead15374dd78724cff89499fadd


Create Embedding function and Insert Documents (Vectors + Text)

In [9]:
# Embedding function (stub)
def get_embedding(text: str) -> list[float]:
    """
    Use msmarco-distilbert-base-v4 via SentenceTransformers
    to generate an embedding for the given text.
    """
    # The embedder.encode(...) returns a list (or numpy array) of floats
    embedding_vector = embedder.encode([text])[0]  # Encode returns a list of embeddings
    return embedding_vector.tolist()

# Store each document’s text + embedding in Redis
for d in adv_context:
    embedding = get_embedding(d.page_content)
    
    # Convert the list of floats to bytes for Redis
    embedding_bytes = array.array('f', embedding).tobytes()
    
    # Use the pattern "doc:{INDEX_NAME}:{id}" to keep your docs separate for each index
    doc_key = f"doc:{INDEX_NAME}:{d.metadata}"
    
    # Store both the text and the vector field
    r.hset(
        doc_key,
        mapping={
            "page_content": d.page_content,
            "embedding": embedding_bytes
        }
    )
    
    print(f"Stored '{doc_key}' in Redis")

Stored 'doc:docs_idx_006c8ead15374dd78724cff89499fadd:{'title': '2024 World Series', 'summary': "The 2024 World Series was the championship series of Major League Baseball's (MLB) 2024 season. The 120th edition of the World Series, it was a best-of-seven playoff between the National League (NL) champion Los Angeles Dodgers and the American League (AL) champion New York Yankees. It was the Dodgers' first World Series appearance and win since 2020, and the Yankees' first World Series appearance since 2009. The series began on October 25 and ended on October 30 with the Dodgers winning in five games. Freddie Freeman was named the MVP of the series, tying a World Series record with 12 runs batted in (RBIs) while hitting home runs in the first four games of the series, including the first walk-off grand slam in World Series history in Game 1.\nThe Dodgers and Yankees entered the 2024 MLB postseason as the top seeds in their respective leagues. The Dodgers had home-field advantage in the ser

Search Redis Cache and pass context to LLM inferencing.

In [10]:
def redis_vector_search(r, index_name: str, query_text: str, top_k: int = 10) -> list:
    """
    1. Generate an embedding for the query text.
    2. Use Redis vector similarity search (KNN) to find the top_k matches in the given index.
    3. Do NOT rank (sort) results by vector_score; simply return them in the default order.
    """

    # Generate a query embedding
    query_embedding = get_embedding(query_text)  # Must return list[float]

    # Convert embedding to float32 bytes
    query_embedding_bytes = array.array('f', query_embedding).tobytes()

    # Build the vector search query without sorting:
    #   [KNN <top_k> @embedding $BLOB AS vector_score]
    redis_query = f"*=>[KNN {top_k} @embedding $BLOB AS vector_score]"

    # Execute the search against our index named index_name
    # - We simply return "page_content"
    search_result = r.ft(index_name).search(
        Query(redis_query)
        .return_fields("page_content")  # or just "page_content"
        .dialect(2),
        query_params={"BLOB": query_embedding_bytes}
    )

    # Format the results, excluding ranking and score
    results = []
    for doc in search_result.docs:
        results.append({
            # We only keep the raw content, no "score" field
            "page_content": doc.page_content
        })

    return results

# Search Cache 
search_results = redis_vector_search(r, INDEX_NAME, question)

# Aggregate the results
search_context = "\n\n".join([res["page_content"] for res in search_results])

# Pass the concatenated context to our QA pipeline
with dspy.context(lm=lm):
    inference = answer(
        question=question,
        context=search_context
    )

# Print the final answer
print(f"Context Length: {len(search_context)}")
print(f"=== LLM Inference === \n{inference.answer}")


Context Length: 40018
=== LLM Inference === 
The World Series is the annual championship series of Major League Baseball (MLB) and concludes the MLB postseason.
First played in 1903, the World Series championship is a best-of-seven playoff and is a contest between the champions of baseball's National League (NL) and American League (AL).
Often referred to as the "Fall Classic", the modern World Series has been played every year since 1903 with two exceptions: in 1904, when the NL champion New York Giants declined to play the AL champion Boston Americans; and in 1994, when the series was canceled due to the players' strike.
The best-of-seven style has been the format of all World Series except in 1903, 1919, 1920, 1921, when the winner was determined through a best-of-nine playoff.
Although the large majority of contests have been played entirely during the month of October, a small number of Series have also had games played during September and November.
The Series-winning team is awa

### 4. Ranking Results

Ranking involves scoring retrieved documents based on their relevance to the user's query. 

Techniques like **BM25** or neural network-based ranking can help ensure the best content is selected for augmentation. The ranking process is crucial for ensuring that the generative model has the most relevant and informative content to work with.



**AWS Service**: **Amazon OpenSearch Service** (formerly Elasticsearch) can be used for indexing and ranking documents using techniques like BM25. You can also use **Amazon SageMaker** for building custom ranking models.

In [11]:
def redis_vector_search_with_ranking(r, index_name: str, query_text: str, top_k: int = 10) -> list:
    """
    1. Generate an embedding for the query text.
    2. Use Redis vector similarity search (KNN) to find the top_k matches in the given index.
    """
    # 1. Generate a query embedding
    query_embedding = get_embedding(query_text)  # Must return list[float]
    
    # Convert embedding to float32 bytes
    query_embedding_bytes = array.array('f', query_embedding).tobytes()
    
    # 2. Build the vector search query
    #    - [KNN <top_k> @embedding $BLOB AS vector_score]
    redis_query = f'*=>[KNN {top_k} @embedding $BLOB AS vector_score]'
    
    # 3. Execute the search against our index named index_name
    #    Sort by vector_score ascending (closest match first),
    #    and return "page_content" plus "vector_score".
    search_result = r.ft(index_name).search(
        Query(redis_query)
        .sort_by("vector_score")
        .return_fields("page_content", "vector_score")
        .dialect(2),
        query_params={"BLOB": query_embedding_bytes}
    )
    
    # 4. Format the results
    results = []
    for doc in search_result.docs:
        results.append({
            "score": doc.vector_score,
            "page_content": doc.page_content
        })
    return results
# Search Cache 
search_results = redis_vector_search_with_ranking(r, INDEX_NAME, question, top_k=3)

print("\n=== Search Results ===")
for idx, res in enumerate(search_results):
    print(f"Result {idx+1}: (score={res['score']})")

# Aggregate the results
search_context = "\n\n".join([res["page_content"] for res in search_results])

# Pass the concatenated context to our QA pipeline
with dspy.context(lm=lm):
    inference = answer(
        question=question,
        context=search_context
    )

# Print the final answer
print(f"Context Length: {len(search_context)}")
print(f"=== LLM Inference === \n{inference.answer}")


=== Search Results ===
Result 1: (score=0.37796831131)
Result 2: (score=0.37796831131)
Result 3: (score=0.37796831131)
Context Length: 12004
=== LLM Inference === 
The 2024 World Series was the championship series of Major League Baseball's (MLB) 2024 season. The 120th edition of the World Series, it was a best-of-seven playoff between the National League (NL) champion Los Angeles Dodgers and the American League (AL) champion New York Yankees. It was the Dodgers' first World Series appearance and win since 2020, and the Yankees' first World Series appearance since 2009.

The series began on October 25 and ended on October 30 with the Dodgers winning in five games. Freddie Freeman was named the MVP of the series, tying a World Series record with 12 runs batted in (RBIs) while hitting home runs in the first four games of the series, including the first walk-off grand slam in World Series history in Game 1.

The Dodgers and Yankees entered the 2024 MLB postseason as the top seeds in thei

### 2. Cache/Semantic Cache

A **local cache** can significantly improve response time for frequent queries. Storing previously retrieved results in a cache reduces the need for repeated retrievals, thus speeding up response generation. A **semantic cache** leverages embeddings to cache similar queries, allowing for efficient reuse of previously retrieved information.

**AWS Service**: **Amazon ElastiCache** can be used to set up an in-memory data store (using Redis or Memcached) to cache frequently accessed data.


In [12]:
# https://redis.io/docs/latest/integrate/redisvl/user-guide/semantic-caching/
from redisvl.extensions.llmcache import SemanticCache

llmcache = SemanticCache(
    name=f"idx_{uuid.uuid4().hex}",      # underlying search index name
    prefix="llmcache",                   # redis key prefix for hash entries
    redis_url="redis://localhost:6379",  # redis connection url string
    distance_threshold=0.1               # semantic cache distance threshold
)



In [13]:
with dspy.context(lm=lm):

    inference = answer(
        question=question,
        context=simple_context[0].page_content
    )

print(inference.answer)

# Cache the question, answer, and arbitrary metadata
llmcache.store(
    prompt=question,
    response=inference.answer,
    metadata={"context": simple_context[0].page_content}
)

The Los Angeles Dodgers won the 2024 MLB World Series.


'llmcache:446550d10775e88856ac5ff9bf067af5a44aae998b191140b57598fe164859d1'

In [14]:
questions_list = [
    "Which team won the 2024 MLB World Series?",
    "Who emerged victorious in the 2024 MLB World Series?",
    "What team took home the 2024 World Series championship?",
    "Who won the Superbowl in 2024?",
]

for q in questions_list:
    cached_response = llmcache.check(prompt=q)
    if cached_response:
        print(f"[Cache Hit] Q: '{q}'\nAnswer: Yes\n")
    else:
        print(f"[Cache Miss] No match in cache for Q: '{q}'\n")

[Cache Hit] Q: 'Which team won the 2024 MLB World Series?'
Answer: Yes

[Cache Hit] Q: 'Who emerged victorious in the 2024 MLB World Series?'
Answer: Yes

[Cache Hit] Q: 'What team took home the 2024 World Series championship?'
Answer: Yes

[Cache Miss] No match in cache for Q: 'Who won the Superbowl in 2024?'

