# Testing Metadata Filtering in RAG Workflow

This notebook demonstrates how to implement and test metadata filtering in a Retrieval-Augmented Generation (RAG) workflow based on the Multi-Meta-RAG approach. This approach significantly improves results on multi-hop queries by using LLM-extracted metadata for database filtering.

## 1. Setup and Installation

First, let's install the necessary dependencies:

In [None]:
# Install required packages
!pip install langchain langchain-community langchain-anthropic sentence-transformers retry
!pip install openai neo4j python-dotenv chromadb

In [None]:
import os
import json
import ast
from typing import List, Dict, Any, Optional
from dotenv import load_dotenv
from datetime import datetime

# LangChain imports
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_anthropic import ChatAnthropic
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from openai import OpenAI

# Load environment variables
load_dotenv()

# Set up OpenAI API (for metadata extraction)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## 2. Sample Document Corpus

Let's create a small sample corpus with metadata to test our approach. In a real application, you might load documents from files or a database.

In [None]:
# Create a small sample corpus with metadata
sample_corpus = [
    {
        "title": "MacBook Air Discount",
        "content": "Apple's 13.6-inch MacBook Air with the M2 chip is currently discounted by $150, bringing the price down to $1,049. This is one of the best deals we've seen for this model in recent months.",
        "source": "Engadget",
        "published_at": "March 10, 2025"
    },
    {
        "title": "Galaxy Buds Sale",
        "content": "Samsung's Galaxy Buds 2 are now on sale for just $99, a $50 discount from their regular price. These wireless earbuds offer excellent noise cancellation and sound quality.",
        "source": "The Verge",
        "published_at": "March 15, 2025"
    },
    {
        "title": "Tech Discounts Roundup",
        "content": "This week's best tech deals include discounts on Apple MacBooks, Samsung earbuds, and various smart home devices. Retailers are offering these discounts ahead of upcoming product refreshes.",
        "source": "TechCrunch",
        "published_at": "March 12, 2025"
    },
    {
        "title": "New Pixel Phone Launch",
        "content": "Google has announced the launch date for its next Pixel phone. The device is expected to feature significant camera improvements and a new custom processor.",
        "source": "Wired",
        "published_at": "March 5, 2025"
    },
    {
        "title": "Streaming Service Price Increase",
        "content": "Netflix has announced a price increase for all subscription tiers starting next month. The basic plan will increase by $2, while premium subscriptions will see a $3 increase.",
        "source": "The New York Times",
        "published_at": "March 8, 2025"
    }
]

# Convert the sample corpus to LangChain documents
documents = []
for item in sample_corpus:
    doc = Document(
        page_content=f"{item['title']}\n\n{item['content']}",
        metadata={
            "title": item["title"],
            "source": item["source"],
            "published_at": item["published_at"],
        }
    )
    documents.append(doc)

# Display the documents
for i, doc in enumerate(documents):
    print(f"Document {i+1}:")
    print(f"Content: {doc.page_content[:100]}...")
    print(f"Metadata: {doc.metadata}")
    print()

## 3. Chunk Documents and Create Vector Store

Now we'll split the documents into chunks and create a vector store with the documents and their metadata.

In [None]:
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=256,
    chunk_overlap=32,
    length_function=len,
)

# Split the documents and preserve metadata
chunks = text_splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks from {len(documents)} documents.")
print(f"\nSample chunk: {chunks[0].page_content}")
print(f"Sample chunk metadata: {chunks[0].metadata}")

In [None]:
# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",  # Using a smaller model for demo purposes
    model_kwargs={"device": "cpu"}
)

# Create a vector store with the chunks
vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_metadata={"hnsw:space": "cosine"}
)

print("Vector store created successfully!")

## 4. Metadata Extraction Using LLM

Now we'll implement the metadata extraction function using an LLM (in this case, OpenAI's GPT model). This function will extract relevant metadata from queries to use for filtering.

In [None]:
# Template for metadata extraction
EXTRACT_FILTER_TEMPLATE = """Given the question, extract the metadata to filter the database about article sources and dates.
The sources can only be from the list: ['Engadget', 'The Verge', 'TechCrunch', 'Wired', 'The New York Times']

Examples to follow:

Question: Who is the individual associated with the cryptocurrency industry facing a criminal trial on fraud and conspiracy charges, as reported by both The Verge and TechCrunch?
Answer: {'source': {'$in': ['The Verge', 'TechCrunch']}}

Question: Did Engadget report a discount on the MacBook Air before TechCrunch published an article about tech discounts?
Answer: {'source': {'$in': ['Engadget', 'TechCrunch']}}

Question: What did The New York Times report about subscription price increases in March 2025?
Answer: {'source': {'$in': ['The New York Times']}, 'published_at': {'$in': ['March 2025']}}

Now it is your turn:

Question: {query}
Answer:
"""

def extract_metadata_filter(query: str) -> Dict[str, Any]:
    """Extract metadata filter from a query using OpenAI's GPT model."""
    try:
        # Call the OpenAI API
        completion = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": EXTRACT_FILTER_TEMPLATE.format(query=query)}],
            temperature=0.1,
        )
        
        # Extract the filter from the response
        filter_str = completion.choices[0].message.content
        
        # Convert string to dictionary
        filter_dict = ast.literal_eval(filter_str)
        
        print(f"Extracted filter: {filter_dict}")
        return filter_dict
    except Exception as e:
        print(f"Error extracting metadata filter: {e}")
        # Return empty filter if extraction fails
        return {}

# Test the metadata extraction function with a sample query
sample_query = "Did Engadget report a discount on the MacBook Air before The Verge published an article about Samsung Galaxy Buds?"
metadata_filter = extract_metadata_filter(sample_query)

## 5. Implement Retrieval with Metadata Filtering

Now let's implement the retrieval function that uses both semantic search and metadata filtering.

In [None]:
def retrieve_with_metadata_filter(query: str, metadata_filter: Dict[str, Any], k: int = 3) -> List[Document]:
    """Retrieve documents using both semantic search and metadata filtering."""
    try:
        # Convert the filter to the format expected by Chroma
        chroma_filter = {}
        
        if "source" in metadata_filter:
            if "$in" in metadata_filter["source"]:
                chroma_filter["source"] = {"$in": metadata_filter["source"]["$in"]}
        
        if "published_at" in metadata_filter:
            if "$in" in metadata_filter["published_at"]:
                chroma_filter["published_at"] = {"$in": metadata_filter["published_at"]["$in"]}
        
        print(f"Chroma filter: {chroma_filter}")
        
        # Retrieve documents with filter
        docs = vectordb.similarity_search(
            query=query,
            k=k,
            filter=chroma_filter if chroma_filter else None
        )
        
        return docs
    except Exception as e:
        print(f"Error retrieving documents: {e}")
        return []

# Test retrieval with metadata filtering
filtered_docs = retrieve_with_metadata_filter(sample_query, metadata_filter)

print(f"\nRetrieved {len(filtered_docs)} documents with metadata filtering:")
for i, doc in enumerate(filtered_docs):
    print(f"\nDocument {i+1}:")
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")

## 6. Compare with Regular Retrieval (Without Metadata Filtering)

Let's compare the results with and without metadata filtering to see the difference.

In [None]:
# Retrieve documents without metadata filtering
regular_docs = vectordb.similarity_search(query=sample_query, k=3)

print(f"Retrieved {len(regular_docs)} documents without metadata filtering:")
for i, doc in enumerate(regular_docs):
    print(f"\nDocument {i+1}:")
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")

# Compare the results
print("\n--- Comparison ---")
print("Documents retrieved with metadata filtering:")
filtered_sources = [doc.metadata["source"] for doc in filtered_docs]
print(filtered_sources)

print("\nDocuments retrieved without metadata filtering:")
regular_sources = [doc.metadata["source"] for doc in regular_docs]
print(regular_sources)

## 7. End-to-End RAG with Metadata Filtering

Now let's implement a complete RAG pipeline that uses metadata filtering and compare it with a standard RAG pipeline.

In [None]:
# Initialize the LLM
llm = ChatAnthropic(model="claude-3-sonnet-20240229", temperature=0)

# Define the RAG prompt
rag_prompt_template = """Answer the following question based on the provided context. If the context doesn't contain the necessary information, state that you don't know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt = PromptTemplate(
    template=rag_prompt_template,
    input_variables=["context", "question"]
)

In [None]:
def run_rag(query: str, use_metadata_filtering: bool = False) -> str:
    """Run the RAG pipeline with or without metadata filtering."""
    # Step 1: Retrieve relevant documents
    if use_metadata_filtering:
        # Extract metadata filter
        metadata_filter = extract_metadata_filter(query)
        # Retrieve with metadata filtering
        retrieved_docs = retrieve_with_metadata_filter(query, metadata_filter)
    else:
        # Regular retrieval without filtering
        retrieved_docs = vectordb.similarity_search(query=query, k=3)
    
    # Step 2: Prepare the context
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])
    
    # Step 3: Generate the answer using the LLM
    chain = LLMChain(llm=llm, prompt=rag_prompt)
    response = chain.run(context=context, question=query)
    
    return response

In [None]:
# Test queries for comparison
test_query = "Did Engadget report a discount on the MacBook Air before The Verge published an article about Samsung Galaxy Buds?"

# Run the regular RAG pipeline
print("Running regular RAG (without metadata filtering)...")
regular_rag_response = run_rag(test_query, use_metadata_filtering=False)

# Run the RAG pipeline with metadata filtering
print("\nRunning RAG with metadata filtering...")
filtered_rag_response = run_rag(test_query, use_metadata_filtering=True)

# Compare the responses
print("\n--- Regular RAG Response ---")
print(regular_rag_response)

print("\n--- Metadata-Filtered RAG Response ---")
print(filtered_rag_response)

## 8. Experiment with Different Queries

Let's test a few more queries to see how metadata filtering helps with different types of questions.

In [None]:
test_queries = [
    "What did Engadget and The Verge report about tech discounts recently?",
    "Did The New York Times discuss subscription price increases in March 2025?",
    "What products were on sale according to TechCrunch's discount roundup?"
]

for i, query in enumerate(test_queries):
    print(f"\n--- Query {i+1}: {query} ---")
    
    # With metadata filtering
    print("\nMetadata-Filtered RAG Response:")
    filtered_response = run_rag(query, use_metadata_filtering=True)
    print(filtered_response)
    
    # Without metadata filtering
    print("\nRegular RAG Response:")
    regular_response = run_rag(query, use_metadata_filtering=False)
    print(regular_response)
    
    print("\n" + "-"*80)

## 9. Conclusion

In this notebook, we've demonstrated how to implement and test metadata filtering in a RAG workflow. The key findings are:

1. Metadata filtering helps ensure that only documents from relevant sources are retrieved for multi-hop queries
2. LLM-based metadata extraction can automatically identify the relevant sources and dates from the query
3. The combination of semantic search and metadata filtering provides more accurate results than semantic search alone
4. This approach is particularly useful for questions that reference specific sources or time periods

This implementation is a simplified version of the Multi-Meta-RAG approach described in the paper. For production use, you might want to consider using a more sophisticated vector database like Neo4j (as used in the original implementation) and optimizing the metadata extraction prompt for your specific use case.