# The Fact-Checked Analyst: Agentic RAG Workflow with Semantic Caching

## Project Overview
I built an agentic workflow with LangGraph that combines RAG techniques with semantic caching and embedding reuse.  
The system orchestrates multiple specialized agents responsible for retrieval, fact-checking, and generating responses, ensuring efficient and reliable analysis.

---

## Core Innovation: Semantic Cache Layer
The semantic cache detects when similar questions are asked, even with different phrasing.  
This reduces redundant processing and speeds up response times while keeping answers consistent.

---

## Technical Architecture Highlights
- Multi-agent workflow in LangGraph managing retrieval, writing, and verification  
- Semantic chunking to maintain context across document sections  
- Embedding-based caching with similarity thresholds for prompt reuse  
- Token-aware processing for cost and performance optimization  

---

## Real-World Application
Applied to Apple's 10-K report, the system extracts key insights on risks, market trends, and strategy.  
It scales across document collections while maintaining fast response times for cached queries.


In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

# API Configuration
os.environ['OPENAI_API_KEY'] = 'sk-'

print("✅ Environment configured successfully")

✅ Environment configured successfully


In [None]:
# Install Required Dependencies
!pip install langgraph==0.2.34 langchain==0.3.7 langchain-openai==0.2.9 langchain-community==0.3.7 pypdf==5.1.0 pandas==2.2.3 numpy==1.26.4 scikit-learn==1.5.2 tiktoken==0.8.0



# Environment Setup & Data Acquisition

## Configuring API Access and Document Retrieval
The system uses OpenAI's embedding and language models for vector search and natural language generation.  
Apple’s latest 10-K report is downloaded directly for analysis and processing.


In [None]:
import os
import warnings
import pandas as pd
import numpy as np
from typing import List, Dict, Optional, TypedDict
import json
import tiktoken

warnings.filterwarnings('ignore')

# API Configuration (optional, can be removed if not needed for subsequent steps)
# os.environ['OPENAI_API_KEY'] = 'sk-...'
# Note: It's best practice not to hardcode API keys.

print("✅ Environment configured successfully")

# Define the local PDF file path
file_path = "apple_2024_k-10_report.pdf"
print(f"\n📄 Checking for local PDF file: {file_path}")

# Verify the file exists in the environment
if os.path.exists(file_path):
    # Calculate and print file size for confirmation
    file_size = os.path.getsize(file_path) / (1024 * 1024)
    print(f"✅ {file_path} found ({file_size:.1f} MB). Ready to be loaded.")
else:
    print(f"⚠️  {file_path} not found.")
    print("Please make sure the file is in the same directory as your script or provide the full path.")

✅ Environment configured successfully

📄 Checking for local PDF file: apple_2024_k-10_report.pdf
⚠️  apple_2024_k-10_report.pdf not found.
Please make sure the file is in the same directory as your script or provide the full path.


# Part 1: Document Processing & Vector Database Creation

## Building the Knowledge Base
The foundation of the RAG system is a vector database built by processing the PDF, splitting it into semantic chunks, generating embeddings, and storing everything in a clear, structured format.  
This setup enables efficient retrieval and full transparency into how the system works.


In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
import pandas as pd
import numpy as np
import tiktoken
import ast

def setup_database(pdf_path: str = "Apple_K10.pdf",
                  chunk_size: int = 1000,
                  chunk_overlap: int = 200) -> pd.DataFrame:
    """
    Processes a PDF document into a searchable vector database.

    This function performs semantic chunking, embedding generation, and token counting
    to create a comprehensive knowledge base stored in CSV format for transparency.

    Args:
        pdf_path: Path to the PDF document
        chunk_size: Target size for text chunks in characters
        chunk_overlap: Overlap between consecutive chunks to preserve context

    Returns:
        DataFrame containing processed chunks with embeddings and metadata
    """

    print(f"\n{'='*60}")
    print("🔧 INITIALIZING DOCUMENT PROCESSING PIPELINE")
    print(f"{'='*60}\n")

    # Load the PDF document
    print("📄 Loading PDF document...")
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    print(f"✅ Loaded {len(documents)} pages")

    # Initialize the text splitter with semantic awareness
    print("\n✂️ Splitting document into semantic chunks...")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", ".", " ", ""]  # Hierarchical splitting
    )

    # Split documents into chunks
    chunks = text_splitter.split_documents(documents)
    print(f"✅ Created {len(chunks)} text chunks")

    # Initialize embedding model
    print("\n🧠 Initializing embedding model...")
    embeddings_model = OpenAIEmbeddings(
        model="text-embedding-3-small",
        dimensions=1536
    )

    # Initialize tokenizer for accurate token counting
    encoding = tiktoken.encoding_for_model("gpt-4o-mini")

    # Process each chunk
    print("\n⚡ Processing chunks (embedding + token counting)...")
    chunk_data = []

    for i, chunk in enumerate(chunks):
        if i % 50 == 0:
            print(f"   Processing chunk {i}/{len(chunks)}...")

        # Extract text
        chunk_text = chunk.page_content.strip()

        # Generate embedding
        chunk_embedding = embeddings_model.embed_query(chunk_text)

        # Count tokens
        token_count = len(encoding.encode(chunk_text))

        # Store as dictionary
        chunk_data.append({
            'chunk_text': chunk_text,
            'chunk_embedding': str(chunk_embedding),  # Store as string for CSV
            'token_count': token_count
        })

    # Create DataFrame
    df = pd.DataFrame(chunk_data)

    # Save to CSV
    csv_path = 'document_chunks.csv'
    df.to_csv(csv_path, index=False)

    print(f"\n✅ Database created with {len(df)} chunks")
    print(f"📊 Statistics:")
    print(f"   - Average chunk size: {df['token_count'].mean():.1f} tokens")
    print(f"   - Total tokens: {df['token_count'].sum():,}")
    print(f"   - Database saved to: {csv_path}")

    # Display sample entries
    print(f"\n📋 Sample Database Entries:")
    print(f"{'='*60}")
    for i in range(min(3, len(df))):
        print(f"\nChunk {i+1}:")
        print(f"  Text preview: {df.iloc[i]['chunk_text'][:100]}...")
        print(f"  Token count: {df.iloc[i]['token_count']}")
        print(f"  Embedding dims: {len(ast.literal_eval(df.iloc[i]['chunk_embedding']))}")

    return df

# Execute database setup
document_db = setup_database()


🔧 INITIALIZING DOCUMENT PROCESSING PIPELINE

📄 Loading PDF document...
✅ Loaded 121 pages

✂️ Splitting document into semantic chunks...
✅ Created 543 text chunks

🧠 Initializing embedding model...

⚡ Processing chunks (embedding + token counting)...
   Processing chunk 0/543...
   Processing chunk 50/543...
   Processing chunk 100/543...
   Processing chunk 150/543...
   Processing chunk 200/543...
   Processing chunk 250/543...
   Processing chunk 300/543...
   Processing chunk 350/543...
   Processing chunk 400/543...
   Processing chunk 450/543...
   Processing chunk 500/543...

✅ Database created with 543 chunks
📊 Statistics:
   - Average chunk size: 189.5 tokens
   - Total tokens: 102,889
   - Database saved to: document_chunks.csv

📋 Sample Database Entries:

Chunk 1:
  Text preview: UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☒    AN...
  Token count: 267
  Embedding dims: 1536

Chunk 2:
  Text preview: Title of each class
Tradi

# Part 2: Intelligent Semantic Caching System

## Building the Cache Layer
The semantic cache is key to the system’s speed and consistency.  
It detects when similar questions have been asked before using embedding similarity, allowing instant responses even if the queries are phrased differently.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import ast
import os

class SemanticCache:
    """
    Implements an intelligent caching system based on semantic similarity.

    This cache recognizes when similar questions have been asked before,
    even if phrased differently, and returns cached results instantly.
    """

    def __init__(self, cache_file: str = "prompt_cache.csv",
                 similarity_threshold: float = 0.95):
        """
        Initialize the semantic cache with configurable similarity threshold.

        Args:
            cache_file: Path to the CSV file storing cached prompts and answers
            similarity_threshold: Minimum cosine similarity to trigger cache hit
        """
        self.cache_file = cache_file
        self.similarity_threshold = similarity_threshold
        self.embeddings_model = OpenAIEmbeddings(
            model="text-embedding-3-small",
            dimensions=1536
        )

        # Initialize cache file if it doesn't exist
        if not os.path.exists(self.cache_file):
            empty_cache = pd.DataFrame(columns=['prompt_text', 'prompt_embedding', 'cached_answer'])
            empty_cache.to_csv(self.cache_file, index=False)
            print(f"📦 Initialized new cache at {self.cache_file}")

    def check_cache(self, user_query: str) -> Optional[str]:
        """
        Checks if a similar query exists in the cache.

        Uses cosine similarity between embeddings to find semantically similar queries.

        Args:
            user_query: The new query to check against the cache

        Returns:
            Cached answer if similarity exceeds threshold, None otherwise
        """

        # Load existing cache
        if not os.path.exists(self.cache_file):
            return None

        cache_df = pd.read_csv(self.cache_file)

        if cache_df.empty:
            return None

        print(f"\n🔍 Checking semantic cache ({len(cache_df)} entries)...")

        # Embed the new query
        query_embedding = self.embeddings_model.embed_query(user_query)
        query_embedding_np = np.array(query_embedding).reshape(1, -1)

        # Calculate similarities with all cached prompts
        max_similarity = 0
        best_match_idx = -1

        for idx, row in cache_df.iterrows():
            cached_embedding = np.array(ast.literal_eval(row['prompt_embedding'])).reshape(1, -1)
            similarity = cosine_similarity(query_embedding_np, cached_embedding)[0][0]

            if similarity > max_similarity:
                max_similarity = similarity
                best_match_idx = idx

        print(f"   Maximum similarity found: {max_similarity:.3f}")

        # Check if we have a cache hit
        if max_similarity >= self.similarity_threshold:
            cached_prompt = cache_df.iloc[best_match_idx]['prompt_text']
            cached_answer = cache_df.iloc[best_match_idx]['cached_answer']

            print(f"\n🎯 CACHE HIT! (Similarity: {max_similarity:.3f})")
            print(f"   Original query: '{cached_prompt[:80]}...'")
            print(f"   Your query:     '{user_query[:80]}...'")

            return cached_answer

        print(f"   No cache hit (threshold: {self.similarity_threshold})")
        return None

    def update_cache(self, user_query: str, answer: str) -> None:
        """
        Adds a new query-answer pair to the cache.

        Args:
            user_query: The original query
            answer: The generated answer to cache
        """

        # Generate embedding for the query
        query_embedding = self.embeddings_model.embed_query(user_query)

        # Load existing cache
        if os.path.exists(self.cache_file):
            cache_df = pd.read_csv(self.cache_file)
        else:
            cache_df = pd.DataFrame(columns=['prompt_text', 'prompt_embedding', 'cached_answer'])

        # Add new entry
        new_entry = pd.DataFrame([{
            'prompt_text': user_query,
            'prompt_embedding': str(query_embedding),
            'cached_answer': answer
        }])

        cache_df = pd.concat([cache_df, new_entry], ignore_index=True)

        # Save updated cache
        cache_df.to_csv(self.cache_file, index=False)

        print(f"\n💾 Cache updated (now {len(cache_df)} entries)")

# Initialize the semantic cache
semantic_cache = SemanticCache(similarity_threshold=0.92)
print("✅ Semantic cache system initialized")

📦 Initialized new cache at prompt_cache.csv
✅ Semantic cache system initialized
