# InvestigatorAI: Comprehensive RAGAS Evaluation Framework

## 🎯 Objective
This notebook implements comprehensive evaluation of our InvestigatorAI fraud investigation system using RAGAS with both RAG and Agent evaluation metrics:

### 📊 RAG Evaluation Metrics:
- **Faithfulness**: Response grounding in retrieved contexts
- **Answer Relevancy**: Response relevance to questions  
- **Context Precision**: Relevance of retrieved contexts
- **Context Recall**: Completeness of retrieved information

### 🤖 Agent Evaluation Metrics:
- **Tool Call Accuracy**: Correct tool usage and parameters
- **Agent Goal Accuracy**: Achievement of user's stated goals
- **Topic Adherence**: Staying on-topic for fraud investigation

### 📈 Integration:
- **LangSmith**: Capturing evaluation results and conversation traces
- **Real Data**: Using official FinCEN/FFIEC/FDIC regulatory documents
- **Multi-Agent System**: Evaluating our complete fraud investigation workflow

---

*Following AI Makerspace evaluation patterns with Task 5 certification requirements*


## 📦 Dependencies and Setup


In [None]:
# Core dependencies for RAGAS evaluation
import os
import sys
import asyncio
from getpass import getpass
from datetime import datetime
from typing import List, Dict, Any
import pandas as pd
import json

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

## 🔑 API Keys Configuration


In [3]:
# Configure API keys for evaluation
print("🔐 Setting up API keys for evaluation...")

# OpenAI API Key (required for LLM and embeddings)
if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
    
# LangSmith API Key (for evaluation tracking)
if not os.getenv("LANGSMITH_API_KEY"):
    os.environ["LANGSMITH_API_KEY"] = getpass("Enter your LangSmith API key: ")

# External API keys (if not already set)
external_apis = [
    "TAVILY_SEARCH_API_KEY",
    "ALPHA_VANTAGE_API_KEY"
]

for api_key in external_apis:
    if not os.getenv(api_key):
        response = input(f"Enter {api_key} (or press Enter to skip): ")
        if response.strip():
            os.environ[api_key] = response.strip()

print("✅ API keys configured for evaluation!")


🔐 Setting up API keys for evaluation...
✅ API keys configured for evaluation!


## 🏗️ Load InvestigatorAI Components


In [None]:
# Import existing InvestigatorAI components
print("🔄 Loading InvestigatorAI components for evaluation...")

try:
    # Load core components
    from api.core.config import Settings, get_settings, initialize_llm_components
    from api.services.vector_store import VectorStoreService  
    from api.services.external_apis import ExternalAPIService
    from api.agents.multi_agent_system import FraudInvestigationSystem
    from api.models.schemas import InvestigationRequest
    
    print("✅ Core InvestigatorAI components loaded!")
    
    # Initialize settings and LLM components
    settings = get_settings()
    llm, embeddings = initialize_llm_components(settings)
    
    print("✅ Settings and LLM components initialized!")
    
    # Initialize services with required arguments
    vector_service = VectorStoreService(embeddings=embeddings, settings=settings)
    external_api_service = ExternalAPIService(settings=settings)
    
    # Initialize multi-agent system
    fraud_system = FraudInvestigationSystem(
        llm=llm,
        external_api_service=external_api_service
    )
    
    print("✅ InvestigatorAI system initialized for evaluation!")
    
except ImportError as e:
    print(f"⚠️  Error loading InvestigatorAI components: {e}")
    print("💡 Make sure you're running from the project root directory")
except ValueError as e:
    print(f"⚠️  Configuration error: {e}")
    print("💡 Make sure your API keys are set in environment variables")
    
    # Fallback initialization for RAGAS evaluation only
    print("🔄 Falling back to basic LLM setup for RAGAS evaluation...")
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    embeddings = OpenAIEmbeddings()
    
except Exception as e:
    print(f"⚠️  Unexpected error: {e}")
    print("🔄 Using fallback LLM configuration...")
    
    # Fallback initialization
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    embeddings = OpenAIEmbeddings()

# Wrap with RAGAS wrappers for evaluation
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0))
generator_embeddings = LangchainEmbeddingsWrapper(embeddings)

print("✅ RAGAS LLM and embeddings configured!")


🔄 Loading InvestigatorAI components for evaluation...
✅ Core InvestigatorAI components loaded!


TypeError: VectorStoreService.__init__() missing 2 required positional arguments: 'embeddings' and 'settings'

## 📄 Load Regulatory Documents and Generate Synthetic Dataset


In [None]:
# Load regulatory PDFs and generate synthetic test dataset
print("📄 Loading regulatory documents for evaluation...")

# Load PDF documents from data directory
pdf_path = "data/pdf_downloads/"
loader = DirectoryLoader(pdf_path, glob="*.pdf", loader_cls=PyMuPDFLoader)
regulatory_docs = loader.load()

print(f"✅ Loaded {len(regulatory_docs)} regulatory document chunks")

# Generate fraud investigation questions based on the documents
fraud_investigation_questions = [
    "What is the SAR filing threshold for wire transfers to high-risk countries?",
    "What documentation is required for CTR reporting on cash transactions?",
    "Are there specific red flags for this transaction pattern involving shell companies?",
    "What are the Enhanced Due Diligence requirements for politically exposed persons?",
    "Which transactions require immediate law enforcement notification under FinCEN guidance?",
    "What are the money laundering indicators for wire transfers to the UAE?",
    "Are there structuring indicators in this series of cash deposits?",
    "How should suspicious account takeover activities be documented and reported?",
    "What are the trade-based money laundering red flags in import/export transactions?",
    "Are there compliance violations in this cryptocurrency exchange pattern?",
    "What KYC documentation is required for high-risk customer onboarding?",
    "How should politically exposed person transactions be monitored and reported?",
    "What are the investigation steps for suspected terrorist financing activities?",
    "Are there Bank Secrecy Act violations in these transaction patterns?",
    "What evidence collection procedures should be followed for fraud investigations?",
    "How should cross-border wire transfer compliance be verified?",
    "What are the regulatory reporting requirements for this suspicious activity?",
    "Are there OFAC sanctions screening violations in these transactions?",
    "What investigation timeline should be followed for this fraud case?",
    "How should this complex money laundering scheme be analyzed and reported?"
]

# Create synthetic dataset structure for RAGAS evaluation
print(f"🎯 Generated {len(fraud_investigation_questions)} fraud investigation questions")

# Display sample questions
print("\n📋 Sample fraud investigation questions:")
for i, question in enumerate(fraud_investigation_questions[:5]):
    print(f"  {i+1}. {question}")

print(f"\n✅ Dataset ready with {len(fraud_investigation_questions)} questions for evaluation!")


In [None]:
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

generator = TestsetGenerator(
    llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(regulatory_docs[:20], testset_size=10)

## 🤖 Generate Responses with InvestigatorAI Multi-Agent System

Next sections to implement:
- Generate responses using the fraud investigation system
- Create RAGAS evaluation samples
- Run RAG evaluation (faithfulness, answer relevancy, context precision, context recall)
- Run Agent evaluation (tool call accuracy, agent goal accuracy, topic adherence) 
- Integrate with LangSmith for results tracking
- Provide comprehensive performance analysis and recommendations

*Implementation following AI Makerspace patterns with complete LangSmith integration*
