# Ingestion Service

Some basic usage examples of the RagDoll2 ingestion service

## Basic Usage

In [13]:
import os
import glob
from pathlib import Path
from ragdoll.ingestion.ingestion_service import IngestionService

# Get absolute path to the test_data directory
current_file = Path(os.path.abspath(""))  # Current notebook directory
test_data_dir = (current_file.parent / "tests" / "test_data").resolve()

# Find all files using glob
file_paths = glob.glob(str(test_data_dir / "*"))
print(f"Found {len(file_paths)} files")

# Create ingestion service with default settings
service = IngestionService()

# Process all documents
documents = service.ingest_documents(file_paths)

# Show how many documents were extracted
print(f"Processed {len(documents)} documents")


2025-04-17 15:47:30,527 - INFO - Loaded 18 file extension loaders
2025-04-17 15:47:30,527 - INFO - Cache initialized at C:\Users\PG518JW\.ragdoll\cache with TTL=86400s
2025-04-17 15:47:30,536 - INFO - Initialized with 18 loaders, max_threads=10, use_cache=True, collect_metrics=False
2025-04-17 15:47:30,537 - INFO - Starting ingestion of 5 inputs
2025-04-17 15:47:30,537 - INFO - Processing batch 1 with 5 sources


Found 5 files


2025-04-17 15:47:36,258 - INFO - Finished ingestion, loaded 777 documents


Processed 777 documents


In [14]:
import json

# Access the first document
if documents:
    doc = documents[0]
    print(f"First document content (preview): {doc.page_content[:100]}...")
    print(f"Metadata:\n {json.dumps(doc.metadata, indent=2)}")

First document content (preview): Content from the zip file `test_docx.docx`:

## File: [Content_Types].xml

<?xml version="1.0" encod...
Metadata:
 {
  "source": "C:\\dev\\RAGdoll\\tests\\test_data\\test_docx.docx",
  "file_name": "test_docx.docx",
  "file_size": 327119,
  "conversion_success": true,
  "metadata_extraction_error": "No module named 'exceptions'",
  "content_type": "document_full"
}


## Working with different file types

In [15]:
from ragdoll.ingestion.ingestion_service import IngestionService

# Initialize service
service = IngestionService()

# Process files of different types
pdf_docs = service.ingest_documents(["../tests/test_data/test_pdf.pdf"])
text_docs = service.ingest_documents(["../tests/test_data/test_txt.txt", "../tests/test_data/test_txt.txt"])
docx_docs = service.ingest_documents(["../tests/test_data/test_docx.docx"])

# Process HTML from URLs
web_docs = service.ingest_documents(["https://github.com/nsasto/langchain-markitdown"])

# Combine all documents
all_docs = pdf_docs + text_docs + docx_docs + web_docs

print(f"Total documents: {len(all_docs)}")
print(f"Documents by type:")
print(f"  - PDF: {len(pdf_docs)}")
print(f"  - Text: {len(text_docs)}")
print(f"  - DOCX: {len(docx_docs)}")
print(f"  - Web: {len(web_docs)}")

2025-04-17 15:47:36,329 - INFO - Loaded 18 file extension loaders
2025-04-17 15:47:36,332 - INFO - Cache initialized at C:\Users\PG518JW\.ragdoll\cache with TTL=86400s
2025-04-17 15:47:36,333 - INFO - Initialized with 18 loaders, max_threads=10, use_cache=True, collect_metrics=False
2025-04-17 15:47:36,336 - INFO - Starting ingestion of 1 inputs
2025-04-17 15:47:36,339 - INFO - Processing batch 1 with 1 sources
2025-04-17 15:47:39,394 - INFO - Finished ingestion, loaded 773 documents
2025-04-17 15:47:39,397 - INFO - Starting ingestion of 2 inputs
2025-04-17 15:47:39,400 - INFO - Processing batch 1 with 2 sources
2025-04-17 15:47:39,410 - INFO - Finished ingestion, loaded 2 documents
2025-04-17 15:47:39,412 - INFO - Starting ingestion of 1 inputs
2025-04-17 15:47:39,414 - INFO - Processing batch 1 with 1 sources
2025-04-17 15:47:42,166 - INFO - Finished ingestion, loaded 1 documents
2025-04-17 15:47:42,169 - INFO - Starting ingestion of 1 inputs
2025-04-17 15:47:42,173 - INFO - Processi

Total documents: 777
Documents by type:
  - PDF: 773
  - Text: 2
  - DOCX: 1
  - Web: 1


## Customizing Ingestion Settings

In [16]:
# Modified initialization with supported parameters
from ragdoll.ingestion.ingestion_service import IngestionService

# Initialize with only the supported parameters
service = IngestionService(
    max_threads=4,                # Limit concurrency
    batch_size=10,                # Process files in batches of 10
    use_cache=True,               # Enable caching
    collect_metrics=True          # Enable metrics collection
)

# Process documents - pass file_paths directly, not [file_paths]
documents = service.ingest_documents(file_paths)

print(f"Processed {len(documents)} document chunks")

# Document properties can be accessed differently depending on type
if documents:
    doc = documents[0]
    if hasattr(doc, 'page_content'):
        content_length = len(doc.page_content)
    elif isinstance(doc, dict) and 'page_content' in doc:
        content_length = len(doc['page_content'])
    else:
        content_length = 0
    print(f"First document size: {content_length} characters")

2025-04-17 15:47:44,389 - INFO - Loaded 18 file extension loaders
2025-04-17 15:47:44,392 - INFO - Cache initialized at C:\Users\PG518JW\.ragdoll\cache with TTL=86400s
2025-04-17 15:47:44,396 - INFO - Metrics system initialized with storage at C:\Users\PG518JW\.ragdoll\metrics
2025-04-17 15:47:44,399 - INFO - Initialized with 18 loaders, max_threads=4, use_cache=True, collect_metrics=True
2025-04-17 15:47:44,404 - INFO - Starting ingestion of 5 inputs
2025-04-17 15:47:44,406 - INFO - Started metrics session 09fd571f-36f3-44c4-b4ee-460b57ca054b with 5 inputs
2025-04-17 15:47:44,412 - INFO - Processing batch 1 with 5 sources
2025-04-17 15:47:51,026 - INFO - Metrics session completed and saved to C:\Users\PG518JW\.ragdoll\metrics\session_09fd571f-36f3-44c4-b4ee-460b57ca054b.json
2025-04-17 15:47:51,027 - INFO - Processed 777 documents with 100.0% success rate
2025-04-17 15:47:51,028 - INFO - Finished ingestion, loaded 777 documents


Processed 777 document chunks
First document size: 111038 characters


## Working with Caching

In [24]:
# Complete caching performance test
from ragdoll.ingestion.ingestion_service import IngestionService
import time
import statistics

def measure_processing_time(use_cache, file_path, runs=3):
    """Measure document processing time with or without cache."""
    times = []
    
    service = IngestionService(use_cache=use_cache, cache_ttl=3600)
    
    # Run multiple times to get average performance
    for i in range(runs):
        start = time.time()
        docs = service.ingest_documents([file_path])
        elapsed = time.time() - start
        times.append(elapsed)
        
        # Don't sleep on the last run
        if i < runs-1:
            time.sleep(0.5)  # Short pause between runs
    
    return {
        "avg_time": statistics.mean(times),
        "min_time": min(times),
        "max_time": max(times),
        "doc_count": len(docs),
        "runs": runs
    }

# Clear any existing cache first
service_clear = IngestionService(use_cache=True)
service_clear.clear_cache()
print("Cache cleared")

# Test with no cache
no_cache_results = measure_processing_time(False, "../tests/test_data/test_pdf.pdf", runs=3)
print("\nWithout cache:")
print(f"  Processed {no_cache_results['doc_count']} documents")
print(f"  Average time: {no_cache_results['avg_time']:.3f} seconds")
print(f"  Min time: {no_cache_results['min_time']:.3f}, Max time: {no_cache_results['max_time']:.3f}")

# Test with cache (first run populates, subsequent runs use cache)
cache_results = measure_processing_time(True, "../tests/test_data/test_pdf.pdf", runs=3)
print("\nWith cache:")
print(f"  Processed {cache_results['doc_count']} documents")
print(f"  Average time: {cache_results['avg_time']:.3f} seconds")
print(f"  Min time: {cache_results['min_time']:.3f}, Max time: {cache_results['max_time']:.3f}")

# Speed improvement calculation
if no_cache_results['avg_time'] > 0:
    improvement = (no_cache_results['avg_time'] - cache_results['avg_time']) / no_cache_results['avg_time'] * 100
    print(f"\nCache performance improvement: {improvement:.1f}%")

2025-04-17 15:52:18,193 - INFO - Loaded 18 file extension loaders
2025-04-17 15:52:18,196 - INFO - Cache initialized at C:\Users\PG518JW\.ragdoll\cache with TTL=86400s
2025-04-17 15:52:18,198 - INFO - Initialized with 18 loaders, max_threads=10, use_cache=True, collect_metrics=False
2025-04-17 15:52:18,207 - INFO - Loaded 18 file extension loaders
2025-04-17 15:52:18,209 - INFO - Initialized with 18 loaders, max_threads=10, use_cache=False, collect_metrics=False
2025-04-17 15:52:18,210 - INFO - Starting ingestion of 1 inputs
2025-04-17 15:52:18,213 - INFO - Processing batch 1 with 1 sources


Cache cleared


2025-04-17 15:52:20,844 - INFO - Finished ingestion, loaded 773 documents
2025-04-17 15:52:21,347 - INFO - Starting ingestion of 1 inputs
2025-04-17 15:52:21,350 - INFO - Processing batch 1 with 1 sources
2025-04-17 15:52:24,381 - INFO - Finished ingestion, loaded 773 documents
2025-04-17 15:52:24,886 - INFO - Starting ingestion of 1 inputs
2025-04-17 15:52:24,891 - INFO - Processing batch 1 with 1 sources
2025-04-17 15:52:27,302 - INFO - Finished ingestion, loaded 773 documents
2025-04-17 15:52:27,325 - INFO - Loaded 18 file extension loaders
2025-04-17 15:52:27,326 - INFO - Cache initialized at C:\Users\PG518JW\.ragdoll\cache with TTL=3600s
2025-04-17 15:52:27,326 - INFO - Initialized with 18 loaders, max_threads=10, use_cache=True, collect_metrics=False
2025-04-17 15:52:27,332 - INFO - Starting ingestion of 1 inputs
2025-04-17 15:52:27,332 - INFO - Processing batch 1 with 1 sources



Without cache:
  Processed 773 documents
  Average time: 2.699 seconds
  Min time: 2.422, Max time: 3.039


2025-04-17 15:52:29,825 - INFO - Finished ingestion, loaded 773 documents
2025-04-17 15:52:30,328 - INFO - Starting ingestion of 1 inputs
2025-04-17 15:52:30,332 - INFO - Processing batch 1 with 1 sources
2025-04-17 15:52:32,565 - INFO - Finished ingestion, loaded 773 documents
2025-04-17 15:52:33,070 - INFO - Starting ingestion of 1 inputs
2025-04-17 15:52:33,073 - INFO - Processing batch 1 with 1 sources
2025-04-17 15:52:35,431 - INFO - Finished ingestion, loaded 773 documents



With cache:
  Processed 773 documents
  Average time: 2.365 seconds
  Min time: 2.237, Max time: 2.493

Cache performance improvement: 12.4%


## Handling Errors

In [18]:
from ragdoll.ingestion.ingestion_service import IngestionService
import logging

# Configure logging to see warnings and errors
logging.basicConfig(level=logging.INFO)

# Create service
service = IngestionService()

# Mix of valid and invalid files
files = [
    "documents/valid.pdf",
    "documents/corrupted.pdf",
    "documents/nonexistent.txt",
    "documents/valid.txt"
]

try:
    # Service will skip files it can't process
    documents = service.ingest_documents(files)
    print(f"Successfully processed {len(documents)} documents")
    
    # Check how many files were actually processed
    sources = set([doc['metadata'].get('source') for doc in documents if 'source' in doc['metadata']])
    print(f"Documents came from {len(sources)} source files")
    print(f"Source files: {sources}")
    
except Exception as e:
    print(f"Error during ingestion: {e}")

2025-04-17 15:47:56,599 - INFO - Loaded 18 file extension loaders
2025-04-17 15:47:56,602 - INFO - Cache initialized at C:\Users\PG518JW\.ragdoll\cache with TTL=86400s
2025-04-17 15:47:56,605 - INFO - Initialized with 18 loaders, max_threads=10, use_cache=True, collect_metrics=False
2025-04-17 15:47:56,608 - INFO - Starting ingestion of 4 inputs


Error during ingestion: No valid sources found


## Logging metrics

In [19]:
# Replace your current loading code with this
import os
import glob
from pathlib import Path
from ragdoll.ingestion.ingestion_service import IngestionService

# Get absolute path to the test_data directory
current_file = Path(os.path.abspath(""))  # Current notebook directory
test_data_dir = (current_file.parent / "tests" / "test_data").resolve()

# Instead of using Path.glob(), use the glob module which handles absolute paths
file_paths = glob.glob(str(test_data_dir / "*"))
print(f"Found {len(file_paths)} files")

Found 5 files


### Basic Usage

In [20]:
# Create service
service = IngestionService(collect_metrics=True)

# Pass the actual file paths, not the glob pattern
service.ingest_documents(file_paths)


# Get metrics after running
metrics = service.get_metrics(days=30)  # Get metrics from the last 30 days

# Use the metrics data
print(f"Total documents processed: {metrics['aggregate']['total_documents']}")
print(f"Average success rate: {metrics['aggregate']['avg_success_rate']:.2%}")

# Print metrics for each source type
for source_type, type_metrics in metrics['aggregate']['by_source_type'].items():
    print(f"\nMetrics for {source_type} sources:")
    print(f"  Count: {type_metrics['count']}")
    print(f"  Success rate: {type_metrics['success_rate']:.2%}")
    print(f"  Average documents: {type_metrics['avg_documents']:.1f}")
    print(f"  Average processing time: {type_metrics['avg_processing_time_ms']:.1f}ms")

# Get the most recent session details
if metrics['recent_sessions']:
    latest = metrics['recent_sessions'][0]
    print(f"\nLatest session ({latest['session_id']}):")
    print(f"  Time: {latest['timestamp_start']}")
    print(f"  Documents: {latest['document_count']}")
    print(f"  Duration: {latest['duration_seconds']:.2f} seconds")

2025-04-17 15:47:56,719 - INFO - Loaded 18 file extension loaders
2025-04-17 15:47:56,723 - INFO - Cache initialized at C:\Users\PG518JW\.ragdoll\cache with TTL=86400s
2025-04-17 15:47:56,727 - INFO - Metrics system initialized with storage at C:\Users\PG518JW\.ragdoll\metrics
2025-04-17 15:47:56,729 - INFO - Initialized with 18 loaders, max_threads=10, use_cache=True, collect_metrics=True
2025-04-17 15:47:56,731 - INFO - Starting ingestion of 5 inputs
2025-04-17 15:47:56,732 - INFO - Started metrics session 5ef8829c-4aad-4166-99ef-5bb072baac14 with 5 inputs
2025-04-17 15:47:56,739 - INFO - Processing batch 1 with 5 sources
2025-04-17 15:48:04,096 - INFO - Metrics session completed and saved to C:\Users\PG518JW\.ragdoll\metrics\session_5ef8829c-4aad-4166-99ef-5bb072baac14.json
2025-04-17 15:48:04,099 - INFO - Processed 777 documents with 100.0% success rate
2025-04-17 15:48:04,101 - INFO - Finished ingestion, loaded 777 documents


Total documents processed: 9316
Average success rate: 89.66%

Metrics for file sources:
  Count: 57
  Success rate: 89.47%
  Average documents: 163.4
  Average processing time: 4126.6ms

Latest session (5ef8829c-4aad-4166-99ef-5bb072baac14):
  Time: 2025-04-17T15:47:56.732026
  Documents: 777
  Duration: 7.36 seconds


### Direct Access

You can also access metrics directly from the metrics directory

In [21]:
import json
from pathlib import Path
import os

# Default metrics location
metrics_dir = Path.home() / ".ragdoll" / "metrics"
# Or custom location if you specified one
# metrics_dir = Path("/path/to/your/metrics")

# List all session files
session_files = list(metrics_dir.glob("session_*.json"))
# Sort by modification time (most recent first)
session_files.sort(key=os.path.getmtime, reverse=True)

# Read the most recent session
if session_files:
    with open(session_files[0], "r", encoding="utf-8") as f:
        latest_session = json.load(f)
        
    print(f"Session ID: {latest_session['session_id']}")
    print(f"Date: {latest_session['timestamp_start']}")
    print(f"Documents processed: {latest_session['document_count']}")
    print(f"Success rate: {latest_session['success_rate']:.2%}")
    
    # Print details about each source
    for source_id, source_data in latest_session["sources"].items():
        print(f"\nSource: {source_id}")
        print(f"  Type: {source_data['source_type']}")
        print(f"  Success: {source_data['success']}")
        print(f"  Documents: {source_data['document_count']}")
        print(f"  Processing time: {source_data['processing_time_ms']}ms")

Session ID: 5ef8829c-4aad-4166-99ef-5bb072baac14
Date: 2025-04-17T15:47:56.732026
Documents processed: 777
Success rate: 100.00%

Source: C:\dev\RAGdoll\tests\test_data\test_txt.txt
  Type: file
  Success: True
  Documents: 1
  Processing time: 198ms

Source: C:\dev\RAGdoll\tests\test_data\test_pdf.pdf
  Type: file
  Success: True
  Documents: 773
  Processing time: 4030ms

Source: C:\dev\RAGdoll\tests\test_data\test_xlsx.xlsx
  Type: file
  Success: True
  Documents: 1
  Processing time: 4022ms

Source: C:\dev\RAGdoll\tests\test_data\test_pptx.pptx
  Type: file
  Success: True
  Documents: 1
  Processing time: 4422ms

Source: C:\dev\RAGdoll\tests\test_data\test_docx.docx
  Type: file
  Success: True
  Documents: 1
  Processing time: 7350ms


### Displaying Outputs

Simple Metrics dashboard

In [22]:
# Notebook-friendly version of the dashboard script
import json
import os
from datetime import datetime, timedelta
from pathlib import Path
from typing import List, Dict, Any

from ragdoll.metrics.metrics_manager import MetricsManager

def print_section(title: str):
    """Print a section title."""
    print(f"\n{'=' * 80}")
    print(f"  {title}")
    print(f"{'=' * 80}")

def print_session_summary(session: Dict[str, Any]):
    """Print a summary of a session."""
    start_time = datetime.fromisoformat(session["timestamp_start"])
    
    print(f"Session: {session['session_id']}")
    print(f"  Date: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"  Duration: {session.get('duration_seconds', 0):.2f} seconds")
    print(f"  Documents: {session['document_count']}")
    print(f"  Sources: {session['success_count'] + session['failure_count']} "
          f"({session['success_count']} successful, {session['failure_count']} failed)")
    print(f"  Success rate: {session.get('success_rate', 0):.2%}")

def format_bytes(bytes_count: int) -> str:
    """Format bytes as human-readable size."""
    if bytes_count < 1024:
        return f"{bytes_count} B"
    elif bytes_count < 1024**2:
        return f"{bytes_count / 1024:.1f} KB"
    elif bytes_count < 1024**3:
        return f"{bytes_count / (1024**2):.1f} MB"
    else:
        return f"{bytes_count / (1024**3):.2f} GB"

# Initialize metrics manager with the path to your metrics directory
metrics_dir = Path.home() / ".ragdoll" / "metrics"
metrics_manager = MetricsManager(metrics_dir=metrics_dir)

# Show aggregate metrics and recent sessions
print_section("RAGdoll Metrics Dashboard")

# Get aggregate metrics for the last 30 days
days = 30
try:
    # Fix the date handling issue by using timedelta
    from datetime import datetime, timedelta
    
    # Monkey patch the get_aggregate_metrics method to avoid date issues
    def fixed_get_aggregate_metrics(self, days=30):
        cutoff_date = datetime.now() - timedelta(days=days)
        
        aggregate = {
            "total_sessions": 0,
            "total_documents": 0,
            "total_sources": 0,
            "successful_sources": 0,
            "failed_sources": 0,
            "avg_success_rate": 0,
            "avg_documents_per_source": 0,
            "avg_processing_time_ms": 0,
            "by_source_type": {}
        }
        
        try:
            json_files = list(self.metrics_dir.glob("session_*.json"))
            
            # Process each session file
            for file_path in json_files:
                with open(file_path, "r", encoding="utf-8") as f:
                    session = json.load(f)
                
                # Skip if older than cutoff
                session_date = datetime.fromisoformat(session.get("timestamp_start", ""))
                if session_date < cutoff_date:
                    continue
                
                # Update aggregate metrics
                aggregate["total_sessions"] += 1
                aggregate["total_documents"] += session.get("document_count", 0)
                aggregate["total_sources"] += session.get("success_count", 0) + session.get("failure_count", 0)
                aggregate["successful_sources"] += session.get("success_count", 0)
                aggregate["failed_sources"] += session.get("failure_count", 0)
                
                # Process by source type
                for source_key, source_metrics in session.get("sources", {}).items():
                    source_type = source_metrics.get("source_type", "unknown")
                    
                    if source_type not in aggregate["by_source_type"]:
                        aggregate["by_source_type"][source_type] = {
                            "count": 0,
                            "success_count": 0,
                            "document_count": 0,
                            "total_processing_time_ms": 0
                        }
                    
                    type_metrics = aggregate["by_source_type"][source_type]
                    type_metrics["count"] += 1
                    
                    if source_metrics.get("success", False):
                        type_metrics["success_count"] += 1
                    
                    type_metrics["document_count"] += source_metrics.get("document_count", 0)
                    type_metrics["total_processing_time_ms"] += source_metrics.get("processing_time_ms", 0)
            
            # Calculate averages
            if aggregate["total_sources"] > 0:
                aggregate["avg_success_rate"] = aggregate["successful_sources"] / aggregate["total_sources"]
                aggregate["avg_documents_per_source"] = aggregate["total_documents"] / aggregate["total_sources"]
                
                total_time = 0
                total_items = 0
                for source_type, metrics in aggregate["by_source_type"].items():
                    total_time += metrics["total_processing_time_ms"]
                    total_items += metrics["count"]
                    
                    # Calculate source type specific metrics
                    if metrics["count"] > 0:
                        metrics["avg_processing_time_ms"] = metrics["total_processing_time_ms"] / metrics["count"]
                        metrics["avg_documents"] = metrics["document_count"] / metrics["count"]
                        metrics["success_rate"] = metrics["success_count"] / metrics["count"]
                
                if total_items > 0:
                    aggregate["avg_processing_time_ms"] = total_time / total_items
                    
        except Exception as e:
            print(f"Error calculating aggregate metrics: {e}")
            
        return aggregate
    
    # Apply the monkey patch
    from types import MethodType
    metrics_manager.get_aggregate_metrics = MethodType(fixed_get_aggregate_metrics, metrics_manager)
    
    aggregate = metrics_manager.get_aggregate_metrics(days=days)
    
    print(f"Showing metrics for the past {days} days:")
    print(f"  Total sessions: {aggregate['total_sessions']}")
    print(f"  Total documents: {aggregate['total_documents']}")
    print(f"  Total sources: {aggregate['total_sources']}")
    print(f"  Success rate: {aggregate['avg_success_rate']:.2%}")
    print(f"  Avg processing time: {aggregate['avg_processing_time_ms']:.1f}ms per source")
    
    # Show metrics by source type
    print_section("Metrics by Source Type")
    for source_type, metrics in aggregate["by_source_type"].items():
        print(f"\n{source_type.upper()} Sources:")
        print(f"  Count: {metrics['count']}")
        print(f"  Success rate: {metrics.get('success_rate', 0):.2%}")
        print(f"  Avg documents: {metrics.get('avg_documents', 0):.1f} per source")
        print(f"  Avg processing time: {metrics.get('avg_processing_time_ms', 0):.1f}ms")
    
    # Show recent sessions
    recent_sessions = metrics_manager.get_recent_sessions(limit=5)
    
    print_section("Recent Sessions")
    for session in recent_sessions:
        print("")
        print_session_summary(session)
        
    # Pick a specific session to view in detail
    if recent_sessions:
        session_id = recent_sessions[0]["session_id"]
        print_section(f"Detailed Session Report: {session_id}")
        
        # Get the session data
        session_path = Path(metrics_manager.metrics_dir) / f"session_{session_id}.json"
        with open(session_path, "r", encoding="utf-8") as f:
            session = json.load(f)
        
        print_session_summary(session)
        
        print("\nSource Details:")
        for source_id, source_data in session["sources"].items():
            success = "✅" if source_data["success"] else "❌"
            error = f" - Error: {source_data['error']}" if source_data["error"] else ""
            
            print(f"\n{success} {source_id} ({source_data['source_type']}){error}")
            print(f"  Documents: {source_data['document_count']}")
            print(f"  Size: {format_bytes(source_data['bytes'])}")
            print(f"  Processing time: {source_data['processing_time_ms']}ms")
            
except Exception as e:
    print(f"Error running dashboard: {e}")

2025-04-17 15:48:05,346 - INFO - Metrics system initialized with storage at C:\Users\PG518JW\.ragdoll\metrics



  RAGdoll Metrics Dashboard
Showing metrics for the past 30 days:
  Total sessions: 25
  Total documents: 9316
  Total sources: 58
  Success rate: 89.66%
  Avg processing time: 4126.6ms per source

  Metrics by Source Type

FILE Sources:
  Count: 57
  Success rate: 89.47%
  Avg documents: 163.4 per source
  Avg processing time: 4126.6ms

  Recent Sessions

Session: 5ef8829c-4aad-4166-99ef-5bb072baac14
  Date: 2025-04-17 15:47:56
  Duration: 7.36 seconds
  Documents: 777
  Sources: 5 (5 successful, 0 failed)
  Success rate: 100.00%

Session: 09fd571f-36f3-44c4-b4ee-460b57ca054b
  Date: 2025-04-17 15:47:44
  Duration: 6.62 seconds
  Documents: 777
  Sources: 5 (5 successful, 0 failed)
  Success rate: 100.00%

Session: 0d466654-4981-46fe-9a74-d99ac6a0398b
  Date: 2025-04-17 15:45:10
  Duration: 5.50 seconds
  Documents: 777
  Sources: 5 (5 successful, 0 failed)
  Success rate: 100.00%

Session: d384b383-b747-4093-9571-529ef10e900b
  Date: 2025-04-17 15:44:03
  Duration: 0.00 seconds
  Do

### Export to file

Export to other formats

In [23]:
import csv

# Export to CSV
def export_sessions_to_csv(metrics_manager, output_path):
    sessions = metrics_manager.get_recent_sessions(limit=100)
    
    with open(output_path, 'w', newline='') as csvfile:
        fieldnames = ['session_id', 'timestamp', 'documents', 'sources', 
                      'success_rate', 'duration_seconds']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        
        writer.writeheader()
        for session in sessions:
            writer.writerow({
                'session_id': session['session_id'],
                'timestamp': session['timestamp_start'],
                'documents': session['document_count'],
                'sources': session['success_count'] + session['failure_count'],
                'success_rate': session['success_rate'],
                'duration_seconds': session['duration_seconds']
            })