# Fed Minutes Knowledge Base Demonstration

This notebook demonstrates the Phase 2 knowledge base for semantic search and analysis of Federal Reserve meeting minutes (1965-1973).

## Overview
The knowledge base transforms 1,100+ Fed meeting minutes into an intelligent search system using:
- **Vector embeddings** for semantic understanding
- **ChromaDB** for fast similarity search
- **Rich metadata** preservation for context
- **Temporal analysis** capabilities

## Contents
1. [Setup & Data Loading](#1-setup--data-loading)
2. [Vector Embeddings](#2-vector-embeddings)
3. [Database Creation](#3-database-creation)
4. [Semantic Search Examples](#4-semantic-search-examples)
5. [Advanced Analysis](#5-advanced-analysis)
6. [Research Applications](#6-research-applications)

## 1. Setup & Data Loading

In [None]:
# Import required libraries
import sys
import os
from pathlib import Path
import pandas as pd
import numpy as np
import json
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Configure display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)
plt.style.use('default')

# Add project root to path
project_root = Path(os.getcwd()).parent
sys.path.append(str(project_root))

print("‚úì Libraries imported successfully")

In [None]:
# Import knowledge base modules
from src.utils.config import load_config
from src.phase2_knowledge_base import (
    create_embeddings_pipeline,
    create_vector_db,
    create_search_interface,
    DocumentChunk,
    QueryBuilder
)

# Load configuration
config = load_config()
print("‚úì Knowledge base modules loaded")
print("‚úì Configuration loaded")

In [None]:
# Load parsed meetings data
processed_dir = Path(config['paths']['processed_dir'])
meetings_file = processed_dir / 'meetings_full.json'

if not meetings_file.exists():
    print("‚ùå No parsed meetings found.")
    print("   Please run Phase 1 parsing first:")
    print("   python -m src.phase1_parsing.fed_parser")
    raise FileNotFoundError(f"Missing: {meetings_file}")

# Load meetings data
df_meetings = pd.read_json(meetings_file)
print(f"‚úì Loaded {len(df_meetings):,} meetings")
print(f"  Date range: {df_meetings['date'].min()} to {df_meetings['date'].max()}")

# Verify raw_text column exists
if 'raw_text' not in df_meetings.columns:
    print("‚ùå Missing raw_text column")
    print("   Please run: python3 fix_json.py")
    raise ValueError("raw_text column required for embeddings")

avg_text_length = df_meetings['raw_text'].str.len().mean()
print(f"‚úì Raw text available (avg: {avg_text_length:,.0f} chars per meeting)")

In [None]:
# Display data overview
print("üìä Fed Minutes Dataset Overview\n")

# Basic statistics
stats = {
    'Total Meetings': f"{len(df_meetings):,}",
    'Date Range': f"{df_meetings['date'].min().strftime('%Y-%m-%d')} to {df_meetings['date'].max().strftime('%Y-%m-%d')}",
    'Avg Attendees': f"{df_meetings['num_attendees'].mean():.1f}",
    'Avg Decisions': f"{df_meetings['num_decisions'].mean():.1f}",
    'Avg Topics': f"{df_meetings['num_topics'].mean():.1f}",
    'Total Text': f"{df_meetings['raw_text'].str.len().sum():,} characters"
}

for key, value in stats.items():
    print(f"{key:15}: {value}")

# Meeting types distribution
print(f"\nMeeting Types:")
meeting_types = df_meetings['meeting_type'].value_counts()
for meeting_type, count in meeting_types.head().items():
    print(f"  {meeting_type:12}: {count:,} meetings")

## 2. Vector Embeddings

Transform meeting text into vector embeddings for semantic search.

In [None]:
# Check for existing embeddings
embeddings_dir = processed_dir / 'embeddings'
embeddings_file = embeddings_dir / 'embeddings.npy'
chunks_file = embeddings_dir / 'document_chunks.json'

# Initialize chunks_data
chunks_data = None

if embeddings_file.exists() and chunks_file.exists():
    print("üìÅ Loading existing embeddings...")
    
    # Load existing embeddings and chunks
    embeddings = np.load(embeddings_file)
    with open(chunks_file, 'r') as f:
        chunks_data = json.load(f)
    
    print(f"‚úì Loaded {len(chunks_data):,} chunks")
    print(f"‚úì Embedding dimension: {embeddings.shape[1]}")
    print(f"‚úì Model: {config['embedding']['model']}")
    
else:
    print("üî® Building embeddings from meeting text...")
    print("   This process may take 5-10 minutes")
    
    # Create embedding pipeline
    pipeline = create_embeddings_pipeline(config)
    
    # Process meetings into chunks
    print("   Step 1: Chunking meeting text...")
    chunks = pipeline.process_meetings_dataframe(df_meetings)
    print(f"   ‚úì Created {len(chunks):,} document chunks")
    
    # Generate embeddings
    print("   Step 2: Generating vector embeddings...")
    chunks, embeddings = pipeline.generate_embeddings_for_chunks(chunks)
    print(f"   ‚úì Generated embeddings: {embeddings.shape}")
    
    # Save results
    print("   Step 3: Saving to disk...")
    pipeline.save_processed_data(chunks, embeddings, str(embeddings_dir))
    
    # Load saved data for consistency
    with open(chunks_file, 'r') as f:
        chunks_data = json.load(f)
    
    print(f"‚úì Embeddings saved to {embeddings_dir}")
    print(f"‚úì Ready for semantic search with {len(chunks_data):,} chunks")

In [None]:
# Examine sample chunk structure
if chunks_data and len(chunks_data) > 0:
    sample_chunk = chunks_data[0]
    
    print("üìù Sample Document Chunk Structure:\n")
    print(f"Chunk ID: {sample_chunk['chunk_id']}")
    print(f"Meeting:  {sample_chunk['filename']} ({sample_chunk['date'][:10]})")
    print(f"Type:     {sample_chunk['meeting_type']}")
    print(f"Topics:   {', '.join(sample_chunk['topics'][:3]) if sample_chunk['topics'] else 'None'}")
    print(f"\nText Preview ({len(sample_chunk['chunk_text'])} chars):")
    print("‚îÄ" * 50)
    print(sample_chunk['chunk_text'][:300] + "...")
    print("‚îÄ" * 50)
else:
    print("‚ùå No chunk data available")

## 3. Database Creation

Load chunks into ChromaDB vector database for fast semantic search.

In [None]:
# Initialize vector database
if not chunks_data:
    print("‚ùå Cannot create database without chunk data")
    raise RuntimeError("Chunks data not available")

print("üóÑÔ∏è  Initializing ChromaDB vector database...")
vector_db = create_vector_db(config, reset=False)

# Check database status
stats = vector_db.get_collection_stats()
print(f"\nüìä Database Statistics:")
print(f"  Total chunks: {stats['total_chunks']:,}")
print(f"  Date range: {stats['date_range']['earliest']} to {stats['date_range']['latest']}")
print(f"  Meeting types: {len(stats['meeting_types'])}")
print(f"  Collection: {stats['collection_name']}")
print(f"  Model: {stats['embedding_model']}")

In [None]:
# Load chunks into database if empty
if stats['total_chunks'] == 0:
    print("üì• Loading chunks into database...")
    print("   This may take 2-3 minutes")
    
    # Convert to DocumentChunk objects
    from src.phase2_knowledge_base.vector_embeddings import DocumentChunk
    
    chunks_objects = []
    for chunk_data in chunks_data:
        chunk = DocumentChunk(
            chunk_id=chunk_data['chunk_id'],
            meeting_id=chunk_data['meeting_id'],
            filename=chunk_data['filename'],
            date=datetime.fromisoformat(chunk_data['date']) if chunk_data['date'] else None,
            chunk_text=chunk_data['chunk_text'],
            chunk_index=chunk_data['chunk_index'],
            total_chunks=chunk_data['total_chunks'],
            meeting_type=chunk_data['meeting_type'],
            attendees=chunk_data['attendees'],
            topics=chunk_data['topics'],
            decisions_summary=chunk_data['decisions_summary'],
            page_references=chunk_data['page_references']
        )
        chunks_objects.append(chunk)
    
    # Add to database in batches
    vector_db.add_document_chunks(chunks_objects, batch_size=100)
    
    # Verify loading
    stats = vector_db.get_collection_stats()
    print(f"‚úì Database populated with {stats['total_chunks']:,} chunks")
    
else:
    print("‚úì Database already contains data")

print(f"\nüöÄ Vector database ready for semantic search!")

## 4. Semantic Search Examples

Demonstrate various search capabilities of the knowledge base.

In [None]:
# Initialize search interface
if stats['total_chunks'] > 0:
    search = create_search_interface(config)
    print("üîç Semantic search interface initialized")
    print(f"   Ready to search {stats['total_chunks']:,} document chunks")
else:
    print("‚ùå Cannot initialize search - database is empty")
    raise RuntimeError("Database not populated")

### Basic Semantic Search

In [None]:
# Example 1: Monetary Policy Search
print("üéØ Example 1: Monetary Policy Discussions\n")

query = "interest rates and monetary policy decisions"
results = search.search(query, max_results=3)

print(f"Query: '{query}'")
print(f"Found: {results['total_results']} results\n")

for i, result in enumerate(results['results'], 1):
    print(f"üìÑ Result {i}:")
    print(f"   Meeting: {result['filename']} ({result['date'][:10]})")
    print(f"   Similarity: {result['similarity_score']:.4f}")
    print(f"   Topics: {', '.join(result['topics'][:3]) if result['topics'] else 'N/A'}")
    print(f"   Preview: {result['chunk_text'][:150]}...")
    print()

In [None]:
# Example 2: International Finance
print("üåç Example 2: International Financial Coordination\n")

query = "international monetary cooperation foreign exchange intervention"
results = search.search(query, max_results=3)

print(f"Query: '{query}'")
print(f"Found: {results['total_results']} results\n")

for i, result in enumerate(results['results'], 1):
    print(f"üåê Result {i}: {result['filename']} ({result['date'][:10]})")
    print(f"   Similarity: {result['similarity_score']:.4f}")
    print(f"   Text: {result['chunk_text'][:200]}...")
    print()

### Date-Filtered Search

In [None]:
# Example 3: Nixon Shock Period Analysis
print("‚ö° Example 3: Inflation Concerns Around Nixon Shock\n")

query = "inflation price stability wage controls"
date_range = ("1971-07-01", "1972-03-31")

results = search.search(
    query=query,
    date_range=date_range,
    max_results=5
)

print(f"Query: '{query}'")
print(f"Period: {date_range[0]} to {date_range[1]} (Nixon Shock era)")
print(f"Found: {results['total_results']} results")

if results['results']:
    print("\nüìÖ Timeline of Results:")
    dates = sorted(set(r['date'][:10] for r in results['results'] if r['date']))
    for date in dates[:5]:
        count = sum(1 for r in results['results'] if r['date'][:10] == date)
        print(f"   {date}: {count} relevant chunks")
        
    print(f"\nüîç Top Result:")
    top = results['results'][0]
    print(f"   {top['filename']} ({top['date'][:10]})")
    print(f"   {top['chunk_text'][:250]}...")

### Topic-Based Search

In [None]:
# Example 4: Topic Analysis
print("üìä Example 4: Topic-Based Search Results\n")

topics = {
    'Monetary Policy': 'monetary_policy',
    'Banking Regulation': 'banking_regulation', 
    'International Finance': 'international_finance'
}

topic_results = {}
for topic_name, topic_key in topics.items():
    results = search.search_by_topic(topic_key, max_results=10)
    topic_results[topic_name] = results['total_results']
    
    # Show year distribution for this topic
    if results['results']:
        years = [r['date'][:4] for r in results['results'] if r['date']]
        year_dist = pd.Series(years).value_counts().sort_index()
        
        print(f"üìà {topic_name}:")
        print(f"   Total results: {results['total_results']}")
        print(f"   Year distribution: {dict(year_dist.head(3))}")
        print()

# Visualize topic distribution
plt.figure(figsize=(10, 6))
plt.bar(topic_results.keys(), topic_results.values(), color=['steelblue', 'darkgreen', 'darkred'])
plt.title('Fed Minutes: Topic Search Results Distribution')
plt.ylabel('Number of Relevant Chunks')
plt.xticks(rotation=15)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## Summary

### üéØ Knowledge Base Capabilities

This Fed Minutes Knowledge Base provides:

**‚úÖ Core Features:**
- **Semantic search** across 1,100+ Fed meeting minutes
- **Date-filtered searches** for specific time periods
- **Topic-based analysis** with predefined categories
- **Sub-second search performance** with rich metadata

**üî¨ Research Applications:**
- Historical policy analysis during critical periods
- Decision-making pattern recognition
- Institutional behavior studies
- Economic event impact assessment
- Cross-temporal policy comparison

**üöÄ Next Phase:**
Phase 3 will add AI-powered analysis including:
- RAG (Retrieval-Augmented Generation) for intelligent Q&A
- Automated insight generation and report creation
- Advanced pattern recognition and anomaly detection
- Natural language research query interface

---
*The knowledge base successfully transforms static historical documents into an intelligent research platform for understanding Federal Reserve decision-making during the pivotal 1965-1973 period.*