# Flatfile Chat Database - Interactive Demo

Welcome to the comprehensive demo of the Flatfile Chat Database System! This notebook will walk you through all the key features of this file-based storage solution for AI chat applications.

## What you'll learn:
- 💾 **Chat Storage**: Store messages, sessions, and user profiles
- 📄 **Document Processing**: Add documents and create embeddings
- 🔍 **Vector Search**: Semantic similarity search
- 🔎 **Advanced Search**: Text-based search with filters
- ⚙️ **Configuration**: Both legacy and new architecture
- 🧠 **PrismMind Integration**: Enhanced document processing

Let's get started!

## 1. Setup and Imports

In [7]:
import sys
import os
from pathlib import Path
import asyncio
from datetime import datetime
import json

# Add parent directory to path to import the flatfile database
sys.path.append('..')

# Add PrismMind directory to path (if available)
prismmind_path = '/home/markly2/prismmind'
if os.path.exists(prismmind_path):
    sys.path.append(prismmind_path)
    print(f"✅ Added PrismMind path: {prismmind_path}")
else:
    print(f"⚠️ PrismMind not found at: {prismmind_path}")

# Import the main components - NO LEGACY ADAPTER
from ff_storage_manager import FFStorageManager
from ff_class_configs.ff_configuration_manager_config import FFConfigurationManagerConfigDTO, load_config
from ff_class_configs.ff_chat_entities_config import (
    FFMessageDTO, FFSessionDTO, FFDocumentDTO, FFUserProfileDTO, MessageRole
)
from ff_search_manager import FFSearchManager, FFSearchQueryDTO
from ff_vector_storage_manager import FFVectorStorageManager
from ff_document_processing_manager import FFDocumentProcessingManager

print("✅ All imports successful\!")
print(f"📁 Working directory: {os.getcwd()}")

✅ Added PrismMind path: /home/markly2/prismmind
✅ All imports successful\!
📁 Working directory: /home/markly2/claude_code/flatfile_chat_database_v2/demo


## 2. Configuration Setup

Let's configure the database for our demo. We'll use a temporary directory to avoid interfering with any existing data.

In [8]:
# Create a demo configuration using the new configuration system
demo_data_path = Path("./demo_data")
demo_data_path.mkdir(exist_ok=True)

# Initialize configuration using the new system
config = FFConfigurationManagerConfigDTO()
config.storage.base_path = str(demo_data_path)
config.storage.enable_compression = False  # Disable for easier inspection
config.locking.enable_file_locking = True

print(f"📍 Demo data will be stored in: {config.storage.base_path}")
print(f"🔒 File locking enabled: {config.locking.enable_file_locking}")
print(f"📊 Compression enabled: {config.storage.enable_compression}")

📍 Demo data will be stored in: demo_data
🔒 File locking enabled: True
📊 Compression enabled: False


## 3. Initialize Storage Manager

The `FFStorageManager` is the main interface for all database operations.

In [9]:
# Initialize the storage manager
storage_manager = FFStorageManager(config)

print("✅ FFStorageManager initialized successfully!")
print(f"🏠 Base path: {storage_manager.config.storage.base_path}")
print(f"🔧 Backend type: {type(storage_manager.backend).__name__}")

✅ FFStorageManager initialized successfully!
🏠 Base path: demo_data
🔧 Backend type: FFFlatfileStorageBackend


## 4. User Management

Let's create some demo users and profiles.

In [10]:
# Create demo users with updated DTO classes
users = [
    {
        "user_id": "alice",
        "profile": FFUserProfileDTO(
            user_id="alice",
            username="Alice Johnson",
            preferences={"theme": "dark", "language": "en"},
            metadata={"role": "data_scientist", "department": "AI Research"}
        )
    },
    {
        "user_id": "bob",
        "profile": FFUserProfileDTO(
            user_id="bob",
            username="Bob Smith",
            preferences={"theme": "light", "language": "en"},
            metadata={"role": "developer", "department": "Engineering"}
        )
    }
]

# Store user profiles
for user in users:
    await storage_manager.store_user_profile(user["profile"])
    print(f"👤 Created user: {user['profile'].username} ({user['user_id']})")

print("\n✅ All users created successfully!")

👤 Created user: Alice Johnson (alice)
👤 Created user: Bob Smith (bob)

✅ All users created successfully!


## 5. Chat Sessions and Messages

Now let's create some chat sessions and add messages to demonstrate the core functionality.

In [12]:
# Create a chat session for Alice
alice_session_id = await storage_manager.create_session(
    user_id="alice",
    title="AI Research Discussion"
)
alice_session = await storage_manager.get_session("alice", alice_session_id)


print(f"💬 Created session: {alice_session.title}")
print(f"🆔 Session ID: {alice_session.session_id}")
print(f"📅 Created at: {alice_session.created_at}")

💬 Created session: AI Research Discussion
🆔 Session ID: chat_session_20250731_173523_044126
📅 Created at: 2025-07-31T17:35:23.044133


## 6. Document Processing and RAG Pipeline

Let's add some documents to our session and process them for semantic search.

In [None]:
# Create sample documents using updated DTO classes
sample_documents = [
    {
        "filename": "machine_learning_guide.md",
        "content": """# Machine Learning Guide

## Data Preprocessing
Data preprocessing is a crucial step in machine learning that involves cleaning and transforming raw data into a format suitable for modeling.

### Handling Missing Values
- **Numerical data**: Use mean, median, or mode imputation
- **Categorical data**: Use most frequent category or create a separate 'missing' category
- **Advanced methods**: KNN imputation, iterative imputation

### Feature Scaling
Feature scaling ensures all features contribute equally to the model:
- **StandardScaler**: Scales features to have mean=0 and std=1
- **MinMaxScaler**: Scales features to a fixed range (usually 0-1)
- **RobustScaler**: Uses median and IQR, robust to outliers

### Encoding Categorical Variables
- **One-hot encoding**: Creates binary columns for each category
- **Label encoding**: Assigns numerical values to categories
- **Target encoding**: Uses target variable statistics
"""
    },
    {
        "filename": "deep_learning_basics.md",
        "content": """# Deep Learning Basics

## Neural Networks
Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes (neurons).

### Architecture Components
- **Input Layer**: Receives the input data
- **Hidden Layers**: Process the data through weighted connections
- **Output Layer**: Produces the final prediction

### Activation Functions
- **ReLU**: Rectified Linear Unit, most commonly used
- **Sigmoid**: Outputs values between 0 and 1
- **Tanh**: Outputs values between -1 and 1
- **Softmax**: Used in multi-class classification

### Training Process
1. **Forward Pass**: Input data flows through the network
2. **Loss Calculation**: Compare prediction with actual target
3. **Backward Pass**: Calculate gradients using backpropagation
4. **Weight Update**: Adjust weights using optimization algorithm
"""
    }
]

# Store documents in the session using updated DTO classes
stored_docs = []
for doc_data in sample_documents:
    # Create document object
    document = FFDocumentDTO(
        filename=doc_data["filename"],
        content=doc_data["content"],
        metadata={"type": "markdown", "topic": "machine_learning"}
    )
    
    # Store document
    doc_id = await storage_manager.store_document(
        alice_session.session_id, 
        "alice", 
        document
    )
    stored_docs.append((doc_id, document))
    print(f"📄 Stored document: {document.filename} (ID: {doc_id})")

print("\n✅ All documents stored successfully!")

## 7. Vector Storage and Embeddings

Let's process our documents to create embeddings for semantic search. Note: This demo uses mock embeddings for simplicity.

In [None]:
# Initialize document pipeline
doc_pipeline = FFDocumentProcessingManager(config)

print("🧠 FFDocumentProcessingManager initialized")
print(f"⚙️ Using PrismMind integration: {doc_pipeline.use_prismmind}")

In [None]:
# For demo purposes, let's create mock embeddings
import numpy as np

def create_mock_embedding(text: str, dim: int = 384) -> list:
    """Create a mock embedding based on text hash for demo purposes."""
    # Use hash of text to create reproducible "embedding"
    hash_val = hash(text) % (2**31)
    np.random.seed(hash_val)
    return np.random.normal(0, 1, dim).tolist()

# Process documents to create embeddings
vector_storage = FFVectorStorageManager(config)

for doc_id, document in stored_docs:
    # Simple chunking - split by paragraphs
    chunks = [chunk.strip() for chunk in document.content.split('\n\n') if chunk.strip()]
    
    # Create mock embeddings for each chunk
    embeddings = [create_mock_embedding(chunk) for chunk in chunks]
    
    # Store vectors
    success = await vector_storage.store_vectors(
        session_id=alice_session.session_id,
        document_id=doc_id,
        chunks=chunks,
        vectors=embeddings,
        metadata={"document_name": document.filename}
    )
    
    print(f"🔢 Created {len(embeddings)} embeddings for {document.filename}")
    print(f"📊 Vector storage success: {success}")

print("\n✅ Vector embeddings created and stored!")

## 8. Searching and Retrieval

Now let's demonstrate the search capabilities - both text-based and vector-based search.

In [None]:
# Initialize search engine
search_engine = FFSearchManager(config)

print("🔍 FFSearchManager initialized")

In [None]:
# Text-based search in messages using updated DTO classes
text_query = FFSearchQueryDTO(
    query_text="data preprocessing",
    user_id="alice",
    session_id=alice_session.session_id,
    include_messages=True,
    include_documents=True
)

text_results = await search_engine.search(text_query)

print(f"📝 Text Search Results for 'data preprocessing':")
print(f"Found {len(text_results.results)} results")
for i, result in enumerate(text_results.results[:3]):  # Show first 3
    print(f"\n{i+1}. Type: {result.result_type}")
    print(f"   Score: {result.score:.3f}")
    print(f"   Content: {result.content[:100]}...")
    if result.metadata:
        print(f"   Metadata: {result.metadata}")

In [None]:
# Vector-based semantic search
search_text = "How do I handle missing values in my dataset?"
search_embedding = create_mock_embedding(search_text)

vector_results = await vector_storage.search_similar(
    session_id=alice_session.session_id,
    query_vector=search_embedding,
    top_k=5
)

print(f"🔢 Vector Search Results for: '{search_text}'")
print(f"Found {len(vector_results)} similar chunks")

for i, result in enumerate(vector_results):
    print(f"\n{i+1}. Similarity: {result.similarity:.3f}")
    print(f"   Document: {result.metadata.get('document_name', 'Unknown')}")
    print(f"   Chunk: {result.text[:150]}...")

## 9. Data Inspection

Let's examine the file structure that was created to understand how the data is stored.

In [None]:
import os

def print_directory_tree(path, prefix="", max_depth=3, current_depth=0):
    """Print directory tree structure."""
    if current_depth > max_depth:
        return
    
    path = Path(path)
    if not path.exists():
        return
    
    items = list(path.iterdir())
    items.sort(key=lambda x: (x.is_file(), x.name))
    
    for i, item in enumerate(items):
        is_last = i == len(items) - 1
        current_prefix = "└── " if is_last else "├── "
        print(f"{prefix}{current_prefix}{item.name}")
        
        if item.is_dir() and current_depth < max_depth:
            next_prefix = prefix + ("    " if is_last else "│   ")
            print_directory_tree(item, next_prefix, max_depth, current_depth + 1)

print("📁 Generated File Structure:")
print_directory_tree(demo_data_path)

In [None]:
# Let's examine a sample message file
messages_file = demo_data_path / "users" / "alice" / alice_session.session_id / "messages.jsonl"

if messages_file.exists():
    print("💬 Sample Messages File Content:")
    print(f"📄 File: {messages_file}")
    print("─" * 50)
    
    with open(messages_file, 'r') as f:
        lines = f.readlines()[:2]  # Show first 2 messages
        for i, line in enumerate(lines, 1):
            msg_data = json.loads(line)
            print(f"Message {i}:")
            print(f"  Role: {msg_data['role']}")
            print(f"  Content: {msg_data['content'][:100]}...")
            print(f"  Timestamp: {msg_data['timestamp']}")
            print()

In [None]:
# Let's examine the session metadata
session_file = demo_data_path / "users" / "alice" / alice_session.session_id / "session.json"

if session_file.exists():
    print("📋 Session Metadata:")
    print(f"📄 File: {session_file}")
    print("─" * 50)
    
    with open(session_file, 'r') as f:
        session_data = json.load(f)
        for key, value in session_data.items():
            print(f"  {key}: {value}")

## 10. Configuration System Demo

Let's explore both the legacy and new configuration systems.

In [None]:
# New configuration system only - no legacy needed
config_example = FFConfigurationManagerConfigDTO()
print("🔧 New Configuration System:")
print(f"  Base Path: {config_example.storage.base_path}")
print(f"  Max Message Size: {config_example.storage.max_message_size_bytes} bytes")
print(f"  Search Top K: {config_example.vector.search_top_k}")
print(f"  File Locking: {config_example.locking.enable_file_locking}")

In [None]:
# New modular configuration system
try:
    from ff_class_configs.ff_configuration_manager_config import FFConfigurationManagerConfigDTO
    
    new_config = FFConfigurationManagerConfigDTO.from_environment("development")
    print("\n⚙️ New Modular Configuration System:")
    print(f"  Environment: {new_config.environment}")
    print(f"  Storage Base Path: {new_config.storage.base_path}")
    print(f"  Search Default Limit: {new_config.search.default_limit}")
    print(f"  Vector Embedding Provider: {new_config.vector.default_embedding_provider}")
    print(f"  Document Max Size: {new_config.document.max_file_size_bytes / 1_048_576:.1f}MB")
    
    # Show configuration summary
    summary = new_config.get_summary()
    print("\n📊 Configuration Summary:")
    for domain, count in summary.items():
        print(f"  {domain}: {count} settings")
        
except ImportError as e:
    print(f"\n⚠️ New configuration system not available: {e}")
except AttributeError as e:
    print(f"\n⚠️ Configuration method not available: {e}")

## 11. Performance and Statistics

Let's gather some basic statistics about our demo session.

In [None]:
# Get session statistics
session_stats = await storage_manager.get_session_stats(alice_session.session_id, "alice")

print("📊 Session Statistics:")
print(f"  Messages: {session_stats.get('message_count', 0)}")
print(f"  Documents: {session_stats.get('document_count', 0)}")
print(f"  Total Size: {session_stats.get('total_size_bytes', 0)} bytes")
print(f"  Created: {alice_session.created_at}")
print(f"  Last Updated: {session_stats.get('last_updated', 'N/A')}")

In [None]:
# Calculate storage usage
import os

def get_directory_size(path):
    """Calculate total size of directory."""
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(path):
        for filename in filenames:
            file_path = os.path.join(dirpath, filename)
            if os.path.exists(file_path):
                total_size += os.path.getsize(file_path)
    return total_size

total_size = get_directory_size(demo_data_path)
file_count = sum([len(files) for r, d, files in os.walk(demo_data_path)])

print(f"💾 Storage Usage Summary:")
print(f"  Total Size: {total_size:,} bytes ({total_size / 1024:.1f} KB)")
print(f"  Total Files: {file_count}")
print(f"  Average File Size: {total_size / max(file_count, 1):.1f} bytes")

## 12. Cleanup (Optional)

Uncomment and run this cell if you want to clean up the demo data.

In [None]:
# Uncomment to clean up demo data
# import shutil
# shutil.rmtree(demo_data_path, ignore_errors=True)
# print(f"🧹 Cleaned up demo data from {demo_data_path}")

## 🎉 Demo Complete!

Congratulations! You've successfully explored the Flatfile Chat Database System. Here's what we demonstrated:

### ✅ Features Covered:
- **User Management**: Created user profiles with metadata
- **Chat Sessions**: Created sessions and stored messages
- **Document Processing**: Added documents and created embeddings
- **Search Capabilities**: Both text-based and vector-based search
- **File Storage**: Examined the generated file structure
- **Configuration**: Explored both legacy and new config systems
- **Statistics**: Gathered usage and performance metrics

### 🚀 Next Steps:
- Explore the CLI demo (`demo/cli_interactive_demo.py`)
- Try the automated demo script (`demo/automated_demo_script.py`)
- Experiment with your own data
- Integrate PrismMind for enhanced document processing

### 📚 Key Benefits:
- **No Database Required**: Pure file-based storage
- **Human Readable**: JSON/JSONL files for easy inspection
- **Scalable**: Efficient for both small and large datasets
- **Flexible**: Configurable for different use cases
- **Search Ready**: Built-in text and semantic search

Thank you for trying the Flatfile Chat Database System! 🙏