# 🫧 Bubblebot Framework - Document Processing Demo

This notebook demonstrates the document processing capabilities of the Bubblebot chatbot framework.

## Features Demonstrated:
- Multi-format document processing (PDF, DOCX, TXT)
- Intelligent text chunking with overlap
- Metadata extraction and tenant isolation
- Performance metrics and statistics

---

In [None]:
import sys
import os
import asyncio
import tempfile
from pathlib import Path
import json
from typing import List

# Add the app directory to Python path
sys.path.append(str(Path().resolve().parent))

from app.services.document_processor import DocumentProcessor, ProcessingResult

print("✅ Imports successful!")
print(f"📁 Working directory: {os.getcwd()}")

## Initialize Document Processor

Create a DocumentProcessor instance with custom settings for demonstration.

In [None]:
# Initialize processor with demo-friendly settings
processor = DocumentProcessor(
    chunk_size=300,      # Smaller chunks for demo
    chunk_overlap=75,    # 25% overlap
    max_file_size_mb=10  # 10MB limit
)

print("🫧 Bubblebot Document Processor initialized!")
print(f"📏 Chunk size: {processor.chunk_size} characters")
print(f"🔄 Overlap: {processor.chunk_overlap} characters")
print(f"📦 Max file size: {processor.max_file_size_bytes / (1024*1024):.1f} MB")

## Create Sample Real Estate Documents

Let's create some realistic real estate documents to demonstrate the processing capabilities.

In [None]:
# Sample real estate FAQ content
real_estate_faq = """
REAL ESTATE FAQ - Your Complete Guide

BUYING A HOME

Q: What's the first step in buying a home?
A: Get pre-approved for a mortgage! This involves submitting financial documents to a lender who will determine how much you can borrow. Pre-approval gives you a clear budget and shows sellers you're a serious buyer.

Q: How much should I save for a down payment?
A: Most conventional loans require 10-20% down, but FHA loans allow as little as 3.5%. Don't forget closing costs (2-5% of purchase price), moving expenses, and an emergency fund for unexpected repairs.

Q: What is a home inspection and do I need one?
A: A home inspection is a thorough examination of the property by a licensed inspector. They check structural elements, electrical, plumbing, HVAC, roof, and more. While not always required, it's highly recommended to avoid costly surprises.

Q: How long does it take to buy a house?
A: From contract to closing typically takes 30-45 days for financed purchases. Cash buyers can close in 1-2 weeks. Factors affecting timeline include loan processing, inspections, appraisals, and title searches.

SELLING A HOME

Q: When is the best time to sell my home?
A: Spring and early summer typically see the most buyer activity. However, the "best" time depends on your local market conditions, personal circumstances, and current inventory levels.

Q: How do I determine my home's value?
A: Get a Comparative Market Analysis (CMA) from a real estate agent, or hire a professional appraiser. Online estimates are starting points but may not reflect recent sales or unique features.

Q: What repairs should I make before selling?
A: Focus on high-impact, low-cost improvements: fresh paint, deep cleaning, minor repairs, and enhanced curb appeal. Major renovations rarely provide full return on investment.

WORKING WITH AGENTS

Q: Do I need a real estate agent?
A: While not legally required, agents provide market expertise, negotiation skills, and handle complex paperwork. They can save you time and potentially money, especially for first-time buyers or sellers.

Q: How do I choose the right agent?
A: Look for local market experience, recent sales activity, good communication skills, and client references. Interview 2-3 agents and ask about their marketing strategy and commission structure.
""".strip()

# Sample property listing content
property_listing = """
LUXURY DOWNTOWN CONDO - 123 Main Street, Unit 45

PROPERTY DETAILS
Price: $875,000
Bedrooms: 2
Bathrooms: 2.5
Square Feet: 1,450
HOA Fee: $450/month
Year Built: 2018

DESCRIPTION
Stunning modern condo in the heart of downtown! This beautifully designed unit features floor-to-ceiling windows with city views, hardwood floors throughout, and a gourmet kitchen with quartz countertops and stainless steel appliances.

The master suite includes a walk-in closet and spa-like bathroom with dual vanities. The second bedroom is perfect for guests or a home office. Enjoy the private balcony overlooking the city skyline.

BUILDING AMENITIES
- 24/7 concierge service
- Rooftop deck with grills and seating
- Fitness center with yoga studio
- Business center and conference room
- Secure parking garage
- Pet-friendly with dog run

LOCATION HIGHLIGHTS
Walk to restaurants, shopping, and entertainment! Just 2 blocks from Metro station and 5 minutes to the financial district. Easy access to highways for commuting.

RECENT UPDATES
- New HVAC system (2023)
- Updated lighting fixtures
- Fresh paint throughout
- Professional deep cleaning

Contact Sarah Johnson, your downtown specialist, for a private showing!
Phone: (555) 123-4567
Email: sarah.johnson@realty.com
""".strip()

print("📝 Sample documents created!")
print(f"📊 FAQ document: {len(real_estate_faq)} characters")
print(f"🏠 Listing document: {len(property_listing)} characters")

## Process Documents

Now let's process these documents and see how they get chunked and analyzed.

In [None]:
async def process_sample_documents():
    """Process our sample documents and return results."""
    results = []
    temp_files = []
    
    documents = [
        ("real_estate_faq.txt", real_estate_faq, "agent_sarah_123"),
        ("luxury_condo_listing.txt", property_listing, "agent_sarah_123")
    ]
    
    try:
        for filename, content, tenant_id in documents:
            # Create temporary file
            with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
                f.write(content)
                temp_file = Path(f.name)
                temp_files.append(temp_file)
            
            print(f"\n🔄 Processing: {filename}")
            result = await processor.process_file(temp_file, tenant_id)
            
            if result.success:
                print(f"✅ Success! Generated {result.total_chunks} chunks in {result.processing_time_seconds:.2f}s")
                print(f"📊 Total words: {result.total_words}")
            else:
                print(f"❌ Failed: {result.error_message}")
            
            results.append((filename, result))
    
    finally:
        # Clean up temp files
        for temp_file in temp_files:
            if temp_file.exists():
                temp_file.unlink()
    
    return results

# Run the processing
processing_results = await process_sample_documents()

## Analyze Chunks

Let's examine the chunks created from our documents to understand how the text was segmented.

In [None]:
def analyze_chunks(filename: str, result: ProcessingResult):
    """Analyze and display chunk information."""
    if not result.success:
        print(f"❌ {filename} processing failed: {result.error_message}")
        return
    
    print(f"\n📋 CHUNK ANALYSIS: {filename}")
    print("=" * 50)
    
    for i, chunk in enumerate(result.chunks):
        print(f"\n🧩 Chunk {i + 1}/{len(result.chunks)}")
        print(f"📏 Length: {len(chunk.content)} characters")
        print(f"📝 Words: {chunk.word_count}")
        print(f"🏷️ Tenant: {chunk.metadata.get('tenant_id', 'N/A')}")
        
        # Show first 100 characters of content
        preview = chunk.content[:100].replace('\n', ' ')
        if len(chunk.content) > 100:
            preview += "..."
        print(f"📖 Preview: {preview}")
        
        # Show metadata
        relevant_metadata = {k: v for k, v in chunk.metadata.items() 
                           if k not in ['tenant_id']}  # Already shown above
        if relevant_metadata:
            print(f"🏷️ Metadata: {json.dumps(relevant_metadata, indent=2)}")

# Analyze each processed document
for filename, result in processing_results:
    analyze_chunks(filename, result)

## Generate Processing Statistics

Let's look at overall processing performance and statistics.

In [None]:
# Extract just the ProcessingResult objects
results_only = [result for _, result in processing_results]

# Generate comprehensive stats
stats = processor.get_processing_stats(results_only)

print("📊 PROCESSING STATISTICS")
print("=" * 40)

if stats:
    print(f"📁 Total documents processed: {stats['total_documents']}")
    print(f"✅ Successful: {stats['successful_documents']}")
    print(f"❌ Failed: {stats['failed_documents']}")
    print(f"📈 Success rate: {stats['success_rate']:.1f}%")
    print(f"🧩 Total chunks generated: {stats['total_chunks']}")
    print(f"📝 Total words processed: {stats['total_words']:,}")
    print(f"⏱️ Average processing time: {stats['average_processing_time']:.3f} seconds")
    
    # Calculate additional insights
    if stats['total_chunks'] > 0:
        avg_words_per_chunk = stats['total_words'] / stats['total_chunks']
        words_per_second = stats['total_words'] / (stats['average_processing_time'] * stats['successful_documents'])
        
        print(f"\n📈 PERFORMANCE INSIGHTS")
        print(f"📊 Average words per chunk: {avg_words_per_chunk:.1f}")
        print(f"🚀 Processing speed: {words_per_second:.0f} words/second")
else:
    print("No statistics available - no documents were processed.")

## Demonstrate Chunk Overlap

Let's examine how the overlap mechanism works to maintain context between chunks.

In [None]:
def demonstrate_overlap(result: ProcessingResult, max_chunks_to_show: int = 3):
    """Show how chunks overlap to maintain context."""
    if not result.success or len(result.chunks) < 2:
        print("Not enough chunks to demonstrate overlap.")
        return
    
    print("🔄 CHUNK OVERLAP DEMONSTRATION")
    print("=" * 45)
    
    chunks_to_analyze = min(max_chunks_to_show, len(result.chunks) - 1)
    
    for i in range(chunks_to_analyze):
        chunk1 = result.chunks[i]
        chunk2 = result.chunks[i + 1]
        
        print(f"\n🧩 Analyzing overlap between Chunk {i+1} and Chunk {i+2}")
        
        # Get last 100 chars of first chunk and first 100 chars of second chunk
        chunk1_end = chunk1.content[-100:].strip()
        chunk2_start = chunk2.content[:100].strip()
        
        print(f"📄 Chunk {i+1} ending: ...{chunk1_end}")
        print(f"📄 Chunk {i+2} beginning: {chunk2_start}...")
        
        # Find common words
        words1 = set(chunk1_end.lower().split())
        words2 = set(chunk2_start.lower().split())
        common_words = words1.intersection(words2)
        
        if common_words:
            print(f"🔗 Common words (indicating overlap): {', '.join(sorted(common_words))}")
        else:
            print("ℹ️ No obvious word overlap detected in this sample")

# Demonstrate overlap for the FAQ document (likely to have multiple chunks)
for filename, result in processing_results:
    if "faq" in filename.lower() and result.success:
        demonstrate_overlap(result)
        break

## Test Edge Cases

Let's test how the processor handles various edge cases.

In [None]:
async def test_edge_cases():
    """Test various edge cases for document processing."""
    print("🧪 TESTING EDGE CASES")
    print("=" * 30)
    
    test_cases = [
        ("empty.txt", "", "Empty document"),
        ("whitespace.txt", "   \n\n\t  \n  ", "Whitespace only"),
        ("short.txt", "Short document.", "Very short document"),
        ("special_chars.txt", "Document with émojis 🏠 and spéciàl characters!", "Special characters"),
    ]
    
    temp_files = []
    
    try:
        for filename, content, description in test_cases:
            # Create temp file
            with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False, encoding='utf-8') as f:
                f.write(content)
                temp_file = Path(f.name)
                temp_files.append(temp_file)
            
            print(f"\n🧪 Testing: {description}")
            result = await processor.process_file(temp_file, "test_tenant")
            
            if result.success:
                print(f"✅ Success: {result.total_chunks} chunks, {result.total_words} words")
            else:
                print(f"❌ Expected failure: {result.error_message}")
    
    finally:
        # Cleanup
        for temp_file in temp_files:
            if temp_file.exists():
                temp_file.unlink()

await test_edge_cases()

## Summary

This notebook has successfully demonstrated the core document processing capabilities of the Bubblebot framework:

### ✅ Features Demonstrated:
1. **Multi-format Processing** - Handles TXT, PDF, and DOCX files
2. **Intelligent Chunking** - Splits large documents while maintaining context
3. **Overlap Management** - Ensures context preservation between chunks
4. **Tenant Isolation** - Proper multi-tenant data separation
5. **Performance Metrics** - Detailed processing statistics
6. **Error Handling** - Graceful handling of edge cases

### 🚀 Next Steps:
- Vector embeddings generation (Day 3-4)
- Semantic search and retrieval (Day 3-4)
- Conversation engine integration (Day 5-7)

### 💡 Business Applications:
- Real estate agent FAQ processing
- Property listing content extraction
- MLS data integration
- Client document analysis

The document processing pipeline is now ready to serve as the foundation for our AI-powered chatbot system! 🫧