# Getting Started with Component Forge

Welcome to Component Forge! This notebook will guide you through the basics of working with the AI development environment.

## What You'll Learn

1. Environment setup and imports
2. Database connection and basic queries
3. Working with sample data
4. Basic AI operations
5. Data visualization

## Prerequisites

- Docker services running (`docker-compose up -d`)
- Database migrated (`alembic upgrade head`)
- Sample data loaded (`python scripts/seed_all.py`)
- Jupyter kernel installed (`./scripts/setup_jupyter.sh`)

## 1. Environment Setup

In [None]:
# Standard library imports
import sys
import os
import asyncio
from pathlib import Path
from datetime import datetime, timedelta

# Add project paths
project_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
backend_src = project_root / 'backend' / 'src'
sys.path.insert(0, str(backend_src))

# Load environment variables
from dotenv import load_dotenv
load_dotenv(project_root / 'backend' / '.env')

print(f"📁 Project root: {project_root}")
print(f"🐍 Python path includes: {backend_src}")
print(f"✅ Environment loaded")

In [None]:
# Data science imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Project imports
from core.database import check_database_connection, database_health_check
from core.models import User, Document, Conversation, Message
from notebooks.utils.database_helpers import (
    sync_get_database_stats,
    sync_get_users_df,
    sync_get_documents_df,
    sync_get_conversations_df
)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

print("📊 Libraries imported successfully!")

## 2. Database Connection Test

In [None]:
# Test database connection
connection_ok = await check_database_connection()
print(f"🗄️ Database connection: {'✅ OK' if connection_ok else '❌ Failed'}")

if connection_ok:
    health = await database_health_check()
    print(f"\n📊 Database Health:")
    print(f"  Status: {health['status']}")
    print(f"  Tables: {health.get('table_count', 'Unknown')}")
    print(f"  Connection time: {health.get('connection_time_ms', 'Unknown')}ms")
else:
    print("❌ Please check that Docker services are running and database is migrated")

## 3. Explore Sample Data

In [None]:
# Get database statistics
stats = sync_get_database_stats()

print("📈 Database Statistics:")
for table, count in stats.items():
    print(f"  {table}: {count} records")

# Visualize data distribution
fig, ax = plt.subplots(figsize=(10, 6))
tables = list(stats.keys())
counts = list(stats.values())

bars = ax.bar(tables, counts, color=sns.color_palette("husl", len(tables)))
ax.set_title('Database Record Counts')
ax.set_ylabel('Number of Records')
ax.tick_params(axis='x', rotation=45)

# Add value labels on bars
for bar, count in zip(bars, counts):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.1,
            f'{count}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## 4. Analyze Users and Activity

In [None]:
# Load users data
users_df = sync_get_users_df()
print(f"👥 Found {len(users_df)} users")

# Display users table
display(users_df[['username', 'email', 'is_active', 'is_admin', 'login_count', 'created_at']])

# Visualize user activity
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Login counts
ax1.bar(users_df['username'], users_df['login_count'], 
        color=sns.color_palette("viridis", len(users_df)))
ax1.set_title('User Login Counts')
ax1.set_ylabel('Login Count')
ax1.tick_params(axis='x', rotation=45)

# User types
user_types = users_df.groupby(['is_admin', 'is_active']).size().reset_index(name='count')
user_types['type'] = user_types.apply(
    lambda x: f"{'Admin' if x['is_admin'] else 'User'} ({'Active' if x['is_active'] else 'Inactive'})", 
    axis=1
)
ax2.pie(user_types['count'], labels=user_types['type'], autopct='%1.0f%%')
ax2.set_title('User Types Distribution')

plt.tight_layout()
plt.show()

## 5. Explore Documents

In [None]:
# Load documents data
docs_df = sync_get_documents_df()
print(f"📄 Found {len(docs_df)} documents")

# Display documents overview
display(docs_df[['title', 'content_type', 'category', 'word_count', 'processing_status']])

# Document analysis
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# Content types
content_counts = docs_df['content_type'].value_counts()
ax1.pie(content_counts.values, labels=content_counts.index, autopct='%1.0f%%')
ax1.set_title('Document Content Types')

# Categories
category_counts = docs_df['category'].value_counts()
ax2.bar(category_counts.index, category_counts.values)
ax2.set_title('Document Categories')
ax2.tick_params(axis='x', rotation=45)

# Word count distribution
ax3.hist(docs_df['word_count'], bins=10, alpha=0.7, color='skyblue', edgecolor='black')
ax3.set_title('Word Count Distribution')
ax3.set_xlabel('Word Count')
ax3.set_ylabel('Number of Documents')

# File sizes
docs_df['file_size_kb'] = docs_df['file_size'] / 1024
ax4.scatter(docs_df['word_count'], docs_df['file_size_kb'], alpha=0.7)
ax4.set_title('Word Count vs File Size')
ax4.set_xlabel('Word Count')
ax4.set_ylabel('File Size (KB)')

plt.tight_layout()
plt.show()

# Document statistics
print("\n📊 Document Statistics:")
print(f"  Total words: {docs_df['word_count'].sum():,}")
print(f"  Average words per document: {docs_df['word_count'].mean():.0f}")
print(f"  Total file size: {docs_df['file_size'].sum() / 1024:.1f} KB")

## 6. Analyze Conversations

In [None]:
# Load conversations data
conversations_df = sync_get_conversations_df()
print(f"💬 Found {len(conversations_df)} conversations")

if len(conversations_df) > 0:
    # Display conversations overview
    display(conversations_df[['username', 'title', 'model_name', 'message_count', 'total_tokens_used']])

    # Conversation analysis
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

    # Conversations by user
    user_conv_counts = conversations_df['username'].value_counts()
    ax1.bar(user_conv_counts.index, user_conv_counts.values)
    ax1.set_title('Conversations by User')
    ax1.tick_params(axis='x', rotation=45)

    # Models used
    model_counts = conversations_df['model_name'].value_counts()
    ax2.pie(model_counts.values, labels=model_counts.index, autopct='%1.0f%%')
    ax2.set_title('Models Used')

    # Message count distribution
    ax3.hist(conversations_df['message_count'], bins=10, alpha=0.7, color='lightgreen', edgecolor='black')
    ax3.set_title('Message Count Distribution')
    ax3.set_xlabel('Messages per Conversation')
    ax3.set_ylabel('Number of Conversations')

    # Token usage
    ax4.scatter(conversations_df['message_count'], conversations_df['total_tokens_used'], alpha=0.7)
    ax4.set_title('Messages vs Token Usage')
    ax4.set_xlabel('Message Count')
    ax4.set_ylabel('Total Tokens Used')

    plt.tight_layout()
    plt.show()

    # Token statistics
    print("\n🎯 Token Usage Statistics:")
    print(f"  Total tokens used: {conversations_df['total_tokens_used'].sum():,}")
    print(f"  Average tokens per conversation: {conversations_df['total_tokens_used'].mean():.0f}")
    print(f"  Average messages per conversation: {conversations_df['message_count'].mean():.1f}")
else:
    print("No conversations found. Run the fixture loader to add sample conversations.")

## 7. Quick Database Query Examples

In [None]:
from notebooks.utils.database_helpers import sync_execute_query

# Example: Find most active users
active_users_query = """
SELECT 
    u.username,
    COUNT(DISTINCT c.id) as conversation_count,
    COUNT(m.id) as total_messages,
    SUM(CASE WHEN m.role = 'user' THEN 1 ELSE 0 END) as user_messages,
    SUM(m.total_tokens) as total_tokens
FROM users u
LEFT JOIN conversations c ON u.id = c.user_id
LEFT JOIN messages m ON c.id = m.conversation_id
GROUP BY u.id, u.username
HAVING COUNT(DISTINCT c.id) > 0
ORDER BY conversation_count DESC
"""

active_users = sync_execute_query(active_users_query)
print("👑 Most Active Users:")
display(active_users)

In [None]:
# Example: Document categories and their average word counts
doc_categories_query = """
SELECT 
    category,
    COUNT(*) as document_count,
    AVG(word_count) as avg_word_count,
    SUM(word_count) as total_words
FROM documents
WHERE processing_status = 'completed'
GROUP BY category
ORDER BY document_count DESC
"""

doc_categories = sync_execute_query(doc_categories_query)
print("📚 Document Categories Analysis:")
display(doc_categories)

## 8. Next Steps

Congratulations! You've successfully explored the Component Forge environment. Here's what you can do next:

### 🧪 Experiments
- **RAG System Testing**: `notebooks/experiments/02_rag_testing.ipynb`
- **Embedding Models**: `notebooks/experiments/03_embedding_comparison.ipynb`
- **Prompt Engineering**: `notebooks/experiments/04_prompt_engineering.ipynb`

### 📊 Analysis
- **Performance Analysis**: `notebooks/evaluation/01_performance_analysis.ipynb`
- **User Behavior**: `notebooks/exploration/02_user_behavior.ipynb`
- **Token Usage**: `notebooks/exploration/03_token_analysis.ipynb`

### 🛠️ Development
- Explore the database models in `backend/src/core/models.py`
- Try the database helpers in `notebooks/utils/database_helpers.py`
- Set up your own experiments using this notebook as a template

### 📚 Resources
- Project documentation: `CLAUDE.md`
- Database design: `backend/data/fixtures/documents/database_design_principles.md`
- AI development guide: `backend/data/fixtures/documents/ai_development_guide.md`

Happy experimenting! 🚀