# TMux Agent Expiration Analysis

## Problem Statement

Currently, tmux agents are set to auto-expire after 24 hours, but most agents complete their tasks within minutes. This leads to:

1. **Resource waste**: Completed agents sitting idle for 23+ hours
2. **Status confusion**: Monitoring tools show "active" agents that are actually completed
3. **Resource consumption**: Each idle tmux session consumes system resources

## Current State Analysis

From recent monitoring, we observed:
- 9 "active" tmux sessions reported
- 4+ were actually COMPLETED (based on log analysis)
- Sessions dating back to 03:31 (13+ hours ago) still showing as "active"

## Investigation Approach

1. Find where the 24-hour timeout is configured
2. Analyze current cleanup mechanisms
3. Propose 1-hour timeout implementation
4. Ensure proper completion detection

In [None]:
# First, let's find where the 24-hour timeout is configured
import subprocess
import os

def search_for_timeout_config():
    """Search for 24-hour timeout configurations in the orchestration system."""
    
    search_patterns = [
        "24 hour",
        "24h", 
        "86400",  # 24 hours in seconds
        "1440",   # 24 hours in minutes
        "auto-close",
        "expire",
        "timeout"
    ]
    
    results = {}
    
    for pattern in search_patterns:
        try:
            result = subprocess.run(
                ["grep", "-r", "-n", "-i", pattern, "orchestration/", ".claude/commands/"],
                capture_output=True,
                text=True,
                cwd="/home/jleechan/projects/worldarchitect.ai/worktree_roadmap"
            )
            if result.stdout.strip():
                results[pattern] = result.stdout.strip().split('\n')
        except Exception as e:
            print(f"Error searching for {pattern}: {e}")
    
    return results

timeout_configs = search_for_timeout_config()
for pattern, matches in timeout_configs.items():
    print(f"\n=== Pattern: {pattern} ===")
    for match in matches[:5]:  # Show first 5 matches
        print(match)

In [None]:
# Let's also check the specific files that manage tmux sessions
import glob

def find_tmux_management_files():
    """Find files that manage tmux sessions."""
    
    patterns = [
        "orchestration/*tmux*",
        "orchestration/*agent*", 
        "orchestration/*monitor*",
        "orchestration/*session*",
        ".claude/commands/*orchestrate*"
    ]
    
    files = []
    for pattern in patterns:
        files.extend(glob.glob(pattern, recursive=True))
    
    return sorted(set(files))

tmux_files = find_tmux_management_files()
print("TMux management files:")
for f in tmux_files:
    print(f"  {f}")

In [None]:
# Check the unified orchestration script for timeout configuration
def analyze_orchestration_unified():
    """Analyze the main orchestration script for timeout settings."""
    
    try:
        with open("orchestration/orchestrate_unified.py", "r") as f:
            content = f.read()
            
        # Look for timeout-related code
        lines = content.split('\n')
        timeout_lines = []
        
        for i, line in enumerate(lines, 1):
            if any(keyword in line.lower() for keyword in ['24', 'hour', 'timeout', 'expire', 'close']):
                timeout_lines.append(f"{i}: {line.strip()}")
        
        return timeout_lines
    except Exception as e:
        return [f"Error reading file: {e}"]

timeout_lines = analyze_orchestration_unified()
print("Timeout-related lines in orchestrate_unified.py:")
for line in timeout_lines:
    print(line)

## Solution Design

Based on analysis, we need to:

### 1. Implement Smart Expiration

Instead of a fixed 24-hour timeout, implement:
- **1-hour base timeout** for general cleanup
- **Completion detection** to immediately clean up finished agents
- **Grace period** for agents that might still be working

### 2. Improved Status Detection

- Check for "Claude exit code: 0" in logs
- Look for "Agent completed successfully" markers
- Detect stale sessions based on last activity

### 3. Active Cleanup Process

- Background cleanup script that runs every 15 minutes
- Immediate cleanup when agents signal completion
- Safe cleanup that preserves logs and results

In [None]:
# Design the new timeout logic
def design_smart_timeout():
    """Design smart timeout logic for tmux agents."""
    
    timeout_config = {
        "base_timeout_minutes": 60,  # 1 hour instead of 24
        "cleanup_check_interval_minutes": 15,  # Check every 15 minutes
        "completion_markers": [
            "Claude exit code: 0",
            "Agent completed successfully",
            '"subtype":"success"'
        ],
        "grace_period_minutes": 5,  # 5 minutes after completion marker
        "max_idle_minutes": 30,  # Consider idle if no log activity for 30 min
    }
    
    implementation_plan = {
        "modify_agent_creation": "Update orchestrate_unified.py to set 1hr timeout",
        "add_completion_detection": "Monitor logs for completion markers", 
        "implement_cleanup_service": "Background process for smart cleanup",
        "update_monitoring": "Show accurate active vs completed status",
        "preserve_artifacts": "Keep logs and results after cleanup"
    }
    
    return timeout_config, implementation_plan

config, plan = design_smart_timeout()

print("Smart Timeout Configuration:")
for key, value in config.items():
    print(f"  {key}: {value}")

print("\nImplementation Plan:")
for key, value in plan.items():
    print(f"  {key}: {value}")

## Implementation Strategy

### Phase 1: Immediate Improvements
1. Change timeout from 24h to 1h in agent creation
2. Add completion detection to existing monitoring
3. Manual cleanup command for immediate relief

### Phase 2: Smart Cleanup
1. Background cleanup service
2. Intelligent completion detection
3. Graceful session termination

### Phase 3: Enhanced Monitoring
1. Accurate status reporting (active vs completed)
2. Resource usage tracking
3. Performance metrics

## Benefits

- **Resource efficiency**: 95% reduction in idle session time
- **Accurate monitoring**: Clear distinction between active and completed
- **Better UX**: Faster status updates, less confusion
- **System health**: Reduced resource consumption