# Phase2: Remediation

## Overview

Phase2 executes the Fix Plan generated by Phase1 to remediate the identified issues with volume I/O. This phase implements the solutions to resolve the root causes identified during the investigation.

### Key Components

- **LangGraph StateGraph**: Manages the flow of the remediation process
- **Fix Plan Executor**: Executes the steps in the Fix Plan
- **System Tools**: Tools for interacting with the system (filesystem repairs, hardware diagnostics)
- **Kubernetes Tools**: Tools for interacting with the Kubernetes API

### Inputs and Outputs

- **Inputs**: 
  - Fix Plan from Phase1
  - Knowledge Graph and collected information from Phase0
  - Pod name, namespace, volume path
- **Outputs**: 
  - Remediation result with details of actions taken
  - Success/failure status of each remediation step

In [None]:
# Import necessary libraries
import asyncio
import json
from typing import Dict, List, Any

# Import mock data for demonstration
import sys
sys.path.append('../')
from tests.mock_knowledge_graph import create_mock_knowledge_graph
from tests.mock_kubernetes_data import get_mock_kubernetes_data
from tests.mock_system_data import get_mock_system_data

## Mock Remediation Phase

For demonstration purposes, we'll create a mock implementation of the Remediation Phase that simulates the execution of a Fix Plan.

In [None]:
class MockRemediationPhase:
    """
    Mock implementation of Remediation Phase for demonstration
    """
    
    def __init__(self, collected_info, config_data=None):
        """
        Initialize the mock remediation phase
        
        Args:
            collected_info: Pre-collected diagnostic information from Phase0
            config_data: Configuration data (optional)
        """
        self.collected_info = collected_info
        self.config_data = config_data or {}
        self.interactive_mode = config_data.get('troubleshoot', {}).get('interactive_mode', False)
        print("Initializing Remediation Phase...")
    
    async def run_remediation_with_graph(self, query, graph, timeout_seconds=1800):
        """
        Run remediation using the provided LangGraph StateGraph
        
        Args:
            query: The initial query to send to the graph
            graph: LangGraph StateGraph to use
            timeout_seconds: Maximum execution time in seconds
            
        Returns:
            str: Remediation result
        """
        print("\nStarting remediation with LangGraph...")
        print("This may take a few minutes to complete.")
        
        # Format the query
        formatted_query = {"messages": [{"role": "user", "content": query}]}
        
        try:
            # Run graph with timeout
            response = await asyncio.wait_for(
                graph.ainvoke(formatted_query, config={"recursion_limit": 100}),
                timeout=timeout_seconds
            )
            print("Remediation complete!")
            
            # Extract the final message
            final_message = response["messages"][-1]["content"]
            return final_message
            
        except asyncio.TimeoutError:
            print("Remediation timed out!")
            return "Remediation phase timed out - manual intervention may be required."
    
    async def execute_fix_plan(self, fix_plan, pod_name, namespace, volume_path, message_list=None):
        """
        Execute the Fix Plan
        
        Args:
            fix_plan: Fix Plan generated by Phase1
            pod_name: Name of the pod with the error
            namespace: Namespace of the pod
            volume_path: Path of the volume with I/O error
            message_list: Optional message list for chat mode
            
        Returns:
            tuple: (Remediation result, Updated message list)
        """
        print(f"\nExecuting Fix Plan for pod {namespace}/{pod_name} with volume path {volume_path}")
        print(f"\nFix Plan:\n{fix_plan}")
        
        # Create LangGraph for remediation
        from tests.mock_knowledge_graph import MockLangGraphStateGraph
        graph = MockLangGraphStateGraph(self.collected_info, phase="phase2", config_data=self.config_data)
        
        # Create query for LangGraph
        query = f"Execute the Fix Plan for pod {pod_name} in namespace {namespace} with volume path {volume_path}:\n\n{fix_plan}"
        
        # Run LangGraph
        result = await self.run_remediation_with_graph(query, graph, timeout_seconds=self.config_data.get('troubleshoot', {}).get('timeout_seconds', 1800))
        
        # Update message list if provided
        if message_list is not None:
            message_list.append({"role": "assistant", "content": result})
        
        return result, message_list

## Mock LangGraph StateGraph

Let's create a mock LangGraph StateGraph for the remediation phase.

In [None]:
class MockLangGraphStateGraph:
    """
    Mock implementation of LangGraph StateGraph for demonstration
    """
    
    def __init__(self, collected_info, phase="phase2", config_data=None):
        """
        Initialize the mock LangGraph StateGraph
        
        Args:
            collected_info: Pre-collected diagnostic information from Phase0
            phase: Phase name ("phase1" or "phase2")
            config_data: Configuration data (optional)
        """
        self.collected_info = collected_info
        self.phase = phase
        self.config_data = config_data or {}
        self.knowledge_graph = collected_info.get('knowledge_graph')
        print(f"Initializing LangGraph StateGraph for {phase}...")
    
    async def ainvoke(self, query, config=None):
        """
        Invoke the graph asynchronously
        
        Args:
            query: Query to send to the graph
            config: Configuration for the graph execution
            
        Returns:
            Dict[str, Any]: Graph execution result
        """
        print(f"Running LangGraph for {self.phase}...")
        
        # Extract the fix plan from the query
        fix_plan = query["messages"][0]["content"]
        
        # Simulate graph execution with delay
        await asyncio.sleep(2)
        
        # Generate a response based on the fix plan
        if "filesystem" in fix_plan.lower():
            response = self._generate_filesystem_remediation_response()
        elif "hardware" in fix_plan.lower():
            response = self._generate_hardware_remediation_response()
        else:
            response = self._generate_generic_remediation_response()
        
        return response
    
    def _generate_filesystem_remediation_response(self):
        """
        Generate a mock response for filesystem remediation
        
        Returns:
            Dict[str, Any]: Mock remediation response
        """
        return {
            "messages": [
                {"role": "user", "content": "Execute the Fix Plan"},
                {"role": "assistant", "content": """# Remediation Results

I have executed the Fix Plan to resolve the filesystem corruption issue. Here are the results of each step:

## Step 1: Unmount the filesystem
✅ **Success**: Pod `test-pod` was deleted successfully.
✅ **Success**: Filesystem was unmounted successfully.

## Step 2: Run xfs_repair
✅ **Success**: XFS repair completed with the following output:
```
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
```

## Step 3: Check drive health
⚠️ **Warning**: SMART diagnostics show potential issues with the drive:
```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   095   095   000    Old_age   Always       -       5
197 Current_Pending_Sector  0x0012   098   098   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
```
The drive has 5 uncorrectable errors and 4 pending sectors. While not critical, these should be monitored.

## Step 4: Remount the filesystem
✅ **Success**: Filesystem was remounted successfully.

## Step 5: Restart the pod
✅ **Success**: Pod `test-pod` was created successfully and is now running.

# Summary

The filesystem corruption has been successfully repaired. The pod is now running and able to access the volume without I/O errors. 

## Recommendations

1. Monitor the drive health regularly using SMART diagnostics
2. Consider replacing the drive in the next maintenance window due to the uncorrectable errors
3. Enable regular filesystem checks to catch corruption early

All remediation steps have been completed successfully. The system is now operational."""}
            ]
        }
    
    def _generate_hardware_remediation_response(self):
        """
        Generate a mock response for hardware remediation
        
        Returns:
            Dict[str, Any]: Mock remediation response
        """
        return {
            "messages": [
                {"role": "user", "content": "Execute the Fix Plan"},
                {"role": "assistant", "content": """# Remediation Results

I have executed the Fix Plan to address the hardware issues. Here are the results of each step:

## Step 1: Backup important data
✅ **Success**: Pod `test-pod` was deleted successfully.
✅ **Success**: Data was backed up to `/backup` directory.

## Step 2: Check drive connections
⚠️ **Manual Step**: This step requires physical access to the server. Please have your datacenter team check the drive connections.

## Step 3: Run extended SMART diagnostics
❌ **Failure**: Extended SMART test failed with the following errors:
```
Error: SMART execute offline immediate command failed: I/O error
```
This indicates serious hardware issues with the drive.

## Step 4: Replace the drive
⚠️ **Manual Step**: This step requires physical access to the server. Please have your datacenter team replace the drive with a new one.

## Step 5: Restore data to the new drive
⚠️ **Pending**: This step is pending completion of Step 4 (drive replacement).

# Summary

The remediation process has been partially completed. The data has been successfully backed up, but the hardware issues require manual intervention from the datacenter team.

## Next Steps

1. Contact the datacenter team to check drive connections and replace the faulty drive
2. Once the drive is replaced, restore the data from the backup
3. Recreate the pod after data restoration

The automated portion of the remediation has been completed, but manual steps are required to fully resolve the issue."""}
            ]
        }
    
    def _generate_generic_remediation_response(self):
        """
        Generate a mock response for generic remediation
        
        Returns:
            Dict[str, Any]: Mock remediation response
        """
        return {
            "messages": [
                {"role": "user", "content": "Execute the Fix Plan"},
                {"role": "assistant", "content": """# Remediation Results

I have executed the Fix Plan to address the volume I/O issues. Here are the results of each step:

## Step 1: Restart the affected pod
✅ **Success**: Pod `test-pod` was deleted successfully.
✅ **Success**: Pod `test-pod` was recreated successfully and is now running.

## Step 2: Check system logs
✅ **Success**: System logs were checked and no new errors were found.
```
Jun 16 12:30:15 worker-1 kubelet[1234]: Volume mounted successfully: /var/lib/kubelet/pods/pod-123-456/volumes/kubernetes.io~csi/test-pv/mount
Jun 16 12:30:18 worker-1 kubelet[1234]: Started container test-container
```

## Step 3: Monitor the system
✅ **Success**: System monitoring was enabled for 5 minutes and no issues were detected.
```
LAST SEEN   TYPE     REASON              OBJECT                  MESSAGE
30s         Normal   Scheduled           pod/test-pod            Successfully assigned default/test-pod to worker-1
25s         Normal   Pulling             pod/test-pod            Pulling image "nginx:latest"
15s         Normal   Pulled              pod/test-pod            Successfully pulled image "nginx:latest"
10s         Normal   Created             pod/test-pod            Created container test-container
5s          Normal   Started             pod/test-pod            Started container test-container
```

# Summary

The volume I/O issue appears to have been resolved by simply restarting the pod. This suggests that the issue was transient and not related to any persistent filesystem or hardware problems.

## Recommendations

1. Continue monitoring the pod for any recurrence of I/O errors
2. If the issue recurs, consider a more in-depth investigation of the underlying storage
3. Check for any recent system updates or changes that might have caused the transient issue

All remediation steps have been completed successfully. The system is now operational."""}
            ]
        }

## Running Phase2: Remediation

Now let's run the Remediation phase with our mock implementation.

In [None]:
async def run_remediation_phase_with_fix_plan(pod_name, namespace, volume_path, collected_info, fix_plan, config_data=None, message_list=None):
    """
    Run Phase 2: Remediation with a Fix Plan
    
    Args:
        pod_name: Name of the pod with the error
        namespace: Namespace of the pod
        volume_path: Path of the volume with I/O error
        collected_info: Pre-collected diagnostic information from Phase0
        fix_plan: Fix Plan generated by Phase1
        config_data: Configuration data (optional)
        message_list: Optional message list for chat mode
        
    Returns:
        tuple: (Remediation result, Updated message list)
    """
    print("Starting Phase 2: Remediation with Fix Plan")
    
    # Initialize the remediation phase
    phase = MockRemediationPhase(collected_info, config_data)
    
    # Execute the fix plan
    result, message_list = await phase.execute_fix_plan(fix_plan, pod_name, namespace, volume_path, message_list)
    
    return result, message_list

In [None]:
# Create mock collected info from Phase0 (reusing the function from Phase1 notebook)
def create_mock_collected_info():
    knowledge_graph = create_mock_knowledge_graph()
    kubernetes_data = get_mock_kubernetes_data()
    system_data = get_mock_system_data()
    
    return {
        "pod_info": kubernetes_data.get("pods", {}),
        "pvc_info": kubernetes_data.get("pvcs", {}),
        "pv_info": kubernetes_data.get("pvs", {}),
        "node_info": kubernetes_data.get("nodes", {}),
        "csi_driver_info": kubernetes_data.get("csi_driver", {}),
        "storage_class_info": kubernetes_data.get("storage_classes", {}),
        "system_info": system_data,
        "knowledge_graph_summary": {
            "pod_count": 1,
            "pvc_count": 1,
            "pv_count": 1,
            "node_count": 1,
            "issue_count": len(knowledge_graph.issues)
        },
        "issues": knowledge_graph.issues,
        "knowledge_graph": knowledge_graph
    }

# Create mock fix plan from Phase1
mock_fix_plan = """# Root Cause Analysis

After executing the Investigation Plan, I have identified the root cause of the volume I/O errors:

## Primary Issue
- **Issue**: XFS filesystem corruption detected
- **Severity**: critical
- **Entity**: Volume (gnode:Volume:volume-123-456)
- **Details**: XFS metadata corruption detected, causing I/O errors when accessing files

## Evidence
1. XFS filesystem corruption detected in the volume's metadata
2. Kernel logs show XFS_CORRUPT_INODES errors
3. I/O errors reported by the container runtime

## Contributing Factors
- Possible hardware issues with the underlying drive
- Multiple read failures recorded on the drive

# Fix Plan

The following steps should be taken to resolve the issue:

1. Unmount the corrupted filesystem
2. Run xfs_repair with the -L option to fix the filesystem corruption
3. Check the drive health with SMART tools
4. Remount the filesystem
5. Restart the affected pod

## Commands to Execute

```bash
# Step 1: Unmount the filesystem
kubectl delete pod test-pod -n default
umount /var/lib/kubelet/pods/pod-123-456/volumes/kubernetes.io~csi/test-pv/mount

# Step 2: Run xfs_repair
xfs_repair -L /dev/mapper/volume-123-456

# Step 3: Check drive health
smartctl -a /dev/sda

# Step 4: Remount the filesystem
mount /dev/mapper/volume-123-456 /var/lib/kubelet/pods/pod-123-456/volumes/kubernetes.io~csi/test-pv/mount

# Step 5: Restart the pod
kubectl create -f pod-definition.yaml
```
"""

# Define the target pod, namespace, and volume path
target_pod = "test-pod"
target_namespace = "default"
target_volume_path = "/var/lib/kubelet/pods/pod-123-456/volumes/kubernetes.io~csi/test-pv/mount"

# Define configuration data
config_data = {
    "troubleshoot": {
        "timeout_seconds": 300,
        "interactive_mode": True
    }
}

# Create mock collected info
collected_info = create_mock_collected_info()

# Run the remediation phase
remediation_result, _ = await run_remediation_phase_with_fix_plan(
    target_pod, target_namespace, target_volume_path, 
    collected_info, mock_fix_plan, config_data
)

## Examining the Remediation Result

Let's examine the remediation result from Phase2, which includes the details of the actions taken and their outcomes.

In [None]:
# Display the remediation result
print(remediation_result)

## LangGraph Workflow

Phase2 uses a LangGraph StateGraph similar to Phase1, but with a focus on executing remediation actions rather than investigation. The graph consists of three main nodes:

1. **call_model**: LLM reasoning node that decides what remediation action to take next
2. **tools_condition**: Condition node that checks if a tool was requested
3. **serial_tools**: Tool execution node that runs the requested remediation tool

The flow is as follows:

1. The LLM (**call_model**) analyzes the Fix Plan and decides what action to take
2. If a tool is requested, the **tools_condition** routes to **serial_tools**
3. The **serial_tools** node executes the requested remediation action and returns the result
4. The result is fed back to the LLM (**call_model**) for further analysis
5. This loop continues until all remediation steps are completed or until a critical error occurs

The remediation tools include:

- **Kubernetes tools**: For managing pods, services, and other Kubernetes resources
- **Filesystem tools**: For repairing and managing filesystems
- **Hardware diagnostic tools**: For checking drive health and hardware status
- **System tools**: For system-level operations like mounting/unmounting filesystems

## Summary

Phase2 (Remediation) executes the Fix Plan generated by Phase1 to resolve the identified issues with volume I/O. The remediation process involves executing a series of steps to fix the root cause of the problem, such as repairing filesystem corruption, addressing hardware issues, or resolving configuration problems.

In this notebook, we demonstrated:

1. How the Remediation phase is initialized and executed
2. How the Fix Plan is parsed and executed step by step
3. How the results of each remediation action are tracked and reported
4. How the system status is verified after remediation

The output of Phase2 includes a detailed report of the remediation actions taken, their success or failure status, and recommendations for further actions if needed.