# Phase0: Information Collection

## Overview

Phase0 is responsible for collecting comprehensive diagnostic information about the Kubernetes cluster, focusing on the pod with volume I/O errors. This phase builds a Knowledge Graph that serves as the foundation for the subsequent phases.

### Key Components

- **ComprehensiveInformationCollector**: Collects data from various sources including Kubernetes API, system logs, and hardware diagnostics
- **Knowledge Graph**: A graph-based representation of system entities and their relationships
- **Tool Executors**: Various tools that collect specific types of information

### Inputs and Outputs

- **Inputs**: Pod name, namespace, volume path
- **Outputs**: 
  - Knowledge Graph with system entities and relationships
  - Collected diagnostic information (pod info, PVC info, PV info, node info, etc.)
  - Issues detected during information collection

In [4]:
# Import necessary libraries
import asyncio
import json
from typing import Dict, Any

# Import mock data for demonstration
import sys
sys.path.append('../')
from tests.mock_kubernetes_data import get_mock_kubernetes_data
from tests.mock_system_data import get_mock_system_data
from tests.mock_knowledge_graph import create_mock_knowledge_graph

## Mock Implementation of ComprehensiveInformationCollector

For demonstration purposes, we'll create a mock implementation of the ComprehensiveInformationCollector class that uses our mock data.

In [5]:
class MockComprehensiveInformationCollector:
    """
    Mock implementation of ComprehensiveInformationCollector for demonstration
    """
    
    def __init__(self, config_data=None):
        """
        Initialize the mock collector
        
        Args:
            config_data: Configuration data (optional)
        """
        self.config_data = config_data or {}
        print("Initializing ComprehensiveInformationCollector...")
    
    async def comprehensive_collect(self, target_pod, target_namespace, target_volume_path):
        """
        Collect comprehensive information about the target pod and volume
        
        Args:
            target_pod: Name of the target pod
            target_namespace: Namespace of the target pod
            target_volume_path: Path of the volume with I/O error
            
        Returns:
            Dict[str, Any]: Collection result
        """
        print(f"Collecting information for pod {target_namespace}/{target_pod} with volume path {target_volume_path}")
        
        # Simulate collection process with delays
        print("Collecting Kubernetes data...")
        await asyncio.sleep(1)  # Simulate API call delay
        kubernetes_data = get_mock_kubernetes_data()
        
        print("Collecting system data...")
        await asyncio.sleep(1)  # Simulate system data collection delay
        system_data = get_mock_system_data()
        
        print("Building Knowledge Graph...")
        await asyncio.sleep(1)  # Simulate graph building delay
        knowledge_graph = create_mock_knowledge_graph()
        
        # Format collected data
        collected_data = {
            "kubernetes": kubernetes_data,
            "system": system_data,
            "csi_baremetal": kubernetes_data.get("csi_driver", {})
        }
        
        # Create context summary
        context_summary = {
            "pod_count": len(kubernetes_data.get("pods", {})),
            "pvc_count": len(kubernetes_data.get("pvcs", {})),
            "pv_count": len(kubernetes_data.get("pvs", {})),
            "node_count": len(kubernetes_data.get("nodes", {})),
            "issue_count": len(knowledge_graph.issues)
        }
        
        print("Information collection complete!")
        
        return {
            "collected_data": collected_data,
            "context_summary": context_summary,
            "knowledge_graph": knowledge_graph
        }

## Mock Implementation of InformationCollectionPhase

Now we'll create a mock implementation of the InformationCollectionPhase class that uses our mock collector.

In [13]:
class MockInformationCollectionPhase:
    """
    Mock implementation of InformationCollectionPhase for demonstration
    """
    
    def __init__(self, config_data=None):
        """
        Initialize the mock phase
        
        Args:
            config_data: Configuration data (optional)
        """
        self.config_data = config_data or {}
        print("Initializing InformationCollectionPhase...")
    
    async def collect_information(self, pod_name, namespace, volume_path):
        """
        Collect all necessary diagnostic information
        
        Args:
            pod_name: Name of the pod with the error
            namespace: Namespace of the pod
            volume_path: Path of the volume with I/O error
            
        Returns:
            Dict[str, Any]: Pre-collected diagnostic information
        """
        print(f"\nStarting information collection for pod {namespace}/{pod_name}")
        
        # Initialize information collector
        info_collector = MockComprehensiveInformationCollector(self.config_data)
        
        # Run comprehensive collection
        collection_result = await info_collector.comprehensive_collect(
            target_pod=pod_name,
            target_namespace=namespace,
            target_volume_path=volume_path
        )
        
        # Get the knowledge graph from collection result
        knowledge_graph = collection_result.get('knowledge_graph')
        
        # Format collected data into expected structure
        collected_info = self._format_collected_data(collection_result, knowledge_graph)
        
        # Print Knowledge Graph summary
        self._print_knowledge_graph_summary(knowledge_graph)
        
        return collected_info
    
    def _format_collected_data(self, collection_result, knowledge_graph):
        """
        Format collected data into expected structure
        
        Args:
            collection_result: Result from comprehensive collection
            knowledge_graph: Knowledge Graph instance
            
        Returns:
            Dict[str, Any]: Formatted collected data
        """
        return {
            "pod_info": collection_result.get('collected_data', {}).get('kubernetes', {}).get('pods', {}),
            "pvc_info": collection_result.get('collected_data', {}).get('kubernetes', {}).get('pvcs', {}),
            "pv_info": collection_result.get('collected_data', {}).get('kubernetes', {}).get('pvs', {}),
            "node_info": collection_result.get('collected_data', {}).get('kubernetes', {}).get('nodes', {}),
            "csi_driver_info": collection_result.get('collected_data', {}).get('csi_baremetal', {}),
            "storage_class_info": collection_result.get('collected_data', {}).get('kubernetes', {}).get('storage_classes', {}),
            "system_info": collection_result.get('collected_data', {}).get('system', {}),
            "knowledge_graph_summary": collection_result.get('context_summary', {}),
            "issues": knowledge_graph.issues if knowledge_graph else [],
            "knowledge_graph": knowledge_graph
        }
    
    def _print_knowledge_graph_summary(self, knowledge_graph):
        """
        Print Knowledge Graph summary
        
        Args:
            knowledge_graph: Knowledge Graph instance
        """
        print("\n" + "=" * 80)
        print("KNOWLEDGE GRAPH SUMMARY")
        print("=" * 80)
        
        # Print graph summary
        print(knowledge_graph.print_graph())
        
        # Print issues
        print("\n" + "=" * 80)
        print(f"DETECTED ISSUES: {len(knowledge_graph.issues)}")
        print("=" * 80)
        
        for i, issue in enumerate(knowledge_graph.issues, 1):
            print(f"\nIssue {i}: {issue['message']}")
            print(f"Severity: {issue['severity']}")
            print(f"Entity: {issue['entity_type']} ({issue['entity_id']})")
            print(f"Details: {issue['details']}")
            print("Possible causes:")
            for cause in issue['possible_causes']:
                print(f"  - {cause}")
            print("Recommended actions:")
            for action in issue['recommended_actions']:
                print(f"  - {action}")

## Running Phase0: Information Collection

Now let's run the Information Collection phase with our mock implementation.

In [7]:
async def run_information_collection_phase(pod_name, namespace, volume_path, config_data=None):
    """
    Run Phase 0: Information Collection - Gather all necessary data upfront
    
    Args:
        pod_name: Name of the pod with the error
        namespace: Namespace of the pod
        volume_path: Path of the volume with I/O error
        config_data: Configuration data (optional)
        
    Returns:
        Dict[str, Any]: Pre-collected diagnostic information
    """
    print("Starting Phase 0: Information Collection")
    
    # Initialize the phase
    phase = MockInformationCollectionPhase(config_data)
    
    # Run the collection
    collected_info = await phase.collect_information(pod_name, namespace, volume_path)
    
    return collected_info

In [8]:
# Define the target pod, namespace, and volume path
target_pod = "test-pod"
target_namespace = "default"
target_volume_path = "/var/lib/kubelet/pods/pod-123-456/volumes/kubernetes.io~csi/test-pv/mount"

# Define configuration data
config_data = {
    "troubleshoot": {
        "timeout_seconds": 300,
        "interactive_mode": True
    }
}

# Run the information collection phase
collected_info = await run_information_collection_phase(target_pod, target_namespace, target_volume_path, config_data)

Starting Phase 0: Information Collection
Initializing InformationCollectionPhase...

Starting information collection for pod default/test-pod
Initializing ComprehensiveInformationCollector...
Collecting information for pod default/test-pod with volume path /var/lib/kubelet/pods/pod-123-456/volumes/kubernetes.io~csi/test-pv/mount
Collecting Kubernetes data...
Collecting system data...
Building Knowledge Graph...
Information collection complete!

KNOWLEDGE GRAPH SUMMARY
Knowledge Graph Summary:
Total nodes: 8
Total edges: 8
Total issues: 2

Node types:
  - Pod: 1
  - PVC: 1
  - PV: 1
  - Node: 1
  - Drive: 1
  - Volume: 1
  - StorageClass: 1
  - System: 1

Relationship types:
  - USES: 5
  - RUNS_ON: 1
  - BOUND_TO: 1
  - IS_ON: 1

Issues by severity:
  - critical: 1

DETECTED ISSUES: 2

Issue 1: XFS filesystem corruption detected on volume test-pv
Severity: critical
Entity: System (gnode:System:filesystem)
Details: XFS metadata corruption found during filesystem check. This can lead to 

## Examining Collected Information

Let's examine some of the key information collected during Phase0.

In [14]:
# Examine the pod information
print("Pod Information:")
pod_info = collected_info['pod_info']
pod_key = list(pod_info.keys())[0]  # Get the first pod key
pod_data = pod_info[pod_key]

print(f"Name: {pod_data['metadata']['name']}")
print(f"Namespace: {pod_data['metadata']['namespace']}")
print(f"Status: {pod_data['status']['phase']}")
print(f"Node: {pod_data['spec']['nodeName']}")

# Check for container errors
for container_status in pod_data['status']['containerStatuses']:
    if 'lastState' in container_status and 'terminated' in container_status['lastState']:
        term_info = container_status['lastState']['terminated']
        if 'reason' in term_info and term_info['reason'] == 'Error':
            print(f"\nContainer Error: {container_status['name']}")
            print(f"Exit Code: {term_info['exitCode']}")
            print(f"Reason: {term_info['reason']}")
            print(f"Message: {term_info['message']}")

Pod Information:
Name: test-pod
Namespace: default
Status: Running
Node: worker-1

Container Error: test-container
Exit Code: 1
Reason: Error
Message: I/O error on volume


In [10]:
# Examine the volume and storage information
print("Volume and Storage Information:")

# PVC info
pvc_info = collected_info['pvc_info']
pvc_key = list(pvc_info.keys())[0]  # Get the first PVC key
pvc_data = pvc_info[pvc_key]

print(f"\nPVC Name: {pvc_data['metadata']['name']}")
print(f"PVC Namespace: {pvc_data['metadata']['namespace']}")
print(f"Storage Class: {pvc_data['spec']['storageClassName']}")
print(f"Volume Name: {pvc_data['spec']['volumeName']}")
print(f"Status: {pvc_data['status']['phase']}")
print(f"Capacity: {pvc_data['status']['capacity']['storage']}")

# PV info
pv_info = collected_info['pv_info']
pv_key = list(pv_info.keys())[0]  # Get the first PV key
pv_data = pv_info[pv_key]

print(f"\nPV Name: {pv_data['metadata']['name']}")
print(f"Storage Class: {pv_data['spec']['storageClassName']}")
print(f"Reclaim Policy: {pv_data['spec']['persistentVolumeReclaimPolicy']}")
print(f"Status: {pv_data['status']['phase']}")
print(f"CSI Driver: {pv_data['spec']['csi']['driver']}")
print(f"Volume Handle: {pv_data['spec']['csi']['volumeHandle']}")
print(f"FS Type: {pv_data['spec']['csi']['fsType']}")

Volume and Storage Information:

PVC Name: test-pvc
PVC Namespace: default
Storage Class: csi-baremetal-sc
Volume Name: test-pv
Status: Bound
Capacity: 10Gi

PV Name: test-pv
Storage Class: csi-baremetal-sc
Reclaim Policy: Delete
Status: Bound
CSI Driver: csi-baremetal
Volume Handle: volume-123-456
FS Type: xfs


In [11]:
# Examine the system information related to the volume
print("System Volume Diagnostics:")
volume_diag = collected_info['system_info']['volume_diagnostics']

print(f"\nMount Info:")
print(f"Device: {volume_diag['mount_info']['device']}")
print(f"Mountpoint: {volume_diag['mount_info']['mountpoint']}")
print(f"Type: {volume_diag['mount_info']['type']}")
print(f"Options: {volume_diag['mount_info']['options']}")

print(f"\nXFS Repair Check:")
print(f"Status: {volume_diag['xfs_repair_check']['status']}")
print(f"Repair Recommended: {volume_diag['xfs_repair_check']['repair_recommended']}")
print(f"Errors Found:")
for error in volume_diag['xfs_repair_check']['errors_found']:
    print(f"  - {error}")

print(f"\nI/O Stats:")
print(f"Read Ops: {volume_diag['io_stats']['read_ops']}")
print(f"Write Ops: {volume_diag['io_stats']['write_ops']}")
print(f"Errors: {volume_diag['io_stats']['errors']}")

System Volume Diagnostics:

Mount Info:
Device: /dev/mapper/volume-123-456
Mountpoint: /var/lib/kubelet/pods/pod-123-456/volumes/kubernetes.io~csi/test-pv/mount
Type: xfs
Options: rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota

XFS Repair Check:
Status: error
Repair Recommended: True
Errors Found:
  - Inode 1234 has corrupt core.mode
  - Inode 5678 has corrupt core.size
  - Filesystem has corrupt metadata

I/O Stats:
Read Ops: 12345
Write Ops: 23456
Errors: 123


In [12]:
# Examine the detected issues
print("Detected Issues:")
issues = collected_info['issues']

for i, issue in enumerate(issues, 1):
    print(f"\nIssue {i}: {issue['message']}")
    print(f"Severity: {issue['severity']}")
    print(f"Category: {issue['category']}")
    print(f"Details: {issue['details']}")
    print(f"Related Entities: {', '.join(issue['related_entities'])}")

Detected Issues:

Issue 1: XFS filesystem corruption detected on volume test-pv
Severity: critical
Category: filesystem
Details: XFS metadata corruption found during filesystem check. This can lead to I/O errors and data loss.
Related Entities: gnode:PV:test-pv, gnode:Pod:default/test-pod

Issue 2: Multiple I/O errors detected on drive /dev/sda
Category: hardware
Details: The drive has reported multiple read failures which may indicate hardware degradation.
Related Entities: gnode:Volume:default/volume-123-456, gnode:Node:worker-1


## Summary

Phase0 (Information Collection) is responsible for collecting comprehensive diagnostic information about the Kubernetes cluster, focusing on the pod with volume I/O errors. This phase builds a Knowledge Graph that serves as the foundation for the subsequent phases.

In this notebook, we demonstrated:

1. How the Information Collection phase is initialized and executed
2. How the ComprehensiveInformationCollector gathers data from various sources
3. How the Knowledge Graph is built and populated with entities and relationships
4. How issues are detected and added to the Knowledge Graph
5. How the collected information is formatted and returned for use in subsequent phases

The output of Phase0 includes:

1. A Knowledge Graph with system entities and relationships
2. Collected diagnostic information (pod info, PVC info, PV info, node info, etc.)
3. Issues detected during information collection

This information serves as the foundation for the Plan Phase, which will generate an Investigation Plan based on the collected data.