# Triage data gathering

- **Pod Status Check**:
    - `Tool`: `K8sAPI.get_problematic_pods()`
    - `Output`: Pods with pending states, container errors, high restarts (>3)
- **Trace Analysis**:
    - `Tool`: `JaegerAPI.get_processed_traces()` and `JaegerAPI.get_slow_traces()`
    - `Output`: Traces with errors or high latency, including service sequences
- **Metrics Analysis**:
    - `Tool`: `PrometheusAPI.get_pod_triage_metrics()`
    - `Output`: Pods with thread saturation (>95%), high CPU load (>10.0), or network errors

In [None]:
from dotenv import load_dotenv
import os
# Get the path to the root directory of the repository
root_dir = os.path.abspath(os.path.join(os.getcwd(), '../..'))

# Load environment variables from .env file in the root directory
load_dotenv(os.path.join(root_dir, '.env'))

In [None]:
import sys

# Add MCP-server to path
mcp_server_path = os.path.abspath(os.path.join(os.getcwd(), '../../MCP-server'))
sys.path.insert(0, mcp_server_path)

### Kubernetes API

The `get_problematic_pods()` method identifies unhealthy pods in a Kubernetes namespace using four key heuristics:

1. **Pod Pending State**
- **Detection**: Pod is stuck in `Pending` phase with no container statuses
- **Meaning**: The pod cannot be scheduled or is waiting for resources
- **Common causes**: Insufficient cluster resources (CPU/memory), node selector mismatches, persistent volume issues, or scheduling constraints

2. **Container Waiting State**
- **Detection**: Container is in a `Waiting` state (not yet running)
- **Meaning**: The container cannot start properly
- **Common causes**: 
  - `ImagePullBackOff`: Cannot pull the container image (authentication, wrong tag, or network issues)
  - `CrashLoopBackOff`: Container starts but immediately crashes
  - `CreateContainerError`: Error creating the container runtime

3. **Container Terminated with Error**
- **Detection**: Container terminated with non-zero exit code
- **Meaning**: The container exited abnormally
- **Common causes**: Application crash, failed health checks, out of memory (OOMKilled), or configuration errors
- **Captures**: Exit code, reason, message, and restart count

4. **High Restart Count (Crash Loop)**
- **Detection**: Container has restarted more than 3 times
- **Threshold**: `restart_count > 3` (filters out transient single restarts)
- **Meaning**: Container is repeatedly crashing and restarting
- **Common causes**: Application bugs, missing dependencies, memory leaks, or failed liveness probes
- **Note**: Can appear even when container state is "Running" (between crashes)

In [None]:
from api.k8s_api import K8sAPI
# Get kubernetes problematic pods
k8s_api = K8sAPI()

# Get all services and pods
services = k8s_api.get_services_list()
pods = k8s_api.get_pods_list()

problematic_pods = k8s_api.get_problematic_pods()

In [None]:
problematic_pods

### Jaeger API (traces analysis)

The Jaeger API provides methods to identify problematic traces using server-side filtering for efficiency.

**1. `get_processed_traces(service, only_errors=False)`**
Returns all traces for a service with optional error filtering.

**2. `get_slow_traces(service, min_duration_ms, only_errors=False)`**
Returns only traces exceeding a latency threshold, sorted by slowest first.

In [None]:
from api.jaeger_api import JaegerAPI

# Get traces which present delays or errors
jaeger_api = JaegerAPI()

# Traces which have errors
problematic_traces = jaeger_api.get_processed_traces(service="frontend", only_errors=True)

# Filtern for traces whxich take more than 2 seconds
slow_traces = jaeger_api.get_slow_traces(service="frontend", min_duration_ms=2000)

In [None]:
problematic_traces

In [None]:
slow_traces

### Prometheus API (Metrics Analysis)

The Prometheus API provides methods to analyze pod health using resource metrics and detect anomalies.

**`get_pod_triage_metrics(pod_name)`**
Performs a simple triage based on universal, instant metrics without requiring pod resource specifications.

**Heuristics:**

**1. Thread Saturation**
- **Detection**: `container_threads / container_threads_max > 0.95`
- **Threshold**: 95% of maximum threads
- **Meaning**: The container is running out of available threads
- **Impact**: Application may hang, reject new requests, or crash

**2. High CPU Load**
- **Detection**: `container_cpu_load_average_10s > 10.0`
- **Threshold**: CPU load average exceeds 10.0 over 10 seconds
- **Meaning**: The CPU is likely saturated and struggling to keep up with demand
- **Impact**: High latency, slow response times, request timeouts


**3. Network Errors & Packet Drops**
- **Detection**: Any of the following metrics > 1:
  - `container_network_receive_errors_total`: Network receive errors
  - `container_network_transmit_errors_total`: Network transmit errors
  - `container_network_receive_packets_dropped_total`: Dropped received packets
  - `container_network_transmit_packets_dropped_total`: Dropped transmitted packets
- **Threshold**: More than 1 occurrence during pod lifetime
- **Meaning**: The pod has experienced network connectivity issues
- **Impact**: Request failures, data loss, connectivity problems

In [None]:
from api.prometheus_api import PrometheusAPI

# Get metrics for each pod
prometheus_api = PrometheusAPI()

problematic_pods_metrics = []

for pod in pods:
    triage_metric_report = prometheus_api.get_pod_triage_metrics(pod)
    if triage_metric_report["is_anomalous"]:
        problematic_pods_metrics.append(triage_metric_report)

In [None]:
problematic_pods_metrics