# **How to Run (Notebook Version)**

1. ***Clone repo / upload files to Colab or Vertex AI Workbench***

2. **Make sure you have:**

*  A GCP project
*  Vertex AI API enabled


*   Gemini 2.5 Flash available in your region

**3. Install dependencies in the notebook**

In [None]:
!pip install gcsfs



In [None]:
!pip install python-docx

import gcsfs
import json
from docx import Document
import os

fs = gcsfs.GCSFileSystem()

path = "gs://demo-incidents/Incident.docx"

incidents = []

# Create a temporary local file to write the GCS content
local_file_path = "temp_incident.docx"
with fs.open(path, "rb") as gcs_file:
    with open(local_file_path, "wb") as local_file:
        local_file.write(gcs_file.read())

document = Document(local_file_path)
full_text = []
for para in document.paragraphs:
    if para.text.strip(): # Only add non-empty paragraphs
        full_text.append(para.text.strip())

# For now, let's just store the full text as a single "incident"
# The user might need to clarify how incidents are structured within the DOCX.
if full_text:
    incidents.append({"content": "\n".join(full_text)})

# Clean up the temporary local file
os.remove(local_file_path)

len(incidents)



1

**4. Initialize Vertex AI**

In [None]:
# Send each incident to Gemini 1.5

!pip install google-cloud-aiplatform --upgrade



**5. Run the notebook cells in order:**

*   Load incidents
*   Call Gemini to enrich and summarize

*   Cluster incidents
*   Generate RCA per cluster

*   Export final_aiops_report.json

**6. After execution, download final_aiops_report.json from the notebook file browser.**

In [None]:
# Import Gemini + initialize Vertex AI

from google.cloud import aiplatform

aiplatform.init(
    project = "my-sentiment-478307",
    location = "us-central1"
)

from vertexai.generative_models import GenerativeModel

model = GenerativeModel("gemini-2.5-flash")

  from google.cloud.aiplatform.utils import gcs_utils


In [None]:
# Build the function to send one incident to Gemini

def analyze_incident(incident):
  prompt = f""" You are an AI assistant that analyzes IT Incidents.

  INCIDENT:
  {incident}

  Task:
  1. Classify severity: Critical / High / Medium / Low
  2. Identify root-cause category (Server, Network, DB, Application, Infra, Cloud)
  3. Suggest best next action
  4. Is this correlated to any known common issue? (Y/N)
  5. Provide correlation signature (short text)

  Return JSON Only:
  {{
    "severity": "",
    "category": "",
    "next_action": "",
    "correlation": "",
    "signature": ""
  }}
  """
  response = model.generate_content(prompt)
  return response.text

In [None]:
# Send ALL incident to Gemini

all_incident_results = []
for i, incident_data in enumerate(incidents):
  print(f"Processing incident {i+1}/{len(incidents)}")
  analysis = analyze_incident(incident_data)
  all_incident_results.append(analysis)

Processing incident 1/1


In [None]:
import json

with open("gemini_incident_analysis.json1", "w") as f:
  for r in all_incident_results:
    f.write(r + "\n")

In [None]:
import json
import re # Import regex for removing markdown fences
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Process all_incident_results to parse the JSON strings
processed_gemini_outputs = []
for raw_gemini_output in all_incident_results:
    # Remove markdown code block fences (e.g., ```json\n...\n```)
    clean_json_str = re.sub(r'```json\n|```', '', raw_gemini_output).strip()
    try:
        parsed_data = json.loads(clean_json_str)
        # Gemini might return a list of objects or a single object
        if isinstance(parsed_data, list):
            processed_gemini_outputs.extend(parsed_data)
        else:
            processed_gemini_outputs.append(parsed_data)
    except json.JSONDecodeError as e:
        print(f"Warning: Failed to decode JSON from Gemini output: {e}. Data: {clean_json_str[:100]}...")
        # If parsing fails, you might want to skip this item or add a placeholder
        continue

# Extract text for clustering. 'signature' from Gemini's analysis seems appropriate.
incident_texts = [item.get("signature", "") for item in processed_gemini_outputs if isinstance(item, dict)]

# Filter out any empty signatures, as they won't contribute to clustering
incident_texts = [text for text in incident_texts if text.strip()]

# Ensure there's enough data to cluster
if len(incident_texts) == 0:
    print("No valid incident texts (signatures) found for clustering.")
    # Clear all_incident_results if no valid data to prevent further errors
    all_incident_results = []
elif len(incident_texts) < 3: # KMeans needs at least n_samples >= n_clusters for default n_clusters=3
    print(f"Warning: Only {len(incident_texts)} incident texts found. Adjusting number of clusters.")
    num_clusters = max(1, len(incident_texts)) # Use at least 1 cluster if data exists
    if num_clusters == 1:
        labels = [0] * len(incident_texts) # All in one cluster
    else:
        vectorizer = TfidfVectorizer(stop_words="english")
        X = vectorizer.fit_transform(incident_texts)
        kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10) # n_init for newer KMeans
        labels = kmeans.fit_predict(X)
else:
    # Vectorize alerts
    vectorizer = TfidfVectorizer(stop_words="english")
    X = vectorizer.fit_transform(incident_texts)

    # Choose num_clusters = 3 for demo
    kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) # n_init for newer KMeans
    labels = kmeans.fit_predict(X)

# Attach clusters to the processed Gemini outputs
for idx, item in enumerate(processed_gemini_outputs):
    if idx < len(labels):
        item["cluster"] = int(labels[idx])

print("\n=== CLUSTERED INCIDENTS ===\n")
for item in processed_gemini_outputs:
    text_to_display = item.get("signature", str(item)) # Display signature or full item if no signature
    cluster_info = item.get("cluster", "N/A")
    print(f"Cluster {cluster_info} \u2192 {text_to_display}")

# Update all_incident_results to contain the processed and clustered dictionaries
all_incident_results = processed_gemini_outputs


=== CLUSTERED INCIDENTS ===

Cluster 0 → Downstream DB Timeouts
Cluster 2 → High Disk Usage / Failed Cleanup Job
Cluster 0 → Application High CPU & OOM Error
Cluster 1 → OOMKilled Pod Restart Loop
Cluster 0 → Cache Eviction Thrashing / Connection Reset
Cluster 0 → InvalidTokenException Authentication Failures
Cluster 0 → Inter-Region Network Packet Loss
Cluster 0 → Application High CPU & Auto-scale Failure
Cluster 2 → High Disk Usage / Failed Cleanup Job
Cluster 0 → Downstream DB Timeouts
Cluster 1 → OOMKilled Pod Restart Loop
Cluster 0 → Application High CPU & ThreadPool Exhaustion
Cluster 0 → InvalidTokenException Authentication Failures
Cluster 0 → Cache Eviction Thrashing
Cluster 0 → Downstream DB Timeouts
Cluster 2 → High Disk Usage / Failed Cleanup Job
Cluster 0 → Downstream Cluster Timeouts
Cluster 1 → OOMKilled Pod Restart Loop
Cluster 0 → Application High CPU & Query Timeout
Cluster 0 → InvalidTokenException Authentication Failures
Cluster 2 → High Disk Usage / Failed Cleanup

In [None]:
# Build Auto-Root-Cause Analysis
cluster_map = {}
for r in all_incident_results:
    cluster_map.setdefault(r["cluster"], []).append(r["signature"])

rca_results = {}

for cluster_id, incidents in cluster_map.items():
    prompt = f"""
    These incidents appear to be correlated:
    {incidents}

    Give:
    1) Probable Root Cause
    2) Recommended Fix
    3) Monitoring Improvement Suggestions
    """

    response = model.generate_content(prompt)
    rca_results[cluster_id] = response.text

print("\n=== ROOT CAUSE ANALYSIS (AI GENERATED) ===\n")
for cid, text in rca_results.items():
    print(f"Cluster {cid}: \n{text}\n{'-'*80}")


=== ROOT CAUSE ANALYSIS (AI GENERATED) ===

Cluster 0: 
Based on the correlated incidents, here's a breakdown:

---

### 1) Probable Root Cause

The confluence of `Inter-Region Network Packet Loss`, various `Downstream Timeouts` (DB, Cluster, Infra), and widespread `Application High CPU` issues (leading to OOM, ThreadPool Exhaustion, and scaling failures) points to a systemic bottleneck.

**Primary Root Cause: Widespread Inter-Region Network Instability and/or Resource Exhaustion Leading to Cascade Failures.**

**Explanation:**
1.  **Inter-Region Network Packet Loss:** This is a critical indicator. If there's significant packet loss between regions, it will directly impact the performance and reliability of any service that communicates across these boundaries. This would immediately explain `Downstream DB Timeouts`, `Downstream Cluster Timeouts`, and `Downstream Infra Connection Reset / Timeouts` if those resources are in a different region or if the application needs to cross region

In [None]:
# SRE Traits - Add severity prediction

for r in all_incident_results:
    prompt = f"""
    Incident: {r['signature']}
    Predict severity (P1/P2/P3) and justify.
    """

    response = model.generate_content(prompt)
    r["severity"] = response.text.strip()

In [None]:
import json

output = {
    "incidents": all_incident_results,
    "rca": rca_results
}

with open("final_aiops_report.json", "w") as f:
    json.dump(output, f, indent=4)