Input flow

1. Input CSV with incidents
2. Use sentence-transformers to embed RCA fields
3. Apply unsupervised clustering on embeddings
4. When asked:
   a. Convert question → semantic intent
   b. Retrieve best-matching cluster/data slice
   c. Run stats + summarization
   d. Return GenAI-like summary

In [1]:
# pip install pandas sentence-transformers faiss-cpu transformers

In [2]:
# Load data + Embed RCA summaries

import pandas as pd
from sentence_transformers import SentenceTransformer

df = pd.read_csv("production_grade_incident_rcas.csv")
model = SentenceTransformer("all-MiniLM-L6-v2")
df['embedding'] = model.encode(df['rca_summary'].tolist()).tolist()

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# perform Unsupervised clustering

from sklearn.cluster import KMeans

X = list(df['embedding'])
kmeans = KMeans(n_clusters=5, random_state=42).fit(X)
df['cluster'] = kmeans.labels_

In [4]:
# Ask a question -> match to cluster

from sklearn.metrics.pairwise import cosine_similarity

def question_to_cluster(query, top_n=1):
    query_vec = model.encode([query])
    similarities = cosine_similarity(query_vec, X)
    df['similarity'] = similarities.flatten()
    best_cluster = df.groupby('cluster')['similarity'].mean().sort_values(ascending=False).index[0]
    return df[df['cluster'] == best_cluster].sort_values(by='similarity', ascending=False).head(5)

In [5]:
# Summarize that cluster

from transformers import pipeline

summarizer = pipeline("summarization", model="google/flan-t5-base")

def summarize_cluster(df_cluster, query):
    combined = " ".join(df_cluster['rca_summary'].tolist())
    summary = summarizer(f"summarize: {combined}", max_length=100, min_length=30, do_sample=False)[0]['summary_text']
    return f"Answer to your question: {query}\n\nTop insights: {summary}"

Device set to use cpu


In [6]:
# Combined call
cluster_df = question_to_cluster("What are the common reasons for client-impacting issues?")
print(summarize_cluster(cluster_df, "What are the common reasons for client-impacting issues?"))

Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer to your question: What are the common reasons for client-impacting issues?

Top insights: External API latency caused user profile sync timeouts for ClientClient9 for around 92 minutes. External API earlyncy caused User Profile Sync latency for ClientCLint3 for around 45 minutes. The External API Latency Caused User Profile sync Timeouts For ClientCLinet3 For around 107 minutes.


In [7]:
question = "How many incidents are logged in last 10 days and what client is max impacted?"
cluster_df = question_to_cluster("How many incidents are logged in last 10 days and what client is max impacted?")
print(summarize_cluster(cluster_df, "How many incidents are logged in last 10 days and what client is max impacted?"))

Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer to your question: How many incidents are logged in last 10 days and what client is max impacted?

Top insights: A misconfigured firewall rule on the edge gateway blocked incoming traffic to all client-facing applications across regions. Monitoring alerts were delayed due to dependency on internal DNS, which was also down. Full platform outage lasted 154 minutes.


In [8]:
question = "How many clients are impacted in last 10 days?"
cluster_df = question_to_cluster(question)
print(summarize_cluster(cluster_df, question))

Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer to your question: How many clients are impacted in last 10 days?

Top insights: A misconfigured firewall rule on the edge gateway blocked incoming traffic to all client-facing applications across regions. Monitoring alerts were delayed due to dependency on internal DNS, which was also down. Full platform outage lasted 154 minutes.


In [9]:
question = "What is the most reported type of incident?"
cluster_df = question_to_cluster(question)
print(summarize_cluster(cluster_df, question))

Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer to your question: What is the most reported type of incident?

Top insights: Report generation failed due to unhandled exception in newly added CSV export logic. Recovery took 154 minutes via hotfix. Report generation succeeded due to unexplained exception in previously released CSV output logic. Recovery took 147 minutes via Hotfix.


In [13]:
# Sample workflow using LLama and mistral

# Prepare cluster summary
cluster_df = question_to_cluster("schema migration failures")
combined_summary = " ".join(cluster_df['rca_summary'].tolist())

# Build prompt
prompt = f"""
You are a production SRE assistant.

You are given incident RCA summaries. Provide:
1. A detailed summary of what these incidents are about.
2. Patterns you observe across them.
3. If there's a known team/client frequently impacted.

Data:
{combined_summary}
"""

# Send to LLaMA 3 (locally via llama.cpp or Ollama)
from llama_cpp import Llama

llm = Llama(model_path="meta-llama-3-8b-instruct.Q4_K_M.gguf")
response = llm(prompt, max_tokens=500, stop=["</s>"])

print(response['choices'][0]['text'].strip())


llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from meta-llama-3-8b-instruct.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.

---

**Summary:**
The incidents are about Kafka consumers in the pricing engine stalling due to incompatible schema updates. This led to stale prices being served for varying amounts of time (25-80 minutes) until the consumers were restarted.

**Patterns:**
Across the incidents, I observe the following patterns:

* The cause of the incidents is the same: incompatible schema updates in the Kafka consumers.
* The impact is similar: stale prices were served for some time before the consumers were restarted.
* The duration of the incidents varies (25-80 minutes), but they all result in the same outcome: stale prices being served.

**Known team/client frequently impacted:**
Based on the data, I don't see any information indicating a specific team or client that is frequently impacted by these incidents. However, it is possible that the pricing engine is a critical component for multiple teams or clients, and therefore, the impact of these incidents could be widespread. It would be beneficia

In [14]:
# Sample workflow using LLama and mistral

# # Prepare cluster summary
# cluster_df = question_to_cluster("schema migration failures")
# combined_summary = " ".join(cluster_df['rca_summary'].tolist())

# Build prompt
prompt = f"""
What are the most common type of incidents?
"""

# Send to LLaMA 3 (locally via llama.cpp or Ollama)
from llama_cpp import Llama

llm = Llama(model_path="meta-llama-3-8b-instruct.Q4_K_M.gguf")
response = llm(prompt, max_tokens=500, stop=["</s>"])

print(response['choices'][0]['text'].strip())

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from meta-llama-3-8b-instruct.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.

Based on the data, the most common types of incidents reported to the IT department are:
1. Network connectivity issues (34%): This includes issues with Wi-Fi, Ethernet, and other network connections.
2. Password-related issues (20%): This includes issues with forgotten passwords, locked accounts, and password reset requests.
3. Hardware and software issues (15%): This includes issues with computer hardware, software applications, and printers.
4. Network security issues (10%): This includes issues with firewalls, antivirus software, and other security-related problems.
5. Miscellaneous issues (21%): This includes issues that don't fit into the above categories, such as issues with printers, scanners, and other peripherals.

It's worth noting that the most common type of incident may vary depending on the organization, its size, and its industry. The IT department should analyze the data to identify trends and patterns to better understand the types of incidents that occur and to devel