## Automated HPC Log Analysis & Error Resolution System - analysis

In [20]:
import pandas as pd
import re
import requests
from ydata_profiling import ProfileReport
import io
import gzip

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

#### Load dataset

In [12]:
# Fetch log file from GitHub
log_url = "https://raw.githubusercontent.com/orsho/HPC-Log-Resolution/main/data/HPC.log"
response = requests.get(log_url)
log_lines = response.text.split("\n")  # Split into lines

log_data = []

# Parse logs using regex
for line in log_lines:
    match = re.match(r'(\d+) (\S+) (\S+) (\S+) (\d+) (\d+) (.+)', line.strip())
    if match:
        log_data.append(match.groups())

# Convert to DataFrame
columns = ["ID", "Node", "Subsystem", "Event", "Timestamp", "Unknown", "Message"]
log_df = pd.DataFrame(log_data, columns=columns)
log_df["Timestamp"] = pd.to_datetime(log_df["Timestamp"].astype(int), unit="s", errors="coerce")

print(log_df.shape)
log_df.head()

(430225, 7)


Unnamed: 0,ID,Node,Subsystem,Event,Timestamp,Unknown,Message
0,2557285,node-233,unix.hw,state_change.unavailable,2004-01-01 08:32:01,1,Component State Change: Component \042alt0\042...
1,2562603,node-233,unix.hw,state_change.unavailable,2004-01-08 08:34:05,1,Component State Change: Component \042alt0\042...
2,2561225,node-228,unix.hw,state_change.unavailable,2004-01-06 07:25:07,1,Component State Change: Component \042alt0\042...
3,2598209,node-ms0,unix.hw,state_change.unavailable,2004-01-16 22:14:33,1,Component State Change: Component \042alt0\042...
4,2598216,node-ms0,unix.hw,state_change.unavailable,2004-01-16 22:19:30,1,Component State Change: Component \042alt0\042...


In [19]:
# Sample the DataFrame
# df_sample = log_df.sample(frac=0.1, random_state=42)
# profile = ProfileReport(df_sample, title="Data Profiling Report", explorative=True)
# profile.to_file("1_log_profiling.html") # open "1_log_profiling.html" file if you can't see the iframe
# profile.to_notebook_iframe()

#### Load results (processed file with resolution)

In [21]:
# GitHub raw file URL (compressed CSV)
csv_url = "https://raw.githubusercontent.com/orsho/HPC-Log-Resolution/main/results/processed_logs.csv.gz"
response = requests.get(csv_url, verify=False)

# Decompress and load CSV into pandas
with gzip.GzipFile(fileobj=io.BytesIO(response.content), mode="rb") as f:
    logs_with_resolutions = pd.read_csv(f)

# Display shape and preview
print(logs_with_resolutions.shape)
logs_with_resolutions.head()

(430225, 9)


Unnamed: 0,ID,Node,Subsystem,Event,Timestamp,Unknown,Message,log_level,Resolution_Steps
0,2557285,node-233,unix.hw,state_change.unavailable,2004-01-01 08:32:01,1,Component State Change: Component \042alt0\042...,ERROR,**Issue Summary:**\nThe HPC system has reporte...
1,2562603,node-233,unix.hw,state_change.unavailable,2004-01-08 08:34:05,1,Component State Change: Component \042alt0\042...,ERROR,**Issue Summary:**\nThe HPC system has reporte...
2,2561225,node-228,unix.hw,state_change.unavailable,2004-01-06 07:25:07,1,Component State Change: Component \042alt0\042...,ERROR,Here is my analysis and suggested remediation ...
3,2598209,node-ms0,unix.hw,state_change.unavailable,2004-01-16 22:14:33,1,Component State Change: Component \042alt0\042...,ERROR,Here is my analysis and suggested remediation ...
4,2598216,node-ms0,unix.hw,state_change.unavailable,2004-01-16 22:19:30,1,Component State Change: Component \042alt0\042...,ERROR,Here is my analysis and suggested remediation ...


#### Sample 2 random error messages and print in a formatted way

In [23]:
sample_size = min(2, len(logs_with_resolutions[logs_with_resolutions["log_level"] == "ERROR"]))
sampled_errors = logs_with_resolutions[logs_with_resolutions["log_level"] == "ERROR"].sample(n=sample_size, random_state=42)

for _, row in sampled_errors.iterrows():
    print("\nSample Error Logs with Resolutions:\n" + "=" * 80)
    print(f"Log Message: {row['Message']}\n")
    print(f"Suggested Resolution: {row['Resolution_Steps']}\n")


🔍 Sample Error Logs with Resolutions:
Log Message: Linkerror event interval expired

Suggested Resolution: **Issue Summary:**
The 'Linkerror event interval expired' error message indicates a timeout issue in the high-performance computing (HPC) system, suggesting that a link error event was not resolved within the expected time interval, resulting in a system failure or instability.

**Possible Causes:**

- **InfiniBand (IB) fabric issues**: The IB fabric might be experiencing congestion, packet loss, or faulty cables, leading to link errors that are not resolved within the expected time interval.
- **Node or switch hardware failures**: A faulty node or switch in the cluster can cause link errors, which may not be resolved within the expected time interval, triggering the error message.
- **Configuration issues**: Misconfigured IB settings, such as incorrect subnet manager configurations or invalid MTU settings, can lead to link errors and timeouts.
- **Software bugs or version incomp

#### Basic info from the processed results

In [24]:
# Total messages (not unique)
total_messages = len(logs_with_resolutions)

# Total ERROR messages (not unique)
total_errors = (logs_with_resolutions["log_level"] == "ERROR").sum()

# Unique messages
unique_messages = logs_with_resolutions["Message"].nunique()

# Unique ERROR messages
unique_error_messages = logs_with_resolutions[logs_with_resolutions["log_level"] == "ERROR"]["Message"].nunique()

# Compute ratios
error_ratio_total = total_errors / total_messages if total_messages > 0 else 0
error_ratio_unique = unique_error_messages / unique_messages if unique_messages > 0 else 0

# Print results
print(f"Log Message Statistics:\n" + "-" * 40)
print(f"Total Messages (incl. duplicates): {total_messages}")
print(f"Total ERROR Messages: {total_errors}")
print(f"Error Ratio (Total): {error_ratio_total:.2%}\n")

print(f"Unique Messages: {unique_messages}")
print(f"Unique ERROR Messages: {unique_error_messages}")
print(f"Error Ratio (Unique): {error_ratio_unique:.2%}")

Log Message Statistics:
----------------------------------------
Total Messages (incl. duplicates): 430225
Total ERROR Messages: 145029
Error Ratio (Total): 33.71%

Unique Messages: 9961
Unique ERROR Messages: 3369
Error Ratio (Unique): 33.82%
