Skip to content
View MSKazemi's full-sized avatar

Highlights

  • Pro

Block or report MSKazemi

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
mskazemi/README.md

Mohsen Seyedkazemi Ardebili

Platform Engineer · AIOps · MLOps · LLM-Orchestrated Infrastructure
Research Fellow, University of Bologna · Bologna, Italy


I build autonomous AI systems that act on infrastructure — not just explain it. Seven years of hands-on ops in mission-critical industrial environments before a PhD in HPC systems gives me a different lens: I care about correctness, observability, and production trust.


Featured Project

KubeIntellect — Autonomous Kubernetes Operations

LLM-orchestrated multi-agent framework for root cause analysis, diagnosis, and human-gated cluster operations across the full Kubernetes API surface.

Python LangGraph FastAPI Kubernetes

  • LangGraph FSM supervisor with PostgreSQL checkpoints and human-in-the-loop approval gates
  • Dynamic Code-Generator agent: sandboxed tool synthesis and validation at runtime
  • Modular domain agents: logs, metrics, RBAC, lifecycle, scheduling, exec, proxy
  • 93% tool synthesis success rate · 100% reliability across 200+ queries

Other Projects

Project Description Key Metrics Stack
kube_q CLI + Python SDK for KubeIntellect Streaming responses, Rich TUI Python
AOBench Agent Operations Benchmark — role-aware, permission-enforced, trace-based HPC agent evaluation 80 tasks · 26 environments Python, LLM Eval, MCP
GRAAFE Graph anomaly anticipation for exascale HPC AUC 0.91 · 1000+ nodes Python, GCN
HazardNet Thermal hazard prediction for datacenters F1 0.99 · <100ms inference Python, TCN/LSTM

Research

PhD: Design, Analysis, and Management of High-Performance Computing Systems · University of Bologna (2018–2022)

EU Projects: DECICE · Graph-Massivizer · EUROPEAN PILOT · REGALE · EPI SGA1 · SEANERGYS

Scholar:

Citations h-index i10-index
179 (154 since 2021) 7 6

Selected Publications

Title Venue Year Citations
KubeIntellect: A Modular LLM-Orchestrated Agent Framework for Kubernetes Management arXiv 2025
M100 ExaData: A Data Collection Campaign on CINECA's Marconi100 Tier-0 Supercomputer Nature Scientific Data 2023 50
PM100: A Job Power Consumption Dataset of a Large-Scale Production HPC System SC'23 Workshops 2023 21
GRAAFE: Graph Anomaly Anticipation Framework for Exascale HPC Systems FGCS 2024 17
HazardNet: Thermal Hazard Prediction Framework for Datacenters FGCS 2024
Multi-level Anomaly Prediction in Tier-0 Datacenter ACM Computing Frontiers 2022

All publications →


Stack

Platform & Infrastructure

Kubernetes Helm Terraform Azure Docker Linux

AI / ML

Python PyTorch LangGraph FastAPI MLflow

HPC

Slurm MPI OpenMP

Observability

Prometheus Grafana OpenTelemetry


Academic Service

PC Member: PDP 2025 · PDP 2026 · AsHES 2026

Reviewer: IEEE TCAD · FGCS · Journal of Grid Computing · SC · ACM CF · DATE · PDP · AsHES

Supervision: 2 PhD co-advisees (ongoing) · 5 MSc theses completed · Lab of Big Data Architectures, UniBo (2020–2024)


Pinned Loading

  1. Vagrant-Kubernetes-ROS2-Deployment Vagrant-Kubernetes-ROS2-Deployment Public

    "This repository provides a step-by-step guide to deploying ROS2 Talker and Listener Nodes on a Kubernetes cluster using Vagrant. It includes instructions for setting up the cluster, installing the…

    Jupyter Notebook 4 2

  2. GRAAFE GRAAFE Public

    Jupyter Notebook

  3. HazardNet HazardNet Public

    Python

  4. kubeintellect kubeintellect Public

    Chat with your Kubernetes cluster in plain English. Multi-agent AI for root cause analysis, diagnostics, and HITL-gated cluster operations.

    Python 4

  5. aobench aobench Public

    Benchmark framework for evaluating AI agents in HPC environments — role-aware, tool-using, trace-based, and reproducible.

    Python 1

  6. ai-agent-systems-course ai-agent-systems-course Public

    Python 1