Skip to content

iOSNinja/it-support-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Autonomous IT Support Incident Response Agent for Level 1 Operations

1. Project Title & Value Proposition

Autonomous IT Support Incident Response Agent for Level 1 operations teams, reducing initial incident triage and action time by automating health checks, log retrieval, restart decisions, and escalation summaries for common server incidents.

2. Background & Problem Context

In fast-moving production environments, Level 1 support teams are often the first responders for server slowdowns, crashes, memory spikes, and dependency failures. A single shift may involve handling dozens of alerts across application servers, database nodes, authentication services, and search infrastructure.

Manual triage is repetitive and time-sensitive:

  • check server health
  • inspect recent logs
  • decide whether a restart is safe
  • escalate to an engineer if the issue is deeper

At scale, this leads to delayed response times, inconsistent decisions across responders, and unnecessary escalation noise. Simple scripts can gather metrics or logs, but they cannot reliably decide when to restart a service versus when to escalate based on combined health and log context.

3. Target User & Job To Be Done (JTBD)

Primary User

  • Role: Level 1 IT Support / Operations Responder
  • Environment: Internal production support or infrastructure operations team

JTBD

Within seconds of receiving an incident description, determine whether the issue can be safely handled through automated triage and restart, or whether it should be escalated to a human engineer with a clean incident summary.

Secondary Users

  • SRE / Platform Engineers: receive cleaner, context-rich escalations
  • Engineering Managers: monitor incident handling consistency and operational efficiency
  • Operations Leads: reduce repetitive manual triage work and improve response quality

4. Why an Agentic Approach

This problem is not just about pulling data — it requires reasoning across multiple signals.

A script can fetch CPU and memory usage. Another script can fetch logs. But the system still needs to decide:

  • whether the issue is caused by resource exhaustion
  • whether restart is the correct action
  • whether logs indicate a dependency failure that restart will not fix
  • whether escalation is safer than taking an automated action

A simple workflow or RPA-style automation is too rigid because incidents do not follow one deterministic path. A chatbot would require the operator to manually drive every step. An agentic approach is more suitable because the system can:

  • dynamically choose which tool to call next
  • reason over health signals and logs together
  • decide between restart and escalation
  • generate a final human-readable incident response summary

5. Agent Role, Scope & Autonomy Level

Agent Responsibilities

  • Investigate incoming server incident descriptions
  • Retrieve server health metrics
  • Fetch recent logs for diagnosis
  • Decide whether automated restart is appropriate
  • Escalate unresolved or unsafe issues to a human engineer
  • Produce a concise final response for the operator

Autonomy & Boundaries

  • The agent can autonomously inspect health and logs
  • The agent can autonomously trigger restart for clear high CPU / high memory cases
  • The agent can autonomously escalate when logs indicate dependency failures or unknown issues
  • The agent does not perform destructive remediation beyond restart
  • The agent does not modify infrastructure configuration, rotate secrets, or change deployments

Operational Guardrail

The agent is limited to read-only diagnosis plus one bounded remediation action: service restart. All deeper corrective actions remain human-controlled.

6. Agent Architecture & Components

The system is built around a centralized agent orchestrator using an LLM with function calling and a stateful execution loop.

a) Planner / Decision Layer

The LLM acts as the planner. It:

  • interprets the incident
  • decides which tools to call
  • reasons over outputs from health and logs
  • determines whether restart or escalation is appropriate

b) Executor Layer (Tools)

The executor layer consists of Python tool functions:

  • get_server_health(server_id) → retrieves CPU, memory, and status
  • fetch_recent_logs(server_id, lines) → retrieves recent logs
  • restart_service(server_id) → performs bounded restart action
  • escalate_to_engineer(summary) → creates escalation summary/ticket payload

c) Memory

The current implementation uses short-term conversational state through the messages array passed back to the LLM between tool calls.

  • Short-term memory: conversation and tool outputs in current incident session
  • Long-term memory: not implemented in this version

d) Orchestration Logic

The orchestrator:

  • receives the user incident
  • sends tool schema to the model
  • executes requested tools
  • appends tool results back into the conversation
  • loops until the model provides a final response

7. End-to-End Agent Workflow

  1. Input ingestion
    User submits an incident description such as:
    The payment-server-01 is extremely slow and timing out.

  2. Planning / issue interpretation
    The agent interprets the incident and decides which server diagnostics are needed.

  3. Tool execution: health check
    The agent calls get_server_health(server_id).

  4. Tool execution: logs retrieval
    The agent calls fetch_recent_logs(server_id, lines).

  5. Reasoning over combined signals
    The agent compares health metrics and logs:

    • high CPU / memory may indicate restart is useful
    • dependency or connection failures may indicate restart is insufficient
  6. Decision

    • If issue appears recoverable, call restart_service(server_id)
    • If issue appears deeper or external, call escalate_to_engineer(summary)
  7. Final output generation
    The agent provides a concise operator-facing summary of what it observed and what action it took.

8. Tools, Models & Stack (With Rationale)

Component Technology Rationale
Core Model GPT-4o Strong reasoning and tool selection for multi-step incident handling
Orchestration OpenAI Function Calling Reliable structured tool invocation without brittle parsing
Language Python Simple, readable orchestration and tool implementation
Test Framework pytest Lightweight unit testing for deterministic tool behavior
Config Management python-dotenv Safe local environment variable loading without hardcoding secrets
Tool Layer Mock Python functions

9. Evaluation Strategy & Metrics

10. Guardrails, Trust & Safety

11. Failure Modes & Tradeoffs

Known Failure Modes

Tradeoffs

12. Results, Learnings & Insights

What Worked

What Failed Initially

Key Learnings

13. Future Improvements & Iteration Plan

14. Demo & Artifacts

This repository includes:

  • research notebook version of the project
  • modular production-style Python implementation
  • unit tests for deterministic tools

Example Incident Trace

Input:
The payment-server-01 is extremely slow and timing out.

Agent behavior:

  1. checks server health
  2. fetches recent logs
  3. detects CPU saturation and hung process patterns
  4. triggers restart
  5. provides a final response summary

15. Role-Based Signal

For Product Managers

For Engineering Managers

For Software Engineers


Overview

This project is a production-style AI agent prototype that assists Level 1 IT support teams in investigating and responding to common server incidents. It demonstrates how an agent can reason over health metrics and logs, choose bounded remediation actions, and escalate intelligently when automation is insufficient.

Key Features

  • Tool-calling AI incident response workflow
  • Health check and recent log retrieval tools
  • Automated restart for recoverable cases
  • Escalation path for dependency or deeper failures
  • Modular Python project structure
  • Research notebook preserved in repo
  • Unit tests for core tool behavior

Tech Stack

  • Python
  • OpenAI API with function calling
  • pytest
  • python-dotenv

Project Structure

it-support-agent/
│── src/
│   ├── __init__.py
│   ├── main.py
│   ├── config.py
│   ├── prompts.py
│   ├── agent.py
│   └── tools/
│       ├── __init__.py
│       ├── it_support_tools.py
│       └── registry.py
│── tests/
│   └── test_it_support_tools.py
│── research/
│   └── it_support_agent_research.ipynb
│── docs/
│   └── architecture.md
│── README.md
│── requirements.txt
│── pytest.ini
│── .env.example
│── .gitignore

Installation Steps

1. Clone the repository

git clone <your-repo-url>
cd it-support-agent

2. Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

Configuration

Create a local .env file using .env.example:

OPENAI_API_KEY=your_real_openai_key_here
MODEL_NAME=gpt-4o

Usage Instructions

Run the agent from the project root:

python -m src.main

Then enter an incident description such as:

The payment-server-01 is extremely slow and timing out.

Example Input

The search-index-09 is failing to serve results and seems broken.

Example Output

Investigated search-index-09. Health metrics do not indicate local CPU or memory pressure, but logs show critical dependency failure and connection refused errors to the search backend. Automated restart is unlikely to resolve the issue. Escalated to a human engineer with summary.

Running Tests

Run all tests:

pytest

Run one test file with verbose output:

pytest tests/test_it_support_tools.py -v

Future Improvements

Contributing Guidelines

License

Choose a license for this repository, such as MIT.

Author

Ravi Doddi
GitHub: https://github.com/iOSNinja/it-support-agent LinkedIn: www.linkedin.com/in/ravi-doddi-32061110

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors