Autonomous IT Support Incident Response Agent for Level 1 Operations

1. Project Title & Value Proposition

Autonomous IT Support Incident Response Agent for Level 1 operations teams, reducing initial incident triage and action time by automating health checks, log retrieval, restart decisions, and escalation summaries for common server incidents.

2. Background & Problem Context

In fast-moving production environments, Level 1 support teams are often the first responders for server slowdowns, crashes, memory spikes, and dependency failures. A single shift may involve handling dozens of alerts across application servers, database nodes, authentication services, and search infrastructure.

Manual triage is repetitive and time-sensitive:

check server health
inspect recent logs
decide whether a restart is safe
escalate to an engineer if the issue is deeper

At scale, this leads to delayed response times, inconsistent decisions across responders, and unnecessary escalation noise. Simple scripts can gather metrics or logs, but they cannot reliably decide when to restart a service versus when to escalate based on combined health and log context.

3. Target User & Job To Be Done (JTBD)

Primary User

Role: Level 1 IT Support / Operations Responder
Environment: Internal production support or infrastructure operations team

JTBD

Within seconds of receiving an incident description, determine whether the issue can be safely handled through automated triage and restart, or whether it should be escalated to a human engineer with a clean incident summary.

Secondary Users

SRE / Platform Engineers: receive cleaner, context-rich escalations
Engineering Managers: monitor incident handling consistency and operational efficiency
Operations Leads: reduce repetitive manual triage work and improve response quality

4. Why an Agentic Approach

This problem is not just about pulling data — it requires reasoning across multiple signals.

A script can fetch CPU and memory usage. Another script can fetch logs. But the system still needs to decide:

whether the issue is caused by resource exhaustion
whether restart is the correct action
whether logs indicate a dependency failure that restart will not fix
whether escalation is safer than taking an automated action

A simple workflow or RPA-style automation is too rigid because incidents do not follow one deterministic path. A chatbot would require the operator to manually drive every step. An agentic approach is more suitable because the system can:

dynamically choose which tool to call next
reason over health signals and logs together
decide between restart and escalation
generate a final human-readable incident response summary

5. Agent Role, Scope & Autonomy Level

Agent Responsibilities

Investigate incoming server incident descriptions
Retrieve server health metrics
Fetch recent logs for diagnosis
Decide whether automated restart is appropriate
Escalate unresolved or unsafe issues to a human engineer
Produce a concise final response for the operator

Autonomy & Boundaries

The agent can autonomously inspect health and logs
The agent can autonomously trigger restart for clear high CPU / high memory cases
The agent can autonomously escalate when logs indicate dependency failures or unknown issues
The agent does not perform destructive remediation beyond restart
The agent does not modify infrastructure configuration, rotate secrets, or change deployments

Operational Guardrail

The agent is limited to read-only diagnosis plus one bounded remediation action: service restart. All deeper corrective actions remain human-controlled.

6. Agent Architecture & Components

The system is built around a centralized agent orchestrator using an LLM with function calling and a stateful execution loop.

a) Planner / Decision Layer

The LLM acts as the planner. It:

interprets the incident
decides which tools to call
reasons over outputs from health and logs
determines whether restart or escalation is appropriate

b) Executor Layer (Tools)

The executor layer consists of Python tool functions:

get_server_health(server_id) → retrieves CPU, memory, and status
fetch_recent_logs(server_id, lines) → retrieves recent logs
restart_service(server_id) → performs bounded restart action
escalate_to_engineer(summary) → creates escalation summary/ticket payload

c) Memory

The current implementation uses short-term conversational state through the messages array passed back to the LLM between tool calls.

Short-term memory: conversation and tool outputs in current incident session
Long-term memory: not implemented in this version

d) Orchestration Logic

The orchestrator:

receives the user incident
sends tool schema to the model
executes requested tools
appends tool results back into the conversation
loops until the model provides a final response

7. End-to-End Agent Workflow

Input ingestion
User submits an incident description such as:
The payment-server-01 is extremely slow and timing out.
Planning / issue interpretation
The agent interprets the incident and decides which server diagnostics are needed.
Tool execution: health check
The agent calls get_server_health(server_id).
Tool execution: logs retrieval
The agent calls fetch_recent_logs(server_id, lines).
Reasoning over combined signals
The agent compares health metrics and logs:
- high CPU / memory may indicate restart is useful
- dependency or connection failures may indicate restart is insufficient
Decision
- If issue appears recoverable, call restart_service(server_id)
- If issue appears deeper or external, call escalate_to_engineer(summary)
Final output generation
The agent provides a concise operator-facing summary of what it observed and what action it took.

8. Tools, Models & Stack (With Rationale)

Component	Technology	Rationale
Core Model	GPT-4o	Strong reasoning and tool selection for multi-step incident handling
Orchestration	OpenAI Function Calling	Reliable structured tool invocation without brittle parsing
Language	Python	Simple, readable orchestration and tool implementation
Test Framework	pytest	Lightweight unit testing for deterministic tool behavior
Config Management	python-dotenv	Safe local environment variable loading without hardcoding secrets
Tool Layer	Mock Python functions

9. Evaluation Strategy & Metrics

10. Guardrails, Trust & Safety

11. Failure Modes & Tradeoffs

Known Failure Modes

Tradeoffs

12. Results, Learnings & Insights

What Worked

What Failed Initially

Key Learnings

13. Future Improvements & Iteration Plan

14. Demo & Artifacts

This repository includes:

research notebook version of the project
modular production-style Python implementation
unit tests for deterministic tools

Example Incident Trace

Input:
The payment-server-01 is extremely slow and timing out.

Agent behavior:

checks server health
fetches recent logs
detects CPU saturation and hung process patterns
triggers restart
provides a final response summary

15. Role-Based Signal

For Product Managers

For Engineering Managers

For Software Engineers

Overview

This project is a production-style AI agent prototype that assists Level 1 IT support teams in investigating and responding to common server incidents. It demonstrates how an agent can reason over health metrics and logs, choose bounded remediation actions, and escalate intelligently when automation is insufficient.

Key Features

Tool-calling AI incident response workflow
Health check and recent log retrieval tools
Automated restart for recoverable cases
Escalation path for dependency or deeper failures
Modular Python project structure
Research notebook preserved in repo
Unit tests for core tool behavior

Tech Stack

Python
OpenAI API with function calling
pytest
python-dotenv

Project Structure

it-support-agent/
│── src/
│   ├── __init__.py
│   ├── main.py
│   ├── config.py
│   ├── prompts.py
│   ├── agent.py
│   └── tools/
│       ├── __init__.py
│       ├── it_support_tools.py
│       └── registry.py
│── tests/
│   └── test_it_support_tools.py
│── research/
│   └── it_support_agent_research.ipynb
│── docs/
│   └── architecture.md
│── README.md
│── requirements.txt
│── pytest.ini
│── .env.example
│── .gitignore

Installation Steps

1. Clone the repository

git clone <your-repo-url>
cd it-support-agent

2. Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

Configuration

Create a local .env file using .env.example:

OPENAI_API_KEY=your_real_openai_key_here
MODEL_NAME=gpt-4o

Usage Instructions

Run the agent from the project root:

python -m src.main

Then enter an incident description such as:

The payment-server-01 is extremely slow and timing out.

Example Input

The search-index-09 is failing to serve results and seems broken.

Example Output

Investigated search-index-09. Health metrics do not indicate local CPU or memory pressure, but logs show critical dependency failure and connection refused errors to the search backend. Automated restart is unlikely to resolve the issue. Escalated to a human engineer with summary.

Running Tests

Run all tests:

pytest

Run one test file with verbose output:

pytest tests/test_it_support_tools.py -v

Future Improvements

Contributing Guidelines

License

Choose a license for this repository, such as MIT.

Author

Ravi Doddi
GitHub: https://github.com/iOSNinja/it-support-agent LinkedIn: www.linkedin.com/in/ravi-doddi-32061110

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
research		research
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Autonomous IT Support Incident Response Agent for Level 1 Operations

1. Project Title & Value Proposition

2. Background & Problem Context

3. Target User & Job To Be Done (JTBD)

Primary User

JTBD

Secondary Users

4. Why an Agentic Approach

5. Agent Role, Scope & Autonomy Level

Agent Responsibilities

Autonomy & Boundaries

Operational Guardrail

6. Agent Architecture & Components

a) Planner / Decision Layer

b) Executor Layer (Tools)

c) Memory

d) Orchestration Logic

7. End-to-End Agent Workflow

8. Tools, Models & Stack (With Rationale)

9. Evaluation Strategy & Metrics

10. Guardrails, Trust & Safety

11. Failure Modes & Tradeoffs

Known Failure Modes

Tradeoffs

12. Results, Learnings & Insights

What Worked

What Failed Initially

Key Learnings

13. Future Improvements & Iteration Plan

14. Demo & Artifacts

Example Incident Trace

15. Role-Based Signal

For Product Managers

For Engineering Managers

For Software Engineers

Overview

Key Features

Tech Stack

Project Structure

Installation Steps

1. Clone the repository

2. Create and activate a virtual environment

3. Install dependencies

Configuration

Usage Instructions

Example Input

Example Output

Running Tests

Future Improvements

Contributing Guidelines

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages