Autonomous IT Support Incident Response Agent for Level 1 operations teams, reducing initial incident triage and action time by automating health checks, log retrieval, restart decisions, and escalation summaries for common server incidents.
In fast-moving production environments, Level 1 support teams are often the first responders for server slowdowns, crashes, memory spikes, and dependency failures. A single shift may involve handling dozens of alerts across application servers, database nodes, authentication services, and search infrastructure.
Manual triage is repetitive and time-sensitive:
- check server health
- inspect recent logs
- decide whether a restart is safe
- escalate to an engineer if the issue is deeper
At scale, this leads to delayed response times, inconsistent decisions across responders, and unnecessary escalation noise. Simple scripts can gather metrics or logs, but they cannot reliably decide when to restart a service versus when to escalate based on combined health and log context.
- Role: Level 1 IT Support / Operations Responder
- Environment: Internal production support or infrastructure operations team
Within seconds of receiving an incident description, determine whether the issue can be safely handled through automated triage and restart, or whether it should be escalated to a human engineer with a clean incident summary.
- SRE / Platform Engineers: receive cleaner, context-rich escalations
- Engineering Managers: monitor incident handling consistency and operational efficiency
- Operations Leads: reduce repetitive manual triage work and improve response quality
This problem is not just about pulling data — it requires reasoning across multiple signals.
A script can fetch CPU and memory usage. Another script can fetch logs. But the system still needs to decide:
- whether the issue is caused by resource exhaustion
- whether restart is the correct action
- whether logs indicate a dependency failure that restart will not fix
- whether escalation is safer than taking an automated action
A simple workflow or RPA-style automation is too rigid because incidents do not follow one deterministic path. A chatbot would require the operator to manually drive every step. An agentic approach is more suitable because the system can:
- dynamically choose which tool to call next
- reason over health signals and logs together
- decide between restart and escalation
- generate a final human-readable incident response summary
- Investigate incoming server incident descriptions
- Retrieve server health metrics
- Fetch recent logs for diagnosis
- Decide whether automated restart is appropriate
- Escalate unresolved or unsafe issues to a human engineer
- Produce a concise final response for the operator
- The agent can autonomously inspect health and logs
- The agent can autonomously trigger restart for clear high CPU / high memory cases
- The agent can autonomously escalate when logs indicate dependency failures or unknown issues
- The agent does not perform destructive remediation beyond restart
- The agent does not modify infrastructure configuration, rotate secrets, or change deployments
The agent is limited to read-only diagnosis plus one bounded remediation action: service restart. All deeper corrective actions remain human-controlled.
The system is built around a centralized agent orchestrator using an LLM with function calling and a stateful execution loop.
The LLM acts as the planner. It:
- interprets the incident
- decides which tools to call
- reasons over outputs from health and logs
- determines whether restart or escalation is appropriate
The executor layer consists of Python tool functions:
get_server_health(server_id)→ retrieves CPU, memory, and statusfetch_recent_logs(server_id, lines)→ retrieves recent logsrestart_service(server_id)→ performs bounded restart actionescalate_to_engineer(summary)→ creates escalation summary/ticket payload
The current implementation uses short-term conversational state through the messages array passed back to the LLM between tool calls.
- Short-term memory: conversation and tool outputs in current incident session
- Long-term memory: not implemented in this version
The orchestrator:
- receives the user incident
- sends tool schema to the model
- executes requested tools
- appends tool results back into the conversation
- loops until the model provides a final response
-
Input ingestion
User submits an incident description such as:
The payment-server-01 is extremely slow and timing out. -
Planning / issue interpretation
The agent interprets the incident and decides which server diagnostics are needed. -
Tool execution: health check
The agent callsget_server_health(server_id). -
Tool execution: logs retrieval
The agent callsfetch_recent_logs(server_id, lines). -
Reasoning over combined signals
The agent compares health metrics and logs:- high CPU / memory may indicate restart is useful
- dependency or connection failures may indicate restart is insufficient
-
Decision
- If issue appears recoverable, call
restart_service(server_id) - If issue appears deeper or external, call
escalate_to_engineer(summary)
- If issue appears recoverable, call
-
Final output generation
The agent provides a concise operator-facing summary of what it observed and what action it took.
| Component | Technology | Rationale |
|---|---|---|
| Core Model | GPT-4o | Strong reasoning and tool selection for multi-step incident handling |
| Orchestration | OpenAI Function Calling | Reliable structured tool invocation without brittle parsing |
| Language | Python | Simple, readable orchestration and tool implementation |
| Test Framework | pytest | Lightweight unit testing for deterministic tool behavior |
| Config Management | python-dotenv | Safe local environment variable loading without hardcoding secrets |
| Tool Layer | Mock Python functions |
This repository includes:
- research notebook version of the project
- modular production-style Python implementation
- unit tests for deterministic tools
Input:
The payment-server-01 is extremely slow and timing out.
Agent behavior:
- checks server health
- fetches recent logs
- detects CPU saturation and hung process patterns
- triggers restart
- provides a final response summary
This project is a production-style AI agent prototype that assists Level 1 IT support teams in investigating and responding to common server incidents. It demonstrates how an agent can reason over health metrics and logs, choose bounded remediation actions, and escalate intelligently when automation is insufficient.
- Tool-calling AI incident response workflow
- Health check and recent log retrieval tools
- Automated restart for recoverable cases
- Escalation path for dependency or deeper failures
- Modular Python project structure
- Research notebook preserved in repo
- Unit tests for core tool behavior
- Python
- OpenAI API with function calling
- pytest
- python-dotenv
it-support-agent/
│── src/
│ ├── __init__.py
│ ├── main.py
│ ├── config.py
│ ├── prompts.py
│ ├── agent.py
│ └── tools/
│ ├── __init__.py
│ ├── it_support_tools.py
│ └── registry.py
│── tests/
│ └── test_it_support_tools.py
│── research/
│ └── it_support_agent_research.ipynb
│── docs/
│ └── architecture.md
│── README.md
│── requirements.txt
│── pytest.ini
│── .env.example
│── .gitignore
git clone <your-repo-url>
cd it-support-agentpython -m venv .venv
source .venv/bin/activatepip install -r requirements.txtCreate a local .env file using .env.example:
OPENAI_API_KEY=your_real_openai_key_here
MODEL_NAME=gpt-4oRun the agent from the project root:
python -m src.mainThen enter an incident description such as:
The payment-server-01 is extremely slow and timing out.
The search-index-09 is failing to serve results and seems broken.
Investigated search-index-09. Health metrics do not indicate local CPU or memory pressure, but logs show critical dependency failure and connection refused errors to the search backend. Automated restart is unlikely to resolve the issue. Escalated to a human engineer with summary.
Run all tests:
pytestRun one test file with verbose output:
pytest tests/test_it_support_tools.py -vChoose a license for this repository, such as MIT.
Ravi Doddi
GitHub: https://github.com/iOSNinja/it-support-agent
LinkedIn: www.linkedin.com/in/ravi-doddi-32061110