Wireshark for AI-agent token traffic.
AgentShark analyzes AI-agent traces and tells you where your tokens went, which ones were wasted, and how to reduce cost without hurting answer quality.
python agentshark.py analyze examples/sample_trace_rag_overload.jsonAgentShark v0.1
========================================
Total tokens: 6,200
Estimated cost: $0.0310
Breakdown:
System Prompt 800 (13%)
Conversation History 1,200 (19%)
Retrieved Context 2,900 (47%)
Tool Output 700 (11%)
Final Answer 600 (10%)
Diagnosis: RAG_OVERLOAD
Explanation: Retrieved context exceeds 40% of total tokens.
Recommendations:
1. Reduce RAG top_k
2. Add reranking before sending to model
3. Deduplicate retrieved chunks
Estimated savings: 35-50%
Quality risk: Low
Modern AI agents make multiple model calls, retrieve large knowledge chunks, call external tools, use long system prompts, and pass long conversation histories back into the model.
Existing observability tools tell you what happened and how much it cost.
They rarely tell you why it cost so much or what to change.
Developers often don't know:
- Which part of the workflow used the most tokens?
- Were those tokens actually useful?
- Did retrieved documents help the final answer?
- Did the agent make unnecessary model calls?
- How can cost be reduced without reducing quality?
AgentShark reads an AI-agent trace and produces a plain-English explanation:
- Token breakdown — where did each token go?
- Waste diagnosis — which pattern caused the cost?
- Recommendations — what should you change?
- Savings estimate — how much could you save?
- Quality risk — what is the risk of the optimization?
AgentShark is not another tracing dashboard. It is a focused analysis layer that sits on top of your existing observability setup.
From token counting to token intelligence.
| Wireshark | AgentShark |
|---|---|
| Understands internet traffic | Understands token traffic |
| Captures packets | Captures agent steps |
| Measures packet size | Measures token usage |
| Finds network bottlenecks | Finds token cost bottlenecks |
| Helps debug network problems | Helps debug expensive AI workflows |
Requirements: Python 3.8+. No external dependencies for v0.1.
# Clone the repo
git clone https://github.com/yourusername/agentshark.git
cd agentshark
# Run against the sample trace
python agentshark.py analyze examples/sample_trace_rag_overload.jsonAgentShark reads a simple JSON trace file. You can export this from your own agent or use one of the included examples.
{
"trace_id": "trace_001",
"session_id": "session_abc123",
"total_tokens": 6200,
"estimated_cost": 0.031,
"steps": [
{ "type": "system_prompt", "tokens": 800 },
{ "type": "conversation_history", "tokens": 1200 },
{ "type": "retrieved_context", "tokens": 2900, "chunk_count": 8 },
{ "type": "tool_output", "tokens": 700 },
{ "type": "final_answer", "tokens": 600 }
]
}Supported step types:
| Type | Description |
|---|---|
system_prompt |
Developer or operator system prompt |
conversation_history |
Prior turns passed back to the model |
retrieved_context |
RAG chunks sent as context |
tool_output |
Output from tool or API calls |
final_answer |
Model response to the user |
AgentShark classifies token waste into named categories.
| Category | Trigger | Description |
|---|---|---|
RAG_OVERLOAD |
Retrieved context > 40% of tokens | Too many chunks retrieved |
HISTORY_BLOAT |
Conversation history > 30% of tokens | Old turns bloating input |
TOOL_OUTPUT_BLOAT |
Tool output > 25% of tokens | Verbose tool responses |
PROMPT_BLOAT |
System prompt > 20% of tokens | Oversized system prompt |
MULTI_CALL_OVERHEAD |
Many LLM steps for simple request | Unnecessary model calls |
CACHE_MISS |
Repeated similar questions recomputed | No caching in place |
- JSON trace input
- Token breakdown
- Waste classification
- Savings recommendations
- Sample traces
- Import traces directly from Langfuse
- Batch analysis across sessions
- Markdown report export
- Cost by trace, client, and session
- Top waste reasons
- Estimated savings over time
- Arize Phoenix
- Helicone
- OpenTelemetry
- Before/after optimization comparison
- LLM judge scoring
- Cost vs quality tradeoff report
AgentShark is not a replacement for Langfuse, Arize Phoenix, Helicone, or LangSmith.
Those tools are excellent for tracing, monitoring, and evaluation.
AgentShark adds a focused layer on top:
Langfuse / Phoenix / Helicone / Custom Logs
↓
AgentShark
↓
Token breakdown + Waste diagnosis + Recommendations
Think of it as the analysis engine that explains what your existing traces are telling you.
Three sample traces are included:
| File | Scenario |
|---|---|
examples/sample_trace_simple.json |
Clean FAQ response, minimal waste |
examples/sample_trace_rag_overload.json |
RAG overload, 8 chunks retrieved |
examples/sample_trace_tool_loop.json |
Agent loop, repeated tool calls |
AgentShark is early and welcomes contributions.
Good first issues:
- Add a new connector (Langfuse, Phoenix, Helicone)
- Add a new waste rule
- Improve the report format
- Add a new sample trace
- Write a test
Please open an issue before starting a large PR.
AgentShark was built because the author needed it.
Running AI chatbots in production surfaces a problem quickly: you know what your LLM bill is, but you do not know why. Existing tools show the total. AgentShark shows the breakdown, names the waste, and tells you what to fix.
MIT
AgentShark v0.1 is a working CLI analyzer. It is early, focused, and intentionally small.
The goal is to be useful immediately — not to be complete eventually.
AgentShark. See where your tokens go.