I got tired of answering "which agent framework should we use?" with "it depends" and then spending an hour qualifying that, so I went through 44 of them and wrote down what I found. Saved me from ever having that conversation again, hopefully. Probably not.
February 2026. This is a snapshot. The field moves fast enough that some of this will be wrong by the time you read it. Check the dates on individual files and adjust your expectations accordingly.
Not an "awesome list". Those exist, they're fine for discovery, but they don't help you choose. This is more like: I looked at each framework in enough depth to form an opinion, and then I wrote the opinion down. With some kind of evidence, usually.
I split them up into three tiers based on how deep I went:
- Tier 1 (11 frameworks): 3,000+ word analyses. Architecture, context handling, tradeoffs, failure modes, code examples. The stuff you'd actually need to make a decision.
- Tier 2 (16 frameworks): 1,000–1,500 words. The important bits, when to use it, when not to.
- Tier 3 (17 frameworks): 100–200 words. What it is, whether it matters.
One consistent dimension across all of them: context engineering. How much does this framework actually help you manage what goes into the model's context window? The answer is almost always "less than you'd hope," but the specifics differ in interesting ways. If context engineering as a discipline is what you're after, that's what contextpatterns.com is for.
The detailed analysis files (architecture breakdowns, code examples, comparison tables) were produced with heavy AI assistance. I directed the research, verified the findings, and used the frameworks myself, but the structured reference material was not written by hand, and it reads like it. The editorial voice lives in this README, the synthesis, and the quoted notes at the top of each analysis file. The rest is research output that I've reviewed in varying detail for accuracy but not rewritten for personality. Seemed more honest to say that than to pretend otherwise, it's not like you can't tell anyway. Celebrate the emdashes!
| Framework | Best For | Key Differentiator |
|---|---|---|
| LangChain / LangGraph | Production agents, large ecosystem | 127K stars, 600+ integrations, graph orchestration |
| CrewAI | Multi-agent collaboration | Role-based agents, Flows for deterministic control |
| AutoGen (Microsoft) | Enterprise, Microsoft stack | Dual API (AgentChat + Core), distributed runtimes |
| Letta (MemGPT) | Long-running conversations | Hierarchical memory: core / recall / archival |
| Vercel AI SDK | TypeScript / web apps | Best-in-class streaming, React hooks |
| Pydantic AI | Type-safe Python agents | Full Pydantic v2 validation, Logfire observability |
| OpenAI Agents SDK | OpenAI ecosystem | Handoffs, guardrails, fastest time-to-agent |
| Anthropic / Claude Code | Computer use, coding agents | Pattern-based (not a framework), 200K context |
| Mastra | TypeScript agents with memory | Working memory + semantic recall, Next.js native |
| Haystack (deepset) | Enterprise-scale RAG | Pipeline architecture, Elasticsearch integration |
| Google ADK / Genkit | Google Cloud / Gemini | A2A protocol, Vertex AI, built-in tracing |
| Framework | Focus |
|---|---|
| AG2 | AutoGen community fork |
| Agno | Lightweight Python agents (formerly Phi) |
| AutoGen Studio | Visual no-code AutoGen UI |
| AutoGPT | Early autonomous agent, mostly historical |
| DSPy | Prompt optimization via programming |
| Instructor | Structured LLM output (Pydantic wrapper) |
| LangFlow | Visual LangChain pipeline builder |
| LlamaIndex | RAG and data retrieval |
| Mirascope | Lightweight, type-safe Python |
| n8n | Workflow automation with AI nodes |
| OpenHands | Open-source coding agent (formerly OpenDevin) |
| Open Interpreter | Local code execution agent |
| Phidata | Production agents with built-in memory |
| Pi | Minimal context-focused coding agent |
| Semantic Kernel | Microsoft enterprise, C# / Python |
| Smolagents | HuggingFace, local models, code execution |
| File | What It Contains |
|---|---|
| comparisons.md | 5 comparison matrices: features, context engineering, provider lock-in, production readiness, DX |
| synthesis.md | Cross-framework patterns, ecosystem trends, consolidation analysis, practitioner recommendations |
This was the big one. Every framework will tell you how many tokens are in your context window. None of them will tell you whether those tokens are actually helping. No quality monitoring, no proactive compression when context starts degrading, no real isolation between sub-agents. You get a number, and figuring out what to do with it is your problem.
I was surprised at first, but after going through twenty-something frameworks it started making sense. Context management is genuinely hard, and most of these projects are already struggling with orchestration, memory, and tooling. Context quality is apparently next year's problem. Every year.
A lot of these frameworks exist. Not all of them will in a year. Four are pulling ahead far enough that the gap is already meaningful: LangChain/LangGraph on ecosystem breadth, CrewAI for multi-agent, Vercel AI SDK for TypeScript/web, Pydantic AI for type-safe Python. Everything else is either very niche, early, or quietly losing contributors.
AutoGen is the interesting case. Microsoft seems to be heading in a slightly different direction with each release, which is not a great sign if you're trying to build on top of it. Worth watching, wouldn't build anything critical on it right now.
Even frameworks that didn't start with graph-based execution are adding it now. LangGraph pushed this pattern into the mainstream, and it's become the expected baseline. If a framework doesn't support graph-based control flow in some form, that's a meaningful limitation.
Most frameworks treat memory as a feature you bolt on. Letta is the only one where memory is load-bearing architecture; it's the whole point. Everywhere else, if you need your agent to remember something across sessions, you're building it yourself with a vector store and duct tape. Probably more fragile than you'd like.
Pydantic AI grew quickly, and I think it's because type errors in agent systems are genuinely miserable to debug at runtime. You're chasing phantom bugs through LLM output that looked right but wasn't, and you don't find out until three tool calls later. Frameworks that catch this earlier save real time. Not everyone needs it, but once you've been burned by it in production, it starts feeling a lot less optional.
| Use Case | Primary | Runner-up | Key Tradeoff |
|---|---|---|---|
| Production chatbot with memory | Pydantic AI + Letta | LangChain + LangGraph | Type safety vs. ecosystem breadth |
| Multi-agent research system | CrewAI | LangGraph | Autonomy vs. control |
| Enterprise workflow automation | LangChain + LangSmith | Google ADK (if GCP) | Vendor support vs. cloud lock-in |
| TypeScript / Next.js project | Vercel AI SDK | Mastra | Streaming/UI focus vs. orchestration power |
| Quick prototype | OpenAI Agents SDK | Vercel AI SDK | Speed vs. flexibility |
| Code generation agent | Claude Code (patterns) | Smolagents | Product polish vs. customization |
| Enterprise RAG pipeline | Haystack | LlamaIndex | Pipeline power vs. community size |
Each Tier 1 analysis maps the framework against 8 context engineering patterns. Here's where the ecosystem stands:
| Pattern | Frameworks with strong support |
|---|---|
| Select, Don't Dump | Haystack, LlamaIndex, LangChain (via retrievers) |
| Write Outside the Window | Letta, Mastra, LangChain, AutoGen |
| Compress & Restart | Letta (automatic), LangChain (ConversationSummaryMemory) |
| Recursive Delegation | CrewAI, LangGraph, OpenAI Agents SDK (handoffs) |
| Progressive Disclosure | Letta, Mastra, Haystack |
| Isolate | LangGraph (subgraphs), limited elsewhere |
| The Pyramid | LangChain (PromptTemplate), manual elsewhere |
| Context Rot Awareness | Nobody. Universal gap. |
That last row is the telling one.
| Pattern | Frameworks | Best For | Watch Out For |
|---|---|---|---|
| Graph-based | LangGraph, Mastra, Google ADK | Production reliability, audit trails | Upfront graph design complexity |
| Multi-agent / role-based | CrewAI, OpenAI Agents SDK | Complex tasks, collaboration | Debugging opacity |
| Memory-first | Letta | Long-running agents, assistants | Architectural complexity |
| Pipeline-based | Haystack, LlamaIndex | RAG, knowledge retrieval | Not agent-native |
| Pattern-based (no framework) | Anthropic / Claude Code | Full control, custom architectures | You build everything |
| Tool-first / code execution | Smolagents, Open Interpreter | Local models, automation | Security (obviously) |
| Risk | Frameworks | Why |
|---|---|---|
| High | Claude Code, Google ADK | Provider-specific, hard to migrate away from |
| Medium | Letta, Semantic Kernel, AutoGen | Optimized for specific providers/ecosystems |
| Low | Pydantic AI, Vercel AI SDK, Mastra, LangChain | Clean abstractions, swap providers without rewriting |
For each framework I read the docs, looked at source code, ran examples, and read the GitHub issues. The issues are often more honest than the docs; you learn a lot about a framework from what people complain about. I checked how active the maintainers were, how they handled breaking changes, and formed an opinion.
For the Tier 1 analyses I went deeper: architecture decomposition, tracing design decisions, and deliberately trying things that should break to see how they fail.
The dimensions I applied consistently:
- Architecture and how the pieces fit together
- Context management: what's built in, what you're building yourself
- Tool system flexibility
- Multi-agent support and what it actually looks like when you use it (not what the marketing page says)
- Memory: short-term, long-term, or "we'll add that later"
- Developer experience: time to first working thing, time to first confusing error
- Production readiness: observability, error handling, retries
- Lock-in and how painful switching would be
CC BY-NC 4.0 — Share and adapt with attribution, non-commercial use only.