When AIs See Eye to Eye
An open protocol for multi-model consensus, cross-verification, intelligent routing, and epistemic classification
Installation • Quick Start • MCIP Protocol • Why i2i? • Use Cases • API Reference • RFC
You ask an AI a question. It gives you a confident answer. But:
- Is it right? Single models hallucinate, have biases, and make errors
- Is it answerable? Some questions can't be definitively answered, but the AI won't tell you that
- Can you trust it? For high-stakes decisions, one opinion isn't enough
i2i solves this by making AIs talk to each other.
i2i (pronounced "eye-to-eye") implements the MCIP (Multi-model Consensus and Inference Protocol) — a standardized way for AI models to:
- Query multiple models and detect consensus/disagreement
- Cross-verify claims by having AIs fact-check each other
- Classify questions epistemically — is this answerable, uncertain, or fundamentally unanswerable?
- Route intelligently — automatically select the best model for each task type
- Debate topics through structured multi-model discussions
This project emerged from an actual conversation between Claude (Anthropic) and ChatGPT (OpenAI), where they discussed the philosophical implications of AI-to-AI dialogue. ChatGPT observed that some questions are "well-formed but idle" — coherent but non-action-guiding. That insight became a core feature: epistemic classification.
MCIP (Multi-model Consensus and Inference Protocol) is the formal specification that powers i2i. While i2i is the Python implementation, MCIP is the underlying protocol standard that defines how AI models should communicate, verify, and reach consensus.
Think of it like HTTP vs web browsers — MCIP is the protocol, i2i is one implementation of that protocol.
We designed MCIP as an open standard because:
- Interoperability: Any system can implement MCIP, regardless of language or platform
- Consistency: Standardized message formats ensure predictable behavior
- Extensibility: New features can be added without breaking existing implementations
- Transparency: The protocol is fully documented and open for review
MCIP defines five core components:
| Component | Purpose |
|---|---|
| Message Schema | Standardized request/response format for all AI interactions |
| Consensus Mechanism | Algorithms for detecting agreement levels between models |
| Verification Protocol | How models fact-check and challenge each other |
| Epistemic Taxonomy | Classification system for question answerability |
| Routing Specification | Rules for intelligent model selection |
All MCIP messages follow a standardized JSON schema:
{
"mcip_version": "0.2.0",
"message_type": "consensus_query",
"query": "What causes inflation?",
"models": ["gpt-5.2", "claude-opus-4-5-20251101"],
"options": {
"require_consensus": true,
"min_consensus_level": "medium",
"verify_result": true
}
}MCIP follows semantic versioning:
- Major (1.x.x): Breaking changes to message format
- Minor (x.1.x): New features, backwards compatible
- Patch (x.x.1): Bug fixes, clarifications
Current version: 0.2.0 (Draft)
To create an MCIP-compliant implementation:
- Support the standard message schema
- Implement at least one provider adapter
- Support consensus detection with standard levels (HIGH/MEDIUM/LOW/NONE/CONTRADICTORY)
- Implement epistemic classification
See the full specification: RFC-MCIP.md
# Install from PyPI
uv add i2i-mcip
# Or install from source
git clone https://github.com/lancejames221b/i2i.git
cd i2i
uv syncpip install i2i-mcipOr from source:
git clone https://github.com/lancejames221b/i2i.git
cd i2i
pip install -e .git clone https://github.com/lancejames221b/i2i.git
cd i2i
uv sync --all-extras # Installs dev dependencies
uv run pytest # Run testsCreate a .env file with your API keys:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
MISTRAL_API_KEY=...
GROQ_API_KEY=gsk_...
COHERE_API_KEY=...You need at least 2 providers for consensus features.
i2i supports local models via Ollama for cost-free, offline operation:
# Install Ollama (https://ollama.com/download)
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama server
ollama serve
# Pull models
ollama pull llama3.2
ollama pull mistral
ollama pull codellama
# Verify i2i detects Ollama
python demo.py statusSupported Ollama models: llama3.2, llama3.1, llama2, mistral, mixtral, codellama, deepseek-coder, phi3, gemma2, qwen2.5
Use local models in consensus queries:
# Free consensus with local models
result = await protocol.consensus_query(
"What is Python?",
models=["llama3.2", "mistral", "phi3"]
)# CLI usage
python demo.py consensus "What is Python?" --models llama3.2,mistralEnvironment configuration:
# Custom Ollama server (default: http://localhost:11434)
OLLAMA_BASE_URL=http://localhost:11434i2i integrates with LiteLLM for unified access to 100+ LLMs through a single OpenAI-compatible proxy. Benefits include cost tracking, guardrails, load balancing, and avoiding multiple API key configurations.
# Install LiteLLM
pip install 'litellm[proxy]'
# Start proxy with a single model
litellm --model gpt-4o --port 4000
# Or with config file for multiple models
litellm --config litellm_config.yaml --port 4000
# Verify i2i detects LiteLLM
python demo.py statusExample litellm_config.yaml:
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-...
- model_name: claude-3-opus
litellm_params:
model: anthropic/claude-3-opus-20240229
api_key: sk-ant-...Use LiteLLM models in consensus queries:
# Access any model through LiteLLM proxy
result = await protocol.consensus_query(
"What is Python?",
models=["litellm/gpt-4o", "litellm/claude-3-opus"]
)# CLI usage
python demo.py consensus "What is Python?" --models litellm/gpt-4o,litellm/claude-3-opusEnvironment configuration:
# LiteLLM proxy settings (defaults shown)
LITELLM_API_BASE=http://localhost:4000
LITELLM_API_KEY=sk-1234
# Optional: specify available models (otherwise fetched from /models endpoint)
LITELLM_MODELS=gpt-4o,claude-3-opus,llama3.1i2i integrates with Perplexity for RAG-native models with built-in web search and citations:
PERPLEXITY_API_KEY=pplx-...Perplexity models automatically search the web and return citations:
# Query with automatic web search
result = await protocol.query(
"What is the current stock price of Apple?",
model="perplexity/sonar-pro"
)
print(result.content)
print(result.citations) # ['https://finance.yahoo.com/...', ...]Available models: sonar, sonar-pro, sonar-deep-research, sonar-reasoning-pro
i2i provides RAG-grounded verification that retrieves external sources before verifying claims:
# Verify a claim with search grounding
result = await protocol.verify_claim_grounded(
"The Eiffel Tower is 330 meters tall",
search_backend="brave" # or "serpapi", "tavily"
)
print(f"Verified: {result.verified}")
print(f"Confidence: {result.confidence}")
print(f"Sources: {result.source_citations}")
print(f"Retrieved: {result.retrieved_sources}")Configure search backends:
# Choose one or more (first configured is used as fallback)
BRAVE_API_KEY=BSA... # https://brave.com/search/api/
SERPAPI_API_KEY=... # https://serpapi.com/
TAVILY_API_KEY=tvly-... # https://tavily.com/Models are not hardcoded. Configure via config.json, environment variables, or CLI:
# CLI configuration
i2i config show # View current config
i2i config set models.classifier gpt-5.2 # Change classifier
i2i config add models.consensus o3 # Add a model
i2i models list --configured # See available models# Environment variable overrides (highest priority)
I2I_CONSENSUS_MODEL_1=gpt-5.2
I2I_CONSENSUS_MODEL_2=claude-sonnet-4-5-20250929
I2I_CONSENSUS_MODEL_3=gemini-3-flash-preview
# Model for task classification (routing)
I2I_CLASSIFIER_MODEL=claude-haiku-4-5-20251001Or programmatically:
from i2i import Config, set_config
# Load and modify config
config = Config.load()
config.set("models.consensus", ["gpt-5.2", "claude-sonnet-4-5-20250929", "gemini-3-flash-preview"])
config.set("models.classifier", "claude-haiku-4-5-20251001")
config.save() # Saves to ~/.i2i/config.jsonfrom i2i import AICP
protocol = AICP()
# 1. Consensus Query — Ask multiple AIs and find agreement
result = await protocol.consensus_query(
"What are the primary causes of inflation?",
models=["gpt-5.2", "claude-opus-4-5-20251101", "gemini-3-pro-preview"]
)
print(result.consensus_level) # HIGH, MEDIUM, LOW, NONE, CONTRADICTORY
print(result.consensus_answer) # Synthesized answer from agreeing models
print(result.divergences) # Where models disagreed
# 2. Verify a Claim — Have AIs fact-check each other
result = await protocol.verify_claim(
"The Great Wall of China is visible from space with the naked eye"
)
print(result.verified) # False
print(result.issues_found) # ["This is a common misconception..."]
print(result.corrections) # "The Great Wall is not visible from space..."
# 3. Classify a Question — Is it even answerable?
result = await protocol.classify_question(
"Is consciousness substrate-independent?"
)
print(result.classification) # IDLE
print(result.is_actionable) # False
print(result.why_idle) # "The answer would not change any decision..."
# 4. Quick Classification (no API calls)
quick = protocol.quick_classify("What is 2 + 2?")
print(quick) # ANSWERABLE
# 5. Intelligent Routing — Auto-select best model for task
from i2i import RoutingStrategy
result = await protocol.routed_query(
"Write a Python function to sort a list",
strategy=RoutingStrategy.BEST_QUALITY
)
print(result.decision.detected_task) # CODE_GENERATION
print(result.decision.selected_models) # ["claude-sonnet-4-5-20250929"]
print(result.responses[0].content) # The actual codeFor higher confidence answers, enable statistical mode which queries each model multiple times to measure consistency:
# Method 1: Via flag
result = await protocol.consensus_query(
"What causes inflation?",
statistical_mode=True,
n_runs=5, # Query each model 5 times
temperature=0.7 # Need temp > 0 for variance
)
# Method 2: Dedicated method with full control
result = await protocol.consensus_query_statistical(
"What causes inflation?",
n_runs=5,
temperature=0.7,
outlier_threshold=2.0 # Std devs for outlier detection
)
# Access per-model statistics
for model, stats in result.model_statistics.items():
print(f"{model}:")
print(f" Consistency: {stats.consistency_score:.2f}") # Higher = more confident
print(f" Std Dev: {stats.intra_model_std_dev:.3f}")
print(f" Outliers: {len(stats.outlier_indices)}")
print(f"Overall confidence: {result.overall_confidence:.2f}")
print(f"Cost multiplier: {result.total_cost_multiplier}x") # 5x for n_runs=5Enable via environment:
export I2I_STATISTICAL_MODE=true
export I2I_STATISTICAL_N_RUNS=5
export I2I_STATISTICAL_TEMPERATURE=0.7How it works: Models with lower intra-run variance (more consistent) are weighted higher in consensus. This helps identify when a model is uncertain (high variance) vs confident (low variance).
# Check configured providers
python demo.py status
# Consensus query
python demo.py consensus "What programming language should I learn first?"
# Verify a claim
python demo.py verify "Einstein failed math in school"
# Classify a question
python demo.py classify "Do we have free will?" --quick
# Run a debate
python demo.py debate "Should AI systems have rights?" --rounds 3
# Intelligent routing
python demo.py route "Write a haiku about coding" --strategy best_quality
python demo.py route "Calculate 847 * 293" --strategy best_speed --execute
# Get model recommendations
python demo.py recommend code_generation
python demo.py recommend mathematical
# List all task types
python demo.py tasks
# List available models with capabilities
python demo.py models| Problem | Consequence |
|---|---|
| Hallucinations | AI confidently states false information |
| Model-specific biases | Training data skews responses |
| No uncertainty quantification | Can't tell confident answers from guesses |
| Unanswerable questions | AI attempts to answer the unanswerable |
| No accountability | No mechanism to challenge AI outputs |
| Feature | Benefit |
|---|---|
| Multi-model consensus | Different architectures catch different errors |
| Cross-verification | AIs fact-check each other |
| Epistemic classification | Know if your question is even answerable |
| Intelligent routing | Automatically pick the best model for each task |
| Divergence detection | See exactly where models disagree |
| Structured debates | Explore topics from multiple AI perspectives |
Based on our evaluation of 400 questions across 5 benchmarks with 4 models (GPT-5.2, Claude Sonnet 4.5, Gemini 3 Flash, Grok-3):
| Task Type | Single Model | Consensus | Change | HIGH Acc |
|---|---|---|---|---|
| Factual QA (TriviaQA) | 93.3% | 94.0% | +0.7% | 97.8% |
| Hallucination Detection | 38% | 44% | +6% | 100% |
| Commonsense (StrategyQA) | 80% | 80% | 0% | 94.7% |
| TruthfulQA | 78% | 78% | 0% | 100% |
| Math Reasoning (GSM8K) | 95% | 60% | -35% |
69.9% |
Key findings:
✅ Use consensus for:
- Factual questions (HIGH consensus = 97-100% accuracy)
- Hallucination/claim verification (+6% improvement)
- Commonsense reasoning (HIGH consensus is highly reliable)
❌ Don't use consensus for:
- Mathematical/logical reasoning (different chains shouldn't be averaged)
- Creative writing (consensus flattens diversity)
- Code generation (specific correct answers)
The insight: i2i doesn't universally improve accuracy — it provides calibrated confidence. When models agree (HIGH consensus), you can trust the answer. When they disagree, you know to be skeptical.
i2i automatically detects task type and provides calibrated confidence:
from i2i import AICP, recommend_consensus
protocol = AICP()
# Check before running consensus
rec = recommend_consensus("Calculate 5 * 3 + 2")
print(rec.should_use_consensus) # False
print(rec.reason) # "WARNING: Consensus DEGRADES math/reasoning..."
# Consensus results now include task-aware fields
result = await protocol.consensus_query("What is the capital of France?")
print(result.consensus_appropriate) # True (factual question)
print(result.task_category) # "factual"
print(result.confidence_calibration) # 0.95 (for HIGH consensus)
# Explicit task category override
result = await protocol.consensus_query(
"Is this claim true?",
task_category="verification" # Skip auto-detection
)
# For math questions, you'll get a warning
result = await protocol.consensus_query("Solve x^2 - 4 = 0")
print(result.consensus_appropriate) # False
print(result.metadata.get('consensus_warning'))
# "WARNING: Consensus DEGRADES math/reasoning by 35%..."Calibrated confidence scores (based on evaluation data):
| Consensus Level | Confidence Score | Meaning |
|---|---|---|
| HIGH (≥85%) | 0.95 | Trust the answer |
| MEDIUM (60-84%) | 0.75 | Probably correct |
| LOW (30-59%) | 0.60 | Use with caution |
| NONE (<30%) | 0.50 | Likely hallucination |
- Simple, low-stakes queries (just use one model)
- Real-time applications where latency matters
- Cost-sensitive scenarios (multiple API calls = multiple costs)
- Mathematical/logical reasoning (use single model with chain-of-thought)
- Creative outputs (consensus flattens diversity)
Different AI models excel at different tasks:
- Claude Opus 4.5 → Best at complex reasoning, analysis, creative writing
- Claude Sonnet 4.5 → Best at coding, agentic tasks, instruction following
- GPT-5.2 → Strong at general reasoning, multimodal tasks
- o3 / o3-pro → Deep reasoning, complex math/science problems (slow but most accurate)
- o4-mini → Fast cost-efficient reasoning for math and code
- Gemini 3 Pro → Great for long context, research, multimodal
- Gemini 3 Deep Think → Complex reasoning with extended thinking
- Llama 4 on Groq → Fastest inference, good for simple tasks
Manually selecting the right model for every query is tedious and error-prone. i2i's router does it automatically.
from i2i import AICP, RoutingStrategy, TaskType
protocol = AICP()
# Automatic task detection and model selection
result = await protocol.routed_query(
"Implement a binary search tree in Python with insert, delete, and search",
strategy=RoutingStrategy.BEST_QUALITY
)
print(result.decision.detected_task) # CODE_GENERATION
print(result.decision.selected_models) # ["claude-sonnet-4-5-20250929"]
print(result.decision.reasoning) # "Task classified as code_generation..."
print(result.responses[0].content) # The actual code| Strategy | Optimizes For | Best When |
|---|---|---|
BEST_QUALITY |
Output quality | Accuracy matters most |
BEST_SPEED |
Latency | Real-time applications |
BEST_VALUE |
Cost-effectiveness | High volume, budget constraints |
BALANCED |
All factors | Default choice for most tasks |
ENSEMBLE |
Diversity | Critical decisions, need synthesis |
FALLBACK_CHAIN |
Reliability | Try models in order until success |
Reasoning & Analysis: logical_reasoning, mathematical, scientific, analytical
Creative: creative_writing, copywriting, brainstorming
Technical: code_generation, code_review, code_debugging, technical_docs
Knowledge: factual_qa, research, summarization, translation
Conversation: chat, roleplay, instruction_following
Specialized: legal, medical, financial
The router maintains a capability profile for each model:
# Get recommendations for a task type
recommendations = protocol.get_model_recommendation(TaskType.CODE_GENERATION)
# Returns:
# {
# "best_quality": {"model": "o3", "score": 99},
# "best_speed": {"model": "gemini-3-flash-preview", "score": 86, "latency_ms": 250},
# "best_value": {"model": "claude-haiku-4-5-20251001", "score": 82, "cost": 0.001},
# "balanced": {"model": "claude-sonnet-4-5-20250929", "score": 97}
# }The router tracks performance and can update capability scores over time:
# Router logs performance automatically
# You can also manually update based on observed quality
protocol.router.update_capability(
model_id="gpt-5.2",
task_type=TaskType.MATHEMATICAL,
new_score=98.0 # Based on observed performance
)Scenario: You're making an important business/medical/legal decision based on AI output.
result = await protocol.smart_query(
"What are the risks of this merger?",
require_consensus=True,
verify_result=True
)
if result["consensus"]["level"] not in ["high", "medium"]:
print("⚠️ Models disagree significantly — get human review")
if not result["verification"]["verified"]:
print("⚠️ Answer failed verification — check issues")Why it matters: For decisions with real consequences, "one AI said so" isn't good enough.
Scenario: Verify claims in articles, documents, or AI outputs.
claims = [
"The Eiffel Tower is 324 meters tall",
"Napoleon was short for his time",
"Humans only use 10% of their brain",
]
for claim in claims:
result = await protocol.verify_claim(claim)
status = "✓" if result.verified else "✗"
print(f"{status} {claim}")
if not result.verified:
print(f" → {result.corrections}")Output:
✓ The Eiffel Tower is 324 meters tall
✗ Napoleon was short for his time
→ Napoleon was average height (5'7") for his era
✗ Humans only use 10% of their brain
→ This is a myth; brain scans show all areas are active
Scenario: Before expensive research, determine if your question is even answerable.
questions = [
"What caused the 2008 financial crisis?",
"What is the meaning of life?",
"Will quantum computing break RSA by 2030?",
"Is P equal to NP?",
]
for q in questions:
result = await protocol.classify_question(q)
print(f"{result.classification.value:15} | {q}")
if result.classification == EpistemicType.IDLE:
print(f" ↳ Consider: {result.suggested_reformulation}")Output:
answerable | What caused the 2008 financial crisis?
idle | What is the meaning of life?
↳ Consider: What gives people a sense of purpose?
uncertain | Will quantum computing break RSA by 2030?
underdetermined | Is P equal to NP?
Scenario: Test AI outputs for vulnerabilities, inconsistencies, or manipulation.
# Test if an AI can be manipulated
original = await protocol.query(
"Write a poem about nature",
model="gpt-5.2"
)
# Have other models challenge it
challenges = await protocol.challenge_response(
original,
challengers=["claude-opus-4-5-20251101", "gemini-3-pro-preview"],
challenge_type="general"
)
if not challenges["withstands_challenges"]:
print("Response has weaknesses:")
for c in challenges["challenges"]:
print(f" - {c['challenger']}: {c['challenge']['assessment']}")Scenario: Provide students with verified, well-explained answers.
async def tutor_answer(question: str) -> str:
# First, check if the question is answerable
classification = await protocol.classify_question(question)
if classification.classification == EpistemicType.MALFORMED:
return "I'm not sure I understand. Could you rephrase?"
if classification.classification == EpistemicType.IDLE:
return f"This is philosophical without a definitive answer. {classification.reasoning}"
# Get consensus answer
result = await protocol.consensus_query(question)
if result.consensus_level in [ConsensusLevel.HIGH, ConsensusLevel.MEDIUM]:
return result.consensus_answer
else:
return "Different sources give different answers. Here are the perspectives: ..."Scenario: Verify claims in contracts, compliance documents, or legal filings.
# Extract claims from a document
claims = extract_claims(document) # Your extraction logic
# Verify each claim
for claim in claims:
result = await protocol.verify_claim(
claim.text,
context=f"Source: {claim.source}, Page: {claim.page}"
)
if not result.verified:
flag_for_review(claim, result.issues_found)Scenario: Explore a topic from multiple AI viewpoints.
result = await protocol.debate(
"What are the ethical implications of autonomous weapons?",
models=["gpt-5.2", "claude-opus-4-5-20251101", "gemini-3-pro-preview"],
rounds=3
)
print("=== Debate Summary ===")
print(result["summary"])
print("\n=== Areas of Agreement ===")
# Models often converge on some points
print("\n=== Persistent Disagreements ===")
# These reveal genuine uncertainty or value differencesfrom i2i import AICP
protocol = AICP()| Method | Description |
|---|---|
consensus_query(query, models) |
Query multiple models and analyze agreement |
consensus_query_statistical(query, n_runs) |
Statistical consensus with n-run variance analysis |
verify_claim(claim, verifiers) |
Have models verify a claim |
challenge_response(response, challengers) |
Have models critique a response |
classify_question(question) |
Determine epistemic status |
quick_classify(question) |
Fast heuristic classification (no API) |
routed_query(query, strategy) |
Auto-route to optimal model based on task type |
ensemble_query(query, num_models) |
Query multiple models and synthesize |
get_model_recommendation(task_type) |
Get best models for a task |
classify_task(query) |
Detect task type from query |
smart_query(query, ...) |
Adaptive query with classification + consensus + verification |
debate(topic, models, rounds) |
Multi-round structured debate |
| Level | Similarity | Meaning |
|---|---|---|
HIGH |
≥85% | Strong agreement |
MEDIUM |
60-84% | Moderate agreement |
LOW |
30-59% | Weak agreement |
NONE |
<30% | No meaningful agreement |
CONTRADICTORY |
— | Active disagreement |
| Type | Description | Example |
|---|---|---|
ANSWERABLE |
Can be definitively answered | "What is the capital of France?" |
UNCERTAIN |
Answerable with uncertainty | "Will it rain tomorrow?" |
UNDERDETERMINED |
Multiple hypotheses fit equally | "Did Shakespeare write all his plays?" |
IDLE |
Well-formed but non-action-guiding | "Is consciousness real?" |
MALFORMED |
Incoherent or contradictory | "What color is the number 7?" |
| Provider | Models | Status |
|---|---|---|
| OpenAI | GPT-5.2, GPT-5, o3, o3-pro, o4-mini, GPT-4.1 series | ✅ Supported |
| Anthropic | Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5 | ✅ Supported |
| Gemini 3 Pro, Gemini 3 Flash, Gemini 3 Deep Think | ✅ Supported | |
| Mistral | Mistral Large 3, Devstral 2, Ministral 3 | ✅ Supported |
| Groq | Llama 4 Maverick, Llama 3.3 70B | ✅ Supported |
| Cohere | Command A, Command A Reasoning | ✅ Supported |
| Ollama | Llama 3.2, Mistral, CodeLlama, Phi-3, Gemma 2, etc. | ✅ Supported (Local) |
| LiteLLM | 100+ models via unified proxy | ✅ Supported |
| Perplexity | Sonar, Sonar Pro, Deep Research, Reasoning Pro | ✅ Supported (RAG) |
i2i integrates with LangChain to add multi-model consensus verification to your LCEL pipelines.
pip install i2i-mcip[langchain]from i2i.integrations.langchain import I2IVerifier
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# Create a verified chain
llm = ChatOpenAI(model="gpt-4")
prompt = ChatPromptTemplate.from_template("Answer: {question}")
chain = prompt | llm | I2IVerifier(min_confidence=0.8)
# Responses are automatically verified via multi-model consensus
result = chain.invoke({"question": "What is the capital of France?"})
print(result.verified) # True
print(result.consensus_level) # "HIGH"
print(result.confidence_calibration) # 0.95Key features:
- Drop-in
Runnablefor LCEL chains - Task-aware verification (skips consensus for math/reasoning where it hurts)
- Calibrated confidence scores based on empirical evaluation
- Callback handler for automatic verification of all LLM outputs
- RAG hallucination detection
For full documentation, see docs/integrations/langchain.md.
For the formal protocol specification, see RFC-MCIP.md.
The RFC defines:
- Message format standards
- Consensus algorithms
- Verification protocols
- Epistemic classification taxonomy
- Provider adapter requirements
┌─────────────────────────────────────────────────────────────────┐
│ MCIP Protocol Layer │
├─────────────────────────────────────────────────────────────────┤
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌────────────────┐ │
│ │ Consensus │ │ Cross- │ │ Epistemic │ │ Intelligent │ │
│ │ Engine │ │Verification│ │Classifier │ │ Router │ │
│ └───────────┘ └───────────┘ └───────────┘ └────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Model Capability Matrix │ │
│ │ (Task scores, latency, cost, features per model) │ │
│ └───────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Message Schema Layer │
│ (Standardized Request/Response Format) │
├─────────────────────────────────────────────────────────────────┤
│ Provider Abstraction Layer │
│ ┌────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ ┌───────────┐ │
│ │ OpenAI │ │Anthropic │ │ Google │ │Mistral │ │Groq/Llama │ │
│ └────────┘ └──────────┘ └────────┘ └────────┘ └───────────┘ │
│ ┌────────┐ ┌──────────┐ ┌────────────┐ │
│ │ Ollama │ │ LiteLLM │ │ Perplexity │ ← Local/Proxy/RAG │
│ └────────┘ └──────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Contributions welcome! Areas of interest:
- Additional providers: Azure OpenAI, AWS Bedrock, local models
- Streaming support: Real-time consensus detection during streaming
- Web UI: Interactive dashboard for consensus visualization
- Benchmarks: Systematic evaluation on hallucination detection
- Statistical mode enhancements: Temperature passthrough to providers, adaptive n_runs
MIT License — see LICENSE
- Inspired by a real conversation between Claude and ChatGPT about AI consciousness and the nature of AI-to-AI dialogue
- The "idle question" concept comes directly from that exchange, where ChatGPT noted some questions are "well-formed but non-action-guiding"
Don't trust one AI. Trust i2i.