Skip to content

Ultra-fast semantic tool filtering for MCP (Model Context Protocol) servers using embedding similarity. Reduce your tool context from 1000+ tools down to the most relevant 10-20 tools in under 10ms.

License

Notifications You must be signed in to change notification settings

Portkey-AI/mcp-tool-filter

Repository files navigation

@portkey-ai/mcp-tool-filter

Ultra-fast semantic tool filtering for MCP (Model Context Protocol) servers using embedding similarity. Reduce your tool context from 1000+ tools down to the most relevant 10-20 tools in under 10ms.

Features

  • ⚑ Lightning Fast: <10ms filtering latency for 1000+ tools with built-in optimizations
  • πŸš€ Performance Optimized: 6-8x faster dot product, smart top-K selection, true LRU cache
  • 🎯 Semantic Understanding: Uses embeddings for intelligent tool matching
  • πŸ“¦ Zero Dependencies on Runtime: Only requires an embedding provider API
  • πŸ”„ Flexible Input: Accepts chat completion messages or raw strings
  • πŸ’Ύ Smart Caching: Caches embeddings and context for optimal performance
  • πŸŽ›οΈ Configurable: Tune scoring thresholds, top-k, and always-include tools
  • πŸ“Š Performance Metrics: Built-in timing for optimization

Installation

npm install @portkey-ai/mcp-tool-filter

Quick Start

import { MCPToolFilter } from '@portkey-ai/mcp-tool-filter';

// 1. Initialize the filter (choose embedding provider)

// Option A: Local Embeddings (RECOMMENDED for low latency < 5ms)
const filter = new MCPToolFilter({
  embedding: {
    provider: 'local',
  }
});

// Option B: API Embeddings (for highest accuracy)
const filter = new MCPToolFilter({
  embedding: {
    provider: 'openai',
    apiKey: process.env.OPENAI_API_KEY,
  }
});

// 2. Load your MCP servers (one-time setup)
await filter.initialize(mcpServers);

// 3. Filter tools based on context
const result = await filter.filter(
  "Search my emails for the Q4 budget discussion"
);

// 4. Use the filtered tools in your LLM request
console.log(result.tools); // Top 20 most relevant tools
console.log(result.metrics.totalTime); // e.g., "2ms" for local, "500ms" for API

Embedding Provider Options

Local Embeddings (Recommended)

Pros:

  • ⚑ Ultra-fast: 1-5ms latency
  • πŸ”’ Private: No data sent to external APIs
  • πŸ’° Free: No API costs
  • 🌐 Offline: Works without internet

Cons:

  • Slightly lower accuracy than API models
  • First initialization downloads model (~25MB)
const filter = new MCPToolFilter({
  embedding: {
    provider: 'local',
    model: 'Xenova/all-MiniLM-L6-v2', // Optional: default model
    quantized: true, // Optional: use quantized model for speed (default: true)
  }
});

Available Models:

  • Xenova/all-MiniLM-L6-v2 (default) - 384 dimensions, very fast
  • Xenova/all-MiniLM-L12-v2 - 384 dimensions, more accurate
  • Xenova/bge-small-en-v1.5 - 384 dimensions, good balance
  • Xenova/bge-base-en-v1.5 - 768 dimensions, higher quality

Performance:

  • Initialization: 100ms-4s (one-time, downloads model)
  • Filter request: 1-5ms
  • Cached request: <1ms

API Embeddings

For highest accuracy, use OpenAI or other API providers:

const filter = new MCPToolFilter({
  embedding: {
    provider: 'openai',
    apiKey: process.env.OPENAI_API_KEY,
    model: 'text-embedding-3-small', // Optional
    dimensions: 384, // Optional: match local model for fair comparison
  }
});

Pros:

  • 🎯 Highest accuracy: 5-15% better than local
  • πŸ”„ Easy to switch models
  • 🌐 No local resources needed

Cons:

  • 🐌 Slow: 400-800ms per request
  • πŸ’° Costs money: ~$0.02 per 1M tokens
  • πŸ”’ Data sent to external API
  • πŸ“Ά Requires internet connection

Performance:

  • Initialization: 200ms-60s (depends on tool count)
  • Filter request: 400-800ms
  • Cached request: 1-3ms

Quick Comparison

Aspect Local API Winner
Speed 1-5ms 400-800ms πŸ† Local (200x faster)
Accuracy Good (85-90%) Best (100%) πŸ† API
Cost Free ~$0.02/1M tokens πŸ† Local
Privacy Fully local Data sent to API πŸ† Local
Offline βœ… Works offline ❌ Needs internet πŸ† Local
Setup Zero config Needs API key πŸ† Local

πŸ“Š See TRADEOFFS.md for detailed analysis

MCP Server JSON Format

The library expects an array of MCP servers with the following structure:

[
  {
    "id": "gmail",
    "name": "Gmail MCP Server",
    "description": "Email management tools",
    "categories": ["email", "communication"],
    "tools": [
      {
        "name": "search_gmail_messages",
        "description": "Search and find email messages in Gmail inbox. Use when user wants to find, search, look up emails...",
        "keywords": ["email", "search", "inbox", "messages"],
        "category": "email-search",
        "inputSchema": {
          "type": "object",
          "properties": {
            "q": { "type": "string" }
          }
        }
      }
    ]
  }
]

Field Descriptions

Required Fields:

  • id: Unique identifier for the server
  • name: Human-readable server name
  • tools: Array of tool definitions
    • name: Unique tool name
    • description: Rich description of what the tool does and when to use it

Optional but Recommended:

  • description: Server-level description
  • categories: Array of category tags for hierarchical filtering
  • keywords: Array of synonym/related terms for better matching
  • category: Tool-level category
  • inputSchema: JSON schema for parameters (parameter names are used for matching)

Tips for Best Results

  1. Rich Descriptions: Write detailed descriptions with use cases

    "description": "Search emails in Gmail. Use when user wants to find, lookup, or retrieve messages, correspondence, or mail."
  2. Add Keywords: Include synonyms and variations

    "keywords": ["email", "mail", "inbox", "messages", "correspondence"]
  3. Mention Use Cases: Explicitly state when to use the tool

    "description": "... Use when user wants to draft, compose, write, or prepare an email to send later."

API Reference

MCPToolFilter

Main class for tool filtering.

Constructor

new MCPToolFilter(config: MCPToolFilterConfig)

Config Options:

{
  embedding: {
    // Local embeddings (recommended)
    provider: 'local',
    model?: string,               // Default: 'Xenova/all-MiniLM-L6-v2'
    quantized?: boolean,          // Default: true
    
    // OR API embeddings
    provider: 'openai' | 'voyage' | 'cohere',
    apiKey: string,
    model?: string,               // Default: 'text-embedding-3-small'
    dimensions?: number,          // Default: 1536 (or 384 for local)
    baseURL?: string,            // For custom endpoints
  },
  defaultOptions?: {
    topK?: number,              // Default: 20
    minScore?: number,          // Default: 0.3
    contextMessages?: number,   // Default: 3
    alwaysInclude?: string[],   // Always include these tools
    exclude?: string[],         // Never include these tools
    maxContextTokens?: number,  // Default: 500
  },
  includeServerDescription?: boolean,  // Default: false (see below)
  debug?: boolean               // Enable debug logging
}

About includeServerDescription:

When enabled, this option includes the MCP server description in the tool embeddings, providing additional context about the domain/category of tools.

// Enable server descriptions in embeddings
const filter = new MCPToolFilter({
  embedding: { provider: 'local' },
  includeServerDescription: true  // Default: false
});

Tradeoffs:

  • βœ… Helps: General intent queries like "manage my local files" (+25% improvement)
  • ❌ Hurts: Specific tool queries like "Execute this SQL query" (-50% degradation)
  • β‰ˆ Neutral: Overall impact is neutral (0% change)

Recommendation: Keep this disabled (default: false) unless your use case primarily involves high-level intent queries. See examples/benchmark-server-description.ts for detailed benchmarks.

Methods

initialize(servers: MCPServer[]): Promise<void>

Initialize the filter with MCP servers. This precomputes and caches all tool embeddings.

Note: Call this once during startup. It's an async operation that may take a few seconds depending on the number of tools.

await filter.initialize(servers);
filter(input: FilterInput, options?: FilterOptions): Promise<FilterResult>

Filter tools based on the input context.

Input Types:

// String input
await filter.filter("Search my emails about the project");

// Chat messages
await filter.filter([
  { role: 'user', content: 'What meetings do I have today?' },
  { role: 'assistant', content: 'Let me check your calendar.' }
]);

Options (all optional, override defaults):

{
  topK?: number,              // Max tools to return
  minScore?: number,          // Minimum similarity score (0-1)
  contextMessages?: number,   // How many recent messages to use
  alwaysInclude?: string[],   // Tool names to always include
  exclude?: string[],         // Tool names to exclude
  maxContextTokens?: number,  // Max context size
}

Returns:

{
  tools: ScoredTool[],        // Filtered and ranked tools
  metrics: {
    totalTime: number,        // Total time in ms
    embeddingTime: number,    // Time to embed context
    similarityTime: number,   // Time to compute similarities
    toolsEvaluated: number,   // Total tools evaluated
  }
}
getStats()

Get statistics about the filter state.

const stats = filter.getStats();
// {
//   initialized: true,
//   toolCount: 25,
//   cacheSize: 5,
//   embeddingDimensions: 1536
// }
clearCache()

Clear the context embedding cache.

filter.clearCache();

Performance Optimization

Built-in Optimizations

The library includes several performance optimizations out of the box:

  1. πŸš€ Loop-Unrolled Dot Product - Vector similarity computation is 6-8x faster through CPU pipeline optimization
  2. πŸ“Š Smart Top-K Selection - Hybrid algorithm uses fast built-in sort for typical workloads, switches to heap-based selection for 500+ tools
  3. πŸ’Ύ True LRU Cache - Intelligent cache eviction based on access patterns, not just insertion order
  4. 🎯 In-Place Operations - Reduced memory allocations through in-place vector normalization
  5. ⚑ Set-Based Lookups - O(1) exclusion checking instead of O(n) array scanning

These optimizations are automatic and transparent - no configuration needed!

Latency Breakdown

Typical performance for 1000 tools:

Building context:        <1ms
Embedding API call:      3-5ms  (cached: 0ms)
Similarity computation:  1-2ms  (6-8x faster with optimizations)
Sorting/filtering:       <1ms   (hybrid algorithm)
─────────────────────────────
Total:                   5-9ms

User Configuration Tips

  1. Use Smaller Embeddings: 512 or 1024 dimensions for faster computation

    embedding: {
      provider: 'openai',
      model: 'text-embedding-3-small',
      dimensions: 512  // Faster than 1536
    }
  2. Reduce Context Size: Fewer messages = faster embedding

    defaultOptions: {
      contextMessages: 2,  // Instead of 3-5
      maxContextTokens: 300
    }
  3. Leverage Caching: Identical contexts reuse cached embeddings (0ms)

  4. Tune topK: Request fewer tools if you don't need 20

    await filter.filter(input, { topK: 10 });

Performance Benchmarks

Micro-benchmarks showing optimization improvements:

Dot Product (1536 dims):        0.001ms vs 0.006ms (6x faster)
Vector Normalization:           0.003ms vs 0.006ms (2x faster)  
Top-K Selection (<500 tools):   Uses optimized built-in sort
Top-K Selection (500+ tools):   O(n log k) heap-based selection
LRU Cache Access:               True access-order tracking

See the existing benchmark examples for end-to-end performance testing:

npx ts-node examples/benchmark.ts

Integration Examples

With Portkey AI Gateway

import Portkey from 'portkey-ai';
import { MCPToolFilter } from '@portkey-ai/mcp-tool-filter';

const portkey = new Portkey({ apiKey: '...' });
const filter = new MCPToolFilter({ /* ... */ });

await filter.initialize(mcpServers);

// Filter tools based on conversation
const { tools } = await filter.filter(messages);

// Convert to OpenAI tool format
const openaiTools = tools.map(t => ({
  type: 'function',
  function: {
    name: t.toolName,
    description: t.tool.description,
    parameters: t.tool.inputSchema,
  }
}));

// Make LLM request with filtered tools
const completion = await portkey.chat.completions.create({
  model: 'gpt-4',
  messages: messages,
  tools: openaiTools,
});

With LangChain

import { ChatOpenAI } from 'langchain/chat_models/openai';
import { MCPToolFilter } from '@portkey-ai/mcp-tool-filter';

const filter = new MCPToolFilter({ /* ... */ });
await filter.initialize(mcpServers);

// Create a custom tool selector
async function selectTools(messages) {
  const { tools } = await filter.filter(messages);
  return tools.map(t => convertToLangChainTool(t));
}

// Use in your agent
const model = new ChatOpenAI();
const tools = await selectTools(messages);
const response = await model.invoke(messages, { tools });

Caching Strategy

// Recommended: Initialize once at startup
let filterInstance: MCPToolFilter;

async function getFilter() {
  if (!filterInstance) {
    filterInstance = new MCPToolFilter({ /* ... */ });
    await filterInstance.initialize(mcpServers);
  }
  return filterInstance;
}

// Use in request handlers
app.post('/chat', async (req, res) => {
  const filter = await getFilter();
  const result = await filter.filter(req.body.messages);
  // ... use filtered tools
});

Benchmarks

Performance on various tool counts (M1 Max):

Local Embeddings (Xenova/all-MiniLM-L6-v2):

Tools Initialization Filter (Cold) Filter (Cached)
10 ~100ms 2ms <1ms
100 ~500ms 3ms <1ms
500 ~2s 4ms 1ms
1000 ~4s 5ms 1ms
5000 ~20s 8ms 2ms

API Embeddings (OpenAI text-embedding-3-small):

Tools Initialization Filter (Cold) Filter (Cached)
10 ~200ms 500ms 1ms
100 ~1.5s 550ms 2ms
500 ~6s 600ms 2ms
1000 ~12s 650ms 3ms
5000 ~60s 800ms 4ms

Key Takeaways:

  • πŸš€ Local embeddings are 200-300x faster for filter requests
  • βœ… Local embeddings meet the <50ms target easily
  • πŸ’° Local embeddings have no API costs
  • πŸ“Š API embeddings may have slightly higher accuracy
  • ⚑ Both benefit significantly from caching

Note: Initialization is a one-time cost. Choose local embeddings for low latency, API embeddings for maximum accuracy.

When to Use Local vs API Embeddings

Use Local Embeddings when:

  • ⚑ You need ultra-low latency (<10ms)
  • πŸ”’ Privacy is important (no external API calls)
  • πŸ’° You want zero API costs
  • 🌐 You need offline operation
  • πŸ“Š "Good enough" accuracy is acceptable

Use API Embeddings when:

  • 🎯 You need maximum accuracy
  • 🌍 You have good internet connectivity
  • πŸ’΅ API costs are not a concern
  • πŸ“ˆ You're dealing with complex/nuanced queries

Recommendation: Start with local embeddings. Only switch to API if accuracy is insufficient.

Testing Local vs API

Compare performance for your use case:

npx ts-node examples/test-local-embeddings.ts

This will benchmark both providers and show you:

  • Initialization time
  • Average filter time
  • Cached filter time
  • Speed comparison

Debugging & Performance Monitoring

Enable Debug Logging

To see detailed timing logs for each request, enable debug mode:

const filter = new MCPToolFilter({
  embedding: { /* ... */ },
  debug: true  // Enable detailed timing logs
});

This will output detailed logs for each filter request:

=== Starting filter request ===
[1/5] Options merged: 0.12ms
[2/5] Context built (156 chars): 0.34ms
[3/5] Cache MISS (lookup: 0.08ms)
     β†’ Embedding generated: 1247.56ms
[4/5] Similarities computed: 1.23ms (25 tools, 0.049ms/tool)
[5/5] Tools selected & ranked: 0.15ms (5 tools returned)
=== Total filter time: 1249.48ms ===
Breakdown: merge=0.12ms, context=0.34ms, cache=0.08ms, embedding=1247.56ms, similarity=1.23ms, selection=0.15ms

Timing Breakdown

Each filter request logs 5 steps:

  1. Options Merging (merge): Merge provided options with defaults
  2. Context Building (context): Build the context string from input messages
  3. Cache Lookup & Embedding (cache + embedding):
    • Cache HIT: 0ms embedding time (reuses cached embedding)
    • Cache MISS: Calls embedding API (typically 200-2000ms depending on provider)
  4. Similarity Computation (similarity): Compute cosine similarity for all tools
    • Also shows per-tool average time
  5. Tool Selection (selection): Filter by score and select top-K tools

Example: Testing Timings

See examples/test-timings.ts for a complete example:

export OPENAI_API_KEY=your-key-here
npx ts-node examples/test-timings.ts

This will run multiple filter requests showing:

  • Cache miss vs cache hit performance
  • Different query types
  • Chat message context handling

Performance Metrics

Every filter request returns detailed metrics:

const result = await filter.filter(input);

console.log(result.metrics);
// {
//   totalTime: 1249.48,      // Total request time in ms
//   embeddingTime: 1247.56,  // Time spent on embedding API
//   similarityTime: 1.23,    // Time computing similarities
//   toolsEvaluated: 25       // Number of tools evaluated
// }

Monitoring in Production

const result = await filter.filter(messages);

// Log metrics for monitoring
logger.info('Tool filter performance', {
  totalTime: result.metrics.totalTime,
  embeddingTime: result.metrics.embeddingTime,
  cached: result.metrics.embeddingTime === 0,
  toolsReturned: result.tools.length,
});

// Alert if too slow
if (result.metrics.totalTime > 5000) {
  logger.warn('Slow filter request', result.metrics);
}

Advanced Usage

Two-Stage Filtering

For very large tool sets, use hierarchical filtering:

// Stage 1: Filter by server categories
const relevantServers = mcpServers.filter(server => 
  server.categories?.some(cat => userIntent.includes(cat))
);

// Stage 2: Filter tools within relevant servers
const result = await filter.filter(messages);

Custom Scoring

Combine embedding similarity with keyword matching:

const { tools } = await filter.filter(input);

// Boost tools with exact keyword matches
const boostedTools = tools.map(tool => {
  const hasKeywordMatch = tool.tool.keywords?.some(kw => 
    input.toLowerCase().includes(kw.toLowerCase())
  );
  return {
    ...tool,
    score: hasKeywordMatch ? tool.score * 1.2 : tool.score
  };
}).sort((a, b) => b.score - a.score);

Always-Include Power Tools

Always include certain essential tools:

const filter = new MCPToolFilter({
  // ...
  defaultOptions: {
    alwaysInclude: [
      'web_search',           // Always useful
      'conversation_search',  // Access to context
    ],
  }
});

Troubleshooting

Slow First Request

Problem: First filter call is slow.

Solution: The embedding API call takes 3-5ms. Subsequent calls with similar context are cached and much faster.

// Warm up the cache
await filter.filter("hello"); // ~5ms
await filter.filter("hello"); // ~1ms (cached)

Poor Tool Selection

Problem: Wrong tools are being selected.

Solutions:

  1. Improve tool descriptions with more keywords and use cases
  2. Lower the minScore threshold
  3. Increase topK to include more tools
  4. Add important tools to alwaysInclude

Memory Usage

Problem: High memory usage with many tools.

Solution: Use smaller embedding dimensions:

embedding: {
  dimensions: 512  // Instead of 1536
}

This reduces memory by ~66% with minimal accuracy loss.

License

MIT

Contributing

Contributions welcome! Please open an issue or PR.

Support

About

Ultra-fast semantic tool filtering for MCP (Model Context Protocol) servers using embedding similarity. Reduce your tool context from 1000+ tools down to the most relevant 10-20 tools in under 10ms.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published