Skip to content

Vector Semantic Search

Temp edited this page Oct 3, 2025 · 1 revision

Vector & Semantic Search Tools

8 specialized tools for vector embeddings, similarity search, and semantic operations using pgvector.

Requirements: pgvector extension (v0.5.0+)


📊 Overview

Tool Purpose Feature
vector_search Similarity search K-nearest neighbors
vector_similarity Calculate similarity scores Cosine, L2, inner product
vector_index_create Create vector indexes HNSW, IVFFlat
vector_index_optimize Optimize vector indexes Index tuning
vector_cluster Cluster vectors K-means clustering
vector_stats Vector statistics Dimensionality, distribution
semantic_search Semantic text search Natural language queries
embedding_insert Insert embeddings Batch insertion

🔧 Tool Details

vector_search

Find the most similar vectors using various distance metrics.

Parameters:

  • table_name (string, required): Table containing vectors
  • vector_column (string, required): Column with vector embeddings
  • query_vector (list, required): Query vector to search for
  • distance_metric (string, optional): cosine, l2, inner_product (default: cosine)
  • limit (integer, optional): Number of results (default: 10)

Returns:

  • Matching rows with similarity scores
  • Distance/similarity values
  • Sorted by relevance

Example:

result = vector_search(
    table_name="documents",
    vector_column="embedding",
    query_vector=[0.1, 0.2, 0.3, ...],  # 384-dimensional vector
    distance_metric="cosine",
    limit=5
)
# Returns: [
#   {"id": 42, "title": "PostgreSQL Guide", "similarity": 0.95},
#   {"id": 18, "title": "Database Tutorial", "similarity": 0.89},
#   ...
# ]

Distance Metrics:

  • cosine - Cosine similarity (0-1, higher is better)
  • l2 - Euclidean distance (lower is better)
  • inner_product - Dot product (higher is better)

Use Cases:

  • Semantic search
  • Document retrieval
  • Recommendation systems
  • Image similarity search

vector_similarity

Calculate similarity between two vectors or a vector and a table.

Parameters:

  • vector1 (list, required): First vector
  • vector2 (list, optional): Second vector (for pairwise comparison)
  • table_name (string, optional): Table for batch comparison
  • vector_column (string, optional): Column containing vectors
  • distance_metric (string, optional): Similarity metric

Returns:

  • Similarity scores
  • Distance values
  • Comparison results

Example:

# Pairwise comparison
result = vector_similarity(
    vector1=[0.1, 0.2, 0.3],
    vector2=[0.15, 0.18, 0.32],
    distance_metric="cosine"
)
# Returns: {"similarity": 0.98, "distance": 0.02}

# Compare against table
result = vector_similarity(
    vector1=[0.1, 0.2, 0.3],
    table_name="documents",
    vector_column="embedding",
    distance_metric="cosine"
)

Use Cases:

  • Duplicate detection
  • Clustering validation
  • Quality assurance
  • Similarity thresholds

vector_index_create

Create optimized indexes for vector similarity search.

Parameters:

  • table_name (string, required): Target table
  • vector_column (string, required): Vector column
  • index_type (string, required): hnsw or ivfflat
  • index_name (string, optional): Custom index name
  • parameters (object, optional): Index-specific parameters

Returns:

  • Index creation status
  • Estimated build time
  • Index configuration

Example:

# HNSW index (recommended for most use cases)
result = vector_index_create(
    table_name="documents",
    vector_column="embedding",
    index_type="hnsw",
    parameters={
        "m": 16,              # Max connections per layer
        "ef_construction": 64 # Build-time accuracy
    }
)

# IVFFlat index (for very large datasets)
result = vector_index_create(
    table_name="documents",
    vector_column="embedding",
    index_type="ivfflat",
    parameters={
        "lists": 100  # Number of clusters
    }
)

Index Types:

HNSW (Hierarchical Navigable Small World):

  • Best for: < 10M vectors
  • Pros: Fast queries, high accuracy
  • Cons: Slower to build, more memory

IVFFlat (Inverted File with Flat Compression):

  • Best for: > 10M vectors
  • Pros: Fast to build, less memory
  • Cons: Requires training, lower accuracy

Use Cases:

  • Performance optimization
  • Large-scale deployments
  • Production readiness

vector_index_optimize

Optimize existing vector indexes for better performance.

Parameters:

  • table_name (string, required): Table with index
  • index_name (string, required): Index to optimize
  • operation (string, required): reindex, analyze, tune

Returns:

  • Optimization results
  • Performance improvements
  • Recommendations

Example:

# Reindex for accuracy
result = vector_index_optimize(
    table_name="documents",
    index_name="documents_embedding_idx",
    operation="reindex"
)

# Analyze index usage
result = vector_index_optimize(
    table_name="documents",
    index_name="documents_embedding_idx",
    operation="analyze"
)
# Returns: {
#   "index_scans": 15420,
#   "tuples_read": 154200,
#   "avg_scan_time_ms": 2.5,
#   "recommendations": ["Consider increasing ef_search for better accuracy"]
# }

Use Cases:

  • Performance tuning
  • Index maintenance
  • Accuracy optimization

vector_cluster

Cluster vectors using K-means or similar algorithms.

Parameters:

  • table_name (string, required): Source table
  • vector_column (string, required): Vector column
  • num_clusters (integer, required): Number of clusters
  • algorithm (string, optional): kmeans (default)
  • max_iterations (integer, optional): Maximum iterations

Returns:

  • Cluster assignments
  • Cluster centroids
  • Cluster statistics

Example:

result = vector_cluster(
    table_name="documents",
    vector_column="embedding",
    num_clusters=10,
    algorithm="kmeans"
)
# Returns: {
#   "clusters": [
#     {
#       "cluster_id": 0,
#       "size": 523,
#       "centroid": [0.15, 0.23, ...],
#       "avg_distance": 0.12
#     },
#     ...
#   ],
#   "total_documents": 5234,
#   "iterations": 15
# }

Use Cases:

  • Document categorization
  • Content organization
  • Recommendation systems
  • Data exploration

vector_stats

Calculate statistics on vector embeddings.

Parameters:

  • table_name (string, required): Source table
  • vector_column (string, required): Vector column

Returns:

  • Dimensionality information
  • Distribution statistics
  • Quality metrics

Example:

result = vector_stats(
    table_name="documents",
    vector_column="embedding"
)
# Returns: {
#   "total_vectors": 10000,
#   "dimensions": 384,
#   "avg_norm": 1.02,
#   "std_norm": 0.15,
#   "null_count": 0,
#   "avg_sparsity": 0.0,
#   "recommendations": [
#     "Vectors are well-normalized",
#     "No null embeddings detected"
#   ]
# }

Use Cases:

  • Data quality checks
  • Embedding validation
  • Performance diagnostics

semantic_search

Natural language semantic search over text with embeddings.

Parameters:

  • table_name (string, required): Table with text and embeddings
  • text_column (string, required): Text column for display
  • vector_column (string, required): Embedding column
  • query_text (string, required): Natural language query
  • limit (integer, optional): Number of results

Returns:

  • Relevant text matches
  • Semantic similarity scores
  • Ranked results

Example:

result = semantic_search(
    table_name="articles",
    text_column="content",
    vector_column="embedding",
    query_text="How do I optimize PostgreSQL performance?",
    limit=5
)
# Returns: [
#   {
#     "id": 42,
#     "title": "PostgreSQL Performance Tuning",
#     "content": "...",
#     "similarity": 0.92
#   },
#   ...
# ]

Note: Query text is automatically embedded using the same model as the stored embeddings.

Use Cases:

  • Knowledge base search
  • FAQ systems
  • Content discovery
  • Semantic Q&A

embedding_insert

Efficiently insert vectors in batches.

Parameters:

  • table_name (string, required): Target table
  • vector_column (string, required): Vector column
  • vectors (list, required): List of vectors to insert
  • metadata (list, optional): Associated metadata

Returns:

  • Insertion status
  • Number of vectors inserted
  • Performance metrics

Example:

result = embedding_insert(
    table_name="documents",
    vector_column="embedding",
    vectors=[
        [0.1, 0.2, 0.3, ...],
        [0.15, 0.18, 0.32, ...],
        [0.12, 0.25, 0.28, ...]
    ],
    metadata=[
        {"title": "Doc 1", "source": "web"},
        {"title": "Doc 2", "source": "pdf"},
        {"title": "Doc 3", "source": "api"}
    ]
)
# Returns: {
#   "inserted": 3,
#   "failed": 0,
#   "duration_ms": 15.2
# }

Use Cases:

  • Bulk data ingestion
  • ETL pipelines
  • Data migration
  • Initial setup

🎯 Common Workflows

Document Similarity Search

# 1. Create vector index
vector_index_create(
    table_name="documents",
    vector_column="embedding",
    index_type="hnsw"
)

# 2. Search for similar documents
results = vector_search(
    table_name="documents",
    vector_column="embedding",
    query_vector=query_embedding,
    distance_metric="cosine",
    limit=10
)

# 3. Filter by similarity threshold
relevant = [r for r in results if r["similarity"] > 0.8]

Semantic Search Pipeline

# 1. Insert embeddings
embedding_insert(
    table_name="articles",
    vector_column="embedding",
    vectors=embeddings,
    metadata=article_metadata
)

# 2. Create optimized index
vector_index_create(
    table_name="articles",
    vector_column="embedding",
    index_type="hnsw",
    parameters={"m": 16, "ef_construction": 64}
)

# 3. Perform semantic search
results = semantic_search(
    table_name="articles",
    text_column="content",
    vector_column="embedding",
    query_text="machine learning tutorials",
    limit=5
)

Clustering and Analysis

# 1. Check vector statistics
stats = vector_stats(
    table_name="documents",
    vector_column="embedding"
)

# 2. Cluster vectors
clusters = vector_cluster(
    table_name="documents",
    vector_column="embedding",
    num_clusters=10
)

# 3. Analyze clusters
for cluster in clusters["clusters"]:
    # Find representative documents from each cluster
    representative = vector_search(
        table_name="documents",
        vector_column="embedding",
        query_vector=cluster["centroid"],
        limit=3
    )

🚀 Best Practices

1. Index Selection

# Small datasets (< 100K vectors): HNSW
vector_index_create(
    table_name="documents",
    vector_column="embedding",
    index_type="hnsw",
    parameters={"m": 16, "ef_construction": 64}
)

# Large datasets (> 1M vectors): IVFFlat
vector_index_create(
    table_name="documents",
    vector_column="embedding",
    index_type="ivfflat",
    parameters={"lists": 1000}
)

2. Normalize Vectors

-- Ensure vectors are normalized
UPDATE documents
SET embedding = embedding / sqrt((embedding::float[]@embedding::float[]))
WHERE sqrt((embedding::float[]@embedding::float[])) != 1.0;

3. Monitor Performance

# Regular optimization
vector_index_optimize(
    table_name="documents",
    index_name="documents_embedding_idx",
    operation="analyze"
)

# Check vector quality
stats = vector_stats(
    table_name="documents",
    vector_column="embedding"
)

📚 Related Documentation


🔗 External Resources


See Home for more tool categories.

Clone this wiki locally