[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/13_Vector_Store.ipynb)

# Vector Store - Comprehensive Guide

## Overview

This notebook provides a **comprehensive walkthrough** of Semantica's vector_store module, demonstrating vector storage, similarity search, hybrid search, and multi-backend support for semantic retrieval.

**Documentation**: [API Reference](https://semantica.readthedocs.io/reference/vector_store/)

### Learning Objectives

By the end of this notebook, you will be able to:

- Store and manage vectors with metadata
- Perform similarity search with different metrics
- Use hybrid search combining vectors and metadata
- Work with multiple vector store backends (FAISS, Weaviate, etc.)
- Create and manage vector indices
- Filter and rank search results
- Implement namespace isolation for multi-tenancy

### What You'll Learn

| Component | Purpose | When to Use |
|-----------|---------|-------------|
| `VectorStore` | Main vector storage | All vector operations |
| `VectorIndexer` | Index creation | Performance optimization |
| `VectorRetriever` | Similarity search | Finding similar vectors |
| `HybridSearch` | Combined search | Vector + metadata filtering |
| `MetadataFilter` | Metadata filtering | Filtering by attributes |
| `MetadataStore` | Metadata management | Storing vector metadata |
| `NamespaceManager` | Multi-tenancy | Isolating vector collections |

---

## Installation

Install Semantica from PyPI:

```bash
pip install semantica
# Or with all optional dependencies:
pip install semantica[all]
```

---

In [1]:
!pip install -q semantica




## Step 1: Basic Vector Storage

Let's start with the `VectorStore` for basic vector storage and retrieval.

### What is VectorStore?

`VectorStore` is the main interface for vector operations:
- **Storage**: Store vectors with metadata
- **Search**: Find similar vectors
- **CRUD**: Create, Read, Update, Delete operations
- **Multi-backend**: Support for FAISS, Weaviate, Qdrant, Milvus

In [2]:
from semantica.vector_store import VectorStore
from semantica.embeddings import TextEmbedder
import numpy as np

# 1. Initialize Embedder (Select Provider & Model)
# You can choose 'sentence_transformers' or 'fastembed'
embedder = TextEmbedder(method="sentence_transformers", model_name="all-MiniLM-L6-v2")
dimension = embedder.get_embedding_dimension()

# 2. Create vector store
store = VectorStore(backend="faiss", dimension=dimension)

# 3. Generate Real Embeddings
texts = [f"Document {i}" for i in range(100)]
vectors = embedder.embed_batch(texts)

metadata = [
    {"text": txt, "category": "science" if i % 2 == 0 else "technology", "year": 2020 + (i % 4)}
    for i, txt in enumerate(texts)
]
# 4. Store vectors
vector_ids = store.store_vectors(vectors, metadata=metadata)

print(f"Stored {len(vector_ids)} vectors")
print(f"First 3 IDs: {vector_ids[:3]}")

Status,Action,Module,Submodule,File,Time
‚ùå,Semantica is indexing,üìä vector_store,VectorStore,-,0.01s
‚úÖ,Semantica is indexing,üìä vector_store,FAISSStore,-,0.01s
‚úÖ,Semantica is indexing,üìä vector_store,HybridSearch,-,0.01s
‚úÖ,Semantica is indexing,üìä vector_store,MetadataStore,-,0.01s
‚úÖ,Semantica is indexing,üìä vector_store,NamespaceManager,-,0.00s


Stored 100 vectors
First 3 IDs: ['vec_0', 'vec_2', 'vec_4']


## Step 2: Similarity Search

Search for similar vectors using different similarity metrics.

### Similarity Metrics

- **Cosine Similarity**: Best for semantic similarity
- **L2 Distance**: Euclidean distance
- **Dot Product**: Fast, requires normalized vectors

In [3]:
from semantica.vector_store import VectorIndexer, FAISSStore 
import numpy as np 
# We use the first vector from the dataset as a sample query
if 'query_vector' not in locals():
    if 'vectors' in locals() and len(vectors) > 0:
        query_vector = vectors[0]
    else:
        # Fallback if vectors are also missing (safety check)
        query_vector = np.random.rand(dimension).astype('float32')

# Create indexer 
indexer = VectorIndexer(backend="faiss", dimension=dimension) 

# Create HNSW index for fast approximate search 
adapter = FAISSStore(dimension=dimension) 
index = adapter.create_index(index_type="hnsw", metric="L2", m=16) 

# Add vectors to index 
vectors_array = np.array(vectors).astype('float32') 
# FIX: Removed 'index' argument. The adapter uses its internal self.index 
adapter.add_vectors(vectors_array, ids=vector_ids) 

# Search using index 
query_array = np.array(query_vector).astype('float32') 
# FIX: Call search on the 'index' object directly, not the adapter 
distances, indices = index.search(query_array.reshape(1, -1), k=10) 

print(f"Index search found {len(indices[0])} results") 
print(f"Distances: {distances[0][:5]}")

Index search found 10 results
Distances: [0.         0.59663105 0.6359793  0.66132385 0.6909249 ]


## Step 3: Vector Indexing

Create indices for faster search on large datasets.

### Index Types (FAISS)

- **Flat**: Exact search (brute force)
- **IVF**: Inverted file index (approximate)
- **HNSW**: Hierarchical graph (best balance)
- **PQ**: Product quantization (compressed)

In [4]:
from semantica.vector_store import VectorIndexer, FAISSStore

# Get dimension from vectors to ensure consistency
dimension = len(vectors[0]) if len(vectors) > 0 else 384

# Create indexer
indexer = VectorIndexer(backend="faiss", dimension=dimension)

# Create HNSW index for fast approximate search
adapter = FAISSStore(dimension=dimension)
index = adapter.create_index(index_type="hnsw", metric="L2", m=16)

# Add vectors to index
vectors_array = np.array(vectors).astype('float32')
adapter.add_vectors(vectors_array, ids=vector_ids)

# Search using index
query_array = query_vector.astype('float32')
distances, indices = index.search(query_array.reshape(1, -1), k=10)

print(f"Index search found {len(indices[0])} results")
print(f"Distances: {distances[0][:5]}")

Index search found 10 results
Distances: [0.         0.59663105 0.6359793  0.66132385 0.6909249 ]


## Step 4: Hybrid Search

Combine vector similarity with metadata filtering.

### Hybrid Search Benefits

- Filter by metadata before vector search
- Combine multiple search criteria
- More precise results

In [5]:
from semantica.vector_store import HybridSearch, MetadataFilter 
import numpy as np

# Create hybrid search 
hybrid_search = HybridSearch() 

# Create metadata filter 
filter = MetadataFilter() \
    .eq("category", "science") \
    .gt("year", 2021) 

# Perform hybrid search 
# Ensure vectors and metadata are available
if 'vectors' not in locals() or 'metadata' not in locals() or 'vector_ids' not in locals():
    print("Warning: vectors, metadata, or vector_ids are missing. Please run previous cells.")
else:
    hybrid_results = hybrid_search.search( 
        query_vector, 
        vectors, 
        metadata, 
        vector_ids, 
        filter=filter, 
        k=10 
    ) 
    
    print(f"Hybrid search found {len(hybrid_results)} results") 
    print("\nFiltered results (science, year > 2021):") 
    for i, result in enumerate(hybrid_results[:5], 1): 
        meta = result.get('metadata', {}) 
        print(f"{i}. Category: {meta.get('category')}, Year: {meta.get('year')}, Score: {result['score']:.3f}")

Hybrid search found 10 results

Filtered results (science, year > 2021):
1. Category: science, Year: 2020, Score: 1.000
2. Category: technology, Year: 2021, Score: 0.702
3. Category: science, Year: 2022, Score: 0.682
4. Category: technology, Year: 2023, Score: 0.669
5. Category: science, Year: 2022, Score: 0.655


## Step 5: Metadata Management

Store and query metadata separately from vectors.

### Metadata Operations

- Store metadata for vectors
- Query by metadata conditions
- Update metadata
- Schema validation

In [6]:
from semantica.vector_store import MetadataStore, MetadataSchema

# Create metadata store
meta_store = MetadataStore()

# Store metadata
for i, vec_id in enumerate(vector_ids[:10]):
    meta_store.store_metadata(vec_id, metadata[i])

# Query metadata
matching_ids = meta_store.query_metadata(
    {"category": "science"},
    operator="AND"
)

print(f"Found {len(matching_ids)} vectors with category='science'")

# Define schema for validation
schema = MetadataSchema({
    "text": {"type": str, "required": True},
    "category": {"type": str, "required": True},
    "year": {"type": int, "required": True}
})

# Validate metadata
is_valid = schema.validate(metadata[0])
print(f"\nMetadata validation: {is_valid}")

Found 5 vectors with category='science'

Metadata validation: True


## Step 6: Result Ranking and Fusion

Combine and rank results from multiple searches.

### Ranking Strategies

- **Reciprocal Rank Fusion (RRF)**: Combine ranked lists
- **Weighted Average**: Weight scores from different sources

In [7]:
from semantica.vector_store import SearchRanker

# Create ranker with RRF strategy
ranker = SearchRanker(strategy="reciprocal_rank_fusion")

# Simulate multiple search results
results1 = [
    {"id": "vec_1", "score": 0.9},
    {"id": "vec_2", "score": 0.8},
    {"id": "vec_3", "score": 0.7}
]

results2 = [
    {"id": "vec_2", "score": 0.85},
    {"id": "vec_4", "score": 0.75},
    {"id": "vec_1", "score": 0.7}
]

# Fuse results using RRF
fused_results = ranker.rank([results1, results2], k=60)

print("Fused results using RRF:")
for i, result in enumerate(fused_results, 1):
    print(f"{i}. ID: {result['id']}, Fused Score: {result['score']:.3f}")

Fused results using RRF:
1. ID: vec_2, Fused Score: 0.033
2. ID: vec_1, Fused Score: 0.032
3. ID: vec_4, Fused Score: 0.016
4. ID: vec_3, Fused Score: 0.016


## Step 7: Namespace Management

Isolate vectors for multi-tenant applications.

### Namespace Features

- Tenant isolation
- Access control
- Per-namespace operations

In [8]:
from semantica.vector_store import NamespaceManager

# Create namespace manager
ns_manager = NamespaceManager()

# Create namespaces for different tenants
ns1 = ns_manager.create_namespace("tenant1", "Tenant 1 vectors")
ns2 = ns_manager.create_namespace("tenant2", "Tenant 2 vectors")

# Add vectors to namespaces
for i in range(5):
    ns_manager.add_vector_to_namespace(f"t1_vec_{i}", "tenant1")
    ns_manager.add_vector_to_namespace(f"t2_vec_{i}", "tenant2")

# Get namespace vectors
tenant1_vectors = ns_manager.get_namespace_vectors("tenant1")
tenant2_vectors = ns_manager.get_namespace_vectors("tenant2")

print(f"Tenant 1: {len(tenant1_vectors)} vectors")
print(f"Tenant 2: {len(tenant2_vectors)} vectors")

# Set access control
ns1.set_access_control("user1", ["read", "write"])
ns1.set_access_control("user2", ["read"])

print(f"\nUser1 can write: {ns1.has_permission('user1', 'write')}")
print(f"User2 can write: {ns1.has_permission('user2', 'write')}")

Tenant 1: 5 vectors
Tenant 2: 5 vectors

User1 can write: True
User2 can write: False


## Step 8: Multi-Backend Support

Work with different vector store backends.

### Supported Backends

| Backend | Type | Best For |
|---------|------|----------|
| FAISS | Local | Development, small datasets |
| Weaviate | Self-hosted | Schema-aware storage |
| Qdrant | Self-hosted | High performance |
| Milvus | Cloud/Self-hosted | Large scale |

In [None]:
from semantica.vector_store import FAISSAdapter, VectorManager

# FAISS (local)
faiss_adapter = FAISSAdapter(dimension=768)
faiss_index = faiss_adapter.create_index(index_type="flat", metric="L2")
print("Created FAISS index")

# Vector Manager for multi-store management
manager = VectorManager()
faiss_store = manager.create_store("faiss", {"dimension": 768})
print(f"\nCreated store via manager")

# List all stores
stores = manager.list_stores()
print(f"Active stores: {stores}")

# Note: For cloud/remote backends (Weaviate, Qdrant, etc.),
# you would need API keys and endpoints
# Example:
# from semantica.vector_store import WeaviateAdapter
# weaviate = WeaviateAdapter(url="http://localhost:8080")

## Step 9: Best Practices

### Performance Tips

1. **Normalize Vectors**: Always normalize for cosine similarity
2. **Use HNSW**: Best balance for speed/accuracy
3. **Batch Operations**: Process in batches (100-1000)
4. **Filter First**: Apply metadata filters before vector search

### Backend Selection

- **Development**: FAISS (local, fast)
- **Production**: Weaviate/Qdrant (scalable, self-hosted)
- **Self-hosted**: Qdrant or Milvus (control, performance)
- **Schema-aware**: Weaviate (rich metadata)

### Index Configuration

- **Small datasets (<10K)**: Flat index
- **Medium datasets (10K-1M)**: HNSW
- **Large datasets (>1M)**: IVF + PQ

## Summary

### What You've Learned

In this notebook, you've learned how to:

- Store and search vectors with VectorStore
- Create indices for performance optimization
- Use hybrid search with metadata filtering
- Manage metadata separately from vectors
- Rank and fuse search results
- Implement namespace isolation
- Use convenience functions for quick operations
- Work with multiple backend adapters
- Apply best practices for production use

### Key Takeaways

1. **Multi-Backend**: Choose the right backend for your needs
2. **Hybrid Search**: Combine vectors with metadata for precision
3. **Indexing**: Use appropriate index types for performance
4. **Metadata**: Separate metadata management for flexibility
5. **Namespaces**: Isolate vectors for multi-tenancy

### Next Steps

**Further Reading**:
- [Vector Store API Reference](https://semantica.readthedocs.io/reference/vector_store/)
- [Advanced Vector Store Notebook](../advanced/Advanced_Vector_Store_and_Search.ipynb)
- [Embedding Generation](12_Embedding_Generation.ipynb)

---

**Questions or Issues?** Check out our [GitHub repository](https://github.com/Hawksight-AI/semantica) or [documentation](https://semantica.readthedocs.io).