# 05 - LanceDB Vector Search & AI Workflows

This notebook demonstrates how to use LanceDB for vector storage, similarity search, and AI/ML workflows in your lakehouse environment.

## What you'll learn:
- How to connect to the LanceDB service
- Vector storage and retrieval operations
- Similarity search and semantic matching
- Integration with embeddings and AI models
- Building AI-powered search applications
- Working with different vector types (documents, images, etc.)

In [None]:
# Install required packages
import subprocess
import sys

def install_package(package):
    try:
        __import__(package.split('[')[0])
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Install LanceDB and related packages
install_package('requests')
install_package('numpy')
install_package('pandas')
install_package('scikit-learn')  # For sample embeddings
install_package('matplotlib')
install_package('seaborn')

print("✅ All packages installed successfully!")

In [None]:
import requests
import numpy as np
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("📊 Libraries loaded successfully!")

## 1. Connect to LanceDB Service

The lakehouse stack includes a LanceDB REST API service for vector operations.

In [None]:
# LanceDB service connection
LANCEDB_URL = 'http://lancedb:8000'  # Container-to-container connection

# Test connection and get service info
try:
    response = requests.get(f'{LANCEDB_URL}/health', timeout=5)
    if response.status_code == 200:
        health_info = response.json()
        print("✅ Connected to LanceDB service!")
        print(f"Status: {health_info['status']}")
        print(f"Available tables: {health_info['tables']}")
        print(f"Data directory: {health_info['data_directory']}")
    else:
        print(f"❌ LanceDB service responded with status {response.status_code}")
except requests.exceptions.RequestException as e:
    print(f"❌ Could not connect to LanceDB service: {e}")
    print("💡 Make sure the lakehouse stack is running with: docker compose up -d")

# Get general service information
try:
    response = requests.get(f'{LANCEDB_URL}/')
    if response.status_code == 200:
        service_info = response.json()
        print(f"\n🔧 Service Info:")
        print(f"Version: {service_info['version']}")
        print(f"Available endpoints: {list(service_info['endpoints'].keys())}")
except:
    pass

## 2. Explore Available Tables

Let's see what vector tables are already available and examine their structure.

In [None]:
# List all available tables
def get_tables():
    """Get list of all available LanceDB tables"""
    try:
        response = requests.get(f'{LANCEDB_URL}/tables')
        if response.status_code == 200:
            return response.json()['tables']
        else:
            print(f"Error getting tables: {response.status_code}")
            return []
    except Exception as e:
        print(f"Error: {e}")
        return []

# Get table information
def get_table_info(table_name):
    """Get detailed information about a specific table"""
    try:
        response = requests.get(f'{LANCEDB_URL}/tables/{table_name}/info')
        if response.status_code == 200:
            return response.json()
        else:
            print(f"Error getting table info: {response.status_code}")
            return None
    except Exception as e:
        print(f"Error: {e}")
        return None

# Explore all tables
tables = get_tables()
print(f"📊 Found {len(tables)} tables:")
print("="*50)

for table in tables:
    print(f"\n🗂️  Table: {table['name']}")
    print(f"   Records: {table['count']:,}")
    
    # Get detailed info
    info = get_table_info(table['name'])
    if info:
        print(f"   Columns: {info['columns']}")
        print(f"   Version: {info['version']}")
        
        # Show sample data
        if info['sample_data']:
            sample = info['sample_data'][0]
            print(f"   Sample record keys: {list(sample.keys())}")
            if 'text' in sample:
                print(f"   Sample text: '{sample['text'][:60]}...'")
            if 'vector' in sample:
                print(f"   Vector dimension: {len(sample['vector'])}")

## 3. Vector Search Operations

Let's perform similarity searches on the existing vector data.

In [None]:
# Function to search vectors
def search_vectors(table_name, query_vector=None, limit=5):
    """Search for similar vectors in a table"""
    try:
        payload = {
            'vector': query_vector,
            'limit': limit
        }
        
        response = requests.post(f'{LANCEDB_URL}/tables/{table_name}/search', json=payload)
        if response.status_code == 200:
            return response.json()
        else:
            print(f"Search error: {response.status_code}")
            return None
    except Exception as e:
        print(f"Error: {e}")
        return None

# Let's search the document embeddings table
print("🔍 Searching document embeddings...")

# First, let's get some sample data without a query vector
doc_results = search_vectors('document_embeddings', query_vector=None, limit=3)

if doc_results:
    print(f"\n📝 Found {doc_results['count']} sample documents:")
    print("-" * 60)
    
    for i, doc in enumerate(doc_results['results'][:3], 1):
        print(f"\n{i}. Document ID: {doc['id']}")
        print(f"   Category: {doc['category']}")
        print(f"   Text: {doc['text'][:100]}...")
        print(f"   Vector dims: {len(doc['vector'])}")

In [None]:
# Now let's do a real similarity search
# We'll use one of the existing vectors as a query

if doc_results and doc_results['results']:
    # Use the first document's vector as our query
    query_vector = doc_results['results'][0]['vector']
    original_text = doc_results['results'][0]['text']
    
    print(f"🎯 Searching for documents similar to:")
    print(f"'{original_text[:80]}...'")
    print()
    
    # Search for similar documents
    similar_docs = search_vectors('document_embeddings', query_vector=query_vector, limit=5)
    
    if similar_docs:
        print(f"🔍 Found {similar_docs['count']} similar documents:")
        print("=" * 70)
        
        for i, doc in enumerate(similar_docs['results'], 1):
            print(f"\n{i}. Similarity Score: {doc.get('_distance', 'N/A')}")
            print(f"   Document ID: {doc['id']}")
            print(f"   Category: {doc['category']}")
            print(f"   Text: {doc['text'][:120]}...")
            
            # Calculate manual cosine similarity for demonstration
            if 'vector' in doc:
                similarity = cosine_similarity([query_vector], [doc['vector']])[0][0]
                print(f"   Cosine Similarity: {similarity:.4f}")

## 4. Working with Image Embeddings

Let's explore the image embeddings table and perform visual similarity searches.

In [None]:
# Explore image embeddings
print("🖼️  Exploring image embeddings...")

image_results = search_vectors('image_embeddings', query_vector=None, limit=5)

if image_results:
    print(f"\n📊 Found {image_results['count']} sample images:")
    print("-" * 60)
    
    image_data = []
    
    for i, img in enumerate(image_results['results'], 1):
        print(f"\n{i}. Image: {img['filename']}")
        print(f"   Tags: {', '.join(img['tags'])}")
        print(f"   Vector dims: {len(img['vector'])}")
        
        # Parse metadata
        metadata = json.loads(img['metadata']) if isinstance(img['metadata'], str) else img['metadata']
        print(f"   Dimensions: {metadata.get('width', 'N/A')}x{metadata.get('height', 'N/A')}")
        
        image_data.append({
            'id': img['id'],
            'filename': img['filename'],
            'tags': ', '.join(img['tags']),
            'vector_dim': len(img['vector']),
            'width': metadata.get('width', 0),
            'height': metadata.get('height', 0)
        })
    
    # Create a summary DataFrame
    if image_data:
        df_images = pd.DataFrame(image_data)
        print(f"\n📈 Image Collection Summary:")
        print(df_images.to_string(index=False))

## 5. Creating Custom Embeddings

Let's create our own embeddings and add them to a new table.

In [None]:
# Sample lakehouse-related documents for embedding
sample_texts = [
    "Data lakehouse architecture combines the best of data warehouses and data lakes",
    "Apache Spark provides distributed computing for big data processing",
    "MinIO offers S3-compatible object storage for cloud-native applications",
    "PostgreSQL is a powerful relational database for analytics workloads", 
    "Apache Airflow orchestrates complex data workflows and ETL pipelines",
    "Jupyter notebooks enable interactive data science and exploration",
    "Apache Superset provides modern business intelligence dashboards",
    "Vector databases enable semantic search and AI-powered applications",
    "Machine learning models require feature stores and model registries",
    "Real-time streaming data processing using Apache Kafka and Spark",
    "Data quality monitoring and observability in modern data stacks",
    "Cloud-native data platforms with Kubernetes orchestration"
]

print(f"📝 Created {len(sample_texts)} sample documents")

# Create TF-IDF embeddings (simple example - in production you'd use more sophisticated embeddings)
vectorizer = TfidfVectorizer(max_features=128, stop_words='english')
tfidf_vectors = vectorizer.fit_transform(sample_texts)

# Convert to dense numpy arrays
embeddings = tfidf_vectors.toarray()

print(f"✅ Generated {embeddings.shape[0]} embeddings with {embeddings.shape[1]} dimensions")
print(f"Feature words: {vectorizer.get_feature_names_out()[:10]}...")

# Prepare data for LanceDB
lakehouse_docs = []
for i, (text, embedding) in enumerate(zip(sample_texts, embeddings)):
    lakehouse_docs.append({
        'id': i + 100,  # Start with ID 100 to avoid conflicts
        'text': text,
        'vector': embedding.tolist(),
        'category': 'lakehouse-guide',
        'created_at': datetime.now().isoformat(),
        'source': 'notebook-generated'
    })

print(f"📋 Prepared {len(lakehouse_docs)} documents for insertion")

In [None]:
# Create a new table for our custom embeddings
def create_table(table_name, table_type='custom'):
    """Create a new LanceDB table"""
    try:
        payload = {'type': table_type}
        response = requests.post(f'{LANCEDB_URL}/tables/{table_name}/create', json=payload)
        
        if response.status_code == 200:
            return response.json()
        else:
            result = response.json()
            if 'already exists' in result.get('message', ''):
                print(f"ℹ️  Table '{table_name}' already exists")
                return result
            else:
                print(f"Error creating table: {response.status_code}")
                return None
    except Exception as e:
        print(f"Error: {e}")
        return None

# Function to insert data into a table
def insert_data(table_name, records):
    """Insert records into a LanceDB table"""
    try:
        payload = {'records': records}
        response = requests.post(f'{LANCEDB_URL}/tables/{table_name}/insert', json=payload)
        
        if response.status_code == 200:
            return response.json()
        else:
            print(f"Insert error: {response.status_code}")
            print(response.text)
            return None
    except Exception as e:
        print(f"Error: {e}")
        return None

# Create table for lakehouse documentation
table_name = 'lakehouse_docs'
print(f"🗂️  Creating table: {table_name}")

create_result = create_table(table_name)
if create_result:
    print(f"✅ Table created successfully")
    
    # Insert our custom embeddings
    print(f"📥 Inserting {len(lakehouse_docs)} documents...")
    
    insert_result = insert_data(table_name, lakehouse_docs)
    if insert_result:
        print(f"✅ Inserted {insert_result['inserted_count']} records")
        print(f"   Total records in table: {insert_result['total_count']}")
    else:
        print("❌ Failed to insert data")
else:
    print("❌ Failed to create table")

## 6. Semantic Search Demo

Now let's perform semantic searches on our custom lakehouse documentation.

In [None]:
# Function to create query embeddings
def create_query_embedding(query_text):
    """Create embedding for a query text"""
    # Transform the query using our trained vectorizer
    query_vector = vectorizer.transform([query_text])
    return query_vector.toarray()[0].tolist()

# Function to perform semantic search
def semantic_search(query_text, table_name='lakehouse_docs', limit=3):
    """Perform semantic search for a query"""
    print(f"🔍 Searching for: '{query_text}'")
    print("-" * 50)
    
    # Create query embedding
    query_embedding = create_query_embedding(query_text)
    
    # Search the database
    results = search_vectors(table_name, query_vector=query_embedding, limit=limit)
    
    if results and results['results']:
        print(f"Found {len(results['results'])} relevant documents:\n")
        
        for i, doc in enumerate(results['results'], 1):
            # Calculate similarity score
            similarity = cosine_similarity([query_embedding], [doc['vector']])[0][0]
            
            print(f"{i}. Similarity: {similarity:.4f}")
            print(f"   Text: {doc['text']}")
            print(f"   Category: {doc['category']}")
            print()
    else:
        print("No results found")
    
    return results

# Test different semantic searches
search_queries = [
    "distributed computing and big data",
    "object storage for cloud applications", 
    "workflow orchestration and ETL",
    "interactive data analysis notebooks",
    "AI and machine learning infrastructure"
]

print("🎯 Performing semantic searches on lakehouse documentation\n")
print("=" * 70)

for query in search_queries[:2]:  # Test first 2 queries
    semantic_search(query, limit=2)
    print("=" * 70)

## 7. Vector Analytics and Visualization

Let's analyze and visualize our vector data.

In [None]:
# Get all documents from our custom table
all_docs = search_vectors('lakehouse_docs', query_vector=None, limit=50)

if all_docs and all_docs['results']:
    # Extract vectors and metadata
    vectors = np.array([doc['vector'] for doc in all_docs['results']])
    texts = [doc['text'][:30] + '...' for doc in all_docs['results']]
    
    print(f"📊 Analyzing {len(vectors)} document vectors")
    print(f"Vector dimension: {vectors.shape[1]}")
    
    # Calculate pairwise similarities
    similarity_matrix = cosine_similarity(vectors)
    
    # Create visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Similarity heatmap
    sns.heatmap(similarity_matrix, 
                annot=False, 
                cmap='viridis', 
                ax=axes[0,0],
                cbar_kws={'label': 'Cosine Similarity'})
    axes[0,0].set_title('Document Similarity Matrix')
    axes[0,0].set_xlabel('Document Index')
    axes[0,0].set_ylabel('Document Index')
    
    # 2. Vector magnitude distribution
    vector_norms = np.linalg.norm(vectors, axis=1)
    axes[0,1].hist(vector_norms, bins=15, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0,1].set_title('Distribution of Vector Magnitudes')
    axes[0,1].set_xlabel('L2 Norm')
    axes[0,1].set_ylabel('Frequency')
    
    # 3. Average similarity by document
    avg_similarities = np.mean(similarity_matrix, axis=1)
    doc_indices = range(len(avg_similarities))
    axes[1,0].bar(doc_indices, avg_similarities, alpha=0.7, color='coral')
    axes[1,0].set_title('Average Similarity Score by Document')
    axes[1,0].set_xlabel('Document Index')
    axes[1,0].set_ylabel('Average Similarity')
    
    # 4. Feature importance (top TF-IDF terms)
    feature_names = vectorizer.get_feature_names_out()
    avg_tfidf_scores = np.mean(vectors, axis=0)
    top_features_idx = np.argsort(avg_tfidf_scores)[-10:]
    top_features = [feature_names[i] for i in top_features_idx]
    top_scores = avg_tfidf_scores[top_features_idx]
    
    axes[1,1].barh(top_features, top_scores, alpha=0.7, color='lightgreen')
    axes[1,1].set_title('Top 10 Important Terms (Avg TF-IDF)')
    axes[1,1].set_xlabel('Average TF-IDF Score')
    
    plt.tight_layout()
    plt.show()
    
    # Print some statistics
    print(f"\n📈 Vector Analytics:")
    print(f"   Average similarity: {np.mean(similarity_matrix):.4f}")
    print(f"   Max similarity: {np.max(similarity_matrix[similarity_matrix < 1.0]):.4f}")
    print(f"   Min similarity: {np.min(similarity_matrix):.4f}")
    print(f"   Vector norm range: {np.min(vector_norms):.4f} - {np.max(vector_norms):.4f}")
else:
    print("No vector data available for analysis")

## 8. Advanced Use Cases

### A. Document Clustering
Group similar documents together using vector similarity.

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

if 'vectors' in locals() and len(vectors) > 5:
    # Perform K-means clustering
    n_clusters = min(4, len(vectors))
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(vectors)
    
    # Reduce dimensionality for visualization
    pca = PCA(n_components=2, random_state=42)
    vectors_2d = pca.fit_transform(vectors)
    
    # Visualize clusters
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(vectors_2d[:, 0], vectors_2d[:, 1], 
                         c=clusters, cmap='tab10', alpha=0.7, s=100)
    
    # Add labels
    for i, txt in enumerate(texts):
        plt.annotate(f"{i}: {txt[:20]}...", 
                    (vectors_2d[i, 0], vectors_2d[i, 1]), 
                    xytext=(5, 5), textcoords='offset points', 
                    fontsize=8, alpha=0.7)
    
    plt.colorbar(scatter, label='Cluster')
    plt.title('Document Clustering in 2D PCA Space')
    plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
    plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # Print cluster analysis
    print(f"📊 Cluster Analysis ({n_clusters} clusters):")
    for i in range(n_clusters):
        cluster_docs = [texts[j] for j in range(len(texts)) if clusters[j] == i]
        print(f"\nCluster {i} ({len(cluster_docs)} documents):")
        for doc in cluster_docs[:3]:  # Show first 3 docs per cluster
            print(f"  • {doc}")

else:
    print("Need more vector data for clustering analysis")

### B. Recommendation System
Build a simple content recommendation system.

In [None]:
def recommend_similar_content(query_text, exclude_exact_match=True, top_k=3):
    """
    Recommend similar content based on a query
    """
    print(f"🎯 Content recommendations for: '{query_text}'")
    print("=" * 60)
    
    # Get recommendations from our vector database
    recommendations = semantic_search(query_text, limit=top_k + 1)
    
    if recommendations and recommendations['results']:
        print("\n💡 You might also be interested in:")
        
        filtered_results = recommendations['results']
        if exclude_exact_match:
            # Remove exact matches
            filtered_results = [r for r in filtered_results if r['text'].lower() != query_text.lower()]
        
        for i, rec in enumerate(filtered_results[:top_k], 1):
            # Calculate similarity
            query_embed = create_query_embedding(query_text)
            similarity = cosine_similarity([query_embed], [rec['vector']])[0][0]
            
            print(f"\n{i}. {rec['text']}")
            print(f"   📊 Relevance: {similarity:.3f}")
            print(f"   🏷️  Category: {rec['category']}")
    
    return recommendations

# Test the recommendation system
test_queries = [
    "I need help with data processing",
    "How do I store large datasets",
    "Building dashboards for analytics"
]

for query in test_queries:
    recommend_similar_content(query, top_k=2)
    print("\n" + "="*60 + "\n")

## 9. Integration with Lakehouse Ecosystem

### Connecting Vector Search with Other Services

In [None]:
# Example: Create a comprehensive search function that could integrate with other services
def lakehouse_intelligent_search(user_query, search_types=['vector', 'sql']):
    """
    Intelligent search across multiple lakehouse components
    """
    print(f"🔍 Lakehouse Intelligence Search: '{user_query}'")
    print("=" * 70)
    
    results = {}
    
    # 1. Vector-based semantic search
    if 'vector' in search_types:
        print("\n📊 Vector Search Results:")
        vector_results = semantic_search(user_query, limit=2)
        results['vector'] = vector_results
    
    # 2. Could integrate with SQL search (example structure)
    if 'sql' in search_types:
        print("\n🗃️  Database Search Suggestions:")
        # In a real implementation, this would query PostgreSQL
        sql_suggestions = [
            "SELECT * FROM orders WHERE description ILIKE '%data%'",
            "SELECT service_name, count(*) FROM metrics GROUP BY service_name"
        ]
        
        for i, suggestion in enumerate(sql_suggestions, 1):
            print(f"   {i}. {suggestion}")
        results['sql_suggestions'] = sql_suggestions
    
    # 3. Could integrate with MinIO object search
    print("\n📁 Object Storage Suggestions:")
    storage_suggestions = [
        "s3://lakehouse/raw-data/sample_orders.csv",
        "s3://processed-data/analytics/monthly_reports/"
    ]
    
    for i, obj in enumerate(storage_suggestions, 1):
        print(f"   {i}. {obj}")
    results['storage'] = storage_suggestions
    
    return results

# Test the intelligent search
search_result = lakehouse_intelligent_search("data processing workflows")

print("\n\n💡 This demonstrates how LanceDB can be integrated with:")
print("   • PostgreSQL for structured data queries")
print("   • MinIO for object storage search")
print("   • Airflow for workflow recommendations")
print("   • Jupyter for notebook suggestions")
print("   • Superset for dashboard templates")

## 10. Performance and Monitoring

Monitor your vector operations and optimize performance.

In [None]:
import time

# Performance testing function
def benchmark_vector_search(table_name='document_embeddings', num_searches=5):
    """
    Benchmark vector search performance
    """
    print(f"🚀 Benchmarking LanceDB performance ({num_searches} searches)")
    print("-" * 50)
    
    # Get a sample vector for searching
    sample_data = search_vectors(table_name, limit=1)
    if not sample_data or not sample_data['results']:
        print("No data available for benchmarking")
        return
    
    query_vector = sample_data['results'][0]['vector']
    
    # Perform benchmark searches
    search_times = []
    
    for i in range(num_searches):
        start_time = time.time()
        
        # Perform search
        results = search_vectors(table_name, query_vector=query_vector, limit=5)
        
        end_time = time.time()
        search_time = end_time - start_time
        search_times.append(search_time)
        
        if results:
            print(f"Search {i+1}: {search_time:.3f}s ({results['count']} results)")
        else:
            print(f"Search {i+1}: {search_time:.3f}s (failed)")
    
    # Calculate statistics
    if search_times:
        avg_time = np.mean(search_times)
        min_time = np.min(search_times)
        max_time = np.max(search_times)
        
        print(f"\n📊 Performance Summary:")
        print(f"   Average search time: {avg_time:.3f}s")
        print(f"   Fastest search: {min_time:.3f}s")
        print(f"   Slowest search: {max_time:.3f}s")
        print(f"   Searches per second: {1/avg_time:.1f}")
        
        # Visualize performance
        plt.figure(figsize=(10, 4))
        plt.plot(range(1, len(search_times) + 1), search_times, 'bo-', alpha=0.7)
        plt.axhline(y=avg_time, color='r', linestyle='--', alpha=0.7, label=f'Average: {avg_time:.3f}s')
        plt.title('Vector Search Performance')
        plt.xlabel('Search Number')
        plt.ylabel('Response Time (seconds)')
        plt.grid(True, alpha=0.3)
        plt.legend()
        plt.show()

# Run benchmark
benchmark_vector_search()

## 11. Best Practices & Next Steps

### 🎯 **Production Tips:**

1. **Vector Quality**: Use high-quality embeddings (OpenAI, Sentence-BERT, etc.)
2. **Indexing**: Consider FAISS or Annoy for large-scale similarity search
3. **Caching**: Cache frequently accessed vectors
4. **Monitoring**: Track search performance and accuracy
5. **Backup**: Regularly backup your vector data

### 🚀 **Advanced Use Cases:**

1. **RAG Systems**: Retrieval-Augmented Generation with LLMs
2. **Recommendation Engines**: Product/content recommendations
3. **Image Search**: Visual similarity search for images
4. **Anomaly Detection**: Find outliers in vector space
5. **Clustering**: Group similar items automatically

### 🔗 **Integration Opportunities:**

- **Airflow**: Automate embedding generation pipelines
- **Superset**: Create vector search dashboards
- **PostgreSQL**: Hybrid vector + relational queries
- **MinIO**: Store and version vector models
- **Jupyter**: Interactive vector analysis workflows

In [None]:
# Summary and service information
print("🎉 LanceDB Vector Search Tutorial Complete!")
print("=" * 50)
print(f"\n🔗 **Access your LanceDB service:**")
print(f"   • API: http://lancedb:8000 (from notebook containers)")
print(f"   • External API: http://localhost:9080 (from host)")
print(f"   • Documentation: http://localhost:9080/docs")
print(f"   • Health check: http://localhost:9080/health")

print(f"\n📊 **What we've covered:**")
print(f"   ✅ Connected to LanceDB service")
print(f"   ✅ Explored existing vector tables")
print(f"   ✅ Performed similarity searches")
print(f"   ✅ Created custom embeddings")
print(f"   ✅ Built semantic search functionality")
print(f"   ✅ Analyzed vector data with visualizations")
print(f"   ✅ Implemented clustering and recommendations")
print(f"   ✅ Benchmarked performance")

print(f"\n🎯 **Next steps:**")
print(f"   1. Explore the LanceDB API docs at /docs endpoint")
print(f"   2. Integrate vector search with your applications")
print(f"   3. Try other notebook tutorials in the lakehouse lab")
print(f"   4. Build production-ready embedding pipelines")

print(f"\n---")
print(f"🏠 **Lakehouse Lab** - AI-Powered Analytics Platform")