# Visualizing LLM Embeddings

This notebook demonstrates how to visualize embeddings from indexed LinkML-Store collections using dimensionality reduction techniques like UMAP and t-SNE.

## Setup

First, let's import the necessary libraries and create some sample data.

In [13]:
from linkml_store import Client
from linkml_store.index.implementations.llm_indexer import LLMIndexer
from linkml_store.utils.embedding_utils import (
    extract_embeddings_from_collection,
    extract_embeddings_from_multiple_collections,
    compute_embedding_statistics
)
from linkml_store.plotting.dimensionality_reduction import (
    reduce_dimensions,
    get_optimal_parameters
)
from linkml_store.plotting.embedding_plot import (
    plot_embeddings,
    plot_embeddings_comparison,
    EmbeddingPlotConfig
)
import numpy as np

## Creating Sample Data

Let's create two collections with different types of documents and index them with LLM embeddings.

In [14]:
# Create a client and database
client = Client()
db = client.get_database("duckdb:///:memory:")

# Create first collection: Scientific papers
papers = db.get_collection("papers")
papers_data = [
    {
        "id": "paper1",
        "title": "Deep Learning for Natural Language Processing",
        "category": "AI",
        "year": 2023,
        "abstract": "This paper explores recent advances in deep learning models for NLP tasks."
    },
    {
        "id": "paper2",
        "title": "Quantum Computing Applications in Cryptography",
        "category": "Quantum",
        "year": 2023,
        "abstract": "We investigate the implications of quantum computing for modern cryptographic systems."
    },
    {
        "id": "paper3",
        "title": "Climate Change Impact on Marine Ecosystems",
        "category": "Climate",
        "year": 2024,
        "abstract": "Analysis of temperature changes affecting coral reefs and marine biodiversity."
    },
    {
        "id": "paper4",
        "title": "Transformer Architectures for Computer Vision",
        "category": "AI",
        "year": 2024,
        "abstract": "Adapting transformer models from NLP to solve computer vision problems."
    },
    {
        "id": "paper5",
        "title": "Sustainable Energy Solutions for Urban Areas",
        "category": "Climate",
        "year": 2023,
        "abstract": "Exploring renewable energy integration in modern city infrastructure."
    }
]
papers.insert(papers_data)

# Create second collection: News articles
news = db.get_collection("news")
news_data = [
    {
        "id": "news1",
        "headline": "Tech Giant Announces New AI Assistant",
        "topic": "Technology",
        "sentiment": "positive",
        "content": "Major technology company reveals advanced AI assistant with multimodal capabilities."
    },
    {
        "id": "news2",
        "headline": "Stock Market Reaches Record High",
        "topic": "Finance",
        "sentiment": "positive",
        "content": "Markets surge as investors show confidence in economic recovery."
    },
    {
        "id": "news3",
        "headline": "New Climate Agreement Signed by Nations",
        "topic": "Environment",
        "sentiment": "neutral",
        "content": "Countries commit to reducing emissions by 50% over the next decade."
    },
    {
        "id": "news4",
        "headline": "Healthcare Breakthrough in Cancer Treatment",
        "topic": "Health",
        "sentiment": "positive",
        "content": "Researchers develop new immunotherapy showing promising results."
    },
    {
        "id": "news5",
        "headline": "Cybersecurity Threats on the Rise",
        "topic": "Technology",
        "sentiment": "negative",
        "content": "Experts warn of increasing sophisticated attacks targeting infrastructure."
    }
]
news.insert(news_data)

print(f"Created {papers.find().num_rows} papers and {news.find().num_rows} news articles")

Created 5 papers and 5 news articles


## Indexing Collections

Now let's create LLM indexes for both collections. Note: This requires an OpenAI API key or another configured LLM provider.

In [15]:
# For demo purposes, we'll use a simple indexer
# In production, use LLMIndexer with proper API configuration
from linkml_store.index.implementations.simple_indexer import SimpleIndexer

# Index papers collection
papers_indexer = SimpleIndexer(
    name="semantic",
    text_template="{title} {abstract}",
    dimensions=100  # Reduced for demo
)
papers.attach_indexer(papers_indexer)

# Index news collection
news_indexer = SimpleIndexer(
    name="semantic",
    text_template="{headline} {content}",
    dimensions=100  # Reduced for demo
)
news.attach_indexer(news_indexer)

print("Collections indexed successfully")

# For real LLM indexing (requires API key):
# papers_indexer = LLMIndexer(
#     name="semantic",
#     text_template="{title} {abstract}",
#     cached_embeddings_database="embeddings_cache.db"
# )

Collections indexed successfully


## Extracting Embeddings

Let's extract embeddings from both collections.

In [16]:
# Extract embeddings from multiple collections
embedding_data = extract_embeddings_from_multiple_collections(
    database=db,
    collection_names=["papers", "news"],
    index_name="semantic",
    include_metadata=True,
    normalize=True
)

print(f"Extracted {embedding_data.n_samples} embeddings")
print(f"Embedding dimensions: {embedding_data.n_dimensions}")

# Compute statistics
stats = compute_embedding_statistics(embedding_data)
print("\nEmbedding statistics:")
for key, value in stats.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.3f}")
    else:
        print(f"  {key}: {value}")

Extracted 10 embeddings
Embedding dimensions: 1000

Embedding statistics:
  n_samples: 10
  n_dimensions: 1000
  n_collections: 2
  collections: ['papers', 'news']
  samples_per_collection: {'papers': 5, 'news': 5}
  mean_norm: 1.000
  std_norm: 0.000
  mean_similarity: 0.189
  std_similarity: 0.053


## Dimensionality Reduction

Now let's reduce the high-dimensional embeddings to 2D for visualization.

In [17]:
# Try different reduction methods
methods = ["pca", "tsne", "umap"]
reductions = {}

for method in methods:
    try:
        # Get optimal parameters for the method
        params = get_optimal_parameters(method, embedding_data.n_samples)
        
        # Perform reduction
        reduction = reduce_dimensions(
            embedding_data.vectors,
            method=method,
            random_state=42,
            **params
        )
        reductions[method] = reduction
        
        print(f"{method.upper()}: Reduced to {reduction.n_components} dimensions")
        if reduction.explained_variance:
            print(f"  Explained variance: {reduction.explained_variance:.2%}")
    except ImportError as e:
        print(f"{method.upper()}: Skipped - {e}")
    except Exception as e:
        print(f"{method.upper()}: Error - {e}")

PCA: Reduced to 2 dimensions
  Explained variance: 27.68%



'n_iter' was renamed to 'max_iter' in version 1.5 and will be removed in 1.7.



TSNE: Reduced to 2 dimensions
UMAP: Skipped - umap-learn is required for UMAP. Install with: pip install umap-learn


## Basic Visualization

Create a basic plot with collections distinguished by shape and color by metadata.

In [18]:
print(f"Available reduction methods: {list(reductions.keys())}")# Use the first available reduction methodif reductions and len(reductions) > 0:    method_name, reduction = next(iter(reductions.items()))        # Create plot configuration    config = EmbeddingPlotConfig(        color_field="collection",  # Color by collection        shape_field="collection",  # Different shapes for each collection        hover_fields=["id", "title", "headline", "category", "topic"],        title=f"Document Embeddings ({method_name.upper()})",        width=800,        height=600,        point_size=10,        opacity=0.8    )        # Create the plot    fig = plot_embeddings(        embedding_data=embedding_data,        reduction_result=reduction,        config=config    )        # Display the plot    # fig.show() # Commented for papermill testingelse:    print("No reduction methods available. Please install scikit-learn or umap-learn.")

Available reduction methods: ['pca', 'tsne']


## Advanced Visualization: Color by Metadata

Let's create a more sophisticated visualization where we color points by their category/topic.

In [19]:
print(f"Available reduction methods: {list(reductions.keys())}")# Add a unified category field for coloringfor i, meta in enumerate(embedding_data.metadata):    if "category" in meta:        meta["unified_category"] = meta["category"]    elif "topic" in meta:        meta["unified_category"] = meta["topic"]    else:        meta["unified_category"] = "Unknown"if reductions and len(reductions) > 0:    # Create configuration with category coloring    config_advanced = EmbeddingPlotConfig(        color_field="unified_category",  # Color by category/topic        shape_field="collection",        # Shape by collection type        hover_fields=["id", "title", "headline", "unified_category", "year", "sentiment"],        title="Document Embeddings by Category",        width=900,        height=700,        point_size=12,        opacity=0.7,        color_discrete_map={            "AI": "#FF6B6B",            "Climate": "#4ECDC4",            "Quantum": "#45B7D1",            "Technology": "#96CEB4",            "Finance": "#FECA57",            "Environment": "#48C9B0",            "Health": "#BB8FCE"        }    )        fig_advanced = plot_embeddings(        embedding_data=embedding_data,        reduction_result=reduction,        config=config_advanced    )        # fig_advanced.show() # Commented for papermill testing

Available reduction methods: ['pca', 'tsne']


## Comparing Different Reduction Methods

If multiple reduction methods are available, let's compare them side by side.

In [20]:
if False and len(reductions) > 1:
    # Prepare datasets for comparison
    comparison_data = {
        method: (embedding_data, reduction)
        for method, reduction in reductions.items()
    }
    
    # Create comparison plot
    comparison_config = EmbeddingPlotConfig(
        color_field="collection",
        title="Comparison of Dimensionality Reduction Methods",
        width=800,
        height=600,
        point_size=8
    )
    
    fig_comparison = plot_embeddings_comparison(
        embedding_datasets=comparison_data,
        config=comparison_config
    )
    
    # fig_comparison.show() # Commented for papermill testing
else:
    print(f"Only {len(reductions)} method available. Need at least 2 for comparison.")

Only 2 method available. Need at least 2 for comparison.


## Using Real Data from a Database

Here's how you would use this with your actual bervo.ddb database:

In [9]:
# Example code for your actual database (uncomment to use)
"""
# Connect to your database
client = Client("duckdb:///~/databases/bervo.ddb")
db = client.get_database()

# List available collections
collections = db.list_collection_names()
print(f"Available collections: {collections}")

# Extract embeddings from your collections
embedding_data = extract_embeddings_from_multiple_collections(
    database=db,
    collection_names=collections[:2],  # Use first two collections
    index_name="llm",  # or whatever index name you used
    limit_per_collection=1000,  # Limit for performance
    normalize=True
)

# Perform UMAP reduction
reduction = reduce_dimensions(
    embedding_data.vectors,
    method="umap",
    n_neighbors=15,
    min_dist=0.1,
    random_state=42
)

# Create visualization
config = EmbeddingPlotConfig(
    color_field="your_property",  # Replace with actual property
    shape_field="collection",
    hover_fields=["id", "name", "description"],  # Adjust to your schema
    title="Bervo Database Embeddings",
    width=1000,
    height=800
)

fig = plot_embeddings(
    embedding_data=embedding_data,
    reduction_result=reduction,
    config=config,
    output_file="bervo_embeddings.html"
)

    # fig.show() # Commented for papermill testing
"""

'\n# Connect to your database\nclient = Client("duckdb:///~/databases/bervo.ddb")\ndb = client.get_database()\n\n# List available collections\ncollections = db.list_collection_names()\nprint(f"Available collections: {collections}")\n\n# Extract embeddings from your collections\nembedding_data = extract_embeddings_from_multiple_collections(\n    database=db,\n    collection_names=collections[:2],  # Use first two collections\n    index_name="llm",  # or whatever index name you used\n    limit_per_collection=1000,  # Limit for performance\n    normalize=True\n)\n\n# Perform UMAP reduction\nreduction = reduce_dimensions(\n    embedding_data.vectors,\n    method="umap",\n    n_neighbors=15,\n    min_dist=0.1,\n    random_state=42\n)\n\n# Create visualization\nconfig = EmbeddingPlotConfig(\n    color_field="your_property",  # Replace with actual property\n    shape_field="collection",\n    hover_fields=["id", "name", "description"],  # Adjust to your schema\n    title="Bervo Database Embedd

## CLI Usage

You can also use the command-line interface to generate these plots:

```bash
# Basic usage
linkml-store -d ~/databases/bervo.ddb plot-embeddings \
  -c collection1,collection2 \
  --method umap \
  -o embeddings.html

# Advanced usage with custom parameters
linkml-store -d ~/databases/bervo.ddb plot-embeddings \
  -c collection1,collection2 \
  --method umap \
  --color-field category \
  --shape-field collection \
  --hover-fields id,name,description \
  --n-neighbors 30 \
  --min-dist 0.05 \
  --width 1200 \
  --height 900 \
  --dark-mode \
  -o embeddings_advanced.html

# Using t-SNE instead
linkml-store -d ~/databases/bervo.ddb plot-embeddings \
  -c collection1,collection2 \
  --method tsne \
  --perplexity 50 \
  --limit-per-collection 500 \
  -o embeddings_tsne.html
```

## Adding Clustering

Let's add clustering to identify groups in the embedding space.

In [10]:
if reductions:
    try:
        from sklearn.cluster import KMeans
        from linkml_store.plotting.embedding_plot import plot_embedding_clusters
        
        # Perform clustering on the reduced dimensions
        n_clusters = 3
        kmeans = KMeans(n_clusters=n_clusters, random_state=42)
        cluster_labels = kmeans.fit_predict(reduction.coordinates)
        
        # Create cluster visualization
        cluster_config = EmbeddingPlotConfig(
            shape_field="collection",
            hover_fields=["id", "title", "headline", "unified_category"],
            title=f"Document Clusters (K-Means, k={n_clusters})",
            width=900,
            height=700
        )
        
        fig_clusters = plot_embedding_clusters(
            embedding_data=embedding_data,
            reduction_result=reduction,
            cluster_labels=cluster_labels,
            config=cluster_config
        )
        
    # fig_clusters.show() # Commented for papermill testing
        
    except ImportError:
        print("scikit-learn required for clustering. Install with: pip install scikit-learn")

## Saving and Exporting Results

Finally, let's save our results for later use.

In [11]:
# Save the plot to HTML
if reductions and 'fig' in locals():
    output_file = "embedding_visualization.html"
    fig.write_html(output_file)
    print(f"Plot saved to {output_file}")
    
    # Export data for external analysis
    import pandas as pd
    
    # Create DataFrame with coordinates and metadata
    export_data = {
        "x": reduction.coordinates[:, 0],
        "y": reduction.coordinates[:, 1],
        "collection": embedding_data.collection_names,
        "id": embedding_data.object_ids
    }
    
    # Add metadata fields
    for key in ["title", "headline", "category", "topic"]:
        values = embedding_data.get_metadata_values(key)
        if any(v is not None for v in values):
            export_data[key] = values
    
    df = pd.DataFrame(export_data)
    df.to_csv("embeddings_data.csv", index=False)
    print(f"Data exported to embeddings_data.csv")
    
    # Show summary
    print("\nData summary:")
    print(df.describe())
    print("\nCollection distribution:")
    print(df['collection'].value_counts())

## Summary

This notebook demonstrated how to:

1. Extract embeddings from indexed LinkML-Store collections
2. Apply dimensionality reduction (PCA, t-SNE, UMAP)
3. Create interactive visualizations with different encoding schemes
4. Compare multiple collections and reduction methods
5. Add clustering to identify groups
6. Export results for further analysis

The embedding visualization tools help you:
- Understand the structure of your indexed data
- Identify clusters and patterns
- Compare different collections
- Debug indexing and search issues
- Explore semantic relationships in your data