![umap in atlas](https://docs.nomic.ai/img/umap-with-nomic-atlas.png)

In [12]:
from IPython.display import HTML
import requests

def play(url):
	response = requests.get(url)
	response.raise_for_status()
	html = f'<video width=1000 controls autoplay loop><source src="{url}" type="video/mp4"></video>'
	return HTML(html)

# UMAP with Nomic Atlas

UMAP is available as a projection in Nomic Atlas, which creates interactive maps of your data with AI analysis, vector search APIs, and additional resources like duplicate detection and topic label generation.

## Example 1: Visualizing text embeddings

In [13]:
play('https://assets.nomicatlas.com/airline-reviews-umap.mp4')

In [7]:
import pandas as pd

# Example data
df = pd.read_csv("https://docs.nomic.ai/singapore_airlines_reviews.csv")
df['id'] = df.index.astype(str)
df.head()

Unnamed: 0,published_date,published_platform,rating,type,text,title,helpful_votes,id
0,2024-03-12T14:41:14-04:00,Desktop,1,review,We used this airline to go from Singapore to L...,Ok,0,0
1,2024-03-11T19:39:13-04:00,Desktop,2,review,The service on Singapore Airlines Suites Class...,The service in Suites Class makes one feel lik...,0,1
2,2024-03-11T12:20:23-04:00,Desktop,0,review,"Booked, paid and received email confirmation f...",Don’t give them your money,0,2
3,2024-03-11T07:12:27-04:00,Desktop,2,review,"Best airline in the world, seats, food, servic...",Best Airline in the World,0,3
4,2024-03-10T05:34:18-04:00,Desktop,0,review,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,0,4


### Upload to Nomic Atlas

In [6]:
from nomic import AtlasDataset
from nomic.data_inference import ProjectionOptions

dataset = AtlasDataset("example-dataset-airline-reviews", unique_id_field="id")

dataset.add_data(df)

atlas_map = dataset.create_index(
    indexed_field='text',
    projection=ProjectionOptions(
      model="umap",
      n_neighbors=20,
      min_dist=0.01,
      n_epochs=200
  )
)

[32m2025-05-10 21:07:19.282[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_create_project[0m:[36m867[0m - [1mOrganization name: `nomic`[0m
[32m2025-05-10 21:07:19.794[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_create_project[0m:[36m895[0m - [1mCreating dataset `example-dataset-airline-reviews`[0m
100%|██████████| 2/2 [00:02<00:00,  1.46s/it]
[32m2025-05-10 21:07:23.142[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_add_data[0m:[36m1702[0m - [1mUpload succeeded.[0m
[32m2025-05-10 21:07:24.594[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36mcreate_index[0m:[36m1289[0m - [1mCreated map `0196bce1-e7c2-5b15-31cd-ddd20c4fb6f4` in dataset `nomic/example-dataset-airline-reviews`: https://atlas.nomic.ai/data/nomic/example-dataset-airline-reviews[0m


## Example 2: Visualizing MNIST Model training

In [14]:
play('https://assets.nomicatlas.com/umap-with-nomic-atlas.mp4')

In [15]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Subset
import numpy as np
from nomic import AtlasDataset
import time

# --- Hyperparameters ---
NUM_EPOCHS = 15
LEARNING_RATE = 3e-6
BATCH_SIZE = 128
NUM_VIS_SAMPLES = 3000
EMBEDDING_DIM = 128
ATLAS_DATASET_NAME = "mnist_training_embeddings"

# Determine device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}\n")

Using device: cpu



In [16]:

# --- 1. Define PyTorch Model ---
class MNIST_CNN(nn.Module):
    def __init__(self, embedding_dim=128):
        super(MNIST_CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2) # 28x28 -> 14x14
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2) # 14x14 -> 7x7
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(64 * 7 * 7, embedding_dim) # Embedding layer
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(embedding_dim, 10) # Output layer

    def forward(self, x):
        x = self.pool1(self.relu1(self.conv1(x)))
        x = self.pool2(self.relu2(self.conv2(x)))
        x = self.flatten(x)
        embeddings = self.relu3(self.fc1(x))
        output = self.fc2(embeddings)
        return output, embeddings

# --- 2. Load MNIST Data ---
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# DataLoader for training
# Handle persistent_workers based on device type (MPS doesn't support it well)
persistent_workers_flag = True if device.type not in ['mps', 'cpu'] else False
num_workers_val = 2 if persistent_workers_flag else 0 # num_workers > 0 can cause issues on MPS without persistent_workers

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=num_workers_val, persistent_workers=persistent_workers_flag if num_workers_val > 0 else False)

# Create a subset of the test dataset for visualization
vis_indices = list(range(NUM_VIS_SAMPLES))
vis_subset = Subset(test_dataset, vis_indices)
test_loader_for_vis = DataLoader(vis_subset, batch_size=BATCH_SIZE, shuffle=False, num_workers=num_workers_val, persistent_workers=persistent_workers_flag if num_workers_val > 0 else False)



print(f"Training on {len(train_dataset)} samples, visualizing {NUM_VIS_SAMPLES} test samples per epoch.\n")

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 404: Not Found

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 28680519.48it/s]


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 404: Not Found

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 1419897.25it/s]

Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 404: Not Found

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz





Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 10652550.42it/s]


Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 404: Not Found

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 2080433.41it/s]

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw

Training on 60000 samples, visualizing 3000 test samples per epoch.






In [19]:
import base64
import io
from PIL import Image

# --- 3. Initialize Model, Optimizer, Criterion ---
model = MNIST_CNN(embedding_dim=EMBEDDING_DIM).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# --- 4. Training Loop & Embedding Extraction ---
all_embeddings_list = []
all_metadata_list = []
all_images_html = []  # Store HTML representations of images

# Helper function to convert tensor to HTML image
def tensor_to_html(tensor):
    # Denormalize the image
    img = tensor.clone().detach().cpu().squeeze(0)
    img = img * 0.3081 + 0.1307  # Reverse the normalization
    img = torch.clamp(img, 0, 1)

    
    img_pil = Image.fromarray((img.numpy() * 255).astype('uint8'), mode='L')
    buffered = io.BytesIO()
    img_pil.save(buffered, format="PNG")
    img_str = base64.b64encode(buffered.getvalue()).decode()
    
    return f'<img src="data:image/png;base64,{img_str}" width="28" height="28">'

overall_start_time = time.time()
for epoch in range(NUM_EPOCHS):
    epoch_start_time = time.time()
    model.train()
    running_loss = 0.0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        outputs, _ = model(data)
        loss = criterion(outputs, target)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if (batch_idx + 1) % 200 == 0: # Print every 200 mini-batches
            print(f'Epoch [{epoch+1}/{NUM_EPOCHS}], Batch [{batch_idx+1}/{len(train_loader)}], Avg Loss: {running_loss / 200:.4f}')
            running_loss = 0.0

    print(f"Epoch {epoch+1}/{NUM_EPOCHS} training finished in {time.time() - epoch_start_time:.2f}s.\n")

    # Extract embeddings for visualization subset
    model.eval()
    vis_samples_collected_this_epoch = 0
    image_offset_in_vis_subset = 0 # Tracks the index within the vis_subset (0 to NUM_VIS_SAMPLES-1)
    with torch.no_grad():
        for data, target in test_loader_for_vis:
            data, target = data.to(device), target.to(device)
            _, embeddings_batch = model(data)
            for i in range(embeddings_batch.size(0)):
                # original_idx_in_subset is the true index of this image within the NUM_VIS_SAMPLES selected for visualization
                original_idx_in_subset = image_offset_in_vis_subset + i 
                if original_idx_in_subset >= NUM_VIS_SAMPLES: # Should not happen if test_loader_for_vis is setup correctly
                    continue
                
                all_embeddings_list.append(embeddings_batch[i].cpu().numpy())
                
                # Generate HTML representation of the image
                img_html = tensor_to_html(data[i])
                all_images_html.append(img_html)
                
                all_metadata_list.append({
                    'id': f'vis_img_{original_idx_in_subset}_epoch_{epoch}', # Unique ID for Atlas
                    'epoch': epoch,
                    'label': f'Digit: {target[i].item()}',
                    'vis_sample_idx': original_idx_in_subset, # Index within the 0..NUM_VIS_SAMPLES-1 range
                    'image_html': img_html  # Add the HTML representation to metadata
                })
                vis_samples_collected_this_epoch += 1
            image_offset_in_vis_subset += embeddings_batch.size(0) # Move offset by batch size
            if vis_samples_collected_this_epoch >= NUM_VIS_SAMPLES: # Ensure we don't collect more than needed
                break
                
    print(f"Collected {vis_samples_collected_this_epoch} embeddings for visualization in epoch {epoch+1}.\n")

total_script_time = time.time() - overall_start_time
print(f"Total training and embedding extraction time: {total_script_time:.2f}s\n")


Epoch [1/15], Batch [200/469], Avg Loss: 2.2695
Epoch [1/15], Batch [400/469], Avg Loss: 2.1794
Epoch 1/15 training finished in 16.47s.

Collected 3000 embeddings for visualization in epoch 1.

Epoch [2/15], Batch [200/469], Avg Loss: 2.0083
Epoch [2/15], Batch [400/469], Avg Loss: 1.8469
Epoch 2/15 training finished in 16.46s.

Collected 3000 embeddings for visualization in epoch 2.

Epoch [3/15], Batch [200/469], Avg Loss: 1.6037
Epoch [3/15], Batch [400/469], Avg Loss: 1.4239
Epoch 3/15 training finished in 18.34s.

Collected 3000 embeddings for visualization in epoch 3.

Epoch [4/15], Batch [200/469], Avg Loss: 1.2073
Epoch [4/15], Batch [400/469], Avg Loss: 1.0713
Epoch 4/15 training finished in 17.51s.

Collected 3000 embeddings for visualization in epoch 4.

Epoch [5/15], Batch [200/469], Avg Loss: 0.9194
Epoch [5/15], Batch [400/469], Avg Loss: 0.8350
Epoch 5/15 training finished in 16.78s.

Collected 3000 embeddings for visualization in epoch 5.

Epoch [6/15], Batch [200/469],

In [21]:
from nomic import AtlasDataset

dataset = AtlasDataset("mnist-training-embeddings", unique_id_field='id')
dataset.add_data(data=all_metadata_list, embeddings=np.array(all_embeddings_list))

[32m2025-05-11 15:02:50.075[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m__init__[0m:[36m804[0m - [1mLoading existing dataset `nomic/mnist-training-embeddings`.[0m
100%|██████████| 9/9 [00:12<00:00,  1.42s/it]
[32m2025-05-11 15:03:03.282[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_add_data[0m:[36m1702[0m - [1mUpload succeeded.[0m


In [22]:
dataset.create_index(projection='umap', topic_model=False) 

[32m2025-05-11 15:03:25.894[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36mcreate_index[0m:[36m1289[0m - [1mCreated map `0196c0bb-07a7-f93a-5c4d-15ab8c640e70` in dataset `nomic/mnist-training-embeddings`: https://atlas.nomic.ai/data/nomic/mnist-training-embeddings[0m


Your map in Atlas will look something like the above video.