# Block2Vec: Learning Minecraft Block Embeddings

## What This Notebook Does

This notebook trains a neural network to learn **meaningful numerical representations** (called "embeddings") for Minecraft blocks. After training, similar blocks will have similar embeddings - for example, `oak_planks` and `spruce_planks` will be close together in the embedding space, while `oak_planks` and `lava` will be far apart.

## Why Do We Need This?

This is **Phase 2** of a larger project to generate Minecraft structures from text prompts. The pipeline is:

1. **Phase 1 (Done)**: Prepare training data - 4,462 Minecraft builds as 3D arrays
2. **Phase 2 (This Notebook)**: Train Block2Vec to learn block embeddings
3. **Phase 3**: Train a VQ-VAE to compress/decompress structures
4. **Phase 4**: Connect text descriptions to the VQ-VAE
5. **Phase 5**: Generate new structures from text!

The embeddings we learn here will be used as input to the VQ-VAE in Phase 3.

---

# Part 1: Understanding Embeddings

## What is an Embedding?

An **embedding** is a way to represent something (a word, a block, an image) as a list of numbers (a vector). The key insight is that we want **similar things to have similar vectors**.

### Example: Representing Colors

Imagine you want a computer to understand colors. You could:

**Bad approach - One-hot encoding:**
```
red   = [1, 0, 0, 0, 0]
blue  = [0, 1, 0, 0, 0]
green = [0, 0, 1, 0, 0]
pink  = [0, 0, 0, 1, 0]
navy  = [0, 0, 0, 0, 1]
```

Problem: Every color is equally different from every other color. But pink is more similar to red than to green!

**Good approach - Learned embeddings:**
```
red   = [0.9, 0.1, 0.2]  # High red, low blue, low green
pink  = [0.8, 0.1, 0.5]  # Similar to red!
blue  = [0.1, 0.9, 0.2]
navy  = [0.1, 0.7, 0.1]  # Similar to blue!
green = [0.2, 0.1, 0.9]
```

Now similar colors have similar numbers!

### Why 32 Dimensions?

We use 32-dimensional embeddings (each block becomes a list of 32 numbers). Why 32?

- **Too few dimensions (e.g., 2)**: Can't capture enough nuance. Imagine trying to describe a person with only 2 numbers!
- **Too many dimensions (e.g., 1000)**: Wastes memory, slower training, and risks "overfitting" (memorizing instead of learning)
- **32 dimensions**: A sweet spot that captures block relationships without being excessive

This is a **hyperparameter** - a choice we make before training. Finding good hyperparameters is part art, part science.

---

# Part 2: The Skip-Gram Model (Word2Vec for Blocks)

## The Core Idea

Block2Vec is inspired by **Word2Vec**, a famous algorithm for learning word embeddings. The key insight:

> **"You shall know a word by the company it keeps"** - J.R. Firth, 1957

In text, words that appear in similar contexts have similar meanings:
- "The **cat** sat on the mat"
- "The **dog** sat on the mat"

Since "cat" and "dog" appear in similar contexts, they should have similar embeddings.

## Applying This to Minecraft

In Minecraft, we use **spatial context** instead of textual context:

- Blocks that appear **next to similar blocks** should have similar embeddings
- Oak stairs often appear next to oak planks
- Stone bricks often appear next to other stone bricks
- Torches often appear on walls (next to stone, wood, etc.)

## Skip-Gram Architecture

The Skip-Gram model works by:

1. **Input**: A center block (e.g., `oak_planks`)
2. **Goal**: Predict the neighboring blocks
3. **Learning**: Adjust embeddings so prediction gets better

```
     [air]     [oak_stairs]
        \         /
         \       /
    [oak_log]--[OAK_PLANKS]--[oak_log]
         /       \
        /         \
   [stone]      [glass]
   
Center block: oak_planks
Context blocks: air, oak_stairs, oak_log, oak_log, stone, glass
```

The model learns that `oak_planks` should be close to `oak_log` and `oak_stairs` in embedding space because they frequently appear together.

---

# Part 3: Negative Sampling

## The Problem with Naive Training

A naive approach would be:

1. For center block `oak_planks`, predict probability of ALL 3,717 blocks being neighbors
2. The correct neighbors get high probability, others get low
3. This requires computing 3,717 probabilities per training example!

This is **extremely slow** because we have millions of training examples.

## The Solution: Negative Sampling

Instead of predicting all blocks, we:

1. Take one **positive pair**: (oak_planks, oak_log) - they ARE neighbors
2. Sample a few **negative pairs**: (oak_planks, diamond_ore), (oak_planks, lava) - random blocks that probably AREN'T neighbors
3. Train the model to distinguish positive from negative

```
Positive: oak_planks + oak_log → Should output HIGH score (they're neighbors)
Negative: oak_planks + diamond_ore → Should output LOW score (random, unlikely neighbors)
Negative: oak_planks + lava → Should output LOW score
```

This is **much faster** - we only compute 6 scores instead of 3,717!

## How Negative Sampling Works Mathematically

For each (center, context) positive pair:

1. **Positive score** = dot product of center embedding and context embedding
2. We want this to be **high** (close to 1 after sigmoid)

For each (center, random_negative) negative pair:

1. **Negative score** = dot product of center embedding and negative embedding  
2. We want this to be **low** (close to 0 after sigmoid)

The **loss function** penalizes the model when:
- Positive scores are low (should be high!)
- Negative scores are high (should be low!)

---

# Part 4: Subsampling Frequent Blocks

## The Air Problem

In Minecraft builds, **air** blocks are everywhere - often 70-80% of all blocks! If we train on every air block:

1. Training is dominated by air-related pairs
2. The model mostly learns about air, not interesting blocks
3. Other blocks don't get enough training signal

## The Solution: Subsampling

We randomly **skip** frequent blocks during training. The probability of keeping a block is:

```
P(keep) = sqrt(threshold / frequency)
```

For example, if `threshold = 0.001` and air has `frequency = 0.75`:

```
P(keep air) = sqrt(0.001 / 0.75) = 0.036 = 3.6%
```

So we only keep 3.6% of air blocks! This balances the dataset.

For rare blocks like `diamond_block` with `frequency = 0.0001`:

```
P(keep diamond_block) = sqrt(0.001 / 0.0001) = 3.16 → capped at 100%
```

Rare blocks are always kept.

---

# Part 5: The Training Loop

## What Happens During Training

Each training step:

1. **Forward Pass**: 
   - Get embeddings for center blocks
   - Get embeddings for context blocks (positive)
   - Get embeddings for negative samples
   - Compute dot products (similarity scores)
   - Compute loss (how wrong were we?)

2. **Backward Pass** (Backpropagation):
   - Calculate gradients (which direction should each embedding move?)
   - This is done automatically by PyTorch!

3. **Optimizer Step**:
   - Update embeddings in the direction that reduces loss
   - Learning rate controls how big each step is

## Key Concepts

### Gradient Descent

Imagine you're blindfolded on a hill, trying to find the lowest point:

1. Feel the ground around you (compute gradient)
2. Take a step downhill (update weights)
3. Repeat until you reach the bottom (loss is minimized)

The **gradient** tells us which direction is "downhill" for each embedding value.

### Learning Rate

How big are your steps?

- **Too large**: You overshoot and bounce around, never converging
- **Too small**: Training takes forever
- **Just right**: Steady progress toward the minimum

We use `learning_rate = 0.001` which is a common starting point.

### Batch Size

We don't update after every single example - that would be noisy and slow. Instead:

1. Collect a **batch** of examples (e.g., 4096)
2. Compute average loss across the batch
3. Update once based on the average

Larger batches = more stable gradients, but need more memory.

---

# Part 6: Let's Start Coding!

Now that you understand the concepts, let's implement it. First, let's set up our environment.

In [None]:
# ============================================================
# CELL 1: Imports and Setup
# ============================================================
# These are the libraries we need:
# - torch: PyTorch, the deep learning framework
# - numpy: For numerical operations on arrays
# - h5py: For reading HDF5 files (our training data)
# - json: For reading the vocabulary file
# - matplotlib: For visualization
# - sklearn: For t-SNE dimensionality reduction

import json
import random
import time
from pathlib import Path
from typing import Iterator, Optional

import h5py
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from torch.utils.data import DataLoader, IterableDataset
from tqdm.notebook import tqdm

# Check if GPU is available - this is why we're using Kaggle!
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# ============================================================
# CELL 2: Configuration
# ============================================================
# These are our HYPERPARAMETERS - values we choose before training.
# Finding good hyperparameters is crucial for ML success.

# === Data Paths (adjust if needed for your Kaggle dataset) ===
DATA_DIR = "/kaggle/input/minecraft-schematics/minecraft_splits/splits/train"  # Training data
VOCAB_PATH = "/kaggle/input/minecraft-schematics/tok2block.json"  # Block vocabulary
OUTPUT_DIR = "/kaggle/working"  # Where to save results

# === Model Architecture ===
EMBEDDING_DIM = 32  # Size of each block's embedding vector
                     # 32 is a good balance between expressiveness and efficiency

# === Training Hyperparameters ===
EPOCHS = 50          # How many times to iterate through all data
                     # More epochs = better embeddings, but longer training
                     
BATCH_SIZE = 8192    # How many (center, context) pairs per update
                     # Larger = more stable gradients, but needs more memory
                     
LEARNING_RATE = 0.001  # How big of a step to take each update
                        # 0.001 is a common starting point for Adam optimizer
                        
WEIGHT_DECAY = 0.0001  # L2 regularization - prevents overfitting
                        # Adds a small penalty for large weights

# === Skip-Gram Settings ===
NUM_NEGATIVE_SAMPLES = 5  # How many negative samples per positive pair
                           # More = better distinction, but slower
                           
CONTEXT_TYPE = "neighbors_6"  # Use 6 adjacent neighbors (up/down/left/right/front/back)
                               # Alternative: "neighbors_26" for full 3x3x3 cube

# === Subsampling ===
SUBSAMPLE_THRESHOLD = 0.001  # Blocks more frequent than this get subsampled
                              # Air is ~75% of blocks, so it will be heavily subsampled

# === Other ===
SEED = 42  # Random seed for reproducibility
            # Using the same seed = same results every time

print("Configuration loaded!")
print(f"  Embedding dimension: {EMBEDDING_DIM}")
print(f"  Epochs: {EPOCHS}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Learning rate: {LEARNING_RATE}")

In [None]:
# ============================================================
# CELL 3: Load Vocabulary
# ============================================================
# The vocabulary maps integer tokens (0, 1, 2, ...) to block names.
# Our training data uses tokens, but we need names for interpretation.

with open(VOCAB_PATH, 'r') as f:
    tok2block = {int(k): v for k, v in json.load(f).items()}

VOCAB_SIZE = len(tok2block)
print(f"Vocabulary size: {VOCAB_SIZE} unique block states")

# Let's look at some examples
print("\nSample blocks:")
for tok in [0, 102, 307, 500, 1000]:
    if tok in tok2block:
        print(f"  Token {tok}: {tok2block[tok]}")

# Find the air token (we'll need this later)
AIR_TOKEN = None
for tok, name in tok2block.items():
    if name == "minecraft:air":
        AIR_TOKEN = tok
        break
print(f"\nAir token: {AIR_TOKEN}")

---

# Part 7: The Block2Vec Model

Now let's define our neural network. In PyTorch, we create a class that inherits from `nn.Module`.

In [None]:
# ============================================================
# CELL 4: Block2Vec Model Definition
# ============================================================

class Block2Vec(nn.Module):
    """
    Skip-gram model for learning Minecraft block embeddings.
    
    The model has TWO embedding matrices:
    1. center_embeddings: Used when a block is the CENTER of a context window
    2. context_embeddings: Used when a block is in the CONTEXT (neighbor)
    
    Why two matrices? This is a trick from Word2Vec that works better in practice.
    At the end, we use center_embeddings as our final embeddings.
    
    Architecture:
        Input: (center_token, context_token, [negative_tokens])
        ↓
        Look up embeddings from the matrices
        ↓
        Compute dot products (similarity scores)
        ↓
        Output: Loss value (how wrong were our predictions?)
    """
    
    def __init__(self, vocab_size: int, embedding_dim: int = 32):
        """
        Initialize the model.
        
        Args:
            vocab_size: Number of unique blocks (3717 in our case)
            embedding_dim: Size of each embedding vector (32)
        """
        # This calls the parent class constructor - required for PyTorch
        super().__init__()
        
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        
        # nn.Embedding is a lookup table:
        # - Input: integer token (e.g., 307 for stone)
        # - Output: embedding vector of size embedding_dim
        # The table starts with random values and learns during training
        
        self.center_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # Initialize with small random values
        # This is important - bad initialization can prevent learning
        self._init_weights()
    
    def _init_weights(self):
        """
        Initialize embeddings with small uniform random values.
        
        We use range [-0.5/dim, 0.5/dim] so that initial dot products
        are close to zero (neither strongly positive nor negative).
        """
        init_range = 0.5 / self.embedding_dim
        self.center_embeddings.weight.data.uniform_(-init_range, init_range)
        self.context_embeddings.weight.data.uniform_(-init_range, init_range)
    
    def forward(
        self,
        center_ids: torch.Tensor,
        context_ids: torch.Tensor,
        negative_ids: torch.Tensor,
    ) -> torch.Tensor:
        """
        Compute the Skip-gram loss with negative sampling.
        
        This is called automatically when you do: loss = model(centers, contexts, negatives)
        
        Args:
            center_ids: Tensor of center block tokens [batch_size]
            context_ids: Tensor of positive context tokens [batch_size]
            negative_ids: Tensor of negative sample tokens [batch_size, num_negatives]
            
        Returns:
            Loss value (scalar tensor)
        """
        # Step 1: Look up embeddings
        # center_ids might be [307, 102, 500, ...] (batch of center tokens)
        # center_emb will be [batch_size, embedding_dim] - the embedding for each
        center_emb = self.center_embeddings(center_ids)    # [B, D]
        context_emb = self.context_embeddings(context_ids)  # [B, D]
        neg_emb = self.context_embeddings(negative_ids)     # [B, N, D]
        
        # Step 2: Compute positive score (dot product)
        # For each pair, compute: center · context (element-wise multiply, then sum)
        # High score = model thinks they're neighbors
        pos_score = torch.sum(center_emb * context_emb, dim=1)  # [B]
        
        # Step 3: Positive loss
        # We want pos_score to be HIGH, so we use log-sigmoid
        # log(sigmoid(x)) is close to 0 when x is large (good!)
        # log(sigmoid(x)) is very negative when x is small (bad!)
        pos_loss = F.logsigmoid(pos_score)  # [B]
        
        # Step 4: Compute negative scores
        # We need to broadcast: each center against all its negatives
        center_emb_expanded = center_emb.unsqueeze(1)  # [B, 1, D]
        neg_score = torch.sum(center_emb_expanded * neg_emb, dim=2)  # [B, N]
        
        # Step 5: Negative loss
        # We want neg_score to be LOW, so we use log-sigmoid of NEGATIVE score
        # log(sigmoid(-x)) is close to 0 when x is small (good!)
        neg_loss = F.logsigmoid(-neg_score).sum(dim=1)  # [B]
        
        # Step 6: Total loss
        # We negate because we MAXIMIZE log-likelihood, but PyTorch MINIMIZES loss
        loss = -(pos_loss + neg_loss).mean()
        
        return loss
    
    def get_embeddings(self) -> np.ndarray:
        """
        Get the learned embeddings as a numpy array.
        
        Returns:
            Array of shape [vocab_size, embedding_dim]
        """
        return self.center_embeddings.weight.data.cpu().numpy()


# Let's test it!
model = Block2Vec(vocab_size=VOCAB_SIZE, embedding_dim=EMBEDDING_DIM)
print(f"Model created!")
print(f"  Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"  = {VOCAB_SIZE} blocks × {EMBEDDING_DIM} dims × 2 matrices")

---

# Part 8: The Dataset

Now we need to create training data. We'll iterate through each build, extract (center, context) pairs, and sample negatives.

In [None]:
# ============================================================
# CELL 5: Dataset Class
# ============================================================

class Block2VecDataset(IterableDataset):
    """
    Dataset that yields (center, context, negatives) tuples for training.
    
    This is an "iterable" dataset - it generates data on-the-fly rather than
    loading everything into memory. This is necessary because we have millions
    of training pairs!
    
    How it works:
    1. Load each H5 file (one Minecraft build)
    2. For each block position in the 3D array:
       - Get the center block
       - Get its neighbors (context)
       - Sample random blocks (negatives)
       - Yield the training example
    """
    
    # The 6 neighbors: +/- in each axis (x, y, z)
    NEIGHBORS_6 = [
        (-1, 0, 0), (1, 0, 0),  # left, right
        (0, -1, 0), (0, 1, 0),  # down, up
        (0, 0, -1), (0, 0, 1),  # back, front
    ]
    
    def __init__(
        self,
        data_dir: str,
        vocab_size: int,
        num_negative_samples: int = 5,
        subsample_threshold: float = 0.001,
        air_token: int = 102,
        seed: int = 42,
    ):
        self.data_dir = Path(data_dir)
        self.vocab_size = vocab_size
        self.num_negative_samples = num_negative_samples
        self.subsample_threshold = subsample_threshold
        self.air_token = air_token
        self.seed = seed
        
        # Find all H5 files
        self.h5_files = sorted(self.data_dir.glob("*.h5"))
        print(f"Found {len(self.h5_files)} training files")
        
        # These will be computed on first iteration
        self._block_freqs = None
        self._negative_table = None
        self._subsample_probs = None
    
    def _compute_frequencies(self):
        """
        Count how often each block appears across all data.
        
        This is used for:
        1. Subsampling - skip frequent blocks sometimes
        2. Negative sampling - sample proportional to frequency^0.75
        """
        print("Computing block frequencies (this only happens once)...")
        freqs = np.zeros(self.vocab_size, dtype=np.float64)
        
        for h5_path in tqdm(self.h5_files, desc="Counting blocks"):
            with h5py.File(h5_path, 'r') as f:
                build = f[list(f.keys())[0]][:]
                unique, counts = np.unique(build, return_counts=True)
                for tok, count in zip(unique, counts):
                    if tok < self.vocab_size:
                        freqs[tok] += count
        
        # Normalize to probabilities
        freqs /= freqs.sum()
        self._block_freqs = freqs
        
        # Build negative sampling table
        # We raise to power 0.75 to reduce dominance of frequent blocks
        weighted = np.power(freqs, 0.75)
        weighted /= weighted.sum()
        
        # Pre-sample 10 million negatives for efficiency
        self._negative_table = np.random.choice(
            self.vocab_size, 
            size=10_000_000, 
            p=weighted
        )
        
        # Compute subsampling probabilities
        self._subsample_probs = np.ones(self.vocab_size, dtype=np.float32)
        for i, freq in enumerate(freqs):
            if freq > self.subsample_threshold:
                self._subsample_probs[i] = np.sqrt(self.subsample_threshold / freq)
        
        print(f"  Air frequency: {freqs[self.air_token]:.2%}")
        print(f"  Air keep probability: {self._subsample_probs[self.air_token]:.2%}")
    
    def __iter__(self) -> Iterator:
        """
        Iterate through all data, yielding training examples.
        
        Yields:
            Tuple of (center_token, context_token, negative_tokens)
        """
        # Compute frequencies on first iteration
        if self._block_freqs is None:
            self._compute_frequencies()
        
        rng = random.Random(self.seed)
        neg_idx = 0
        
        for h5_path in self.h5_files:
            # Load the build
            with h5py.File(h5_path, 'r') as f:
                build = f[list(f.keys())[0]][:]
            
            h, w, d = build.shape
            
            # Iterate through every position
            for y in range(h):
                for x in range(w):
                    for z in range(d):
                        center = int(build[y, x, z])
                        
                        # Subsampling check - skip frequent blocks sometimes
                        if rng.random() >= self._subsample_probs[center]:
                            continue
                        
                        # Check each neighbor
                        for dy, dx, dz in self.NEIGHBORS_6:
                            ny, nx, nz = y + dy, x + dx, z + dz
                            
                            # Skip if out of bounds
                            if not (0 <= ny < h and 0 <= nx < w and 0 <= nz < d):
                                continue
                            
                            context = int(build[ny, nx, nz])
                            
                            # Sample negatives from pre-computed table
                            negatives = self._negative_table[
                                neg_idx : neg_idx + self.num_negative_samples
                            ]
                            neg_idx = (neg_idx + self.num_negative_samples) % len(self._negative_table)
                            
                            yield center, context, negatives


def collate_fn(batch):
    """
    Convert a list of examples into tensors for the model.
    
    This is called automatically by the DataLoader.
    """
    centers = torch.tensor([b[0] for b in batch], dtype=torch.long)
    contexts = torch.tensor([b[1] for b in batch], dtype=torch.long)
    negatives = torch.tensor(np.array([b[2] for b in batch]), dtype=torch.long)
    return centers, contexts, negatives


print("Dataset class defined!")

---

# Part 9: Training Loop

Now let's put it all together and train!

In [None]:
# ============================================================
# CELL 6: Create Dataset and DataLoader
# ============================================================

# Set random seeds for reproducibility
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)

# Create dataset
dataset = Block2VecDataset(
    data_dir=DATA_DIR,
    vocab_size=VOCAB_SIZE,
    num_negative_samples=NUM_NEGATIVE_SAMPLES,
    subsample_threshold=SUBSAMPLE_THRESHOLD,
    air_token=AIR_TOKEN,
    seed=SEED,
)

# DataLoader handles batching and (optionally) parallel loading
# pin_memory=True speeds up CPU->GPU transfer
dataloader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    collate_fn=collate_fn,
    pin_memory=(device == "cuda"),
)

print(f"DataLoader ready with batch size {BATCH_SIZE}")

In [None]:
# ============================================================
# CELL 7: Create Model and Optimizer
# ============================================================

# Create the model and move to GPU
model = Block2Vec(vocab_size=VOCAB_SIZE, embedding_dim=EMBEDDING_DIM)
model = model.to(device)

print(f"Model on {device}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Create optimizer
# AdamW is an improved version of Adam with better weight decay handling
# Adam = Adaptive Moment Estimation - adjusts learning rate for each parameter
optimizer = optim.AdamW(
    model.parameters(),
    lr=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
)

print(f"Optimizer: AdamW (lr={LEARNING_RATE}, weight_decay={WEIGHT_DECAY})")

In [None]:
# ============================================================
# CELL 8: Training Loop
# ============================================================

print("="*60)
print("Starting Training")
print("="*60)

# Track losses for plotting
epoch_losses = []
start_time = time.time()

for epoch in range(EPOCHS):
    model.train()  # Set model to training mode
    epoch_loss = 0.0
    num_batches = 0
    
    # Progress bar for this epoch
    pbar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{EPOCHS}")
    
    for center_ids, context_ids, negative_ids in pbar:
        # Move data to GPU
        center_ids = center_ids.to(device)
        context_ids = context_ids.to(device)
        negative_ids = negative_ids.to(device)
        
        # Forward pass: compute loss
        loss = model(center_ids, context_ids, negative_ids)
        
        # Backward pass: compute gradients
        optimizer.zero_grad()  # Clear old gradients
        loss.backward()        # Compute new gradients
        
        # Optimizer step: update weights
        optimizer.step()
        
        # Track progress
        epoch_loss += loss.item()
        num_batches += 1
        pbar.set_postfix({"loss": f"{loss.item():.4f}"})
    
    # Epoch complete
    avg_loss = epoch_loss / num_batches
    epoch_losses.append(avg_loss)
    
    elapsed = time.time() - start_time
    print(f"Epoch {epoch+1}: loss={avg_loss:.4f}, time={elapsed:.0f}s")

print("\n" + "="*60)
print(f"Training complete in {time.time() - start_time:.0f}s")
print("="*60)

In [None]:
# ============================================================
# CELL 9: Save Results
# ============================================================

# Save the model
model_path = f"{OUTPUT_DIR}/block2vec_model.pt"
torch.save(model.state_dict(), model_path)
print(f"Model saved to {model_path}")

# Save embeddings as numpy file (easy to load later)
embeddings = model.get_embeddings()
embeddings_path = f"{OUTPUT_DIR}/block_embeddings.npy"
np.save(embeddings_path, embeddings)
print(f"Embeddings saved to {embeddings_path}")
print(f"  Shape: {embeddings.shape}")

# Save training stats
stats = {
    "epoch_losses": epoch_losses,
    "config": {
        "embedding_dim": EMBEDDING_DIM,
        "epochs": EPOCHS,
        "batch_size": BATCH_SIZE,
        "learning_rate": LEARNING_RATE,
    }
}
with open(f"{OUTPUT_DIR}/training_stats.json", 'w') as f:
    json.dump(stats, f, indent=2)
print("Training stats saved")

---

# Part 10: Visualizing the Results

Now let's see what the model learned! We'll create:

1. **Training loss plot** - Did the model learn?
2. **t-SNE visualization** - 2D projection of embeddings
3. **Similarity analysis** - Which blocks are most similar?
4. **Nearest neighbors** - For key blocks, what's closest?

In [None]:
# ============================================================
# CELL 10: Plot Training Loss
# ============================================================

plt.figure(figsize=(10, 6))
plt.plot(range(1, len(epoch_losses) + 1), epoch_losses, 'b-', linewidth=2, marker='o')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Block2Vec Training Loss', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/training_loss.png", dpi=150)
plt.show()

print(f"Initial loss: {epoch_losses[0]:.4f}")
print(f"Final loss: {epoch_losses[-1]:.4f}")
print(f"Improvement: {(epoch_losses[0] - epoch_losses[-1]) / epoch_losses[0] * 100:.1f}%")

In [None]:
# ============================================================
# CELL 11: t-SNE Visualization
# ============================================================
# t-SNE (t-Distributed Stochastic Neighbor Embedding) reduces our
# 32-dimensional embeddings to 2D for visualization.
# Similar blocks should appear close together.

import re

def get_block_category(block_name: str) -> str:
    """Extract a category from block name for coloring."""
    name = block_name.replace("minecraft:", "")
    name = re.sub(r"\[.*\]", "", name)  # Remove block state
    
    if any(x in name for x in ["planks", "log", "wood", "fence", "door"]):
        return "wood"
    elif any(x in name for x in ["stone", "cobble", "brick", "andesite", "diorite", "granite"]):
        return "stone"
    elif "ore" in name or any(x in name for x in ["diamond", "gold", "iron", "coal", "emerald"]):
        return "ore/mineral"
    elif "glass" in name:
        return "glass"
    elif "wool" in name or "carpet" in name:
        return "wool"
    elif "concrete" in name:
        return "concrete"
    elif "leaves" in name:
        return "leaves"
    elif "water" in name:
        return "water"
    elif "lava" in name:
        return "lava"
    elif "air" in name:
        return "air"
    else:
        return "other"

# Sample blocks for visualization (too many would be cluttered)
sample_size = 1000
indices = np.random.choice(VOCAB_SIZE, min(sample_size, VOCAB_SIZE), replace=False)
sampled_embeddings = embeddings[indices]

# Get categories for coloring
categories = [get_block_category(tok2block.get(i, "unknown")) for i in indices]
unique_cats = list(set(categories))

print(f"Running t-SNE on {len(indices)} blocks...")
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
coords = tsne.fit_transform(sampled_embeddings)

# Plot
plt.figure(figsize=(14, 10))
cmap = plt.cm.get_cmap('tab10', len(unique_cats))

for i, cat in enumerate(unique_cats):
    mask = [c == cat for c in categories]
    plt.scatter(
        coords[mask, 0], coords[mask, 1],
        c=[cmap(i)], label=cat, alpha=0.6, s=20
    )

plt.title('Block2Vec Embeddings (t-SNE projection)', fontsize=14)
plt.xlabel('t-SNE dimension 1')
plt.ylabel('t-SNE dimension 2')
plt.legend(loc='upper right')
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/tsne_embeddings.png", dpi=150)
plt.show()

print("\nt-SNE complete! Similar blocks should be clustered together.")

In [None]:
# ============================================================
# CELL 12: Find Nearest Neighbors
# ============================================================
# For some key blocks, let's find their most similar blocks.
# This tells us what relationships the model learned.

def find_similar_blocks(block_name: str, embeddings: np.ndarray, tok2block: dict, top_k: int = 10):
    """
    Find the most similar blocks to a given block.
    
    Uses cosine similarity: similarity = dot(a, b) / (|a| * |b|)
    Ranges from -1 (opposite) to 1 (identical).
    """
    # Find the token for this block
    block2tok = {v: k for k, v in tok2block.items()}
    
    # Find matching blocks (handle block states)
    matching = [k for k, v in tok2block.items() if block_name in v]
    if not matching:
        print(f"Block '{block_name}' not found")
        return
    
    token = matching[0]  # Use first match
    query = embeddings[token].reshape(1, -1)
    
    # Compute cosine similarity to all blocks
    similarities = cosine_similarity(query, embeddings)[0]
    
    # Get top-k (excluding self)
    top_indices = np.argsort(similarities)[::-1]
    
    print(f"\nMost similar to '{tok2block[token]}':")
    print("-" * 60)
    
    count = 0
    for idx in top_indices:
        if idx != token:
            # Get base block name (without state)
            name = tok2block[idx]
            short_name = re.sub(r"\[.*\]", "", name.replace("minecraft:", ""))
            print(f"  {count+1}. {short_name:30} (similarity: {similarities[idx]:.4f})")
            count += 1
            if count >= top_k:
                break

# Test with some interesting blocks
test_blocks = [
    "oak_planks",
    "stone",
    "glass",
    "torch",
    "water",
    "diamond_block",
]

for block in test_blocks:
    find_similar_blocks(block, embeddings, tok2block, top_k=5)

In [None]:
# ============================================================
# CELL 13: Similarity Heatmap
# ============================================================

# Select some common blocks for the heatmap
blocks_to_compare = [
    "stone", "cobblestone", "stone_bricks",
    "oak_planks", "spruce_planks", "oak_log",
    "glass", "white_wool", "red_wool",
    "iron_block", "gold_block", "diamond_block",
    "dirt", "grass_block", "sand",
    "water", "air",
]

# Find tokens for these blocks
selected_tokens = []
selected_names = []

for block in blocks_to_compare:
    for tok, name in tok2block.items():
        if block in name and "[" not in name:  # Exact match without states
            selected_tokens.append(tok)
            selected_names.append(block)
            break

if len(selected_tokens) > 1:
    # Get embeddings and compute similarity matrix
    selected_embeddings = embeddings[selected_tokens]
    sim_matrix = cosine_similarity(selected_embeddings)
    
    # Plot heatmap
    plt.figure(figsize=(12, 10))
    im = plt.imshow(sim_matrix, cmap='RdYlBu_r', vmin=-1, vmax=1)
    
    plt.xticks(range(len(selected_names)), selected_names, rotation=45, ha='right')
    plt.yticks(range(len(selected_names)), selected_names)
    plt.colorbar(im, label='Cosine Similarity')
    plt.title('Block Embedding Similarity', fontsize=14)
    plt.tight_layout()
    plt.savefig(f"{OUTPUT_DIR}/similarity_heatmap.png", dpi=150)
    plt.show()
    
    print("\nHeatmap interpretation:")
    print("  - Red/orange = similar blocks (often appear together)")
    print("  - Blue = dissimilar blocks (rarely appear together)")
else:
    print("Not enough blocks found for heatmap")

---

# Part 11: Summary and Next Steps

## What We Learned

1. **Embeddings** represent blocks as vectors where similar blocks are close together
2. **Skip-gram** learns embeddings by predicting neighbors from center blocks
3. **Negative sampling** makes training efficient by sampling random "wrong" answers
4. **Subsampling** prevents frequent blocks (like air) from dominating training

## What's Next?

These embeddings will be used as input to a **VQ-VAE** (Vector Quantized Variational Autoencoder) in Phase 3:

1. The VQ-VAE encoder compresses 32×32×32 builds into a small latent representation
2. The VQ-VAE decoder reconstructs builds from the latent representation
3. The block embeddings help the model understand that similar blocks are interchangeable

## Download Your Results

Make sure to download:
- `block_embeddings.npy` - The learned embeddings
- `block2vec_model.pt` - The full model
- The visualization images

In [None]:
# ============================================================
# CELL 14: Final Summary
# ============================================================

print("="*60)
print("TRAINING COMPLETE!")
print("="*60)
print(f"\nModel: Block2Vec")
print(f"  Vocabulary size: {VOCAB_SIZE}")
print(f"  Embedding dimension: {EMBEDDING_DIM}")
print(f"  Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"\nTraining:")
print(f"  Epochs: {EPOCHS}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Initial loss: {epoch_losses[0]:.4f}")
print(f"  Final loss: {epoch_losses[-1]:.4f}")
print(f"\nOutput files in {OUTPUT_DIR}:")
print(f"  - block_embeddings.npy")
print(f"  - block2vec_model.pt")
print(f"  - training_stats.json")
print(f"  - training_loss.png")
print(f"  - tsne_embeddings.png")
print(f"  - similarity_heatmap.png")
print("\n" + "="*60)