# Field Indexes

This notebook demonstrates how to build and use **field indexes** for an optimized cache.
An index maps each unique field value to the sample indices that have that value,
enabling fast class-based subsetting without copying datasets.

**Use case**: Get all samples for a specific ImageNet class (or set of classes)
and pass them to `SlipstreamLoader(indices=...)` for Imagenette / ImageNet-100 style subsets.

In [None]:
LITDATA_VAL_PATH = "s3://visionlab-datasets/imagenet1k/pre-processed/s256-l512-jpgbytes-q100-streaming/val/"

## 1. Build the Optimized Cache

First we need a dataset and its optimized cache. If the cache already exists, this is instant.

In [None]:
from slipstream import SlipstreamDataset, OptimizedCache

dataset = SlipstreamDataset(
    remote_dir=LITDATA_VAL_PATH,
    decode_images=False,
)

# Build or load cache
if OptimizedCache.exists(dataset.cache_path):
    cache = OptimizedCache.load(dataset.cache_path)
else:
    cache = OptimizedCache.build(dataset)

cache

## 2. Build a Field Index

`write_index` reads all values for a field and builds a mapping from each unique value
to the array of sample indices that have that value.

In [None]:
from slipstream import write_index

write_index(cache, fields=['label'])

The index is saved as `label_index.npy` in the cache directory and is automatically
discovered on `OptimizedCache.load()`. Since we passed the cache directly, it's also
available immediately:

In [None]:
label_index = cache.get_index('label')

print(f"Unique labels: {len(label_index)}")
print(f"Samples with label 0: {len(label_index[0])}")
print(f"Sample indices for label 0: {label_index[0]}")

## 3. Auto-discovery on Load

If you reload the cache later, indexes are discovered automatically:

In [None]:
cache2 = OptimizedCache.load(dataset.cache_path)
label_index2 = cache2.get_index('label')
print(f"Unique labels (reloaded): {len(label_index2)}")

## 4. Class-Based Subsetting with SlipstreamLoader

The real power: use the index to create a loader for a subset of classes.
This is how you'd create an Imagenette-style subset (10 classes) from ImageNet-1k.

In [None]:
import numpy as np
from slipstream import SlipstreamLoader, DecodeRandomResizedCrop

# Imagenette class labels (10 "easily classifiable" ImageNet classes)
imagenette_labels = [0, 217, 482, 491, 497, 566, 569, 571, 574, 701]

# Get all sample indices for these classes
subset_indices = np.concatenate([label_index[lbl] for lbl in imagenette_labels])
print(f"Imagenette subset: {len(subset_indices)} samples from {len(imagenette_labels)} classes")

# Create a loader for just these samples
loader = SlipstreamLoader(
    dataset,
    batch_size=64,
    indices=subset_indices,
    pipelines={'image': [DecodeRandomResizedCrop(224)]},
    exclude_fields=['path'],
)

loader

In [None]:
loader.cache.cache_dir

In [None]:
batch = next(iter(loader))
print(f"Batch image shape: {batch['image'].shape}")
print(f"Batch labels: {batch['label'][:10]}")
print(f"Unique labels in batch: {sorted(batch['label'].unique().tolist())}")

## 5. Single-Class Loading

You can also load just one class — useful for few-shot learning or debugging.

In [None]:
# All samples for class 0 (tench)
class_0_indices = label_index[0]
print(f"Class 0 has {len(class_0_indices)} samples")

loader_class0 = SlipstreamLoader(
    dataset,
    batch_size=len(class_0_indices),
    indices=class_0_indices,
    pipelines={'image': [DecodeRandomResizedCrop(224)]},
    exclude_fields=['path'],
    shuffle=False,
)

batch = next(iter(loader_class0))
print(f"All labels are 0: {(batch['label'] == 0).all()}")
print(f"Batch shape: {batch['image'].shape}")

## Summary

**Field indexes** enable fast class-based subsetting:

```python
from slipstream import OptimizedCache, write_index, SlipstreamLoader, DecodeRandomResizedCrop

# Build once
write_index(cache, fields=['label'])

# Use anytime
cache = OptimizedCache.load(cache_dir)  # auto-discovers indexes
indices = cache.get_index('label')[target_class]
loader = SlipstreamLoader(dataset, indices=indices, pipelines={'image': [DecodeRandomResizedCrop(224)]}, ...)
```

- Index files are stored alongside cache data as `{field}_index.npy`
- Auto-discovered on `OptimizedCache.load()` — no extra config needed
- Works with numeric fields (`int`, `float`) and string fields