# LiteEncoder Demo on Real Dataset

This notebook demonstrates how to use `LiteEncoder` from the `afterthoughts` library to generate sentence-chunk embeddings on a real dataset downloaded from Hugging Face.

In [None]:
import torch
from datasets import load_dataset

from afterthoughts import LiteEncoder, configure_logging

configure_logging(level="INFO")

%load_ext autoreload
%autoreload 2

## Load Dataset

We'll use the AG News dataset, a popular news classification dataset with 4 categories: World, Sports, Business, and Sci/Tech.

In [None]:
# Load a subset of AG News for demonstration
dataset = load_dataset("ag_news", split="train[:1000]")
print(f"Loaded {len(dataset)} documents")
print(f"Columns: {dataset.column_names}")
print(f"\nExample document:\n{dataset[0]['text'][:500]}...")

In [None]:
# Extract the text documents
docs = dataset["text"]
labels = dataset["label"]

## Initialize LiteEncoder

`LiteEncoder` is a memory-efficient variant that supports:
- **Quantization options**: `"float16"` (2x) or `"binary"` (32x compression)
- **PCA dimensionality reduction** (GPU-accelerated)
- **Dimension truncation**

We use `multi-qa-mpnet-base-dot-v1`, a model trained specifically for semantic search with questions and answers.

In [None]:
# Initialize LiteEncoder without PCA
encoder = LiteEncoder(
    model_name="sentence-transformers/multi-qa-mpnet-base-dot-v1",
    amp=True,  # Enable automatic mixed precision
    quantize="float16",  # Options: None, "float16" (2x), "binary" (32x)
    normalize=True,  # Normalize embeddings to unit length
    device="cuda" if torch.cuda.is_available() else "cpu",
)
print(f"Model loaded on device: {encoder.device}")

## Encode Documents

The `encode()` method extracts sentence-chunk embeddings from documents. Each chunk consists of groups of consecutive sentences.

In [None]:
# Encode documents with 2-sentence chunks
df, embeddings = encoder.encode(
    docs,
    num_sents=2,  # Each chunk contains 2 consecutive sentences
    chunk_overlap=0.5,  # 50% overlap between chunks (1 sentence)
    batch_tokens=8192,  # Tokens per batch
    return_frame="pandas",
)

print(f"Generated {len(df)} chunks from {len(docs)} documents")
print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dtype: {embeddings.dtype}")

In [None]:
# View the results dataframe
df.head(10)

In [None]:
df.head(10).style

## Semantic Search Demo

Let's demonstrate semantic search by encoding a query and finding the most similar chunks.

In [None]:
# Encode a query
queries = [
    "stock market performance and financial news",
    "sports championship results",
    "technology innovation and AI",
]

query_embeds = encoder.encode_queries(queries)
print(f"Query embedding shape: {query_embeds.shape}")

In [None]:
from sklearn.neighbors import NearestNeighbors

# Build index using cosine similarity (since embeddings are normalized, this equals dot product)
nn = NearestNeighbors(n_neighbors=5, metric="cosine")
nn.fit(embeddings)


def semantic_search(query_embed, top_k=5):
    """Find top-k most similar chunks to a query."""
    distances, indices = nn.kneighbors([query_embed], n_neighbors=top_k)
    # Convert cosine distance to similarity (1 - distance)
    similarities = 1 - distances[0]

    results = []
    for idx, sim in zip(indices[0], similarities, strict=False):
        results.append(
            {
                "chunk": df.iloc[idx]["chunk"],
                "document_idx": df.iloc[idx]["document_idx"],
                "similarity": sim,
            }
        )
    return results

In [None]:
# Search for each query
for i, query in enumerate(queries):
    print(f"\n{'='*60}")
    print(f"Query: '{query}'")
    print("=" * 60)

    results = semantic_search(query_embeds[i], top_k=3)

    for j, result in enumerate(results, 1):
        print(f"\n{j}. [Similarity: {result['similarity']:.4f}]")
        print(f"   Doc #{result['document_idx']}: {result['chunk'][:200]}...")

## Multiple Chunk Sizes

`LiteEncoder` can extract chunks of multiple sizes in a single pass.

In [None]:
# Encode with multiple chunk sizes (1, 2, and 3 sentences per chunk)
df_multi, embeddings_multi = encoder.encode(
    docs[:100],  # Use fewer docs for demo
    num_sents=[1, 2, 3],  # Multiple chunk sizes
    chunk_overlap=0.5,
    batch_tokens=8192,
    return_frame="pandas",
)

print(f"Generated {len(df_multi)} chunks with multiple sizes")
print("\nChunk size distribution:")
print(df_multi["chunk_size"].value_counts().sort_index())

In [None]:
# View chunks of different sizes from the same document
doc_0_chunks = df_multi[df_multi["document_idx"] == 0][["chunk_idx", "chunk_size", "chunk"]]
print("Chunks from document 0:")
doc_0_chunks

## Memory Efficiency

Let's verify the memory savings from PCA and quantization.

In [None]:
# Compare memory usage with float16 quantization
original_dim = 768  # multi-qa-mpnet-base-dot-v1 output dimension
reduced_dim = embeddings.shape[1]

original_bytes_per_embed = original_dim * 4  # float32
reduced_bytes_per_embed = reduced_dim * 2  # float16

num_embeds = len(embeddings)
original_memory_mb = (num_embeds * original_bytes_per_embed) / (1024 * 1024)
reduced_memory_mb = (num_embeds * reduced_bytes_per_embed) / (1024 * 1024)

print(f"Number of embeddings: {num_embeds:,}")
print(f"Original (768 x float32): {original_memory_mb:.2f} MB")
print(f"With float16 ({reduced_dim} x float16): {reduced_memory_mb:.2f} MB")
print(f"Memory reduction: {(1 - reduced_memory_mb / original_memory_mb) * 100:.1f}%")