# Lab 1.7: Dataset Exploration

**Objective**: Load and explore datasets from the Hugging Face Hub

**Duration**: 25 minutes

**Prerequisites**: Lab 1.3

## Learning Outcomes
- Load datasets with streaming
- Explore dataset structure and features
- Filter and sample datasets

In [None]:
# Environment setup
import sys
sys.path.insert(0, "../../../src")

from hf_ecosystem import __version__
print(f"hf-ecosystem version: {__version__}")

In [None]:
# Imports
from datasets import load_dataset
from hf_ecosystem.data import stream_dataset
from hf_ecosystem.data.streaming import take_samples, filter_by_length

## 1. Loading Datasets

The `datasets` library provides efficient loading with optional streaming.

In [None]:
# Load IMDB dataset (non-streaming for small datasets)
imdb = load_dataset("imdb", split="train[:100]")
print(f"Dataset size: {len(imdb)}")
print(f"Features: {imdb.features}")

In [None]:
# View a sample
sample = imdb[0]
print(f"Text: {sample['text'][:200]}...")
print(f"Label: {sample['label']} ({'positive' if sample['label'] == 1 else 'negative'})")

## 2. Streaming Large Datasets

Streaming allows working with datasets that don't fit in memory.

In [None]:
# Stream a large dataset
streamed = stream_dataset("wikitext", name="wikitext-2-raw-v1", split="train")

# Take first 5 samples
samples = take_samples(streamed, n=5)
for i, s in enumerate(samples):
    print(f"Sample {i}: {s['text'][:50]}...")

## 3. Filtering Datasets

In [None]:
# Filter IMDB by label
positive_reviews = imdb.filter(lambda x: x["label"] == 1)
print(f"Positive reviews: {len(positive_reviews)}")

## 4. Exercise: Explore a Dataset

Load the `rotten_tomatoes` dataset and find the average review length.

In [None]:
# YOUR CODE HERE
rt = load_dataset("rotten_tomatoes", split="train")
avg_length = sum(len(x["text"]) for x in rt) / len(rt)
print(f"Average review length: {avg_length:.1f} characters")

## Verification

In [None]:
def verify_lab():
    """Verify lab completion."""
    assert len(imdb) > 0, "Must load IMDB"
    assert len(samples) == 5, "Must take 5 samples"
    assert len(positive_reviews) > 0, "Must filter reviews"
    print("âœ… Lab completed successfully!")

verify_lab()