# Sample data for training and testing

Create training, validation, and test splits with random or stratified sampling.

## Problem

You have a large dataset and need to create subsets for ML trainingâ€”random samples for quick experiments, stratified samples for balanced classes, or reproducible splits for benchmarking.

| Need | Method |
|------|--------|
| Quick experiment | Random sample of N rows |
| Balanced classes | Stratified by label |
| Reproducible | Fixed seed |

## Solution

**What's in this recipe:**
- Random sampling with `sample(n=...)`
- Percentage-based sampling with `sample(fraction=...)`
- Stratified sampling with `stratify_by=`

You use `query.sample()` to create random subsets, with optional stratification for balanced class distribution.

### Setup

In [None]:
%pip install -qU pixeltable

In [None]:
import pixeltable as pxt

In [None]:
# Create a fresh directory
pxt.drop_dir('sampling_demo', force=True)
pxt.create_dir('sampling_demo')

### Create sample dataset

In [None]:
# Create a dataset with labels
data = pxt.create_table(
    'sampling_demo.data',
    {'text': pxt.String, 'label': pxt.String, 'score': pxt.Float}
)

# Insert sample data with imbalanced classes
samples = [
    {'text': 'Great product!', 'label': 'positive', 'score': 0.9},
    {'text': 'Love it', 'label': 'positive', 'score': 0.85},
    {'text': 'Amazing quality', 'label': 'positive', 'score': 0.95},
    {'text': 'Best purchase ever', 'label': 'positive', 'score': 0.88},
    {'text': 'Highly recommend', 'label': 'positive', 'score': 0.92},
    {'text': 'Fantastic!', 'label': 'positive', 'score': 0.91},
    {'text': 'Terrible', 'label': 'negative', 'score': 0.1},
    {'text': 'Waste of money', 'label': 'negative', 'score': 0.15},
    {'text': 'It is okay', 'label': 'neutral', 'score': 0.5},
    {'text': 'Average product', 'label': 'neutral', 'score': 0.55},
]
data.insert(samples)

### Random sampling

In [None]:
# Sample exactly N rows
sample_5 = data.sample(n=5, seed=42).collect()
for row in sample_5:

In [None]:
# Sample a percentage of rows
sample_50pct = data.sample(fraction=0.5, seed=42).collect()

### Stratified sampling

In [None]:
# Stratified sampling: 50% from each class
stratified = data.sample(fraction=0.5, stratify_by=data.label, seed=42).collect()
label_counts = {}
for row in stratified:
    label_counts[row['label']] = label_counts.get(row['label'], 0) + 1
for label, count in sorted(label_counts.items()):

In [None]:
# Equal allocation: N rows from each class
equal_alloc = data.sample(n_per_stratum=1, stratify_by=data.label, seed=42).collect()
for row in equal_alloc:

### Sampling from filtered data

In [None]:
# Sample from filtered query (high-confidence predictions only)
high_conf_sample = data.where(data.score > 0.8).sample(n=3, seed=42).collect()
for row in high_conf_sample:
")

### Persist samples as tables

In [None]:
# Create a persistent table from a sample for dev/test
train_sample = data.sample(fraction=0.8, seed=42)
test_sample = data.sample(fraction=0.2, seed=43)

# Persist as new tables
train_table = pxt.create_table('sampling_demo.train', source=train_sample)
test_table = pxt.create_table('sampling_demo.test', source=test_sample)

## Explanation

**Sampling methods:**

| Method | Parameter | Behavior |
|--------|-----------|----------|
| Fixed count | `n=100` | Exactly 100 rows |
| Percentage | `fraction=0.1` | 10% of rows |
| Per-class | `n_per_stratum=10` | 10 from each class |

**Stratification options:**

| Use case | Parameters |
|----------|------------|
| Proportional | `fraction=0.1, stratify_by=col` |
| Equal allocation | `n_per_stratum=10, stratify_by=col` |
| Reproducible | Add `seed=42` |

**Tips:**
- Always set `seed` for reproducible experiments
- Use stratified sampling for imbalanced datasets
- Combine with `.where()` to sample from subsets

## See also

- [Export for ML training](https://docs.pixeltable.com/howto/cookbooks/data/data-export-pytorch) - PyTorch DataLoader export
- [Import Hugging Face datasets](https://docs.pixeltable.com/howto/cookbooks/data/data-import-huggingface) - Load pre-split datasets