# Import data from Hugging Face datasets

Load datasets from Hugging Face Hub into Pixeltable tables for processing with AI models.

## Problem

You want to use a dataset from Hugging Face Hubâ€”for fine-tuning, evaluation, or analysis. You need to load it into a format where you can add computed columns, embeddings, or AI transformations.

| Dataset | Size | Use case |
|---------|------|----------|
| imdb | 50K reviews | Sentiment analysis |
| squad | 100K Q&A | RAG evaluation |
| coco | 330K images | Vision model training |

## Solution

**What's in this recipe:**
- Import Hugging Face datasets directly into tables
- Handle datasets with multiple splits (train/test/validation)
- Work with image datasets

You use `pxt.create_table()` with a Hugging Face dataset as the `source` parameter. Pixeltable automatically maps HF types to Pixeltable column types.

### Setup

In [None]:
%pip install -qU pixeltable datasets

In [None]:
import pixeltable as pxt
from datasets import load_dataset

In [None]:
# Create a fresh directory
pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')

### Import a single split

Load a specific split from a dataset:

In [None]:
# Load a small subset for demo (first 100 rows of rotten_tomatoes)
hf_dataset = load_dataset('rotten_tomatoes', split='train[:100]')

In [None]:
# Import into Pixeltable
reviews = pxt.create_table(
    'hf_demo.reviews',
    source=hf_dataset
)

In [None]:
# View imported data
reviews.head(5)

### Import multiple splits

Load a DatasetDict with multiple splits and track which split each row came from:

In [None]:
# Load dataset with multiple splits (small subset for demo)
hf_dataset_dict = load_dataset(
    'rotten_tomatoes',
    split={'train': 'train[:50]', 'test': 'test[:50]'}
)

In [None]:
# Import each split separately for clarity
train_data = pxt.create_table(
    'hf_demo.reviews_train',
    source=hf_dataset_dict['train']
)
test_data = pxt.create_table(
    'hf_demo.reviews_test',
    source=hf_dataset_dict['test']
)

In [None]:
# View training data
train_data.head(5)

In [None]:
# View test data
test_data.head(3)

### Add AI-powered computed columns

Enrich the dataset with AI models:

In [None]:
# Add a computed column for text length
reviews.add_computed_column(text_length=reviews.text.apply(len, col_type=pxt.Int))

In [None]:
# View with computed column
reviews.select(reviews.text, reviews.label, reviews.text_length).head(5)

### Type mapping

Pixeltable automatically maps Hugging Face types to Pixeltable types:

| Hugging Face Type | Pixeltable Type |
|-------------------|-----------------|
| `Value('string')` | `pxt.String` |
| `Value('int64')` | `pxt.Int` |
| `Value('float32')` | `pxt.Float` |
| `ClassLabel` | `pxt.String` |
| `Image` | `pxt.Image` |
| `Sequence` | `pxt.Array` or `pxt.Json` |

Use `schema_overrides` to customize type mapping when needed.

## Explanation

**Why import Hugging Face datasets into Pixeltable:**

1. **Add computed columns** - Enrich data with embeddings, AI analysis, or transformations
2. **Incremental processing** - Add new rows without reprocessing existing data
3. **Persistent storage** - Keep processed results across sessions
4. **Query capabilities** - Filter, aggregate, and join with other tables

**Working with large datasets:**

For very large datasets, consider loading in batches or using streaming mode in the `datasets` library before importing.

## See also

- [Import CSV files](https://docs.pixeltable.com/howto/cookbooks/data/data-import-csv) - For CSV and Excel imports
- [Semantic text search](https://docs.pixeltable.com/howto/cookbooks/search/search-semantic-text) - Add embeddings to text data
- [Hugging Face integration notebook](https://docs.pixeltable.com/howto/providers/working-with-hugging-face) - Full integration guide