# Import data from Hugging Face datasets

Load datasets from Hugging Face Hub into Pixeltable tables for processing with AI models.

## Problem

You want to use a dataset from Hugging Face Hubâ€”for fine-tuning, evaluation, or analysis. You need to load it into a format where you can add computed columns, embeddings, or AI transformations.

| Dataset | Size | Use case |
|---------|------|----------|
| imdb | 50K reviews | Sentiment analysis |
| squad | 100K Q&A | RAG evaluation |
| coco | 330K images | Vision model training |

## Solution

**What's in this recipe:**
- Import Hugging Face datasets directly into tables
- Handle datasets with multiple splits (train/test/validation)
- Work with image datasets

You use `pxt.create_table()` with a Hugging Face dataset as the `source` parameter. Pixeltable automatically maps HF types to Pixeltable column types.

### Setup

In [1]:
%pip install -qU pixeltable datasets

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import pixeltable as pxt
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Create a fresh directory
pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory 'hf_demo'.


<pixeltable.catalog.dir.Dir at 0x3162f4150>

### Import a single split

Load a specific split from a dataset:

In [4]:
# Load a small subset for demo (first 100 rows of rotten_tomatoes)
hf_dataset = load_dataset('rotten_tomatoes', split='train[:100]')

In [5]:
# Import into Pixeltable
reviews = pxt.create_table(
    'hf_demo.reviews',
    source=hf_dataset
)

Created table 'reviews'.
Inserting rows into `reviews`: 100 rows [00:00, 13303.43 rows/s]
Inserted 100 rows with 0 errors.


In [6]:
# View imported data
reviews.head(5)

text,label
"the rock is destined to be the 21st century's new "" conan "" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .",pos
"the gorgeously elaborate continuation of "" the lord of the rings "" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .",pos
effective but too-tepid biopic,pos
"if you sometimes like to go to the movies to have fun , wasabi is a good place to start .",pos
"emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",pos


### Import multiple splits

Load a DatasetDict with multiple splits and track which split each row came from:

In [7]:
# Load dataset with multiple splits (small subset for demo)
hf_dataset_dict = load_dataset(
    'rotten_tomatoes',
    split={'train': 'train[:50]', 'test': 'test[:50]'}
)

In [8]:
# Import each split separately for clarity
train_data = pxt.create_table(
    'hf_demo.reviews_train',
    source=hf_dataset_dict['train']
)
test_data = pxt.create_table(
    'hf_demo.reviews_test',
    source=hf_dataset_dict['test']
)

Created table 'reviews_train'.
Inserting rows into `reviews_train`: 50 rows [00:00, 11806.29 rows/s]
Inserted 50 rows with 0 errors.
Created table 'reviews_test'.
Inserting rows into `reviews_test`: 50 rows [00:00, 11534.22 rows/s]
Inserted 50 rows with 0 errors.


In [9]:
# View training data
print("Training data:")
train_data.head(5)

Training data:


text,label
"the rock is destined to be the 21st century's new "" conan "" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .",pos
"the gorgeously elaborate continuation of "" the lord of the rings "" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .",pos
effective but too-tepid biopic,pos
"if you sometimes like to go to the movies to have fun , wasabi is a good place to start .",pos
"emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",pos


In [10]:
# View test data
print("Test data:")
test_data.head(3)

Test data:


text,label
"lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .",pos
consistently clever and suspenseful .,pos
"it's like a "" big chill "" reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .",pos


### Add AI-powered computed columns

Enrich the dataset with AI models:

In [11]:
# Add a computed column for text length
reviews.add_computed_column(text_length=reviews.text.apply(len))

Error: Column type of `len` cannot be inferred. Use `.apply(len, col_type=...)` to specify.

In [None]:
# View with computed column
reviews.select(reviews.text, reviews.label, reviews.text_length).head(5)

### Type mapping

Pixeltable automatically maps Hugging Face types to Pixeltable types:

| Hugging Face Type | Pixeltable Type |
|-------------------|-----------------|
| `Value('string')` | `pxt.String` |
| `Value('int64')` | `pxt.Int` |
| `Value('float32')` | `pxt.Float` |
| `ClassLabel` | `pxt.String` |
| `Image` | `pxt.Image` |
| `Sequence` | `pxt.Array` or `pxt.Json` |

Use `schema_overrides` to customize type mapping when needed.

## Explanation

**Why import Hugging Face datasets into Pixeltable:**

1. **Add computed columns** - Enrich data with embeddings, AI analysis, or transformations
2. **Incremental processing** - Add new rows without reprocessing existing data
3. **Persistent storage** - Keep processed results across sessions
4. **Query capabilities** - Filter, aggregate, and join with other tables

**Working with large datasets:**

For very large datasets, consider loading in batches or using streaming mode in the `datasets` library before importing.

## See also

- [Import CSV files](https://docs.pixeltable.com/howto/cookbooks/data/data-import-csv) - For CSV and Excel imports
- [Semantic text search](https://docs.pixeltable.com/howto/cookbooks/search/search-semantic-text) - Add embeddings to text data
- [Hugging Face integration notebook](https://docs.pixeltable.com/howto/providers/working-with-hugging-face) - Full integration guide