Faceberg

Bridge HuggingFace datasets with Apache Iceberg tables — no data copying, just metadata.

Faceberg maps HuggingFace datasets to Apache Iceberg tables. Your catalog metadata lives on HuggingFace Spaces with an auto-deployed REST API, and any Iceberg-compatible query engine can access the data.

Installation

pip install faceberg

Quick Start

export HF_TOKEN=your_huggingface_token

# Create a catalog on HuggingFace Hub
faceberg user/mycatalog init

# Add datasets
faceberg user/mycatalog add stanfordnlp/imdb --config plain_text
faceberg user/mycatalog add openai/gsm8k --config main

# Query with interactive DuckDB shell
faceberg user/mycatalog quack

SELECT label, substr(text, 1, 100) as preview
FROM iceberg_catalog.stanfordnlp.imdb
LIMIT 10;

How It Works

HuggingFace Hub
┌─────────────────────────────────────────────────────────┐
│                                                         │
│  ┌─────────────────────┐    ┌─────────────────────────┐ │
│  │  HF Datasets        │    │  HF Spaces (Catalog)    │ │
│  │  (Original Parquet) │◄───│  • Iceberg metadata     │ │
│  │                     │    │  • REST API endpoint    │ │
│  │  stanfordnlp/imdb/  │    │  • faceberg.yml         │ │
│  │   └── *.parquet     │    │                         │ │
│  └─────────────────────┘    └───────────┬─────────────┘ │
│                                         │               │
└─────────────────────────────────────────┼───────────────┘
                                          │ Iceberg REST API
                                          ▼
                              ┌─────────────────────────┐
                              │     Query Engines       │
                              │  DuckDB, Pandas, Spark  │
                              └─────────────────────────┘

No data is copied — only metadata is created. Query with DuckDB, PyIceberg, Spark, or any Iceberg-compatible tool.

Python API

import os
from faceberg import catalog

cat = catalog("user/mycatalog", hf_token=os.environ.get("HF_TOKEN"))
table = cat.load_table("stanfordnlp.imdb")
df = table.scan(limit=100).to_pandas()

Share Your Catalog

Your catalog is accessible to anyone via the REST API:

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")
conn.execute("ATTACH 'https://user-mycatalog.hf.space' AS cat (TYPE ICEBERG)")

result = conn.execute("SELECT * FROM cat.stanfordnlp.imdb LIMIT 5").fetchdf()

Documentation

Read the docs →

Getting Started — Full quickstart guide
Local Catalogs — Use local catalogs for development
DuckDB Integration — Advanced SQL queries
Pandas Integration — Load into DataFrames

Development

git clone https://github.com/kszucs/faceberg
cd faceberg
pip install -e .

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
docs		docs
faceberg		faceberg
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
faceberg.png		faceberg.png
justfile		justfile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Faceberg

Installation

Quick Start

How It Works

Python API

Share Your Catalog

Documentation

Development

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

kszucs/faceberg

Folders and files

Latest commit

History

Repository files navigation

Faceberg

Installation

Quick Start

How It Works

Python API

Share Your Catalog

Documentation

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages