## Convert Between HuggingFace and Space Datasets

The [HuggingFace hub](https://huggingface.co/docs/datasets-server/en/parquet) automatically converts every dataset to the Parquet format. The Parquet files can be appended to a Space dataset without rewriting data files. The metadata only append is performant and low-cost.

Similarly, a Space dataset storing all fields in Parquet can be converted to a HuggingFace dataset, reusing existing data files. Therefore, users can use Space as a tool of data manipulation, materialized view, and version management for HuggingFace datasets.

The `parquet_files` method lists all Parquet files of a HuggingFace dataset in the hub, with name `ds` (e.g., `ibm/duorc`), `splits`, and `config`. Replace the parameters for your dataset. See more details in the [HuggingFace docs](https://huggingface.co/docs/datasets-server/en/parquet).

In [None]:
import requests
from typing import List, Optional

# Change it to your HuggingFace API token.
API_TOKEN = "my_huggingface_api_token"

def parquet_files(ds: str, splits: Optional[List[str]] = None,
    config: str = "default") -> List[str]:
  global API_TOKEN

  headers = {"Authorization": f"Bearer {API_TOKEN}"}
  API_URL = f"https://datasets-server.huggingface.co/parquet?dataset={ds}"
  response = requests.get(API_URL, headers=headers).json()
  assert response["partial"] == False

  splits_set = set(splits) if splits else None
  def filter_(f):
    nonlocal splits_set, config
    return (not splits_set or f["split"] in splits_set) and f["config"] == config

  return [f["url"] for f in response["parquet_files"] if filter_(f)]

Download the Parquet files to a directory (local or Cloud Storage):

In [None]:
import os

OUTPUT_DIR = "/path/to/download/files"

for url in parquet_files("ibm/duorc", splits=["train"], config="ParaphraseRC"):
  os.system(f"wget {url} -P {OUTPUT_DIR}")

Obtain the Parquet file schema, it will be used as the Space dataset schema:

In [None]:
import pyarrow.parquet as pq

file_name = os.listdir(OUTPUT_DIR)[0]
schema = pq.read_schema(os.path.join(OUTPUT_DIR, file_name))

Create an empty Space dataset. The schema is the same as the schema of downloaded Parquet files. Space requires a primary key, but it is not enforced. Uniqueness is required for insert and upsert operations. We simply choose `plot_id` as primary key for demo purpose. `append_parquet` appends the Parquet files into the Space dataset, by registrating these files into metadata.

In [None]:
from space import Dataset, DirCatalog

catalog = DirCatalog("/path/to/my/tables")
ds = catalog.create_dataset("huggingface_demo",
  schema, primary_keys=["plot_id"], record_fields=[])

# Append existing Parquet files into Space.
# TODO: the files are outside the Space dataset's `data` folder;
# to support an option to move/copy these files into the dataset
# folder.
ds.local().append_parquet(f"{OUTPUT_DIR}/*.parquet")

print(ds.local().read_all(fields=["plot_id"]).num_rows)

Example of manipulating data in a Space dataset:

In [None]:
import pyarrow.compute as pc

# Delete rows.
ds.local().delete(pc.field("plot_id") == "/m/03vyhn")
ds.add_tag("after_delete")

# Show all versions.
print(ds.versions().to_pandas())

Use the `index_files` method to list Parquet files of a dataset version, and construct a HuggingFace dataset from the files:

In [None]:
from datasets import load_dataset

huggingface_ds = load_dataset("parquet",
  data_files={"train": ds.index_files(version="after_delete")})

print(huggingface_ds.num_rows)