# Week5: Host a F.A.I.R. Dataset on Hugging Face

### **TAs: Chiku Parida (chipa@dtu.dk)**

**Outcome:** By the end, you'll publish a small *F.A.I.R.* dataset to the Hugging Face Hub and learn how others can discover, cite, and reuse it.

We will:
1. Structure a small materials dataset (CSV + JSONL + optional EXTXYZ)  
2. Author a dataset card (README) with license, citation, and provenance  
3. Validate data against a schema for *R*eusability  
4. Upload to the Hub using `huggingface_hub`
5. Load the dataset back for analysis

> **Note:** You can adapt this template to large atomic datasets, polymer property tables, etc.


| Optional Youtube Videos
---
| [Youtube Video - 1](https://youtu.be/VqqwTz1z1SE "Youtube Video")
| [Youtube Video - 2](https://youtu.be/EVdYKvTdLqw "Youtube Video")


## 1) Install & Imports


In [None]:
# If needed, install
# !pip install -q huggingface_hub datasets jsonschema ase pandas

In [None]:

from huggingface_hub import login, HfApi, create_repo, upload_folder, hf_hub_download
from datasets import load_dataset
from pathlib import Path
import pandas as pd
import json, getpass, os
from jsonschema import validate, Draft202012Validator
from jsonschema.exceptions import ValidationError
from ase import Atoms
from ase.io import write

## 2) Log in to Hugging Face
Create an account on https://huggingface.co if you don't have one.  
Generate a User Access Token (Settings → Access Tokens → *New token*, scope: **Write**).

In [None]:
if 'HF_TOKEN' not in os.environ:
    token = getpass.getpass("Paste your Hugging Face token (will be hidden): ")
    login(token=token)
else:
    login(token=os.environ['HF_TOKEN'])
print("🔐 Logged in to Hugging Face Hub")

## 3) Build a Tiny Example Materials Dataset (local)
We'll prepare a small dataset about hypothetical oxide materials with (formula, system, bandgap) and optional atomic structures saved in EXTXYZ.

Folder layout (**recommended**):
```
fair_dataset_demo/
├─ data/
│  ├─ table.csv                 # tabular main data
│  ├─ records.jsonl             # line-delimited JSON for provenance
│  └─ structures/
│     ├─ ABO3_0001.xyz          # optional: atomic structures
│     └─ ...
├─ metadata/
│  ├─ schema.json               # JSON Schema for validation
├─ LICENSE                      # compatible license
├─ CITATION.cff                 # citation info
└─ README.md                    # Dataset card with YAML frontmatter
```
💡 This structure is simple, discoverable, and compatible with the Hub's dataset hosting.

In [None]:
root = Path("fair_dataset_demo")
data_dir = root / "data"
struct_dir = data_dir / "structures"
meta_dir = root / "metadata"
for p in [data_dir, struct_dir, meta_dir]:
    p.mkdir(parents=True, exist_ok=True)

# Create a tiny CSV table
rows = [
    {"id": "ABO3_0001", "formula": "SrTiO3", "system": "perovskite", "bandgap_eV": 3.2, "structure_path": "data/structures/ABO3_0001.xyz"},
    {"id": "ABO3_0002", "formula": "BaZrO3", "system": "perovskite", "bandgap_eV": 5.0, "structure_path": "data/structures/ABO3_0002.xyz"},
    {"id": "ABO3_0003", "formula": "LaAlO3", "system": "perovskite", "bandgap_eV": 5.6, "structure_path": "data/structures/ABO3_0003.xyz"},
]
df = pd.DataFrame(rows)
df.to_csv(data_dir/"table.csv", index=False)

# Mirror as JSONL
with open(data_dir/"records.jsonl", "w") as f:
    for r in rows:
        f.write(json.dumps(r) + "\n")

# Create tiny example structures (cubic) and save as EXTXYZ 
a = 3.905  # ~CaTiO3
for i, r in enumerate(rows, start=1):
    # Simple cubic ABO3 placeholder atoms
    atoms = Atoms(symbols="CaTiO3", positions=[(0,0,0), (a/2,a/2,a/2), (a/2,0,0), (0,a/2,0), (0,0,a/2)], cell=[a,a,a], pbc=True)
    atoms.info.update({
        "id": r["id"],
        "formula": r["formula"],
        "system": r["system"],
        "bandgap_eV": r["bandgap_eV"],
    })
    write(struct_dir/f"{r['id']}.xyz", atoms)

print("📦 Local example dataset created at:", root.resolve())

In [None]:
schema = {
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "SimpleMaterialsRow",
  "type": "object",
  "required": ["id", "formula", "system", "bandgap_eV", "structure_path"],
  "properties": {
    "id": {"type": "string"},
    "formula": {"type": "string"},
    "system": {"type": "string", "enum": ["perovskite", "spinel", "rocksalt", "other"]},
    "bandgap_eV": {"type": "number", "minimum": 0},
    "structure_path": {"type": "string"}
  },
  "additionalProperties": False
}
(meta_dir/"schema.jsonschema").write_text(json.dumps(schema, indent=2))

# Validate the CSV rows
validator = Draft202012Validator(schema)
errors = []
for i, r in df.iterrows():
    try:
        validator.validate(r.to_dict())
    except ValidationError as e:
        errors.append((i, str(e)))
if errors:
    print("❌ Validation errors:")
    for ix, msg in errors:
        print(ix, msg)
else:
    print("All rows pass schema validation")

## 4) Create a Dataset Repo on the Hub & Upload

- **Name:** `fair_dataset_demo`  
- Set `private=False` for teaching visibility (or `True` until review)  
- `upload_folder` handles large files without requiring Git LFS locally

In [None]:
api = HfApi()
username = api.whoami()['name']
repo_name = "fair_dataset_demo"
repo_id = f"{username}/{repo_name}"

# Create (idempotent): if exists, do nothing
create_repo(repo_id=repo_id, repo_type="dataset", private=False, exist_ok=True)

# Upload entire folder
upload_folder(
    repo_id=repo_id,
    repo_type="dataset",
    folder_path=str(root),
    path_in_repo=".",
)
print(f"🚀 Uploaded to https://huggingface.co/datasets/{repo_id}")

## 5) Verify by Loading from the Hub
Demonstrate both `datasets` loading and direct file download.

In [None]:
print("Attempting to load the CSV via datasets...")
ds = load_dataset("cparidaAI/fair_dataset_demo", data_files={"train": "data/table.csv"})
display(ds)



In [None]:
# Option B: use subfolder + bare filename
xyz_path = hf_hub_download(
    repo_id=r"cparidaAI/fair_dataset_demo",
    repo_type="dataset",
    subfolder="data/structures",
    filename="ABO3_0001.xyz",
)
print("Downloaded to:", xyz_path)

In [None]:
from huggingface_hub import snapshot_download

repo_id   = "cparidaAI/fair_dataset_demo"
targetdir = "/local/path/downloaded_data"

local_path = snapshot_download(
    repo_id="cparidaAI/fair_dataset_demo",
    repo_type="dataset",
    local_dir=targetdir,
    local_dir_use_symlinks=False,
    allow_patterns=["data/**"],     # only grab files under data/
)
print("Downloaded data/* to:", local_path)
