# MMathCoT-1M | Local Downloader, Validator, Packager
---

This notebook downloads all image sources required by `URSA-MATH/MMathCoT-1M`, validates coverage against the dataset's `image_url` column, builds a manifest, and packs **tar shards** where each member's **arcname equals `image_url`**. It also includes an optional step to upload to a Hugging Face dataset.

## Table of Contents
---
1. [Install Dependencies](#1-install-dependencies)
2. [Configuration](#2-configuration)
3. [Load Dataset Metadata](#3-load-dataset-metadata)
4. [Download and Extract Sources](#4-download-and-extract-sources)
5. [Per-Source Validation](#5-per-source-validation)
6. [Global Manifest (URL → abs\_path)](#6-global-manifest-url--abs_path)
7. [Sharding: Build URL-Aligned Tar Files](#7-sharding-build-url-aligned-tar-files)
8. [Quick Verification](#8-quick-verification)
9. [Optional: Upload to Hugging Face](#9-optional-upload-to-hugging-face)

### 1. Install Dependencies
---

In [5]:
%%capture
%pip -q install ipywidgets python-dotenv huggingface_hub datasets py7zr tqdm pandas pillow

### 2. Configuration
---

In [28]:
import os
from pathlib import Path
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv(usecwd=True), override=False)

# Root directory for downloads and outputs
BASE_DIR = Path("/Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images")
HF_CACHE = BASE_DIR / "_hf_cache"
TEMP_DIR = BASE_DIR / "_temp"
SHARDS_DIR = BASE_DIR / "shards"

# Tar shard size
MAX_BYTES_PER_SHARD = 5_000_000_000 # (approx 5 GB)

# Hugging Face dataset upload settings
HF_DATASET_REPO_ID = "michaelburiek/mmathcot1m-images"
HF_TOKEN = os.getenv("HF_TOKEN", "")
UPLOAD_TO_HF = False  # set True only when you're ready to upload

# Create directories
for d in [BASE_DIR, HF_CACHE, TEMP_DIR, SHARDS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print(f"[CONFIG] BASE_DIR   = {BASE_DIR}")
print(f"[CONFIG] HF_CACHE   = {HF_CACHE}")
print(f"[CONFIG] TEMP_DIR   = {TEMP_DIR}")
print(f"[CONFIG] SHARDS_DIR = {SHARDS_DIR}")

if UPLOAD_TO_HF and not HF_TOKEN:
    raise RuntimeError("UPLOAD_TO_HF=True but HF_TOKEN is not set in .env")

[CONFIG] BASE_DIR   = /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images
[CONFIG] HF_CACHE   = /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/_hf_cache
[CONFIG] TEMP_DIR   = /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/_temp
[CONFIG] SHARDS_DIR = /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards


### 3. Load Dataset Metadata
---

In [6]:
from datasets import load_dataset

print("[INFO] Loading MMathCoT-1M dataset metadata...")
dataset = load_dataset("URSA-MATH/MMathCoT-1M", split="train")
print(f"[SUCCESS] Loaded {len(dataset):,} entries")

unique_urls = []
for i in range(len(dataset)):
    u = dataset[i].get("image_url", "")
    if u:
        unique_urls.append(u)
unique_urls = sorted(set(unique_urls))
print(f"[INFO] Unique image URLs: {len(unique_urls):,}")

[INFO] Loading MMathCoT-1M dataset metadata...
[SUCCESS] Loaded 1,019,059 entries
[INFO] Unique image URLs: 574,534


### 4. Download and Extract Sources
---

In [7]:
import shutil, zipfile, py7zr
from tqdm.auto import tqdm
from huggingface_hub import hf_hub_download

def count_images(path: Path):
    exts = {'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.webp', '.tif', '.tiff'}
    return sum(1 for p in path.rglob("*") if p.is_file() and p.suffix.lower() in exts)

def extract_7z(archive_path: Path, extract_dir: Path):
    extract_dir.mkdir(parents=True, exist_ok=True)
    with py7zr.SevenZipFile(archive_path, mode='r') as z:
        z.extractall(path=extract_dir)

def extract_zip(zip_path: Path, dest_dir: Path):
    temp = dest_dir.parent / (dest_dir.name + "_tmp")
    if temp.exists():
        shutil.rmtree(temp)
    temp.mkdir(parents=True, exist_ok=True)
    with zipfile.ZipFile(zip_path, 'r') as z:
        names = z.namelist()
        for name in tqdm(names, desc=f"Extracting {zip_path.name}"):
            z.extract(name, temp)
    if dest_dir.exists():
        shutil.rmtree(dest_dir)
    shutil.move(str(temp), str(dest_dir))

# Geo170K (URSA-MATH/MMathCoT-1M: Geo170K.7z)
print("\n" + "="*80)
print("DATASET 1/6: Geo170K")
print("="*80)
g_dir = BASE_DIR / "Geo170K"
g_marker = g_dir / ".download_complete"
if not g_marker.exists():
    arch = hf_hub_download("URSA-MATH/MMathCoT-1M", filename="Geo170K.7z", repo_type="dataset", cache_dir=str(HF_CACHE))
    extract_7z(Path(arch), g_dir)
    print(f"[SUCCESS] Extracted ~{count_images(g_dir):,} images")
    g_marker.touch()
else:
    print("[SKIP] Geo170K already present.")

# MathV-360k (Zhiqiang007/MathV360K: data_images.zip)
print("\n" + "="*80)
print("DATASET 2/6: MathV-360k")
print("="*80)
m_dir = BASE_DIR / "MathV-360k"
m_marker = m_dir / ".download_complete"
if not m_marker.exists():
    zpath = hf_hub_download("Zhiqiang007/MathV360K", filename="data_images.zip", repo_type="dataset", cache_dir=str(HF_CACHE))
    extract_zip(Path(zpath), m_dir)
    print(f"[SUCCESS] Extracted ~{count_images(m_dir):,} images")
    m_marker.touch()
else:
    print("[SKIP] MathV-360k already present.")

# VarsityTutors (URSA-MATH/MMathCoT-1M: VarsityTutors.7z)
print("\n" + "="*80)
print("DATASET 3/6: VarsityTutors")
print("="*80)
v_dir = BASE_DIR / "VarsityTutors"
v_marker = v_dir / ".download_complete"
if not v_marker.exists():
    arch = hf_hub_download("URSA-MATH/MMathCoT-1M", filename="VarsityTutors.7z", repo_type="dataset", cache_dir=str(HF_CACHE))
    extract_7z(Path(arch), v_dir)
    print(f"[SUCCESS] Extracted ~{count_images(v_dir):,} images")
    v_marker.touch()
else:
    print("[SKIP] VarsityTutors already present.")

# DataEngine_Geometry (URSA-MATH/MMathCoT-1M: DataEngine_Geometry.7z)
print("\n" + "="*80)
print("DATASET 4/6: DataEngine_Geometry")
print("="*80)
d_dir = BASE_DIR / "DataEngine_Geometry"
d_marker = d_dir / ".download_complete"
if not d_marker.exists():
    arch = hf_hub_download("URSA-MATH/MMathCoT-1M", filename="DataEngine_Geometry.7z", repo_type="dataset", cache_dir=str(HF_CACHE))
    extract_7z(Path(arch), d_dir)
    print(f"[SUCCESS] Extracted ~{count_images(d_dir):,} images")
    d_marker.touch()
else:
    print("[SKIP] DataEngine_Geometry already present.")

# Multimath (pengshuai-rin/multimath-300k: images.zip)
print("\n" + "="*80)
print("DATASET 5/6: Multimath")
print("="*80)
mm_dir = BASE_DIR / "Multimath"
mm_marker = mm_dir / ".download_complete"
if not mm_marker.exists():
    zpath = hf_hub_download("pengshuai-rin/multimath-300k", filename="images.zip", repo_type="dataset", cache_dir=str(HF_CACHE))
    extract_zip(Path(zpath), mm_dir)
    print(f"[SUCCESS] Extracted ~{count_images(mm_dir):,} images")
    mm_marker.touch()
else:
    print("[SKIP] Multimath already present.")

# Mavis_Extra (URSA-MATH/MMathCoT-1M: Mavis_Extra.7z)
print("\n" + "="*80)
print("DATASET 6/6: Mavis_Extra (partial in HF)")
print("="*80)
me_dir = BASE_DIR / "Mavis_Extra"
me_marker = me_dir / ".download_complete"
if not me_marker.exists():
    arch = hf_hub_download("URSA-MATH/MMathCoT-1M", filename="Mavis_Extra.7z", repo_type="dataset", cache_dir=str(HF_CACHE))
    extract_7z(Path(arch), me_dir)
    print(f"[SUCCESS] Extracted ~{count_images(me_dir):,} images")
    me_marker.touch()
else:
    print("[SKIP] Mavis_Extra already present.")

print("\n[NOTE] The Hugging Face version of Mavis_Extra is intentionally partial (primarily 'function_wo').")
print("[NOTE] For complete MAVIS, use the original Google Drive source and place under BASE_DIR/Mavis_Extra.")


DATASET 1/6: Geo170K
[SUCCESS] Extracted ~12,031 images

DATASET 2/6: MathV-360k


Extracting data_images.zip: 100%|██████████| 38785/38785 [00:10<00:00, 3667.75it/s]


[SUCCESS] Extracted ~38,734 images

DATASET 3/6: VarsityTutors
[SUCCESS] Extracted ~80,639 images

DATASET 4/6: DataEngine_Geometry
[SUCCESS] Extracted ~193,914 images

DATASET 5/6: Multimath


Extracting images.zip: 100%|██████████| 353077/353077 [00:48<00:00, 7240.74it/s]


[SUCCESS] Extracted ~353,076 images

DATASET 6/6: Mavis_Extra (partial in HF)
  [2m2025-10-13T01:13:22.635389Z[0m [33m WARN[0m  [33mReqwest(reqwest::Error { kind: Request, url: "https://transfer.xethub.hf.co/xorbs/default/e0debc5fa67f4ba6cb19b04fb39ff12790228b0246b45cfa7b9702ca5f44fe54?X-Xet-Signed-Range=bytes%3D0-67095396&X-Xet-Session-Id=01K7DJ23HG7RR5ABQ5YA4FY5QC&Expires=1760321602&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly90cmFuc2Zlci54ZXRodWIuaGYuY28veG9yYnMvZGVmYXVsdC9lMGRlYmM1ZmE2N2Y0YmE2Y2IxOWIwNGZiMzlmZjEyNzkwMjI4YjAyNDZiNDVjZmE3Yjk3MDJjYTVmNDRmZTU0P1gtWGV0LVNpZ25lZC1SYW5nZT1ieXRlcyUzRDAtNjcwOTUzOTYmWC1YZXQtU2Vzc2lvbi1JZD0wMUs3REoyM0hHN1JSNUFCUTVZQTRGWTVRQyIsIkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc2MDMyMTYwMn19fV19&Signature=mQXzUFfGAOxqlCQNKIgjf0gBKfTRp2a8p2WpGV8VKXlj3D7CmykVEp9QZ2DKo2N7Fl9kmnBfpNAAaiiOp8gRoOx6Pra8S1kECENsQsjmOWi65H2DtZbshIBy2EEaAJbATiaeUIy4VlvNrAeKPr1k1IBirtVNvBtEbZlVm79rITLHSzBfWJkck19EVlzTl3VS7Vo6IoyvthTs7HA01cBydeoIP

### 5. Per-Source Validation
---

In [8]:
from collections import defaultdict
from urllib.parse import urlparse, unquote

def validate_source(source_name: str, source_root: Path, ds) -> dict:
    exts = {'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.webp', '.tif', '.tiff'}
    # Build filename index for this source
    files = [p for p in source_root.rglob("*") if p.is_file() and p.suffix.lower() in exts]
    index = defaultdict(list)
    for p in files:
        index[p.name].append(p)

    # Count expected from dataset
    expected = 0
    found = 0
    missing = []

    for i in range(len(ds)):
        u = ds[i].get("image_url","")
        if not u.startswith(source_name + "/"):
            continue
        expected += 1
        fname = Path(unquote(urlparse(u).path)).name
        if fname in index:
            found += 1
        else:
            missing.append(u)

    rate = (found / expected * 100) if expected else 0.0
    return {"source": source_name, "expected": expected, "found": found, "rate": rate, "missing": missing}

sources = {
    "Geo170K": (BASE_DIR / "Geo170K" / "Geo170K"),
    "MathV-360k": (BASE_DIR / "MathV-360k" / "data_images"),
    "VarsityTutors": (BASE_DIR / "VarsityTutors" / "VarsityTutors"),
    "DataEngine_Geometry": (BASE_DIR / "DataEngine_Geometry" / "DataEngine_Geometry"),
    "Multimath": (BASE_DIR / "Multimath" / "RGB_images"),
    "Mavis_Extra": (BASE_DIR / "Mavis_Extra" / "Mavis_Extra")
}

summary = []
for name, root in sources.items():
    if root.exists():
        res = validate_source(name, root, dataset)
        summary.append(res)
        print(f"{name:<22} expected={res['expected']:,} found={res['found']:,} rate={res['rate']:.2f}%")
        if res["missing"]:
            miss_path = BASE_DIR / f"{name}_missing.txt"
            miss_path.write_text("\n".join(res["missing"]) + "\n")
            print(f"  -> wrote missing list: {miss_path}")
    else:
        print(f"{name:<22} [WARN] root not found: {root}")

overall_expected = sum(s["expected"] for s in summary)
overall_found = sum(s["found"] for s in summary)
overall_rate = (overall_found / overall_expected * 100) if overall_expected else 0.0
print("-"*80)
print(f"OVERALL          expected={overall_expected:,} found={overall_found:,} rate={overall_rate:.2f}%")

Geo170K                expected=52,196 found=52,196 rate=100.00%
MathV-360k             expected=319,619 found=319,619 rate=100.00%
VarsityTutors          expected=55,162 found=55,162 rate=100.00%
DataEngine_Geometry    expected=186,471 found=186,471 rate=100.00%
Multimath              expected=264,205 found=264,205 rate=100.00%
Mavis_Extra            expected=141,406 found=141,406 rate=100.00%
--------------------------------------------------------------------------------
OVERALL          expected=1,019,059 found=1,019,059 rate=100.00%


### 6. Global Manifest (URL → abs_path)
---

In [13]:
from pathlib import Path
import pandas as pd

# Reuse: BASE_DIR, unique_urls already defined.

# Point each source to the folder whose *relative path* matches the URL after "<source_name>/"
# (i.e., no extra duplicated prefix in the key)
sources_manifest = {
    "Geo170K":             BASE_DIR / "Geo170K" / "Geo170K",
    "MathV-360k":          BASE_DIR / "MathV-360k" / "data_images",
    "VarsityTutors":       BASE_DIR / "VarsityTutors" / "VarsityTutors",
    "DataEngine_Geometry": BASE_DIR / "DataEngine_Geometry" / "DataEngine_Geometry",
    "Multimath":           BASE_DIR / "Multimath",                 # URLs include "RGB_images/..."
    "Mavis_Extra":         BASE_DIR / "Mavis_Extra" / "Mavis_Extra",
}

EXTS = {'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.webp', '.tif', '.tiff'}

def build_url_to_abs_index(sources_map: dict[str, Path]) -> dict[str, Path]:
    idx: dict[str, Path] = {}
    for source_name, root in sources_map.items():
        root = Path(root)
        if not root.exists():
            print(f"[WARN] Source root missing: {source_name} -> {root}")
            continue
        for p in root.rglob("*"):
            if p.is_file() and p.suffix.lower() in EXTS:
                rel_under_root = p.relative_to(root).as_posix()
                url_key = f"{source_name}/{rel_under_root}"
                # keep first occurrence deterministically
                if url_key not in idx:
                    idx[url_key] = p
    return idx

# Build mapping and resolve
url_to_abs = build_url_to_abs_index(sources_manifest)

rows, unresolved = [], []
for u in unique_urls:
    p = url_to_abs.get(u)
    if p and p.exists() and p.suffix.lower() in EXTS:
        rows.append({
            "image_url": u,
            "rel_path": u,        # arcname inside tar must equal image_url
            "abs_path": str(p),
            "source": u.split("/", 1)[0],
        })
    else:
        unresolved.append(u)

print(f"[INFO] Resolved:   {len(rows):,}")
print(f"[INFO] Unresolved: {len(unresolved):,}")

if unresolved:
    out = BASE_DIR / "unresolved_image_urls.txt"
    out.write_text("\n".join(unresolved) + "\n")
    print(f"[WARN] Wrote unresolved list to: {out}")
    print("First 10 unresolved:")
    for x in unresolved[:10]:
        print("  -", x)

# Save manifest
manifest_csv = BASE_DIR / "manifest.csv"
manifest_parquet = BASE_DIR / "manifest.parquet"
df = pd.DataFrame(rows)
df.to_csv(manifest_csv, index=False)
try:
    df.to_parquet(manifest_parquet, index=False)
    print(f"[INFO] Wrote manifest: {manifest_csv}")
    print(f"[INFO] Wrote manifest: {manifest_parquet}")
except Exception as e:
    print(f"[INFO] Wrote manifest: {manifest_csv}")
    print("[WARN] Parquet write failed:", e)

[INFO] Resolved:   574,534
[INFO] Unresolved: 0
[INFO] Wrote manifest: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/manifest.csv
[INFO] Wrote manifest: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/manifest.parquet


### 7. Sharding: Build URL-Aligned Tar Files
---

In [14]:
import tarfile
from pathlib import Path

# 1) Clean out any old shards
deleted = 0
for p in SHARDS_DIR.glob("*.tar"):
    try:
        p.unlink()
        deleted += 1
    except Exception as e:
        print("[WARN] Could not delete", p, e)
print(f"[INFO] Deleted {deleted} old tar(s)")

# 2) Size-based sharding
def shard_by_bytes(rows, max_bytes: int):
    shards, current, used = [], [], 0
    for r in rows:
        try:
            sz = Path(r["abs_path"]).stat().st_size
        except FileNotFoundError:
            continue
        if current and (used + sz > max_bytes):
            shards.append(current)
            current, used = [], 0
        current.append(r)
        used += sz
    if current:
        shards.append(current)
    return shards

# Sort for deterministic output (use rel_path which is the URL)
rows_sorted = sorted(rows, key=lambda r: r["rel_path"])
shards = shard_by_bytes(rows_sorted, MAX_BYTES_PER_SHARD)
print(f"[INFO] Creating {len(shards)} shard(s) with max_bytes={MAX_BYTES_PER_SHARD:,}")

# Normalize tar member metadata for reproducibility
def _normalize_tarinfo(ti: tarfile.TarInfo) -> tarfile.TarInfo:
    ti.uid = 0
    ti.gid = 0
    ti.uname = "root"
    ti.gname = "root"
    ti.mtime = 0
    ti.mode = ti.mode & 0o777
    return ti

# 3) Write shards
total = 0
for i, group in enumerate(shards):
    tar_path = SHARDS_DIR / f"images-{i:03d}.tar"
    with tarfile.open(tar_path, "w", format=tarfile.PAX_FORMAT) as tar:
        for r in group:
            src = Path(r["abs_path"])
            arc = r["rel_path"]  # must equal image_url
            tar.add(src, arcname=arc, filter=_normalize_tarinfo)
    print(f"[INFO] Wrote {tar_path} ({len(group)} files)")
    total += len(group)

print(f"[DONE] Files packed: {total}")

[INFO] Deleted 0 old tar(s)
[INFO] Creating 6 shard(s) with max_bytes=5,000,000,000
[INFO] Wrote /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-000.tar (95538 files)
[INFO] Wrote /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-001.tar (45123 files)
[INFO] Wrote /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-002.tar (31963 files)
[INFO] Wrote /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-003.tar (29877 files)
[INFO] Wrote /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-004.tar (49454 files)
[INFO] Wrote /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-005.tar (322579 files)
[DONE] Files packed: 574534


### 8. Quick Verification
---

In [15]:
from PIL import Image
import random

sample = random.sample(rows, min(25, len(rows)))
ok, bad = 0, 0
for r in sample:
    try:
        with Image.open(r["abs_path"]) as im:
            im.verify()
        with Image.open(r["abs_path"]) as im:
            _ = im.size
        ok += 1
    except Exception as e:
        bad += 1
        print("[FAIL]", r["abs_path"], e)

print(f"Verified OK: {ok}  FAIL: {bad}")

Verified OK: 25  FAIL: 0


### 9. Optional: Upload to Hugging Face
---

In [33]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(usecwd=True), override=True)

In [34]:
from huggingface_hub import HfApi
import os

api = HfApi()
token = os.getenv("HF_TOKEN", "")
print(api.whoami(token))

{'type': 'user', 'id': '6893c075add77868b890bf13', 'name': 'michaelburiek', 'fullname': 'Michael Buriek', 'isPro': False, 'avatarUrl': '/avatars/63cb09a7595ab144ff1ed44d56e700ac.svg', 'orgs': [], 'auth': {'type': 'access_token', 'accessToken': {'displayName': 'dcvlr', 'role': 'fineGrained', 'createdAt': '2025-10-13T01:56:32.122Z', 'fineGrained': {'canReadGatedRepos': False, 'global': [], 'scoped': [{'entity': {'_id': '6893c075add77868b890bf13', 'type': 'user', 'name': 'michaelburiek'}, 'permissions': ['repo.content.read', 'repo.write']}]}}}}


In [35]:
from huggingface_hub import HfApi, create_repo
from pathlib import Path
import os, time

# ---------- Config ----------
BASE_DIR   = Path("/Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images")
SHARDS_DIR = BASE_DIR / "shards"

# Prefer the value from your earlier config if set; otherwise default to your account/repo
REPO_ID = os.environ.get("HF_DATASET_REPO_ID", "").strip() or "michaelburiek/mmathcot1m-images"
TOKEN   = os.environ.get("HF_TOKEN", "").strip()

if not TOKEN:
    raise RuntimeError("HF_TOKEN is empty. Put it in your .env or export it in your shell, then re-run.")

api = HfApi()

# ---------- Sanity checks ----------
who = api.whoami(token=TOKEN)
acct = who.get("name") or (who.get("orgs") or [{}])[0].get("name")
print(f"[INFO] Authenticated as: {acct}")

wanted_ns = REPO_ID.split("/", 1)[0]
if wanted_ns != acct:
    raise RuntimeError(
        f"Namespace mismatch: REPO_ID starts with '{wanted_ns}' but your token is for '{acct}'. "
        f"Fix REPO_ID to '{acct}/mmathcot1m-images' or use a token that belongs to '{wanted_ns}'."
    )

# ---------- Ensure dataset repo exists (idempotent) ----------
print(f"[INFO] Ensuring dataset repo exists: {REPO_ID}")
create_repo(repo_id=REPO_ID, repo_type="dataset", exist_ok=True, token=TOKEN)

# ---------- Collect files to upload (no shard changes) ----------
to_upload = []

m_csv  = BASE_DIR / "manifest.csv"
m_parq = BASE_DIR / "manifest.parquet"
if m_csv.exists():  to_upload.append((m_csv,  "manifest.csv"))
if m_parq.exists(): to_upload.append((m_parq, "manifest.parquet"))

unresolved = BASE_DIR / "unresolved_image_urls.txt"
if unresolved.exists() and unresolved.stat().st_size > 0:
    to_upload.append((unresolved, "unresolved_image_urls.txt"))

shard_files = sorted(SHARDS_DIR.glob("images-*.tar"))
if not shard_files:
    raise RuntimeError(f"No shard files found in {SHARDS_DIR}")
for sf in shard_files:
    to_upload.append((sf, f"shards/{sf.name}"))

# Lightweight README
readme_text = """# MMathCoT-1M Images (Sharded)

Local mirror of image assets referenced by the
[`URSA-MATH/MMathCoT-1M`](https://huggingface.co/datasets/URSA-MATH/MMathCoT-1M) dataset.

- Shards: tar files in `shards/`
- Manifest: `manifest.csv` and `manifest.parquet`

Each tar’s internal arcname equals the original `image_url` from the dataset.
"""
readme_tmp = BASE_DIR / "_README.md"
readme_tmp.write_text(readme_text)
to_upload.append((readme_tmp, "README.md"))

# ---------- Uploader with simple retries ----------
def upload_with_retries(path: Path, dest: str, max_retries: int = 4) -> bool:
    for attempt in range(1, max_retries + 1):
        try:
            api.upload_file(
                path_or_fileobj=str(path),
                path_in_repo=dest,
                repo_id=REPO_ID,
                repo_type="dataset",
                token=TOKEN,
            )
            return True
        except Exception as e:
            if attempt == max_retries:
                print(f"[ERROR] Failed: {path} → {dest}: {e}")
                return False
            backoff = min(60, 2 ** attempt)
            print(f"[WARN] Retry {attempt}/{max_retries} in {backoff}s for {dest}: {e}")
            time.sleep(backoff)

# ---------- Size summary ----------
def _fmt_bytes(n: int) -> str:
    units = ["B","KB","MB","GB","TB"]
    i = 0; x = float(n)
    while x >= 1024 and i < len(units)-1:
        x /= 1024; i += 1
    return f"{x:.2f} {units[i]}"

total_bytes = sum(p.stat().st_size for p, _ in to_upload if p.exists())
print(f"[INFO] Uploading {len(to_upload)} items (~{_fmt_bytes(total_bytes)}) to hf://datasets/{REPO_ID}")

# ---------- Do uploads (idempotent; overwrites same paths only) ----------
ok = 0
for p, dest in to_upload:
    print(f"[INFO] Uploading: {p}  →  {dest}")
    if upload_with_retries(p, dest):
        ok += 1

print(f"[DONE] Uploaded {ok}/{len(to_upload)} items to {REPO_ID}")

# Cleanup temp README file
try:
    readme_tmp.unlink(missing_ok=True)
except Exception:
    pass

[INFO] Authenticated as: michaelburiek
[INFO] Ensuring dataset repo exists: michaelburiek/mmathcot1m-images
[INFO] Uploading 9 items (~28.12 GB) to hf://datasets/michaelburiek/mmathcot1m-images
[INFO] Uploading: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/manifest.csv  →  manifest.csv


Processing Files (1 / 1): 100%|██████████|  151MB /  151MB, 7.51MB/s  
New Data Upload: 100%|██████████|  151MB /  151MB, 7.51MB/s  


[INFO] Uploading: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/manifest.parquet  →  manifest.parquet


Processing Files (1 / 1): 100%|██████████| 33.5MB / 33.5MB, 1.61MB/s  
New Data Upload: 100%|██████████| 23.3MB / 23.3MB, 1.12MB/s  


[INFO] Uploading: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-000.tar  →  shards/images-000.tar


Processing Files (0 / 1): 100%|█████████▉| 5.07GB / 5.07GB,  0.00B/s  
New Data Upload: 100%|██████████| 4.97GB / 4.97GB,  0.00B/s  


[INFO] Uploading: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-001.tar  →  shards/images-001.tar


Processing Files (1 / 1): 100%|██████████| 5.03GB / 5.03GB,  0.00B/s  
New Data Upload: 100%|██████████| 4.82GB / 4.82GB,  0.00B/s  


[INFO] Uploading: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-002.tar  →  shards/images-002.tar


Processing Files (0 / 1): 100%|█████████▉| 5.02GB / 5.02GB,  0.00B/s  
New Data Upload: 100%|██████████| 4.69GB / 4.69GB,  0.00B/s  


[INFO] Uploading: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-003.tar  →  shards/images-003.tar


Processing Files (1 / 1): 100%|██████████| 5.02GB / 5.02GB,  0.00B/s  
New Data Upload: 100%|██████████| 3.17GB / 3.17GB,  0.00B/s  


[INFO] Uploading: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-004.tar  →  shards/images-004.tar


Processing Files (0 / 1): 100%|█████████▉| 5.04GB / 5.04GB,  0.00B/s  
New Data Upload: 100%|██████████| 3.48GB / 3.48GB,  0.00B/s  


[INFO] Uploading: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/shards/images-005.tar  →  shards/images-005.tar


Processing Files (0 / 1):  93%|█████████▎| 4.48GB / 4.81GB, 2.19MB/s  

  [2m2025-10-13T04:13:17.435280Z[0m [33m WARN[0m  [33mStatus Code: 502. Retrying..., [1;33mrequest_id[0m[33m: ""[0m
    [2;3mat[0m /Users/runner/work/xet-core/xet-core/cas_client/src/http_client.rs:227



Processing Files (0 / 1): 100%|█████████▉| 4.81GB / 4.81GB,  0.00B/s  
New Data Upload: 100%|██████████| 4.56GB / 4.56GB,  0.00B/s  


[INFO] Uploading: /Users/michaelburiek/Documents/GitHub/dcvlr/mmath_cot_1m/images/_README.md  →  README.md


- empty or missing yaml metadata in repo card


[DONE] Uploaded 9/9 items to michaelburiek/mmathcot1m-images
