# Access GrandTour Data using HuggingFace 🤗
© 2025 ETH Zurich
 
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/leggedrobotics/grand_tour_dataset/blob/main/examples/%5B0%5D_Accessing_GrandTour_Data.ipynb)


## Overview
> GrandTour data is avaialable in two formats, hosted on two platforms:

<table>
  <tr>
    <th style="padding:10px;text-align:left;">Format</th>
    <th style="padding:10px;text-align:left;"> </th>
    <th style="padding:10px;text-align:left;">Hosted&nbsp;on</th>
    <th style="padding:10px;text-align:left;"> </th>
  </tr>

  <tr>
    <td><img src="https://raw.githubusercontent.com/leggedrobotics/grand_tour_dataset/main/assets/ros-logo.png"  height="30" alt="ROS logo"></td>
    <td style="padding-left:15px;"><a href="https://wiki.ros.org/rosbag">ROS&nbsp;Bags</a></td>
    <td><img src="https://raw.githubusercontent.com/leggedrobotics/grand_tour_dataset/main/assets/rsl-logo.png"  height="30" alt="RSL logo"></td>
    <td style="padding-left:15px;">Kleinkram</td>
  </tr>


  <tr>
    <td><img src="https://raw.githubusercontent.com/leggedrobotics/grand_tour_dataset/main/assets/zarr-logo.png" height="40" alt="Zarr logo"></td>
    <td style="padding-left:15px;"><a href="https://zarr.dev/">ZARR</a></td>
    <td><img src="https://raw.githubusercontent.com/leggedrobotics/grand_tour_dataset/main/assets/hf-logo.png"  height="30" alt="Hugging Face logo"></td>
    <td style="padding-left:15px;">HuggingFace</td>
  </tr>
</table>

> This notebook explains how to download the zarr/png converted dataset hosted on Huggingface.
>
> 
> 💡 Please refer to the `examples_hugging_face/explore.ipynb` on how to use the data.
 
## Downloading
> We provide the entire dataset on HuggingFace in `.zarr`, `.png`, and `.yaml` format.
> 
> To avoid checking in +1M individual files on the HuggingHub, we created a tar-ball `.tar` for each topic per mission.

> HuggingFace has an easy-to-use Python download API called `huggingface_hub`.
> It is possible to download directly from the [GrandTour HuggingFace repo UI](https://huggingface.co/leggedrobotics), but we strongly reccomend making use of `huggingface_hub`, as it manages caching files, interrupted downloads and smart fetching of updated files.

> First, install `huggingface_hub` which requires you to  have an HuggingFace account. You can create one for free at [huggingface.co](https://huggingface.co/).

In [None]:
! pip install -q huggingface_hub # Should be already installed when following the README.md and uv installation!

> Then, login using the cli. This will store authentication tokens on your PC and allow you to use the API to download data.

In [None]:
# If your notebook isn't able to take input from the command line, run this in a local terminal instead
! huggingface-cli login

> Now you can download an a mission of your choice. The next tutorial - _[1] Exploring GrandTour Data_ - uses 2024-10-01-11-29-55, so we will donwload it here in anticipation.

In [None]:
from huggingface_hub import snapshot_download

# Specify the mission you want to download.
mission = "2024-10-01-11-29-55"

# Download the full dataset
allow_patterns = [f"*"]

# Download all data from a single mission
allow_patterns = [f"{mission}/*"]

# Download a specific topic
# topic = "alphasense_front_center"
# allow_patterns = [f"{mission}/*{topic}*", f"{mission}/*.yaml"]


# to only include a subset of the 

# If this is interuppted during download, simply re-run the block and huggingface_hub will resume the download without re-downloading the already downloaded files.
hugging_face_data_cache_path = snapshot_download(repo_id="leggedrobotics/grand_tour_dataset", allow_patterns=allow_patterns, repo_type="dataset")

> The downloaded data will be compressed into `.tar` files, and must be extracted before it can be used. We reccomend extracting to a destination of your choice outside the huggingface cache directory:

In [None]:
from pathlib import Path

# Define the destination directory
dataset_folder = Path("~/grand_tour_dataset").expanduser()
dataset_folder.mkdir(parents=True, exist_ok=True)

# Print for confirmation
print(f"Data will be extracted to: {dataset_folder}")

> Define a `.tar` extractor helper function and extract the files:

In [None]:
import os
import shutil
import tarfile
import re

def move_dataset(cache, dataset_folder, allow_patterns=["*"]):

    def convert_glob_patterns_to_regex(glob_patterns):
        regex_parts = []
        for pat in glob_patterns:
            # Escape regex special characters except for * and ?
            pat = re.escape(pat)
            # Convert escaped glob wildcards to regex equivalents
            pat = pat.replace(r'\*', '.*').replace(r'\?', '.')
            # Make sure it matches full paths
            regex_parts.append(f".*{pat}$")
        
        # Join with |
        combined = "|".join(regex_parts)
        return re.compile(combined)
    
    pattern = convert_glob_patterns_to_regex(allow_patterns)
    files = [f for f in Path(cache).rglob("*") if pattern.match(str(f))]
    tar_files = [f for f in files if f.suffix == ".tar" ]
    
    for source_path in tar_files:
        dest_path = dataset_folder / source_path.relative_to(cache)
        dest_path.parent.mkdir(parents=True, exist_ok=True)
        
        try:
            with tarfile.open(source_path, "r") as tar:
                tar.extractall(path=dest_path.parent)
        except tarfile.ReadError as e:
            print(f"Error opening or extracting tar file '{source_path}': {e}")
        except Exception as e:
            print(f"An unexpected error occurred while processing {source_path}: {e}")
    
    other_files = [f for f in files if not f.suffix == ".tar" and f.is_file()]
    for source_path in other_files:
        dest_path = dataset_folder / source_path.relative_to(cache)
        dest_path.parent.mkdir(parents=True, exist_ok=True)
        shutil.copy2(source_path,dest_path)

    print(f"Moved data from {cache} to {dataset_folder} !")


move_dataset(hugging_face_data_cache_path, dataset_folder, allow_patterns=allow_patterns)

> You should now be able to load the dataset in `.zarr` format an inspect the contents:

In [None]:
import zarr.storage

store = zarr.storage.LocalStore(dataset_folder / mission / "data")
root = zarr.group(store=store)

print([k for k in root.keys()])