<a href="https://colab.research.google.com/github/jonbaer/googlecolab/blob/master/core_five.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌐 core-five: Multi-Modal Geospatial Dataset for Foundational Research
Welcome to the official Colab notebook for accessing and exploring **core-five**, a harmonized dataset across multiple satellite modalities (HighRes, Sentinel-1, Sentinel-2, Landsat, MODIS).

> **🧠 What’s inside?**  
> - Unified geospatial datacube  
> - Temporal alignment across sensors  
> - Ready for Foundation Models, SSL, and benchmarking

---

## 📦 Accessing core-five Data using HF
We'll use `xarray`, `fsspec`, and `rioxarray` to load data directly from the Hugging Face Hub.

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd

import os
import fsspec
import xarray as xr
import huggingface_hub as hf

> 💡 **Tip:** Use Dask with `xr.open_datatree(..., chunks={})` for large-scale scalable workflows.

### 🧬 Metadata Loader: Zero-copy, Zero-hassle, Metadata of Each cube

Streaming GeoParquet directly from Hugging Face with `fsspec` + `GeoPandas` — no local downloads, no clutter:


In [None]:
metadata_url = "https://huggingface.co/datasets/gajeshladhar/core-five/resolve/main/metadata.parquet"
df_metadata = gpd.read_parquet(fsspec.open(metadata_url).open())
df_metadata.head()

Unnamed: 0,path,geometry
0,src/datatree/0097b3/00964b34.nc,"POLYGON ((-41.91965 -22.51009, -41.90972 -22.5..."
1,src/datatree/0097b3/00964b4c.nc,"POLYGON ((-41.91965 -22.51976, -41.90972 -22.5..."
2,src/datatree/00989b/00988334.nc,"POLYGON ((-43.32947 -21.75015, -43.31981 -21.7..."
3,src/datatree/00989b/0098834c.nc,"POLYGON ((-43.32947 -21.75967, -43.31981 -21.7..."
4,src/datatree/0098f5/009858ac.nc,"POLYGON ((-43.17472 -22.13415, -43.16503 -22.1..."


### 🎲 On-the-Fly Access to a Random Scene

Select and stream a sample `.nc` file from the Hugging Face dataset — directly indexed via metadata:


In [None]:
path = os.path.join("https://huggingface.co/datasets/gajeshladhar/core-five/resolve/main/",df_metadata.sample(n=1).path.iloc[0])
path

'https://huggingface.co/datasets/gajeshladhar/core-five/resolve/main/src/datatree/602187/60218764.nc'

In [None]:
%%time
tree = xr.open_datatree(path)
tree

CPU times: user 1.52 s, sys: 220 ms, total: 1.74 s
Wall time: 10.2 s


### 🌳 Navigating the DataTree: Modalities Unfolded

Each `.nc` file is exposed as a structured `xarray`-based `DataTree`. You can directly access nested groups like a filesystem.

Below is a visual representation of how different modalities are structured:

```
📂 root
├── 🛰️ s2
│   ├── B02  ← Blue
│   ├── B03  ← Green
│   ├── B04  ← Red
│   ├── B05  ← Red Edge 1
│   ├── B06  ← Red Edge 2
│   ├── ...  ← More bands (B07 to B12)
├── 📡 s1
│   ├── vv   ← Vertical transmit/receive
│   ├── vh   ← Vertical transmit, horizontal receive
│   └── ...  ← (Polarimetric or derived bands)
├── 🛰️ landsat
│   ├── red
│   ├── green
│   ├── blue
│   ├── nir
│   ├── swir1
│   ├── swir2
│   └── ...  ← (Thermal or quality bands)
├── 🌍 modis
│   ├── sur_refl_b01  ← Red
│   ├── sur_refl_b03  ← Blue
│   ├── sur_refl_b04  ← Green
│   ├── sur_refl_b05
│   ├── ...
└── 🔍 hr
    └── data  ← High-res 25cm image
```



In [None]:
ds_s2 = xr.concat([tree['s2/B04'], tree['s2/B03'], tree['s2/B02']],dim=pd.Index(['B04','B03','B02'],name='band'))
ds_s1 = xr.concat([tree['s1/vv'], tree['s1/vh']],dim=pd.Index(['vv','vh'],name='band'))

ds_landsat = xr.concat([tree['landsat/red'], tree['landsat/green'], tree['landsat/blue']],dim=pd.Index(['red','green','blue'],name='band'))
ds_modis = xr.concat([tree['modis/sur_refl_b01'], tree['modis/sur_refl_b04'], tree['modis/sur_refl_b03']],dim=pd.Index(['red','green','blue'],name='band'))

ds_hr = tree['hr/data']

## 🛰️ Quick Visualization
We'll preview one modality using `hvplot` or `matplotlib` for a quick visual sanity check.