# 📓 Brick E1 – Cloud Object Storage & Cloud-Native Formats

**Part of Modern GIS Bricks**  
*Learn to upload local GIS files, convert them to cloud-native formats, and write them back to object storage.*

**Goals**  
1. Upload GeoTIFF & Shapefile → S3/MinIO  
2. Convert:
   - GeoTIFF → COG
   - Shapefile → GeoParquet  
3. Write outputs back to cloud storage  

## 1. Setup

We’ll import required libraries and configure access to S3/MinIO via `fsspec`.

In [None]:
import os
import fsspec
import rasterio
import geopandas as gpd

# ─── Local MinIO creds & endpoint ──────────────────────────────────────────────
MINIO_ACCESS_KEY = "minioadmin"
MINIO_SECRET_KEY = "minioadmin"
MINIO_ENDPOINT   = "http://localhost:9000"

# ─── Create an S3-filesystem against local MinIO ───────────────────────────────
fs = fsspec.filesystem(
    "s3",
    key=MINIO_ACCESS_KEY,
    secret=MINIO_SECRET_KEY,
    client_kwargs={"endpoint_url": MINIO_ENDPOINT},
    use_ssl=False,               # because we're on HTTP
    # config_kwargs={"signature_version": "s3v4"},  # optional
)

# Now you can treat `fs` exactly like S3FS but it talks to your local MinIO.

## 2. Upload Local Files to MinIO

We’ll push these local files into your MinIO “S3” bucket:

- `data/sample.tif`  
- `data/sample_shapefile/` (folder with `.shp`, `.shx`, `.dbf`, `.prj`)

Destination: `my-minio-bucket/e1-input/`

# TODO: set your local paths & MinIO bucket/prefix
local_raster  = "data/sample.tif"
local_shp_dir = "data/sample_shapefile"
bucket        = "my-minio-bucket"
prefix        = "e1-input"

# Create bucket if missing
if not fs.exists(bucket):
    fs.mkdir(bucket)

# Upload GeoTIFF
fs.put(local_raster, f"{bucket}/{prefix}/sample.tif")

# Upload Shapefile pieces
for ext in ["shp", "shx", "dbf", "prj"]:
    fs.put(f"{local_shp_dir}/sample.{ext}", f"{bucket}/{prefix}/sample.{ext}")

print("Uploaded files:", fs.ls(f"{bucket}/{prefix}"))


## 3.1 COG Best Practices

A Cloud-Optimized GeoTIFF (COG) is a valid GeoTIFF laid out for fast HTTP range requests. Key recommendations:

- **Internal tiling**: 256×256 or 512×512 blocks  
- **Overviews**: generate downsampled levels (smallest ≲256 px)  
- **Compression**: DEFLATE or LZW with PREDICTOR=2 (ints) or 3 (floats)  
- **NoData**: explicitly set (e.g. `NaN` for floats, max-neg for ints)  
- **Projection**: use a known EPSG code, WKT2 format  
- **Web-optimized** (optional): align tiles & overviews to Web Mercator 256 px grid  


In [None]:
from rio_cogeo.cogeo import cog_translate
from rio_cogeo.profiles import cog_profiles

# MinIO paths
in_raster = f"{bucket}/{prefix}/sample.tif"
out_cog   = f"{bucket}/e1-output/sample_cog.tif"

# 1️⃣ Pick the “deflate” COG profile (includes tiling + overviews)
profile = cog_profiles.get("deflate")

# 2️⃣ Optionally tweak tile size
profile["blockxsize"] = 512
profile["blockysize"] = 512

# 3️⃣ Create the COG in MinIO
with fs.open(in_raster, "rb") as src:
    with fs.open(out_cog, "wb") as dst:
        cog_translate(
            src,
            dst,
            profile,
            nodata=None,     # TODO: set your NoData if needed
            in_memory=False  # stream directly to MinIO
        )

print("COG created at", out_cog)

## 3.2 GeoParquet Best Practices

When writing GeoParquet:

- **Version**: 1.1 with the `bbox` covering struct  
- **Compression**: ZSTD  
- **Row-group size**: 100 000–200 000 rows  
- **Spatial partitions**: geohash, S2/H3, or admin boundaries  
- **Hive partitions** (optional): e.g. `country_iso=*` to prune files server-side  


import geohash
import pyarrow as pa
import pyarrow.parquet as pq

# MinIO paths
shp_path = f"{bucket}/{prefix}/sample.shp"
out_pq   = f"{bucket}/e1-output/sample.parquet"

# 1️⃣ Read shapefile from MinIO
with fs.open(shp_path, "rb") as f:
    gdf = gpd.read_file(f)

# 2️⃣ Compute a partition key (geohash at precision 9)
gdf["geohash"] = gdf.geometry.apply(
    lambda geom: geohash.encode(geom.y, geom.x, precision=9)
)  # TODO: adjust precision

# 3️⃣ Sort and convert to Arrow table
gdf_sorted = gdf.sort_values("geohash")
table      = pa.Table.from_pandas(
    gdf_sorted.drop(columns="geometry"), preserve_index=False
)

# 4️⃣ Write GeoParquet with best-practice options
with fs.open(out_pq, "wb") as dst:
    pq.write_table(
        table,
        dst,
        compression="ZSTD",
        version="2.0",           # Parquet file version
        row_group_size=150_000,  # between 100k–200k
    )

print("GeoParquet written at", out_pq)

## 4. (Optional) Preview Files in MinIO

List and inspect the generated COG and GeoParquet.

In [None]:
# List outputs
print("Outputs:", fs.ls(f"{bucket}/e1-output"))

# Inspect COG metadata
with fs.open(out_cog, "rb") as f:
    with rasterio.open(f) as src:
        print("COG bounds:", src.bounds)

# Inspect GeoParquet
with fs.open(out_pq, "rb") as f:
    gdf2 = gpd.read_parquet(f)
print("GeoParquet rows:", len(gdf2))

## 5. Basic Tests

Assert both files exist.

assert fs.exists(out_cog), "COG missing!"
assert fs.exists(out_pq),  "GeoParquet missing!"
print("All tests passed!")

## 6. Summary & Next Steps

- ✅ Uploaded raw files to MinIO  
- ✅ Converted to COG & GeoParquet using best practices  
- Files stored under `my-minio-bucket/e1-output/`  

**Next:** Brick E2 – Iceberg partitioning & time-travel