# NetCDF to Cloud Optimized GeoTIFF (COG) Transformation

This notebook demonstrates how to transform NetCDF files into Cloud Optimized GeoTIFFs for ingestion into the GHG Center.

**Dataset Example**: CASA-GFED3 Land Carbon Flux  
**Source Format**: NetCDF  
**Target Format**: Cloud Optimized GeoTIFF (COG)

## Configuration

Define the ingestion configuration including S3 buckets and transformation parameters.

In [None]:
import os
import xarray
import re
import pandas as pd
import json
import tempfile
import boto3
from datetime import datetime
import numpy as np

In [None]:
# Configuration
config = {
    "data_acquisition_method": "s3",
    "raw_data_bucket": "ghgc-data-store-dev",
    "raw_data_prefix": "raw_data/casa-gfed/",
    "cog_data_bucket": "ghgc-data-store-dev",
    "cog_data_prefix": "transformed_cogs/casa-gfed-v3",
    "date_fmt": "%Y%m",
    "transformation": {
        "reproject_to": "EPSG:4326",
        "compression": "DEFLATE"
    }
}

# Initialize S3 client
session = boto3.session.Session()
s3_client = session.client("s3")
bucket_name = config["cog_data_bucket"]
date_fmt = config["date_fmt"]

## Transformation Process

Process each NetCDF file:
1. Open the dataset
2. Fix coordinate issues (longitude wrap)
3. Extract variables
4. Create COGs for each time step and variable
5. Upload to S3

In [None]:
# Track processed files
files_processed = pd.DataFrame(columns=["file_name", "COGs_created"])

# Process NetCDF files
for name in os.listdir("geoscarb"):
    # Open NetCDF dataset
    xds = xarray.open_dataset(
        f"geoscarb/{name}",
        engine="netcdf4",
    )
    
    # Fix longitude coordinates (wrap from 0-360 to -180-180)
    xds = xds.assign_coords(
        longitude=(((xds.longitude + 180) % 360) - 180)
    ).sortby("longitude")
    
    # Get list of data variables
    variable = [var for var in xds.data_vars]

    # Process each time step
    for time_increment in range(0, len(xds.time)):
        # Process each variable
        for var in variable[:-1]:
            filename = name.split("/ ")[-1]
            filename_elements = re.split("[_ .]", filename)
            
            # Extract data for this time step and variable
            data = getattr(xds.isel(time=time_increment), var)
            
            # Flip latitude to match expected orientation
            data = data.isel(latitude=slice(None, None, -1))
            
            # Set spatial dimensions and CRS
            data.rio.set_spatial_dims("longitude", "latitude", inplace=True)
            data.rio.write_crs("epsg:4326", inplace=True)

            # Format date for filename
            date = data.time.dt.strftime(date_fmt).item(0)
            
            # Create COG filename
            filename_elements.pop()  # Remove extension
            filename_elements[-1] = date  # Replace with formatted date
            filename_elements.insert(2, var)  # Insert variable name
            cog_filename = "_".join(filename_elements)
            cog_filename = f"{cog_filename}.tif"

            # Write COG to temporary file and upload to S3
            with tempfile.NamedTemporaryFile() as temp_file:
                data.rio.to_raster(
                    temp_file.name,
                    driver="COG",
                )
                s3_client.upload_file(
                    Filename=temp_file.name,
                    Bucket=bucket_name,
                    Key=f"{config['cog_data_prefix']}/{cog_filename}",
                )

            # Track processed files
            files_processed = files_processed._append(
                {"file_name": name, "COGs_created": cog_filename},
                ignore_index=True,
            )

            print(f"Generated and saved COG: {cog_filename}")

## Save Metadata

Extract and save metadata from the NetCDF files for the STAC collection.

In [None]:
# Save metadata to S3
with tempfile.NamedTemporaryFile(mode="w+") as fp:
    metadata = {
        "attributes": xds.attrs,
        "data_dimensions": dict(xds.dims),
        "data_variables": list(xds.data_vars),
        "spatial_extent": {
            "xmin": float(xds.longitude.min()),
            "xmax": float(xds.longitude.max()),
            "ymin": float(xds.latitude.min()),
            "ymax": float(xds.latitude.max())
        },
        "temporal_extent": {
            "start": str(xds.time.min().values),
            "end": str(xds.time.max().values)
        }
    }
    json.dump(metadata, fp, indent=2)
    fp.flush()

    s3_client.upload_file(
        Filename=fp.name,
        Bucket=bucket_name,
        Key=f"{config['cog_data_prefix']}/metadata.json",
    )

# Save conversion log
files_processed.to_csv(
    f"s3://{bucket_name}/{config['cog_data_prefix']}/files_converted.csv",
)

print("Done generating COGs")
print(f"Total files processed: {len(files_processed)}")

## Key Considerations

### 1. Coordinate Handling
- NetCDF files may use different longitude conventions (0-360 vs -180-180)
- Always verify and transform to EPSG:4326 for consistency

### 2. Variable Selection
- Process each variable as a separate asset
- Skip auxiliary variables (e.g., time_bounds)

### 3. Temporal Processing
- Extract individual time steps for temporal datasets
- Use consistent date formatting in filenames

### 4. COG Generation
- Use appropriate compression (DEFLATE for general use)
- Consider predictor settings for specific data types

### 5. Metadata Preservation
- Extract and save original NetCDF attributes
- Document spatial and temporal extents
- Track all processed files