# NetCDF to GeoTIFF Converter

This notebook converts NetCDF files to GeoTIFF format for easier processing with standard GIS tools.

**Use cases:**
- Converting PM2.5 NetCDF data to TIFFs
- Processing climate/atmospheric data
- Batch conversion of time-series NetCDF files

**Requirements:**
- `xarray`: For reading NetCDF files
- `rioxarray`: For spatial reference handling and GeoTIFF export
- `netCDF4`: NetCDF backend

## Setup

In [1]:
import xarray as xr
import rioxarray
from pathlib import Path
import numpy as np
from tqdm.auto import tqdm

print("✓ Imports successful")

✓ Imports successful


## Configuration

Set your input/output paths and NetCDF variable information.

In [2]:
# Define paths
project_root = Path.cwd().parent
DATA_ROOT = project_root.parent / "data"

# Input: Directory containing NetCDF files or single NetCDF file
INPUT_PATH = DATA_ROOT / "00_source" / "archives" / "pm25" /"2019" # NC files location

# Output: Directory to save GeoTIFFs
OUTPUT_DIR = DATA_ROOT / "temp" / "pm25_tiffs"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# NetCDF variable settings
VARIABLE_NAME = None  # Set to specific variable name, or None to auto-detect
CRS = "EPSG:4326"  # Default coordinate reference system (WGS84)

# Dimension names (update if your NetCDF uses different names)
X_DIM = "lon"  # or "longitude", "x"
Y_DIM = "lat"  # or "latitude", "y"
TIME_DIM = "time"  # or None if no time dimension

print(f"Input path: {INPUT_PATH}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Target CRS: {CRS}")

Input path: /Users/juancheeto/Library/CloudStorage/Box-Box/UrbanStructureStudies/AfricaProject/data/00_source/archives/pm25/2019
Output directory: /Users/juancheeto/Library/CloudStorage/Box-Box/UrbanStructureStudies/AfricaProject/data/temp/pm25_tiffs
Target CRS: EPSG:4326


## Step 1: Explore NetCDF Structure

First, let's examine a sample NetCDF file to understand its structure.

In [3]:
# Find NetCDF files
if INPUT_PATH.is_file():
    nc_files = [INPUT_PATH]
else:
    nc_files = sorted(list(INPUT_PATH.rglob("*.nc")) + list(INPUT_PATH.rglob("*.nc4")))

print(f"Found {len(nc_files)} NetCDF file(s)")
if nc_files:
    for f in nc_files[:5]:
        print(f"  - {f.name}")
    if len(nc_files) > 5:
        print(f"  ... and {len(nc_files) - 5} more")
else:
    print("⚠ No NetCDF files found. Please check INPUT_PATH.")

Found 12 NetCDF file(s)
  - V6GL02.04.CNNPM25.GL.201901-201901.nc
  - V6GL02.04.CNNPM25.GL.201902-201902.nc
  - V6GL02.04.CNNPM25.GL.201903-201903.nc
  - V6GL02.04.CNNPM25.GL.201904-201904.nc
  - V6GL02.04.CNNPM25.GL.201905-201905.nc
  ... and 7 more


In [4]:
# Inspect first NetCDF file
if nc_files:
    sample_file = nc_files[0]
    print(f"Inspecting: {sample_file.name}")
    print("=" * 70)
    
    with xr.open_dataset(sample_file) as ds:
        print("\nDataset Overview:")
        print(ds)
        
        print("\n" + "=" * 70)
        print("Data Variables:")
        for var in ds.data_vars:
            print(f"  - {var}: {ds[var].dims} {ds[var].shape} ({ds[var].dtype})")
        
        print("\nCoordinates:")
        for coord in ds.coords:
            print(f"  - {coord}: {ds[coord].shape}")
        
        print("\nAttributes:")
        for attr, value in ds.attrs.items():
            print(f"  - {attr}: {value}")

Inspecting: V6GL02.04.CNNPM25.GL.201901-201901.nc

Dataset Overview:
<xarray.Dataset> Size: 2GB
Dimensions:  (lat: 13000, lon: 36000)
Coordinates:
  * lat      (lat) float32 52kB -59.99 -59.99 -59.97 -59.97 ... 69.97 69.99 70.0
  * lon      (lon) float32 144kB -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
Data variables:
    PM25     (lat, lon) float32 2GB ...
Attributes:
    TITLE:            Convolutional Neural Network Monthly PM2.5 Estimation o...
    CONTACT:          SIYUAN SHEN <s.siyuan@wustl.edu>
    LAT_DELTA:        0.01
    LON_DELTA:        0.01
    SPATIALCOVERAGE:  GL
    TIMECOVERAGE:     201901

Data Variables:
  - PM25: ('lat', 'lon') (13000, 36000) (float32)

Coordinates:
  - lat: (13000,)
  - lon: (36000,)

Attributes:
  - TITLE: Convolutional Neural Network Monthly PM2.5 Estimation over GL Area. (0.01x0.01 resolution)
  - CONTACT: SIYUAN SHEN <s.siyuan@wustl.edu>
  - LAT_DELTA: 0.01
  - LON_DELTA: 0.01
  - SPATIALCOVERAGE: GL
  - TIMECOVERAGE: 201901


## Step 2: Convert NetCDF to GeoTIFF

Convert each NetCDF file (or time slice) to GeoTIFF format.

**Important Note on Data Orientation:**  
Many NetCDF files (especially climate/atmospheric data like PM2.5) store latitude coordinates in **descending order** (90° to -90° instead of -90° to 90°). This causes images to be rendered **upside down** (south at top, north at bottom), which **breaks spatial operations** like clipping to country boundaries.

The conversion function below automatically detects and corrects this issue by:
1. Checking if Y-coordinates are in descending order
2. Flipping the Y-axis using `sortby()` to ensure proper orientation
3. Confirming the correction in the console output

You'll see messages like `"⚠ Detected descending Y-axis"` followed by `"✓ Y-axis corrected"` if this fix is applied.

In [5]:
def convert_netcdf_to_geotiff(
    nc_path: Path,
    output_dir: Path,
    variable_name: str = None,
    crs: str = "EPSG:4326",
    x_dim: str = "lon",
    y_dim: str = "lat",
    time_dim: str = "time"
) -> list:
    """
    Convert NetCDF file to GeoTIFF(s).
    
    If the NetCDF has a time dimension, creates one GeoTIFF per time step.
    Otherwise, creates a single GeoTIFF.
    
    Returns:
        List of created GeoTIFF file paths
    """
    output_files = []
    
    with xr.open_dataset(nc_path) as ds:
        # Auto-detect variable if not specified
        if variable_name is None:
            # Get first data variable
            data_vars = list(ds.data_vars)
            if not data_vars:
                raise ValueError(f"No data variables found in {nc_path}")
            variable_name = data_vars[0]
            print(f"  Auto-detected variable: {variable_name}")
        
        # Get the data array
        da = ds[variable_name]
        
        # FIX FOR UPSIDE-DOWN IMAGES:
        # Many NetCDF files (especially climate/atmospheric data) have latitude coordinates
        # in descending order (90 to -90 instead of -90 to 90). This causes images to be
        # rendered upside down (south at top, north at bottom), which breaks spatial operations
        # like clipping. We need to flip the Y-axis if latitudes are descending.
        if y_dim in da.coords:
            y_coords = da.coords[y_dim].values
            # Check if Y coordinates are in descending order (e.g., 90, 89, 88... -> -90)
            if len(y_coords) > 1 and y_coords[0] > y_coords[-1]:
                print(f"  ⚠ Detected descending Y-axis (upside-down data)")
                print(f"    Y range: {y_coords[0]:.2f} to {y_coords[-1]:.2f}")
                print(f"  → Flipping Y-axis to correct orientation...")
                # Reverse the Y dimension to get ascending order
                da = da.sortby(y_dim)
                print(f"  ✓ Y-axis corrected: {da.coords[y_dim].values[0]:.2f} to {da.coords[y_dim].values[-1]:.2f}")
        
        # Set spatial dimensions
        da = da.rename({x_dim: 'x', y_dim: 'y'} if x_dim in da.dims else {})
        
        # Assign CRS if not present
        if not hasattr(da, 'rio'):
            da = da.rio.write_crs(crs)
        elif da.rio.crs is None:
            da = da.rio.write_crs(crs)
        
        # Check if there's a time dimension
        has_time = time_dim in da.dims
        
        if has_time:
            # Process each time step
            time_steps = da[time_dim].values
            print(f"  Processing {len(time_steps)} time steps...")
            
            for i, time_val in enumerate(tqdm(time_steps, desc=f"  {nc_path.name}")):
                # Select single time step
                da_time = da.sel({time_dim: time_val})
                
                # Create output filename
                time_str = str(time_val).replace(':', '-').replace(' ', '_')
                if 'T' in time_str:
                    time_str = time_str.split('T')[0]  # Keep just the date
                
                output_file = output_dir / f"{nc_path.stem}_{time_str}.tif"
                
                # Write to GeoTIFF
                da_time.rio.to_raster(output_file, driver="GTiff", compress="lzw")
                output_files.append(output_file)
        else:
            # Single time step or no time dimension
            output_file = output_dir / f"{nc_path.stem}.tif"
            da.rio.to_raster(output_file, driver="GTiff", compress="lzw")
            output_files.append(output_file)
            print(f"  ✓ Created: {output_file.name}")
    
    return output_files

print("✓ Conversion function defined")

✓ Conversion function defined


In [6]:
# Convert all NetCDF files
print("Converting NetCDF files to GeoTIFF...")
print("=" * 70)

all_output_files = []

for nc_file in nc_files:
    print(f"\n{nc_file.name}:")
    try:
        output_files = convert_netcdf_to_geotiff(
            nc_path=nc_file,
            output_dir=OUTPUT_DIR,
            variable_name=VARIABLE_NAME,
            crs=CRS,
            x_dim=X_DIM,
            y_dim=Y_DIM,
            time_dim=TIME_DIM
        )
        all_output_files.extend(output_files)
        print(f"  ✓ Created {len(output_files)} GeoTIFF file(s)")
    except Exception as e:
        print(f"  ✗ Error: {e}")

print("\n" + "=" * 70)
print(f"✓ Conversion complete!")
print(f"Total GeoTIFF files created: {len(all_output_files)}")
print(f"Output directory: {OUTPUT_DIR}")

Converting NetCDF files to GeoTIFF...

V6GL02.04.CNNPM25.GL.201901-201901.nc:
  Auto-detected variable: PM25
  ✓ Created: V6GL02.04.CNNPM25.GL.201901-201901.tif
  ✓ Created 1 GeoTIFF file(s)

V6GL02.04.CNNPM25.GL.201902-201902.nc:
  Auto-detected variable: PM25
  ✓ Created: V6GL02.04.CNNPM25.GL.201902-201902.tif
  ✓ Created 1 GeoTIFF file(s)

V6GL02.04.CNNPM25.GL.201903-201903.nc:
  Auto-detected variable: PM25
  ✓ Created: V6GL02.04.CNNPM25.GL.201903-201903.tif
  ✓ Created 1 GeoTIFF file(s)

V6GL02.04.CNNPM25.GL.201904-201904.nc:
  Auto-detected variable: PM25
  ✓ Created: V6GL02.04.CNNPM25.GL.201904-201904.tif
  ✓ Created 1 GeoTIFF file(s)

V6GL02.04.CNNPM25.GL.201905-201905.nc:
  Auto-detected variable: PM25
  ✓ Created: V6GL02.04.CNNPM25.GL.201905-201905.tif
  ✓ Created 1 GeoTIFF file(s)

V6GL02.04.CNNPM25.GL.201906-201906.nc:
  Auto-detected variable: PM25
  ✓ Created: V6GL02.04.CNNPM25.GL.201906-201906.tif
  ✓ Created 1 GeoTIFF file(s)

V6GL02.04.CNNPM25.GL.201907-201907.nc:
  Au

## Step 3: Verify Output

Check the created GeoTIFF files.

In [7]:
import rasterio

if all_output_files:
    # Show sample of created files
    print("Created GeoTIFF files:")
    for f in all_output_files[:10]:
        size_mb = f.stat().st_size / (1024**2)
        print(f"  - {f.name} ({size_mb:.1f} MB)")
    
    if len(all_output_files) > 10:
        print(f"  ... and {len(all_output_files) - 10} more")
    
    # Inspect first file
    print(f"\nInspecting first GeoTIFF: {all_output_files[0].name}")
    print("=" * 70)
    
    with rasterio.open(all_output_files[0]) as src:
        print(f"Dimensions: {src.width} x {src.height}")
        print(f"Bands: {src.count}")
        print(f"CRS: {src.crs}")
        print(f"Bounds: {src.bounds}")
        print(f"Data type: {src.dtypes[0]}")
        print(f"NoData value: {src.nodata}")
        
        # Read and show statistics
        data = src.read(1)
        valid_data = data[data != src.nodata] if src.nodata is not None else data
        
        print(f"\nData statistics:")
        print(f"  Min: {np.min(valid_data):.4f}")
        print(f"  Max: {np.max(valid_data):.4f}")
        print(f"  Mean: {np.mean(valid_data):.4f}")
        print(f"  Std: {np.std(valid_data):.4f}")
else:
    print("No files were created.")

Created GeoTIFF files:
  - V6GL02.04.CNNPM25.GL.201901-201901.tif (734.4 MB)
  - V6GL02.04.CNNPM25.GL.201902-201902.tif (733.0 MB)
  - V6GL02.04.CNNPM25.GL.201903-201903.tif (731.6 MB)
  - V6GL02.04.CNNPM25.GL.201904-201904.tif (731.4 MB)
  - V6GL02.04.CNNPM25.GL.201905-201905.tif (733.5 MB)
  - V6GL02.04.CNNPM25.GL.201906-201906.tif (734.8 MB)
  - V6GL02.04.CNNPM25.GL.201907-201907.tif (736.1 MB)
  - V6GL02.04.CNNPM25.GL.201908-201908.tif (734.3 MB)
  - V6GL02.04.CNNPM25.GL.201909-201909.tif (733.5 MB)
  - V6GL02.04.CNNPM25.GL.201910-201910.tif (733.4 MB)
  ... and 2 more

Inspecting first GeoTIFF: V6GL02.04.CNNPM25.GL.201901-201901.tif
Dimensions: 36000 x 13000
Bands: 1
CRS: EPSG:4326
Bounds: BoundingBox(left=-179.99999511705187, bottom=70.00000274664657, right=179.99999511705187, top=-59.99999893194933)
Data type: float32
NoData value: None

Data statistics:
  Min: -999.9000
  Max: 605.0477
  Mean: -624.7463
  Std: 490.0598


## Step 4: Quick Visualization (Optional)

Visualize one of the converted GeoTIFFs.

In [None]:
import matplotlib.pyplot as plt
from rasterio.plot import show as rioshow

if all_output_files:
    # Visualize first file
    with rasterio.open(all_output_files[0]) as src:
        fig, ax = plt.subplots(figsize=(12, 8))
        
        # Use rasterio.plot.show() which handles geospatial orientation correctly
        # This ensures the image is displayed right-side up with proper georeferencing
        rioshow(src, ax=ax, cmap='viridis')
        
        ax.set_title(f"{all_output_files[0].name}", fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        # Also show data statistics
        data = src.read(1)
        if src.nodata is not None:
            valid_data = data[data != src.nodata]
        else:
            valid_data = data
        
        print(f"\nData range: {valid_data.min():.2f} to {valid_data.max():.2f}")
        print(f"Mean: {valid_data.mean():.2f}")
else:
    print("No files to visualize.")

## Summary

Your NetCDF files have been converted to GeoTIFF format and are ready for further processing with geoworkflow or other GIS tools.

In [9]:
print("=" * 70)
print("CONVERSION SUMMARY")
print("=" * 70)
print(f"Input NetCDF files: {len(nc_files)}")
print(f"Output GeoTIFF files: {len(all_output_files)}")
print(f"Output location: {OUTPUT_DIR}")

if all_output_files:
    total_size = sum(f.stat().st_size for f in all_output_files) / (1024**2)
    print(f"Total size: {total_size:.1f} MB")
    print(f"\nNext steps:")
    print(f"  1. Use these GeoTIFFs with geoworkflow processors")
    print(f"  2. Clip to country boundaries using spatial/clipper")
    print(f"  3. Compute temporal statistics with temporal_raster_utils")

print("=" * 70)

CONVERSION SUMMARY
Input NetCDF files: 12
Output GeoTIFF files: 12
Output location: /Users/juancheeto/Library/CloudStorage/Box-Box/UrbanStructureStudies/AfricaProject/data/temp/pm25_tiffs
Total size: 8806.6 MB

Next steps:
  1. Use these GeoTIFFs with geoworkflow processors
  2. Clip to country boundaries using spatial/clipper
  3. Compute temporal statistics with temporal_raster_utils
