# Channel Belt Extractor
This Python script processes raster tiles to extract channel belt polygons for rivers, using input data specified in a CSV file. The CSV must contain the following columns: river_name (unique identifier for each river), working_directory (base folder for input and output files). Ensure that the working directory contains the prerequisite datasets and follows the expected folder structure. You will need to have downloaded the Nyber global channel-belt raster, and have a polygon shapefile outlining the channel belt you would like to process.

The script produces three main outputs for each river: a polygon representing the largest connected component of the channel belt, a shapefile containg the channel belt split by reach, and a CSV containing calculations of area in square kilometers by reach. These outputs are saved in the specified subfolders within the working directory. To use the script, provide the path to the CSV file as input to the generate_channel_belt_masks function. The function will automatically process each river listed in the CSV, ensuring the results are saved in their respective locations.

The script draws on the following dataset to extract channel belts:
Nyberg, B., Henstra, G., Gawthorpe, R.L., Ravnås, R., Ahokas, J., 2023. Global scale analysis on the extent of river channel belts. Nat Commun 14, 2163. https://doi.org/10.1038/s41467-023-37852-8

Author: James (Huck) Rees; PhD Student, UCSB Geography

Date: January 12, 2025

## Import packages

In [1]:
import rasterio
import geopandas as gpd
import numpy as np
import pandas as pd
from rasterio.merge import merge
from rasterio.mask import mask
from rasterio.features import shapes as raster_shapes
from scipy.ndimage import label, binary_fill_holes
from shapely.geometry import shape, LineString, Point, MultiPolygon, MultiLineString
from shapely.ops import split, nearest_points
import os
import math

## Initialize functions to extract channel belt from Nyberg dataset

In [2]:
def find_tiles_to_process(outline_path, index_path, raster_dir):
    """
    Find the tiles intersecting with the outline shapefile and return their paths.

    Parameters:
        outline_path (str): Path to the outline shapefile.
        index_path (str): Path to the tile index shapefile.
        raster_dir (str): Directory containing the raster tiles.

    Returns:
        list: Paths of intersecting raster tiles.
    """
    # Load the outline and index shapefiles
    outline = gpd.read_file(outline_path)
    index = gpd.read_file(index_path)

    # Ensure both layers use the same CRS
    if outline.crs != index.crs:
        outline = outline.to_crs(index.crs)

    # Find intersecting tiles
    intersecting_tiles = gpd.overlay(index, outline, how="intersection")

    if intersecting_tiles.empty:
        raise ValueError("No intersecting tiles found.")

    # Extract tile identifiers and construct their file paths
    tile_ids = intersecting_tiles['filename'].unique()
    tile_paths = [os.path.join(raster_dir, f"{tile_id}") for tile_id in tile_ids]

    return tile_paths

def stitch_tiles(tile_paths):
    """
    Stitch multiple raster tiles into a single raster.

    Parameters:
        tile_paths (list): Paths to raster tiles to stitch.

    Returns:
        tuple: (stitched_array, stitched_transform, meta)
    """
    rasters = [rasterio.open(path) for path in tile_paths]
    stitched_array, stitched_transform = merge(rasters)
    meta = rasters[0].meta.copy()

    # Update metadata for the stitched raster
    meta.update({
        "height": stitched_array.shape[1],
        "width": stitched_array.shape[2],
        "transform": stitched_transform,
        "count": stitched_array.shape[0],
    })

    for raster in rasters:
        raster.close()

    return stitched_array, stitched_transform, meta

def generate_channel_belt_masks(csv_path):
    """
    Generate channel belt polygons for each unique river listed in the CSV file.

    Parameters:
        csv_path (str): Path to the CSV file containing river data.

    Returns:
        gpd.GeoDataFrame: GeoDataFrame containing the channel belt polygons.
    """
    
    river_data = pd.read_csv(csv_path)

    all_polygons = []
    for river_name, group in river_data.groupby('river_name'):
        working_dir = group['working_directory'].iloc[0]
        outline_path = os.path.join(working_dir, 'ChannelBelts', 'Extracted_ChannelBelts', river_name, f"{river_name}_outline.shp")
        index_path = os.path.join(working_dir, 'ChannelBelts', 'GRM_index.shp')
        raster_dir = os.path.join(working_dir, 'ChannelBelts', 'tifs')
        
        tile_paths = find_tiles_to_process(outline_path, index_path, raster_dir)

        if len(tile_paths) > 1:
            stitched_array, stitched_transform, meta = stitch_tiles(tile_paths)
        else:
            with rasterio.open(tile_paths[0]) as src:
                stitched_array = src.read()
                stitched_transform = src.transform
                meta = src.meta.copy()

        outline = gpd.read_file(outline_path)
        with rasterio.open(tile_paths[0]) as src:
            clipped_image, clipped_transform = mask(src, outline.geometry, crop=True, nodata=0)

            clipped_image = np.squeeze(clipped_image)
            clipped_image = np.where(clipped_image == src.nodata, 0, clipped_image)

        binary_image = np.where((clipped_image >= 2) & (clipped_image <= 6), 1, 0).astype(np.uint8)

        labeled_array, num_features = label(binary_image, structure=np.ones((3, 3), dtype=int))
        component_sizes = np.bincount(labeled_array.ravel())
        largest_component_label = component_sizes[1:].argmax() + 1
        largest_component = (labeled_array == largest_component_label)

        filled_component = binary_fill_holes(largest_component)
        final_result = np.where(filled_component, 1, 0).astype(np.uint8)

        polygon_shapes = [
            shape(geom) 
            for geom, value in raster_shapes(final_result, transform=clipped_transform, connectivity=8) 
            if value == 1
        ]

        gdf = gpd.GeoDataFrame(geometry=polygon_shapes, crs=outline.crs)

        # Export unsplit channel belt polygons to a shapefile
        shapefile_output_dir = os.path.join(working_dir, 'ChannelBelts', 'Extracted_ChannelBelts', river_name)
        os.makedirs(shapefile_output_dir, exist_ok=True)
        shapefile_output_path = os.path.join(shapefile_output_dir, f"{river_name}_channelbelt.shp")
        gdf.to_file(shapefile_output_path)

        print(f"Unsplit channel belt shapefile saved at {shapefile_output_path}")
        all_polygons.append(gdf)

    return gpd.GeoDataFrame(pd.concat(all_polygons, ignore_index=True))

## Initialize functions to split extracted channel belt into study reaches and calculate area of channel belt by reach

In [3]:
def generate_reach_straights(reach_gdf):
    """
    Simplify each reach in a GeoDataFrame to a straight line connecting its start and end points.

    Parameters:
        reach_gdf (gpd.GeoDataFrame): GeoDataFrame containing reach line geometries.

    Returns:
        gpd.GeoDataFrame: GeoDataFrame with simplified straight-line geometries.
    """
    straight_gdf = reach_gdf.copy()
    straight_gdf["geometry"] = reach_gdf["geometry"].apply(
        lambda geom: LineString([geom.coords[0], geom.coords[-1]]) if geom and len(geom.coords) > 1 else geom
    )

    return straight_gdf

def get_reach_bnd_points(straight_gdf):
    """
    Generate upstream and downstream boundary points for each reach line in the straight_gdf.

    Parameters:
        straight_gdf (gpd.GeoDataFrame): GeoDataFrame containing straight-line reach geometries.

    Returns:
        gpd.GeoDataFrame: GeoDataFrame containing boundary points with their associated ds_order.
    """
    points = []
    ds_orders = straight_gdf['ds_order'].sort_values().tolist()

    for i, row in straight_gdf.iterrows():
        reach = row.geometry
        ds_order = row['ds_order']

        # Identify upstream point
        if ds_order == 1:
            upstream_point = Point(reach.coords[0])
            points.append({'geometry': upstream_point, 'ds_order': ds_order})
        else:
            upstream_point = Point(reach.coords[0])
            points.append({'geometry': upstream_point, 'ds_order': ds_order})

        # Add downstream point for the last reach
        if ds_order == max(ds_orders):
            downstream_point = Point(reach.coords[-1])
            points.append({'geometry': downstream_point, 'ds_order': ds_order + 1})

    reach_endpoint_gdf = gpd.GeoDataFrame(points, crs=straight_gdf.crs)

    return reach_endpoint_gdf

def generate_bisectors(straight_gdf, reach_endpts_gdf):
    """
    Generate bisecting lines at each endpoint in the reach_endpts_gdf.

    Parameters:
        straight_gdf (gpd.GeoDataFrame): GeoDataFrame containing straight-line reach geometries.
        reach_endpts_gdf (gpd.GeoDataFrame): GeoDataFrame containing reach endpoints with their associated ds_order.

    Returns:
        gpd.GeoDataFrame: GeoDataFrame containing bisecting lines as geometries with associated ds_order.
    """
    bisectors = []

    for i, point_row in reach_endpts_gdf.iterrows():
        endpoint = point_row.geometry
        ds_order = point_row.ds_order

        if ds_order == 1:
            reach = straight_gdf[straight_gdf['ds_order'] == ds_order].iloc[0].geometry
            vec = np.array(reach.coords[-1]) - np.array(reach.coords[0])
            vec = vec / np.linalg.norm(vec)
            bisector_angle = np.arctan2(vec[1], vec[0]) + np.pi / 2
            bisector_vec = np.array([np.cos(bisector_angle), np.sin(bisector_angle)])
            length = reach.length
        elif ds_order == straight_gdf['ds_order'].max() + 1:
            reach = straight_gdf[straight_gdf['ds_order'] == ds_order - 1].iloc[0].geometry
            vec = np.array(reach.coords[-1]) - np.array(reach.coords[0])
            vec = vec / np.linalg.norm(vec)
            bisector_angle = np.arctan2(vec[1], vec[0]) + np.pi / 2
            bisector_vec = np.array([np.cos(bisector_angle), np.sin(bisector_angle)])
            length = reach.length
        else:
            reach1 = straight_gdf[straight_gdf['ds_order'] == ds_order - 1].iloc[0].geometry
            reach2 = straight_gdf[straight_gdf['ds_order'] == ds_order].iloc[0].geometry

            vec1 = np.array(reach1.coords[-1]) - np.array(reach1.coords[0])
            vec2 = np.array(reach2.coords[-1]) - np.array(reach2.coords[0])

            vec1 = vec1 / np.linalg.norm(vec1)
            vec2 = vec2 / np.linalg.norm(vec2)

            mean_vec = (vec1 + vec2) / 2
            mean_vec = mean_vec / np.linalg.norm(mean_vec)

            bisector_angle = np.arctan2(mean_vec[1], mean_vec[0]) + np.pi / 2
            bisector_vec = np.array([np.cos(bisector_angle), np.sin(bisector_angle)])
            length = (reach1.length + reach2.length) / 2

        bisector_start = Point(
            endpoint.x - (length / 2) * bisector_vec[0],
            endpoint.y - (length / 2) * bisector_vec[1]
        )
        bisector_end = Point(
            endpoint.x + (length / 2) * bisector_vec[0],
            endpoint.y + (length / 2) * bisector_vec[1]
        )

        bisector_line = LineString([bisector_start, bisector_end])
        bisectors.append({"geometry": bisector_line, "ds_order": ds_order})

    bisector_gdf = gpd.GeoDataFrame(bisectors, crs=straight_gdf.crs)

    return bisector_gdf

def split_polygon_by_polylines(channel_belt_gdf, bisectors_gdf):
    """
    Splits a single polygon feature in a GeoDataFrame by multiple polylines.

    Parameters:
        channel_belt_gdf (gpd.GeoDataFrame): A GeoDataFrame containing one polygon.
        bisectors_gdf (gpd.GeoDataFrame): A GeoDataFrame containing polylines that will split the polygon.

    Returns:
        gpd.GeoDataFrame: A GeoDataFrame containing the split polygon features.
    """
    if len(channel_belt_gdf) != 1:
        raise ValueError("The polygon GeoDataFrame must contain exactly one polygon feature.")

    channel_belt_gdf = channel_belt_gdf.to_crs(bisectors_gdf.crs)
    polygon = channel_belt_gdf.iloc[0].geometry

    merged_polylines = bisectors_gdf.unary_union
    if not isinstance(merged_polylines, MultiLineString):
        merged_polylines = MultiLineString([merged_polylines])

    geometries = [polygon]

    for line in merged_polylines.geoms:
        new_geometries = []
        for geom in geometries:
            if geom.geom_type == "Polygon":
                split_result = split(geom, line)
                new_geometries.extend(split_result.geoms)
            else:
                new_geometries.append(geom)
        geometries = new_geometries

    split_gdf = gpd.GeoDataFrame(geometry=geometries, crs=bisectors_gdf.crs)

    return split_gdf

def clean_split_cb(split_channel_belt, reach_gdf):
    """
    Cleans the split channel belt polygons by associating them with the reach lines they most overlap.

    Parameters:
        split_channel_belt (gpd.GeoDataFrame): A GeoDataFrame containing split polygon features.
        reach_gdf (gpd.GeoDataFrame): A GeoDataFrame containing reach lines with a 'ds_order' field.

    Returns:
        gpd.GeoDataFrame: A cleaned GeoDataFrame containing only the polygons associated with reach lines,
                          with a 'ds_order' field assigned.
    """
    cleaned_polygons = []

    for _, reach_row in reach_gdf.iterrows():
        reach_line = reach_row.geometry
        ds_order = reach_row['ds_order']

        max_overlap = 0
        best_polygon = None

        for _, poly_row in split_channel_belt.iterrows():
            polygon = poly_row.geometry
            overlap = reach_line.intersection(polygon).length

            if overlap > max_overlap:
                max_overlap = overlap
                best_polygon = polygon

        if best_polygon is not None:
            cleaned_polygons.append({"geometry": best_polygon, "ds_order": ds_order})

    cleaned_split_cb_gdf = gpd.GeoDataFrame(cleaned_polygons, crs=split_channel_belt.crs)

    # Add a new field for area in square kilometers
    cleaned_split_cb_gdf["area_sq_km"] = cleaned_split_cb_gdf.geometry.area / 1e6

    return cleaned_split_cb_gdf

def output_cb_areas(cleaned_split_cb_gdf, output_path):
    """
    Outputs the areas of each feature in the cleaned split channel belt in square kilometers,
    along with their associated ds_order, to a CSV file.

    Parameters:
        cleaned_split_cb_gdf (gpd.GeoDataFrame): A GeoDataFrame containing cleaned split channel belt polygons
                                                 with 'ds_order' and 'area_sq_km' fields.
        output_path (str): The file path where the CSV should be saved.

    Returns:
        None
    """

    output_df = cleaned_split_cb_gdf[["ds_order", "area_sq_km"]].copy()
    output_df.to_csv(output_path, index=False)
    print(f"Channel belt area by reach saved at {output_path}")

def process_channelbelt_byreach(csv_path):
    """
    Main function to process river data and split channel belt polygons by reach lines.

    Parameters:
        csv_path (str): Path to the CSV file containing river data.

    Returns:
        dict: A dictionary with river names as keys and cleaned split channel belt GeoDataFrames as values.
    """
    river_data = pd.read_csv(csv_path)
    results = {}

    for river_name, group in river_data.groupby('river_name'):
        working_dir = group['working_directory'].iloc[0]
        reach_path = os.path.join(working_dir, 'HydroATLAS', 'HydroRIVERS', 'Extracted_Rivers', river_name, f"{river_name}_reaches.shp")
        channel_belt_path = os.path.join(working_dir, 'ChannelBelts', 'Extracted_ChannelBelts', river_name, f"{river_name}_channelbelt.shp")

        if not os.path.exists(reach_path) or not os.path.exists(channel_belt_path):
            print(f"Required files not found for river {river_name}")
            continue

        reach_gdf = gpd.read_file(reach_path)
        channel_belt_gdf = gpd.read_file(channel_belt_path)

        straight_gdf = generate_reach_straights(reach_gdf)
        reach_endpts_gdf = get_reach_bnd_points(straight_gdf)
        bisectors_gdf = generate_bisectors(straight_gdf, reach_endpts_gdf)
        split_channel_belt = split_polygon_by_polylines(channel_belt_gdf, bisectors_gdf)
        cleaned_split_cb_gdf = clean_split_cb(split_channel_belt, reach_gdf)

        # Save the cleaned split channel belt to a shapefile
        output_path = os.path.join(working_dir, 'ChannelBelts', 'Extracted_ChannelBelts', river_name, f"{river_name}_channelbelt_split.shp")
        cleaned_split_cb_gdf.to_file(output_path)
        
        # Save channel belt areas by reach to a .csv
        area_output_path = os.path.join(working_dir, 'ChannelBelts', 'Extracted_ChannelBelts', river_name, f"{river_name}_channelbelt_areas.csv")
        output_cb_areas(cleaned_split_cb_gdf, area_output_path)
        
        # Print out that river processing has been completed
        print(f"Split (by reach) channel belt shapefile saved at {output_path}")

## Initialize path to RivMapper .csv

In [4]:
csv_file_path = r"D:\Dissertation\Data\Geyman_river_datasheet.csv"

## Extract channel belt from Nyberg dataset

In [6]:
channel_belt_gdf = generate_channel_belt_masks(csv_file_path)

Unsplit channel belt shapefile saved at D:\Dissertation\Data\ChannelBelts\Extracted_ChannelBelts\Koyukuk_Huslia\Koyukuk_Huslia_channelbelt.shp
Unsplit channel belt shapefile saved at D:\Dissertation\Data\ChannelBelts\Extracted_ChannelBelts\Yukon_Beaver\Yukon_Beaver_channelbelt.shp


## Split channel belt by reach and calculate areas

In [7]:
process_channelbelt_byreach(csv_file_path)

Channel belt area by reach saved at D:\Dissertation\Data\ChannelBelts\Extracted_ChannelBelts\Koyukuk_Huslia\Koyukuk_Huslia_channelbelt_areas.csv
Split (by reach) channel belt shapefile saved at D:\Dissertation\Data\ChannelBelts\Extracted_ChannelBelts\Koyukuk_Huslia\Koyukuk_Huslia_channelbelt_split.shp
Channel belt area by reach saved at D:\Dissertation\Data\ChannelBelts\Extracted_ChannelBelts\Yukon_Beaver\Yukon_Beaver_channelbelt_areas.csv
Split (by reach) channel belt shapefile saved at D:\Dissertation\Data\ChannelBelts\Extracted_ChannelBelts\Yukon_Beaver\Yukon_Beaver_channelbelt_split.shp
