# Validating discharge estimates from SWOT SoS using observations from SWOT SHCQ

The SWOT mission aims to provide global estimates of river discharge. To ensure those estimates are credible, they need to be benchmarked against real-world observations. Many regions lack dense, high-quality gauge networks. The SWOT Hydrology Community Discharge repository (SWOT SHCQ) addresses this gap by aggregating in-situ streamflow data from a wide range of sources. These community-contributed datasets are essential for evaluating and constraining SWOT discharge estimates and directly inform the refinement of algorithms such as HiVDI by highlighting areas of strong or weak model performance.

<img src="Global-River-Discharge-from-SWOT.jpg" alt="SWOTDischarge" width="500"/>

<br>

Image Source: *Michael Durand, Colin J. Gleason, Tamlin M. Pavelsky, Renato Prata de Moraes Frasson, Michael Turmon, Cédric H. David, Elizabeth H. Altenau, Nikki Tebaldi, Kevin Larnier, Jerome Monnier, Pierre Olivier Malaterre, Hind Oubanas, George H. Allen, Brian Astifan, Craig Brinkerhoff, Paul D. Bates, David Bjerklie, Stephen Coss, Robert Dudley, Luciana Fenoglio, Pierre-André Garambois, Augusto Getirana, Peirong Lin, Steven A. Margulis, Pascal Matte, J. Toby Minear, Aggrey Muhebwa, Ming Pan, Daniel Peters, Ryan Riggs, Md Safat Sikder, Travis Simmons, Cassie Stuurman, Jay Taneja, Angelica Tarpanelli, Kerstin Schulze, Mohammad J. Tourian, Jida Wang. 2023. A Framework for Estimating Global River Discharge From the Surface Water and Ocean Topography Satellite Mission, *Water Resources Research*, 59(4). [https://doi.org/10.1029/2021WR031614](https://doi.org/10.1029/2021WR*


Key documents: https://podaac.jpl.nasa.gov/dataset/SWOT_L4_DAWG_SOS_DISCHARGE

This notebook was written by Anthony Castronova (acastronova@cuahsi.org) and Irene Garousi-Nejad (igarousi@cuahsi.org), CUAHSI.

## 1 Set up environment

The Python cells below need to be run each time the notebook is executed. The set up the needed libraries to run here in CUAHSI's Jupyter Hub cloud. You’ll need valid **Earthdata** login credentials to access the data used in this notebook. If you don’t already have an account, you can create one at https://urs.earthdata.nasa.gov/. **Without these credentials, you won't be able to access the required datasets or run the notebook as intended.**

In [None]:
!pip install -r requirements.txt

In [None]:
import os
import xarray
import earthaccess
import h5netcdf
import hsclient
import getpass
import numpy as np
import plotly.express as px
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

# Prompt the user for Earthdata credentials interactively
username = input("Enter your Earthdata username: ")
password = getpass.getpass("Enter your Earthdata password (input hidden): ")

# Set environment variables
os.environ["EARTHDATA_USERNAME"] = username
os.environ["EARTHDATA_PASSWORD"] = password

# Login using the credentials
auth = earthaccess.login(strategy="environment")

## 2 Access observed discharge data from the SWOT SHCQ Repository


The community-contributed SWOT SHCQ dataset is available through the following HydroShare collection: https://www.hydroshare.org/resource/38feeef698ca484b907b7b3eb84ad05b. 

### Explore the HydroShare collection containing SWOT SHCQ datasets


Each HydroShare record has a unique identifier at the end of its URL, which can be used in combination with the `hsclient` to programmatically access and work with the data.

In [None]:
# Define the HydroShare resource identifier
guid = '38feeef698ca484b907b7b3eb84ad05b'

# Connect to HydroShare using the hsclient Python library. Since we're not using credentials, we'll only have access to public data.
hs = hsclient.HydroShare()

# Load this resource into memory
swot_collection = hs.resource(guid)

This resource happens to contain references to other resources that contain the data we want. Display all of these associated resources so we can choose which one we want to work with.

In [None]:
def print_metadata(resource, indent=''):
    print(f"{indent}{'Title:': <20} {resource.metadata.title}")
    print(f"{indent}{'URL:': <20} {resource.metadata.url}")
    print(f"{indent}{'Subject Keywords:': <20} {resource.metadata.subjects}")
    

print('----------------')
print('Resource Summary')
print('----------------')
swot_collection = hs.resource(guid)
print_metadata(swot_collection)
print('Related Resources')

for relation in swot_collection.metadata.relations:
    # get the resource metadata for each relation
    # using a guid extracted from the relation metadata
    try:
        resource = hs.resource(relation.value.split('/')[-1])
        print()
        print_metadata(resource, indent='    ')
    except Exception:
        # we may encounter exceptions if we try to access resources
        # that we do not have permissions for.
        pass

### Select a community-contributed SWOT SHCQ dataset of interest

Select a globally unique identifier (GUID) from the URLs provided above to download the corresponding streamflow data for analysis.

In [None]:
# Define the identifier for a resource of interest
res = hs.resource('11ddd3102dee413da781de9164bee16e')

Now that this resource is loaded into memory, we can can query the files that are associated with it. Once we've identified a file that we're interested in, we can download and begin working with it.

In [None]:
# Preview the content files
res.files()

In [None]:
# Download data from the HydroShare resource to the working directory
res.file_download('BFG_Rhine_SHCQ2_V3_2020-2024.csv')

Load this dataset using `pandas`. We'll need to do a little cleaning to fix datetime formats.

In [None]:
# Load the csv file
df = pd.read_csv('BFG_Rhine_SHCQ2_V3_2020-2024.csv')

# Set the index to the date listed in the dataset
df['date'] = pd.to_datetime(df["Time_('dd-mm-yyyy')"], errors='coerce')
df.set_index(df.date, inplace=True)

# Drop times that couldn't be converted
df = df[~df.index.isnull()]

# Pring the first few rows
df.head()

List the reach identifiers that exist in the data downloaded from HydroShare.

In [None]:
df.Reach_ID.unique()

Let's plot the observed discharge for one of these reaches.

In [None]:
# Filter data for the selected reach
reach_id = 23267000081
df.sort_index(inplace=True)
df_reach = df[df.Reach_ID == reach_id]

# Create interactive time series plot
fig = px.scatter(
    df_reach,
    x=df_reach.index,
    y='Q_(m^3/s_daily)',
    title=f"Measured River Discharge @ {reach_id}",
    labels={
        'Q_(m^3/s_daily)': 'Daily Discharge [cms]',
        'index': 'Date'
    }
)

# Customize layout
fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Daily Discharge [cms]'
)

fig.show()

## 3 Access modeled discharge data from the SoS project.


Next, we will retrieve modeled discharge data from the Science on SWORD (SoS) project for the same reach of interest. Make sure you are already logged in to Earthdata, as required at the beginning of this notebook.

### Search and locate SWOT SoS data granules

We've prepared a complete Jupyter notebook that offers a detailed, hands-on walkthrough for accessing, exploring, and analyzing global river discharge data derived from SoS dataset. The notebook uses NASA's Earthdata earthaccess API and xarray to access and work with cloud-hosted data. For the full learning experience, see:
Castronova, A. M., I. Garousi-Nejad (2025). Visualizing SWOT Discharge Data from the SWORD of Science (SoS) Dataset, HydroShare.
http://www.hydroshare.org/resource/cdae2ab2435c4baa999c1dca091073ab

Here, we briefly outline the key steps to keep the notebook more manageable and to place particular emphasis on accessing and comparing modeled discharge with validation data from the SWOT SHCQ.

In [None]:
# Search and locate granules
granule_info = earthaccess.search_data(
    short_name="SWOT_L4_DAWG_SOS_DISCHARGE",
    temporal=("2022-01-01", "2025-01-01"),
)

# Open the NetCDF files that are stored for each granule.
files = earthaccess.open(granule_info)



We will load two groups from the SWOT SoS dataset: `reaches`, which contains metadata about each SWORD reach, and `hivdi`, which holds modeled discharge estimates from the HiVDI algorithm. Both are needed for linking reach attributes to their corresponding discharge time series.

In [None]:
%%time

print(f'Loading the "Reaches" group in file: {files[4].full_name}')
ds_reaches = xarray.open_dataset(files[4],
                                 group='reaches',
                                 engine='h5netcdf',
                                 decode_cf=False,    
                                 decode_times=False, 
                                 decode_coords=False)
ds_reaches

In [None]:
%%time

print(f'Loading the "hivdi" group in file: {files[4].full_name}')
ds_hivdi = xarray.open_dataset(files[4],
                           group='hivdi',
                           engine='h5netcdf',
                           decode_cf=False,    
                           decode_times=False, 
                           decode_coords=False,
                        )
ds_hivdi


Combine these datasets using the common `num_reaches` coordinate. This is necessary for us to select hidvi estimates by reach name or identifier.

In [None]:
ds_merged = xarray.combine_by_coords([ds_reaches, ds_hivdi])

### Extract modeled discharge for the reach of interest

Isolate the data corresponding with the reach we have observations for.

In [None]:
# Use the mask to filter our data

mask = (ds_merged.reach_id == reach_id).compute()
ds_filtered = ds_merged.where(mask, drop=True)
d = ds_filtered.to_dataframe().reset_index().explode(['time', 'Q'])

Convert our `time` column into datetime objects for pretty plotting. Since the temporal data in the SWOT SoS files is stored as seconds elapsed since January 1, 2000, these values are then converted into standard Python datetime objects. This conversion ensures that the discharge data can be correctly plotted on a familiar time scale.

In [None]:
reference_date = datetime(2000, 1, 1, 0, 0, 0)
d['datetime'] = d['time'].apply(lambda x: reference_date + timedelta(seconds=x))
d.set_index('datetime', inplace=True)

## 4 Compare modeled and observed discharge

Create a plot containing the observed discharge that was collected from HydroShare and the `hivdi` model results.

In [None]:
import plotly.graph_objects as go

# Create figure
fig = go.Figure()

# Plot modeled discharge (from SoS)
fig.add_trace(go.Scatter(
    x=d.index,
    y=d.Q,
    mode='lines',
    name='SoS Modeled Discharge',
    line=dict(color='blue')
))

# Plot observed discharge
df.sort_index(inplace=True)
df_obs = df[df.Reach_ID == reach_id]
fig.add_trace(go.Scatter(
    x=df_obs.index,
    y=df_obs['Q_(m^3/s_daily)'],
    mode='markers',
    name='Observed Discharge',
    marker=dict(color='orange', size=6)
))

# Customize layout
fig.update_layout(
    title=f"Discharge Timeseries (HiVDI) for {ds_reaches.where(ds_reaches.reach_id == reach_id, drop=True).river_name.item()}, ID: {reach_id}",
    xaxis_title='Time',
    yaxis_title='Discharge [CMS]',
    legend=dict(x=0.01, y=0.99),
    height=500
)

fig.show()
