# Dataset Downloads

This notebooks handles the download of all datasets used in this study.


In [1]:
import cdsapi
import pandas as pd
from tqdm.std import tqdm

from deeprec.utils import ROOT_DIR, download_file, download_zip

# Set download path
DL_PATH = ROOT_DIR / "data/raw"
DL_PATH.mkdir(parents=True, exist_ok=True)
print(f"Download path: {DL_PATH}")

Download path: /Users/lgentn/Repositories/deeprec/data/raw


## Mascons

### JPL Mascons

Global surface mass changes (land + ocean) updated monthly, and is provided on 0.5-degree global grids ([Dataset description](https://grace.jpl.nasa.gov/data/get-data/jpl_global_mascons/)).

In [6]:
start_date = "2002-04-04"
end_date = "2024-12-31"

dataset = "TELLUS_GRAC-GRFO_MASCON_CRI_GRID_RL06.3_V4"
outdir = DL_PATH / "targets/jpl-mascons"

In [3]:
def to_podaac_datetime(date: str) -> str:
    return pd.to_datetime(date).strftime("%Y-%m-%dT%H:%M:%SZ")

In [None]:
!(podaac-data-downloader -c { dataset } -d { outdir } -sd { to_podaac_datetime(start_date) } -ed { to_podaac_datetime(end_date) } -e ".nc")

### CSR Mascons

[Website](https://www2.csr.utexas.edu/grace/RL06_mascons.html)

In [None]:
outdir = DL_PATH / "targets/csr-mascons"
url = "https://download.csr.utexas.edu/outgoing/grace/RL0603_mascons/CSR_GRACE_GRACE-FO_RL0603_Mascons_all-corrections.nc"
download_file(url, outdir)

### GSFC Mascons

[Website](https://earth.gsfc.nasa.gov/geo/data/grace-mascons)


In [4]:
outdir = DL_PATH / "targets/gsfc-mascons"
url = "https://earth.gsfc.nasa.gov/sites/default/files/geo/gsfc.glb_.200204_202406_rl06v2.0_obp-ice6gd_halfdegree.nc"
download_file(url, outdir)

## Hydrologic models

### WaterGAP Global Hydrology Model (WGHM) 2.2e

- [Dataset on the Goethe University Data Repository](https://gude.uni-frankfurt.de/entities/researchdata/c53bb505-a620-4860-b2a2-d5a6de74dbd9/details)
- [Preprint of paper on WaterGAP 2.2e](https://doi.org/10.5194/gmd-2023-213)

*Download links might change in the future, currently there is no public API available.*

In [2]:
outdir = DL_PATH / "inputs/watergap22e"
url = "https://api.gude.uni-frankfurt.de/api/core/bitstreams/879ce7c3-4d21-4ee1-a83c-e830b13b9d2e/content"
name = "watergap22e_gswp3-era5_tws_histsoc_monthly_1901_2022.nc"
download_file(url, outdir, filename=name)

## Weather and climate data

### ERA5

ERA5 data could be combined with the ISIMIP 20CRv3-ERA5 dataset to make up the missing years 2022 and 2023.

- Temporal coverage: 1940 to present
- Temporal frequency:
  - hourly
  - monthly averaged by hour of day (synoptic monthly means)
  - monthly averaged
- Spatial resolution:
  - Reanalysis: 0.25° x 0.25° (atmosphere), 0.5° x 0.5° (ocean waves)
  - Mean, spread and members: 0.5° x 0.5° (atmosphere), 1° x 1° (ocean waves)


Useful links:
- [CDS: ERA5 single levels monthly means](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels-monthly-means)
- [ERA5 data documentation](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation)
- [How to download ERA5](https://confluence.ecmwf.int/display/CKB/How+to+download+ERA5)

In [None]:
START_YEAR = 1940
END_YEAR = 2023
VARIABLES = [
    "total_precipitation",
    "2m_temperature",
    "2m_dewpoint_temperature",
    "high_vegetation_cover",
    "low_vegetation_cover",
    "evaporation",
    "potential_evaporation",
    "runoff",
    "snowfall",
    "snowmelt",
    "snow_depth",
    "snow_evaporation",
    "surface_pressure",
    "leaf_area_index_high_vegetation",
    "leaf_area_index_low_vegetation",
    "sub_surface_runoff",
    "surface_runoff",
    "volumetric_soil_water_layer_1",
    "volumetric_soil_water_layer_2",
    "volumetric_soil_water_layer_3",
    "volumetric_soil_water_layer_4",
]

dataset_path = DL_PATH / "inputs/era5-monthly"
dataset_path.mkdir(exist_ok=True)

c = cdsapi.Client()
dataset = "reanalysis-era5-single-levels-monthly-means"
for variable in VARIABLES:
    print(f"{'=' * 40}")
    print(f"Downloading `{variable}`...")

    request = {
        "product_type": ["monthly_averaged_reanalysis"],
        "variable": [variable],
        "year": [f"{year}" for year in range(START_YEAR, END_YEAR + 1)],
        "month": [f"{month:02}" for month in range(1, 12 + 1)],
        "time": ["00:00"],
        "data_format": "netcdf",
        "download_format": "unarchived",
    }
    dataset_file = dataset_path / f"era5-monthly_{variable}_{START_YEAR}-{END_YEAR}.nc"
    c.retrieve(dataset, request, dataset_file)

print("Download completed.")

### NOAA Reconstructed Sea Surface Temperature

Download the sea surface temperatures (SST) for calculating the Ocenanic Nino index (ONI). The ONI climate indices provided by [NOAA](https://psl.noaa.gov/data/climateindices/list/) only span to 1950.

[Dataset description](https://psl.noaa.gov/data/gridded/data.noaa.ersst.v5.html)

In [2]:
outdir = DL_PATH / "inputs/noaa-ersst-v5"
url = "https://downloads.psl.noaa.gov/Datasets/noaa.ersst.v5/sst.mnmean.nc"
download_file(url, outdir)

## Human influences

### ISIMIP Land Use

[Data set description](https://www.isimip.org/gettingstarted/input-data-bias-adjustment/details/82/)

In [9]:
outdir = DL_PATH / "inputs/landuse"
urls = {
    "5crops": "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/landuse/histsoc/landuse-5crops_histsoc_annual_1901_2021.nc",
    "15crops": "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/landuse/histsoc/landuse-15crops_histsoc_annual_1901_2021.nc",
    "pastures": "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/landuse/histsoc/landuse-pastures_histsoc_annual_1901_2021.nc",
    "totals": "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/landuse/histsoc/landuse-totals_histsoc_annual_1901_2021.nc",
    "urbanareas": "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/landuse/histsoc/landuse-urbanareas_histsoc_annual_1901_2021.nc",
}
for name, url in (pbar := tqdm(urls.items(), desc="Downloading landuse")):
    pbar.set_postfix_str(name)
    download_file(url, outdir)

Downloading landuse: 100%|██████████| 5/5 [00:11<00:00,  2.23s/it, urbanareas]


### ISIMIP Lake area fraction

[Data set description](https://www.isimip.org/gettingstarted/input-data-bias-adjustment/details/132/)

In [3]:
outdir = DL_PATH / "inputs/pctlake"
url = "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/lakes/histsoc/pctlake_histsoc_1901_2021.nc"
download_file(url, outdir)

## Shapes

### GRDC Major River Basins

Major river basins by the Global Runoff Data Center. The basins incorporate HydroBASINS data and are named. They are available as Shapefile and geoJSON

- [Description](https://www.bafg.de/GRDC/EN/02_srvcs/22_gslrs/221_MRB/riverbasins_node.html)
- [Map and download](https://mrb.grdc.bafg.de/)

In [2]:
url = "https://grdc.bafg.de/downloads/GRDC_Major_River_Basins_shp.zip"
path = DL_PATH / "shapefiles/mrb"

download_zip(url, path)
print("Download completed.")

Download completed.


### NaturalEarth 1:50 countries

[Description](https://www.naturalearthdata.com/downloads/50m-cultural-vectors/50m-admin-0-countries-2/)

In [3]:
url = "https://naturalearth.s3.amazonaws.com/5.0.1/50m_cultural/ne_50m_admin_0_countries.zip"
path = DL_PATH / "shapefiles/naturalearth"

download_zip(url, path)
print("Download completed.")

Download completed.


## Previous TWS reconstructions
### Humphrey, 2019

[Data on figshare](https://figshare.com/articles/dataset/GRACE-REC_A_reconstruction_of_climate-driven_water_storage_changes_over_the_last_century/7670849)

In [4]:
url = "https://figshare.com/ndownloader/files/17990285"
path = DL_PATH / "reconstructions/humphrey"
path.mkdir(parents=True, exist_ok=True)

download_zip(url, path)
print("Download completed.")

Download completed.


### Li, 2021

Unavailable for download with `requests`. Please download the file with your webbrowser and place it in the folder created below.

[Download here on DRYAD](https://datadryad.org/stash/dataset/doi:10.5061/dryad.z612jm6bt)

In [None]:
path = DL_PATH / "reconstructions/li"
path.mkdir(parents=True, exist_ok=True)

### Yin, 2023

[Data on zenodo](https://zenodo.org/records/10040927)

In [6]:
path = DL_PATH / "reconstructions/yin"
urls = [
    "https://zenodo.org/records/10040927/files/CSR-based%20GTWS-MLrec%20TWS.nc",
    "https://zenodo.org/records/10040927/files/GSFC-based%20GTWS-MLrec%20TWS.nc",
    "https://zenodo.org/records/10040927/files/JPL-based%20GTWS-MLrec%20TWS.nc",
]

for url in tqdm(urls):
    download_file(url, path)


100%|██████████| 3/3 [04:40<00:00, 93.53s/it] 


### Palazzoli, 2025

[Data on zenodo](https://zenodo.org/records/10953658)

In [7]:
url = "https://zenodo.org/records/10953658/files/GRAiCE_BiLSTM.nc"
path = DL_PATH / "reconstructions/palazzoli"

download_file(url, path)
print("Download completed.")


Download completed.


## Sea Level Rise Contributors

Download global mean sea level (GMSL) time series and its contributors by [Frederikse et al. (2020)](https://doi.org/10.1038/s41586-020-2591-3).

[Data on zenodo](https://zenodo.org/records/3862995)

In [3]:
url = "https://zenodo.org/records/3862995/files/global_basin_timeseries.xlsx"
path = DL_PATH / "eval/sea-level/frederikse"

path.mkdir(parents=True, exist_ok=True)
download_file(url, path)
print("Download completed.")


Download completed.


## Extreme Event Intensity

Download Intensity of hydroclimatic extreme events by [Rodell & Li (2023)](https://doi.org/10.1038/s44221-023-00040-5)

[Data on zenodo](https://doi.org/10.5281/zenodo.7599831)

In [5]:
url = "https://zenodo.org/records/7599831/files/Figure2_data.xlsx"
path = DL_PATH / "eval/intensity/rodell"

path.mkdir(parents=True, exist_ok=True)
download_file(url, path)
print("Download completed.")

Download completed.
