# Dataset Downloads

This notebooks handles the download of all datasets used in this study.


In [1]:
import codecs
import os
import zipfile

import cdsapi
import numpy as np
import pandas as pd
import xarray as xr
from isimip_client.client import ISIMIPClient
from tqdm.std import tqdm, trange

from deepwaters.utils import ROOT_DIR, download_file, download_zip

# Set download path
DL_PATH = ROOT_DIR / "data/raw"
DL_PATH.mkdir(parents=True, exist_ok=True)
print(f"Download path: {DL_PATH}")

Download path: C:\Users\luisg\Repositories\deep-waters\data\raw


## Mascons

### JPL Mascons

Global surface mass changes (land + ocean) updated monthly, and is provided on 0.5-degree global grids ([Dataset description](https://grace.jpl.nasa.gov/data/get-data/jpl_global_mascons/)).

In [7]:
start_date = "2002-04-04"
end_date = "2023-12-31"

dataset = "TELLUS_GRAC-GRFO_MASCON_CRI_GRID_RL06.1_V3"
outdir = DL_PATH / "targets/jpl-mascons"

In [8]:
def to_podaac_datetime(date: str) -> str:
    return pd.to_datetime(date).strftime("%Y-%m-%dT%H:%M:%SZ")

In [9]:
!(podaac-data-downloader -c { dataset } -d { outdir } -sd { to_podaac_datetime(start_date) } -ed { to_podaac_datetime(end_date) } -e ".nc")

[2023-12-22 22:46:16,344] {podaac_data_downloader.py:270} INFO - Found 1 total files to download
[2023-12-22 22:46:16,504] {podaac_data_downloader.py:305} INFO - 2023-12-22 22:46:16.504769 SKIPPED: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/TELLUS_GRAC-GRFO_MASCON_CRI_GRID_RL06.1_V3/GRCTellus.JPL.200204_202309.GLO.RL06.1M.MSCNv03CRI.nc
[2023-12-22 22:46:16,504] {podaac_data_downloader.py:324} INFO - Downloaded Files: 0
[2023-12-22 22:46:16,504] {podaac_data_downloader.py:325} INFO - Failed Files:     0
[2023-12-22 22:46:16,504] {podaac_data_downloader.py:326} INFO - Skipped Files:    1
[2023-12-22 22:46:16,504] {podaac_data_downloader.py:334} INFO - END




Scale factors file:

In [10]:
url = "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-docs/tellus/open/L3/mascon/docs/CLM4.SCALE_FACTOR.JPL.MSCNv03CRI.nc"
download_file(url, outdir)

Placement file:

In [11]:
url = "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-docs/tellus/open/L3/mascon/docs/JPL_MSCNv03_PLACEMENT.nc"
download_file(url, outdir)

### CSR Mascons

In [12]:
outdir = DL_PATH / "targets/csr-mascons"
url = "http://download.csr.utexas.edu/outgoing/grace/RL0602_mascons/CSR_GRACE_GRACE-FO_RL0602_Mascons_all-corrections.nc"
download_file(url, outdir)

### GSFC Mascons

[Website](https://earth.gsfc.nasa.gov/geo/data/grace-mascons)

Summary:
> Global mascon solution where the GAD product has been restored, meaning the ocean mascons describe ocean bottom pressure (OBP). This product is comparable to the JPL and CSR mascon products. The 1-arc-degree equal area values have been placed on an equal angle 0.5x0.5 degree grid. Land values are determined with a least squares estimator that conserves mass over each region, while ocean values have been interpolated/extrapolated.

Grid: 0.5x0.5

In [2]:
outdir = DL_PATH / "targets/gsfc-mascons"
url = "https://earth.gsfc.nasa.gov/sites/default/files/geo/gsfc.glb_.200204_202309_rl06v2.0_obp-ice6gd_halfdegree.nc"
download_file(url, outdir)

## Hydrologic models

### WaterGAP Global Hydrology Model (WGHM) 2.2e

- [Dataset on the Goethe University Data Repository](https://gude.uni-frankfurt.de/entities/researchdata/c53bb505-a620-4860-b2a2-d5a6de74dbd9/details)
- [Preprint of paper on WaterGAP 2.2e](https://doi.org/10.5194/gmd-2023-213)

*Download links might change in the future, currently there is no public API available.*

In [4]:
outdir = DL_PATH / "inputs/watergap22e"
urls = {
    "20crv-era5": "https://api.gude.uni-frankfurt.de/api/core/bitstreams/07183cd6-9d47-4cb2-bc60-00436b0ecd39/content",
    "gswp3-era5": "https://api.gude.uni-frankfurt.de/api/core/bitstreams/879ce7c3-4d21-4ee1-a83c-e830b13b9d2e/content",
    "20crv-w5e5": "https://api.gude.uni-frankfurt.de/api/core/bitstreams/2b22924a-0981-4f6c-886c-542d19db7783/content",
    "gswp3-w5e5": "https://api.gude.uni-frankfurt.de/api/core/bitstreams/adee0d04-c414-420e-85ef-89d3e83e32e9/content",
}
for name, url in (pbar := tqdm(urls.items())):
    pbar.set_postfix_str(f"Downloading watergap22e_{name}")
    download_file(url, outdir)

100%|██████████| 4/4 [00:12<00:00,  3.12s/it, Downloading watergap22e_gswp3-w5e5]


## Weather data
### ISIMIP 20CRv3-ERA5

ISIMIP3a dataset covering 1901-2021 on a 0.5°x0.5° lat-lon grid.
Combines 20CRv3 (1901-1978), homogenized to ERA5, with ERA5 (1979 - present).

Useful links:
- [Dataset description](https://www.isimip.org/gettingstarted/input-data-bias-adjustment/details/105/)
- [Download link (database)](https://data.isimip.org/search/tree/ISIMIP3a/InputData/climate/atmosphere/20crv3-era5/)
- [The Twentieth Century Reanalysis Project](https://www.psl.noaa.gov/data/20thC_Rean/)
- [NOAA/CIRES/DOE 20th Century Reanalysis (V3)](https://www.psl.noaa.gov/data/gridded/data.20thC_ReanV3.html)

| Variables | Description |
|-----------|-------------|
| hurs | Near-surface relative humidity |
| huss | Near-surface relative humidity |
| sfcWind | Near-surface relative humidity |
| tas | Daily mean temperature |
| tasmin | Daily minimum temp |
| uas | eastward near-surface wind |
| rlds | long wave downwelling radiation |
| rsds | short wave downwelling radiation |
| ps | surface air pressure |
| pr | total precipitation |

In [13]:
isimip_out = DL_PATH / "inputs/isimip-climate"
client = ISIMIPClient()

# search the ISIMIP repository using specifiers
response = client.datasets(
    simulation_round="ISIMIP3a",
    product="InputData",
    climate_forcing=["20crv3-era5", "20crv3-w5e5"],
    climate_scenario="obsclim",
    climate_variable=["pr", "tas"],
)

In [14]:
# List returned datasets and count files
filenum = 0

for dataset in response["results"]:
    print(dataset["path"])
    filenum += len(dataset["files"])

ISIMIP3a/InputData/climate/atmosphere/obsclim/global/daily/historical/20CRv3-ERA5/20crv3-era5_obsclim_pr_global_daily
ISIMIP3a/InputData/climate/atmosphere/obsclim/global/daily/historical/20CRv3-ERA5/20crv3-era5_obsclim_tas_global_daily
ISIMIP3a/InputData/climate/atmosphere/obsclim/global/daily/historical/20CRv3-W5E5/20crv3-w5e5_obsclim_pr_global_daily
ISIMIP3a/InputData/climate/atmosphere/obsclim/global/daily/historical/20CRv3-W5E5/20crv3-w5e5_obsclim_tas_global_daily


In [None]:
# Loop over files and download
with trange(filenum) as pbar:
    for dataset in response["results"]:
        for file in dataset["files"]:
            print(f"Downloading file {file['name']}...", end="\r")
            client.download(file["file_url"], path=isimip_out / dataset["name"])
            pbar.update()
print("\nDownload completed.")

### ERA5

ERA5 data could be combined with the ISIMIP 20CRv3-ERA5 dataset to make up the missing years 2022 and 2023.

- Temporal coverage: 1940 to present
- Temporal frequency:
  - hourly
  - monthly averaged by hour of day (synoptic monthly means)
  - monthly averaged
- Spatial resolution:
  - Reanalysis: 0.25° x 0.25° (atmosphere), 0.5° x 0.5° (ocean waves)
  - Mean, spread and members: 0.5° x 0.5° (atmosphere), 1° x 1° (ocean waves)


Useful links:
- [CDS: ERA5 single levels monthly means](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels-monthly-means)
- [ERA5 data documentation](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation)
- [How to download ERA5](https://confluence.ecmwf.int/display/CKB/How+to+download+ERA5)

In [2]:
START_YEAR = 1940
END_YEAR = 2023
VARIABLES = [
    "total_precipitation",
    "2m_temperature",
    "2m_dewpoint_temperature",
    "high_vegetation_cover",
    "low_vegetation_cover",
    "evaporation",
    "potential_evaporation",
    "runoff",
    "surface_pressure",
    "leaf_area_index_high_vegetation",
    "leaf_area_index_low_vegetation",
    "sub_surface_runoff",
    "surface_runoff",
    "volumetric_soil_water_layer_1",
    "volumetric_soil_water_layer_2",
    "volumetric_soil_water_layer_3",
    "volumetric_soil_water_layer_4",
]


dataset_path = DL_PATH / "inputs/era5-monthly"
dataset_path.mkdir(exist_ok=True)

c = cdsapi.Client()
for variable in VARIABLES:
    print(f"{'=' *40}")
    print(f"Downloading `{variable}`...")
    c.retrieve(
        "reanalysis-era5-single-levels-monthly-means",
        {
            "product_type": "monthly_averaged_reanalysis",
            "variable": variable,
            "year": [f"{year}" for year in range(START_YEAR, END_YEAR + 1)],
            "month": [f"{month:02}" for month in range(1, 12 + 1)],
            "time": "00:00",
            # "grid": [0.5, 0.5],
            "format": "netcdf",
        },
        dataset_path / f"era5-monthly_{variable}_{START_YEAR}-{END_YEAR}.nc",
    )
print("Download completed.")

Downloading `volumetric_soil_water_layer_1`...


2024-03-14 22:39:44,474 INFO Welcome to the CDS
2024-03-14 22:39:44,475 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-single-levels-monthly-means
2024-03-14 22:39:44,686 INFO Request is queued
2024-03-15 00:56:23,625 INFO Request is running
2024-03-15 01:08:25,550 INFO Request is completed
2024-03-15 01:08:25,551 INFO Downloading https://download-0003-clone.copernicus-climate.eu/cache-compute-0003/cache/data5/adaptor.mars.internal-1710461141.9315538-10566-4-e952aa10-eed0-4001-956c-3f5a8ad94fbc.nc to C:\Users\luisg\Repositories\deep-waters\data\raw\inputs\era5-monthly\era5-monthly_volumetric_soil_water_layer_1_1940-2023.nc (1.9G)
2024-03-15 01:09:13,939 INFO Download rate 41.3M/s  
2024-03-15 01:09:13,999 INFO Welcome to the CDS
2024-03-15 01:09:13,999 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-single-levels-monthly-means
2024-03-15 01:09:14,083 INFO Request is queued


Downloading `volumetric_soil_water_layer_2`...


2024-03-15 03:05:49,819 INFO Request is running
2024-03-15 03:15:51,435 INFO Request is completed
2024-03-15 03:15:51,435 INFO Downloading https://download-0014-clone.copernicus-climate.eu/cache-compute-0014/cache/data6/adaptor.mars.internal-1710468887.9706395-17953-7-5667c1f5-12c6-4273-9675-61335ac12eac.nc to C:\Users\luisg\Repositories\deep-waters\data\raw\inputs\era5-monthly\era5-monthly_volumetric_soil_water_layer_2_1940-2023.nc (1.9G)
2024-03-15 03:17:39,744 INFO Download rate 18.4M/s  
2024-03-15 03:17:40,080 INFO Welcome to the CDS
2024-03-15 03:17:40,080 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-single-levels-monthly-means
2024-03-15 03:17:40,166 INFO Request is queued


Downloading `volumetric_soil_water_layer_3`...


2024-03-15 05:42:20,626 INFO Request is running
2024-03-15 05:54:23,812 INFO Request is completed
2024-03-15 05:54:23,812 INFO Downloading https://download-0018.copernicus-climate.eu/cache-compute-0018/cache/data4/adaptor.mars.internal-1710478416.4711475-25374-7-20600bea-d898-49e0-8569-0d001d1042d7.nc to C:\Users\luisg\Repositories\deep-waters\data\raw\inputs\era5-monthly\era5-monthly_volumetric_soil_water_layer_3_1940-2023.nc (1.9G)
2024-03-15 05:57:45,825 INFO Download rate 9.9M/s   
2024-03-15 05:57:46,159 INFO Welcome to the CDS
2024-03-15 05:57:46,160 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-single-levels-monthly-means
2024-03-15 05:57:46,238 INFO Request is queued


Downloading `volumetric_soil_water_layer_4`...


2024-03-15 08:50:31,150 INFO Request is running
2024-03-15 09:04:33,274 INFO Request is completed
2024-03-15 09:04:33,274 INFO Downloading https://download-0014-clone.copernicus-climate.eu/cache-compute-0014/cache/data8/adaptor.mars.internal-1710489730.975898-18636-16-91e6800a-38de-4d5a-afdd-b559f2d2ef47.nc to C:\Users\luisg\Repositories\deep-waters\data\raw\inputs\era5-monthly\era5-monthly_volumetric_soil_water_layer_4_1940-2023.nc (1.9G)
2024-03-15 09:06:00,150 INFO Download rate 23M/s    


Download completed.


### CPC Soil Moisture V2
[Data description](https://psl.noaa.gov/data/gridded/data.cpcsoil.html)
- Temp coverage: 1948/01 to now
- Spatial coverage: 0.5° x 0.5°

In [2]:
outdir = DL_PATH / "inputs/cpc-soil"
url = "https://downloads.psl.noaa.gov/Datasets/cpcsoil/soilw.mon.mean.v2.nc"
download_file(url, outdir)

### NOAA Reconstructed Sea Surface Temperature

Download the sea surface temperatures (SST) for calculating the Ocenanic Nino index (ONI). The ONI climate indices provided by [NOAA](https://psl.noaa.gov/data/climateindices/list/) only span to 1950.

[Dataset description](https://psl.noaa.gov/data/gridded/data.noaa.ersst.v5.html)

In [2]:
outdir = DL_PATH / "inputs/noaa-ersst-v5"
url = "https://downloads.psl.noaa.gov/Datasets/noaa.ersst.v5/sst.mnmean.nc"
download_file(url, outdir)

### Glacier mass change

[Dataset description](https://cds.climate.copernicus.eu/cdsapp#!/dataset/derived-gridded-glacier-mass-change?tab=overview)

In [18]:
START_YEAR = 1975
END_YEAR = 2021

dataset_path = DL_PATH / "inputs/wgms-fog"
dataset_path.mkdir(exist_ok=True)
dataset_file = dataset_path / "wgms_fog_2023_09.zip"

c = cdsapi.Client()

c.retrieve(
    "derived-gridded-glacier-mass-change",
    {
        "variable": "glacier_mass_change",
        "product_version": "wgms_fog_2023_09",
        "format": "zip",
        "hydrological_year": [
            f"{year}_{(year + 1) % 100 :02}" for year in range(START_YEAR, END_YEAR + 1)
        ],
    },
    dataset_file,
)

print("Download completed.")

In [None]:
# Extract ZIP
zip_file = zipfile.ZipFile(dataset_file)
zip_file.extractall(dataset_path)

## Climate Indices

- [Climate indices from NOAA](https://psl.noaa.gov/data/climateindices/list)
- [Climate indices from CPC](https://www.cpc.ncep.noaa.gov/data/indices/ersst5.nino.mth.91-20.ascii)

In [2]:
# Download CPC El Niño 3.4
outdir = DL_PATH / "inputs/climate-indices"
url = "https://www.cpc.ncep.noaa.gov/data/indices/ersst5.nino.mth.91-20.ascii"
download_file(url, outdir)

In [3]:
with codecs.open(outdir / "ersst5.nino.mth.91-20.ascii", encoding="utf-8-sig") as f:
    nino = np.array([[x for x in line.split()] for line in f])
# Only keep El Niño anomalies
anomalies = nino[1:, (3, 5, 7, 9)].astype("float32")
# Convert to Xarray Dataset
time_idx = pd.date_range("1950-01-01", periods=len(anomalies), freq="MS")
time_idx
ds = xr.Dataset(
    {
        "nino_12": ("time", anomalies[:, 0]),
        "nino_3": ("time", anomalies[:, 1]),
        "nino_4": ("time", anomalies[:, 2]),
        "nino_34": ("time", anomalies[:, 3]),
    },
    coords={"time": time_idx},
)
ds.to_netcdf(outdir / "cpc-ersst5-nino-anomalies.nc")

## Human influences

### ISIMIP Land Use

[Data set description](https://www.isimip.org/gettingstarted/input-data-bias-adjustment/details/82/)

In [4]:
outdir = DL_PATH / "inputs/landuse"
urls = {
    "5crops": "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/landuse/histsoc/landuse-5crops_histsoc_annual_1901_2021.nc",
    "15crops": "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/landuse/histsoc/landuse-15crops_histsoc_annual_1901_2021.nc",
    "pastures": "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/landuse/histsoc/landuse-pastures_histsoc_annual_1901_2021.nc",
    "totals": "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/landuse/histsoc/landuse-totals_histsoc_annual_1901_2021.nc",
    "urbanareas": "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/landuse/histsoc/landuse-urbanareas_histsoc_annual_1901_2021.nc",
}
for name, url in (pbar := tqdm(urls.items(), desc="Downloading landuse")):
    pbar.set_postfix_str(name)
    download_file(url, outdir)

Downloading landuse: 100%|██████████| 5/5 [00:11<00:00,  2.38s/it, urbanareas]


### ISIMIP Lake area fraction

[Data set description](https://www.isimip.org/gettingstarted/input-data-bias-adjustment/details/132/)

In [3]:
outdir = DL_PATH / "inputs/pctlake"
url = "https://files.isimip.org/ISIMIP3a/InputData/socioeconomic/lakes/histsoc/pctlake_histsoc_1901_2021.nc"
download_file(url, outdir)

## Basin shapes
### HydroBASINS

The HydroBASINS depict sub-basin boundaries at a global scale. They are a series of vectorized polygons and are available for 12 different Pfafstetter levels (with decreasing basin sizes). They were created by the HydroSHEDS project on behalf of the World Wildlife Fund. Cite: [10.1002/hyp.97409](https://doi.org/10.1002/hyp.9740)

File names follow the syntax:

    Hybas_XX_levYY_v1c.shp

where XX indicates the region and YY indicates the Pfafstetter level (01-12). The regional extents 
are defined by a two-digit identifier:

| Identifier | Region                      |
|------------|-----------------------------|
| af         | Africa                      |
| ar         | North American Arctic       |
| as         | Central and South-East Asia |
| au         | Australia and Oceania       |
| eu         | Europe and Middle East      |
| gr         | Greenland                   |
| na         | North America and Caribbean |
| sa         | South America               |
| si         | Siberia                     |

In [None]:
# Identifiers of basin regions
hybas_ids = ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"]
hbas_out = DL_PATH / "basins/hybas"

for id in tqdm(hybas_ids):
    hybas_url = (
        "https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_"
        + id
        + "_lev01-06_v1c.zip"
    )
    download_zip(url=hybas_url, path=hbas_out)

print("Download completed.")

### TRIP

Mayor River Basin Templates from the [Total Runoff Integrating Pathways (TRIP)](https://hydro.iis.u-tokyo.ac.jp/~taikan/TRIPDATA/TRIPDATA.html) project.

DOI: [10.1175/1087-3562(1998)002%3C0001:DOTRIP%3E2.3.CO;2](https://doi.org/10.1175/1087-3562(1998)002%3C0001:DOTRIP%3E2.3.CO;2)

In [19]:
# Download netCDF file
url = "https://hydro.iis.u-tokyo.ac.jp/~taikan/TRIPDATA/Data/trip_0.5x0.5.nc"
path = DL_PATH / "basins/trip"
download_file(url, path)

In [20]:
# Download index file (containing basin names)
url = "https://hydro.iis.u-tokyo.ac.jp/~taikan/TRIPDATA/Data/rivnum05.txt"
download_file(url, path)

### GRDC Major River Basins

Major river basins by the Global Runoff Data Center. The basins incorporate HydroBASINS data and are named. They are available as Shapefile and geoJSON

- [Description](https://www.bafg.de/GRDC/EN/02_srvcs/22_gslrs/221_MRB/riverbasins_node.html)
- [Map and download](https://mrb.grdc.bafg.de/)

In [21]:
url = "https://www.bafg.de/SharedDocs/ExterneLinks/GRDC/mrb_shp_zip.zip?__blob=publicationFile"
path = DL_PATH / "basins/mrb"

download_zip(url, path)
print("Download completed.")

Download completed.


## Country shapes
### NaturalEarth 1:50 countries

[Description](https://www.naturalearthdata.com/downloads/50m-cultural-vectors/50m-admin-0-countries-2/)

In [27]:
url = "https://naturalearth.s3.amazonaws.com/5.0.1/50m_cultural/ne_50m_admin_0_countries.zip"
path = DL_PATH / "countries/naturalearth"

download_zip(url, path)
print("Download completed.")

Download completed.


## Comparison products
### Humphrey's GRACE-Rec

[Data on figshare](https://figshare.com/articles/dataset/GRACE-REC_A_reconstruction_of_climate-driven_water_storage_changes_over_the_last_century/7670849)

In [3]:
url = "https://figshare.com/ndownloader/files/17990285"
path = DL_PATH / "reconstructions/humphrey"

download_zip(url, path)
print("Download completed.")

Download completed.


### Yin's GTWS-MLrec

[Data on zenodo](https://zenodo.org/records/10040927)

In [None]:
path = DL_PATH / "reconstructions/li"
urls = [
    "https://zenodo.org/records/10040927/files/CSR-based%20GTWS-MLrec%20TWS.nc?download=1",
    "https://zenodo.org/records/10040927/files/GSFC-based%20GTWS-MLrec%20TWS.nc?download=1",
    "https://zenodo.org/records/10040927/files/JPL-based%20GTWS-MLrec%20TWS.nc?download=1",
]

for url in tqdm(urls):
    download_file(url, path)

### Li's GRACE-REC

[Data on DRYAD](https://datadryad.org/stash/dataset/doi:10.5061/dryad.z612jm6bt)

In [4]:
url = "https://datadryad.org/stash/downloads/file_stream/665199"
path = DL_PATH / "reconstructions/yin"
download_file(url, path)
# Fix file name
os.rename(path / "665199", path / "GRID_CSR_GRACE_REC.mat")