### Accessing NASA IMERG Final Run V07 Precipitation Data (May 2024)

**Author:** Gaurav Somani  
**Course:** Earth System Data Processing 1 (WS 2025/26)  
**Notebook:** load_nasa_imerg.ipynb

#### Introducing the NASA GPM IMERG Half-Hourly Precipitation Dataset (Final Run, V07)

This notebook demonstrates how to programmatically access and download half-hourly precipitation estimates from the NASA GPM IMERG Final Run (Version 07) dataset. IMERG combines microwave, infrared, and satellite radar observations from multiple platforms to produce global precipitation fields at 0.1° spatial resolution and 30-minute temporal resolution. The “Final Run” is the highest quality product and incorporates additional calibration and gauge-correction steps.

The goal of this notebook is to:

- authenticate with NASA’s Earthdata system
- identify and download half-hourly IMERG files for a specific date
- document the technical workflow, execution time, and any access challenges
- briefly inspect the dataset structure

This notebook demonstrates programmatic access and download of NASA IMERG Final Run (V07B) precipitation data for a 5-day period in May 2024. It includes authentication setup, download procedures, timing and file inspection.

#### Dataset Characteristics

- Online Data Directory: https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/
- Product: GPM IMERG Final Run (Half-Hourly), Version 07
- Temporal Resolution: 30 minutes
- Spatial Resolution: 0.1° × 0.1°
- Coverage: Global (60°N–60°S)
- Format: HDF5
- Files Per Day: 48
- File Size: ~7 MB each
- Date Range Used: May 1–5, 2024 (DOY 122–126)
**Estimated Data Size**

longitude = 3600  
latitude  = 1800  
grid cells = 3600 × 1800 = 6,480,000

Total files downloaded:

5 days × 48 files = 240 files  
Approx ≈ 240 × 7 MB = 1.68 GB

In [5]:

%load_ext autotime

import time
from pathlib import Path
import requests
from bs4 import BeautifulSoup
import h5py
import numpy as np


time: 362 µs (started: 2025-11-30 21:30:11 +01:00)


#### Authentication

- Authentication via .netrc
- NASA GES DISC does not use API keys. Instead, access requires:
    - Earthdata login
    - A .netrc file at ~/.netrc
    - Correct permissions (chmod 600 ~/.netrc)
    - Authorizing the application: NASA GESDISC DATA ARCHIVE

After this, Python requests automatically authenticates.
Please have a look at the **README_nasa_imerg.md** for a more in-depth explanation and follow the steps mentioned.

In [6]:
# Create an authenticated session (uses ~/.netrc)
session = requests.Session()
session.trust_env = True   # VERY IMPORTANT: tells requests to use .netrc

# Quick test request
test_url = "https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/"

r = session.get(test_url)
print("Status:", r.status_code)

Status: 200
time: 733 ms (started: 2025-11-30 21:45:12 +01:00)


In [7]:
# May 1–5, 2025 DOY 121–125 (non-leap year)
days = list(range(121, 126))

base_dir = Path("data/imerg_finalrun_2025_05")
base_dir.mkdir(parents=True, exist_ok=True)

days

[121, 122, 123, 124, 125]

time: 19.3 ms (started: 2025-11-30 21:49:07 +01:00)


In [13]:
def list_files_for_day(doy):
    """Return list of IMERG HDF5 file URLs for a given day-of-year."""
    base_url = f"https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/2024/{doy:03d}/"
    
    r = session.get(base_url)
    if r.status_code != 200:
        print(f"Failed to access {base_url}")
        return []
    
    soup = BeautifulSoup(r.text, "html.parser")
    
    urls = []
    for link in soup.find_all("a"):
        href = link.get("href", "")
        # STRICT FILTER — only true HDF5 data files
        if href.endswith(".HDF5") and "xml" not in href.lower():
            urls.append(base_url + href)
    
    # Remove accidental duplicates
    urls = sorted(set(urls))
    return urls


time: 960 µs (started: 2025-11-30 22:01:15 +01:00)


In [14]:
test_urls = list_files_for_day(122)
len(test_urls), test_urls[:3]

(48,
 ['https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/2024/122/3B-HHR.MS.MRG.3IMERG.20240501-S000000-E002959.0000.V07B.HDF5',
  'https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/2024/122/3B-HHR.MS.MRG.3IMERG.20240501-S003000-E005959.0030.V07B.HDF5',
  'https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/2024/122/3B-HHR.MS.MRG.3IMERG.20240501-S010000-E012959.0060.V07B.HDF5'])

time: 640 ms (started: 2025-11-30 22:01:16 +01:00)


In [10]:
def download_file(url, out_folder):
    """Download a single IMERG HDF5 file."""
    filename = out_folder / url.split("/")[-1]
    
    if filename.exists():
        return filename.stat().st_size  # skip if exists
    
    r = session.get(url)
    if r.status_code == 200:
        with open(filename, "wb") as f:
            f.write(r.content)
        return filename.stat().st_size
    else:
        print(f"Failed: {url}")
        return None

time: 3.31 ms (started: 2025-11-30 21:51:44 +01:00)


In [16]:
from IPython.display import clear_output

all_files = []
start = time.time()

for doy in days:
    day_folder = base_dir / f"{doy:03d}"
    day_folder.mkdir(exist_ok=True)

    urls = list_files_for_day(doy)

    for i, url in enumerate(urls, 1):
        size = download_file(url, day_folder)
        clear_output(wait=True)
        print(f"DOY {doy}: Downloaded {i}/{len(urls)} files")

    all_files.extend(urls)

end = time.time()

print(f"Total files downloaded: {len(all_files)}")
print(f"Total time: {end - start:.2f} seconds")


DOY 125: Downloaded 48/48 files
Total files downloaded: 240
Total time: 470.66 seconds
time: 7min 50s (started: 2025-11-30 22:02:53 +01:00)


In [17]:
total_bytes = 0
for folder in sorted(base_dir.iterdir()):
    if folder.is_dir():
        for f in folder.glob("*.HDF5"):
            total_bytes += f.stat().st_size

print(f"Total dataset size: {total_bytes/1e6:.2f} MB")


Total dataset size: 1950.56 MB
time: 87.5 ms (started: 2025-11-30 22:10:56 +01:00)


In [18]:
#inspecting one file
sample_file = sorted((base_dir / "122").glob("*.HDF5"))[0]
sample_file

PosixPath('data/imerg_finalrun_2025_05/122/3B-HHR.MS.MRG.3IMERG.20240501-S000000-E002959.0000.V07B.HDF5')

time: 4.9 ms (started: 2025-11-30 22:13:07 +01:00)


In [21]:
with h5py.File(sample_file, "r") as f:
    print("Top-level groups:", list(f.keys()))
    grid = f["Grid"]
    print("Grid variables:", list(grid.keys()))
    print("Precip shape:", grid["precipitation"].shape)


Top-level groups: ['Grid']
Grid variables: ['Intermediate', 'nv', 'lonv', 'latv', 'time', 'lon', 'lat', 'time_bnds', 'lon_bnds', 'lat_bnds', 'precipitation', 'randomError', 'probabilityLiquidPrecipitation', 'precipitationQualityIndex']
Precip shape: (1, 3600, 1800)
time: 15.2 ms (started: 2025-11-30 22:18:16 +01:00)


In [23]:
with h5py.File(sample_file, "r") as f:
    grid = f["Grid"]

    print("Variable shapes & dtypes:")
    print("-" * 50)

    for name, obj in grid.items():
        if isinstance(obj, h5py.Dataset):
            print(f"{name:<35} shape={obj.shape}, dtype={obj.dtype}")
        else:
            print(f"{name:<35} (Group)")


Variable shapes & dtypes:
--------------------------------------------------
Intermediate                        (Group)
nv                                  shape=(2,), dtype=int32
lonv                                shape=(2,), dtype=int32
latv                                shape=(2,), dtype=int32
time                                shape=(1,), dtype=int32
lon                                 shape=(3600,), dtype=float32
lat                                 shape=(1800,), dtype=float32
time_bnds                           shape=(1, 2), dtype=int32
lon_bnds                            shape=(3600, 2), dtype=float32
lat_bnds                            shape=(1800, 2), dtype=float32
precipitation                       shape=(1, 3600, 1800), dtype=float32
randomError                         shape=(1, 3600, 1800), dtype=float32
probabilityLiquidPrecipitation      shape=(1, 3600, 1800), dtype=int16
precipitationQualityIndex           shape=(1, 3600, 1800), dtype=float32
time: 25.2 ms (started:

In [24]:
with h5py.File(sample_file, "r") as f:
    # read coordinates
    lats = f["Grid/lat"][:]
    lons = f["Grid/lon"][:]
    
    # target: Cologne
    target_lat = 50.94
    target_lon = 6.96
    
    # find closest index
    lat_idx = np.abs(lats - target_lat).argmin()
    lon_idx = np.abs(lons - target_lon).argmin()
    
    # read precipitation slice (shape = 1, 3600, 1800)
    precip = f["Grid/precipitation"][0, :, :]  # remove time dimension
    
    # extract the value
    cologne_value = precip[lon_idx, lat_idx]
    
    print(f"Closest IMERG grid point to Cologne:")
    print(f"  Grid latitude = {lats[lat_idx]:.2f}°")
    print(f"  Grid longitude = {lons[lon_idx]:.2f}°")
    print(f"\nPrecipitation at Cologne grid point: {cologne_value:.3f} mm/hr")
    
    # show a 5×5 patch around Cologne
    lat_slice = slice(max(lat_idx-2,0), lat_idx+3)
    lon_slice = slice(max(lon_idx-2,0), lon_idx+3)
    
    print("\n5 x 5 precipitation patch around Cologne:\n")
    print(precip[lon_slice, lat_slice])

Closest IMERG grid point to Cologne:
  Grid latitude = 50.95°
  Grid longitude = 6.95°

Precipitation at Cologne grid point: 0.000 mm/hr

5 x 5 precipitation patch around Cologne:

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
time: 161 ms (started: 2025-11-30 22:23:34 +01:00)
