# Exercising PBP/PyPAM on NRS11 data

The main steps in this notebook are:

- Do preparations in terms of working space for downloaded and generated files
- Generate HMB for a single day
- Generate HMB for multiple days in parallel using Dask

## Preparations

We start by being located at `/opt/pbp/pypam-based-processing` due to some inputs that are already in place in the PBP image.

In [1]:
%cd /opt/pbp/pypam-based-processing

/opt/pbp/pypam-based-processing


## Some parameters for PBP

In [2]:
json_base_dir = "NRS11/noaa-passive-bioacoustic_nrs_11_2019-2021"
download_dir = "NRS11/DOWNLOADS"
output_dir = "NRS11/OUTPUT"
output_prefix = "NRS11_"

global_attrs_uri = "NRS11/globalAttributes_NRS11.yaml"
variable_attrs_uri = "NRS11/variableAttributes_NRS11.yaml"

voltage_multiplier = 2.5
sensitivity_uri = "NRS11/NRS11_H5R6_sensitivity_hms5kHz.nc"
subset_to = (10, 2_000)

## Code imports

In [3]:
## Some base imports:
import logging
import xarray as xr
import dask
import pandas as pd
import time
import sys
from google.cloud.storage import (
    Client as GsClient,
)  # To handle download of `gs:` resources

In [4]:
## Some PBP imports:
sys.path = ["."] + sys.path
from src.process_helper import ProcessHelper
from src.file_helper import FileHelper
from src.logging_helper import create_logger

  from .autonotebook import tqdm as notebook_tqdm


# Supporting functions

PBP includes these two main modules that we will be using below:

- `FileHelper`: Facilitates input file reading. It supports reading local files as well as from GCP (`gs://` URIs) and AWS (`s3://` URIs).
- `ProcessHelper`: The main processing module.

We first define a function that takes care of HMB generation for a given date.

Based on that function, we then define one other function to dispatch multiple dates in parallel.


## A function to process a given day

Supported by those PBP modules, we define a function that takes care of processing a given day:

In [5]:
def process_date(date: str, gen_netcdf: bool = True):
    """
    Main function to generate the HMB product for a given day.

    It makes use of supporting elements in PBP in terms of logging,
    file handling, and PyPAM based HMB generation.

    :param date: Date to process, in YYYYMMDD format.

    :param gen_netcdf:  Allows caller to skip the `.nc` creation here
    and instead save the datasets after all days have been generated
    (see parallel execution below).

    :return: the generated xarray dataset.
    """

    log_filename = f"{output_dir}/{output_prefix}{date}.log"

    logger = create_logger(
        log_filename_and_level=(log_filename, logging.INFO),
        console_level=None,
    )

    # we are only downloading publicly accessible datasets:
    gs_client = GsClient.create_anonymous_client()

    file_helper = FileHelper(
        logger=logger,
        json_base_dir=json_base_dir,
        gs_client=gs_client,
        download_dir=download_dir,
        assume_downloaded_files=True,
        retain_downloaded_files=True,
    )

    process_helper = ProcessHelper(
        logger=logger,
        file_helper=file_helper,
        output_dir=output_dir,
        output_prefix=output_prefix,
        global_attrs_uri=global_attrs_uri,
        variable_attrs_uri=variable_attrs_uri,
        voltage_multiplier=voltage_multiplier,
        sensitivity_uri=sensitivity_uri,
        subset_to=subset_to,
    )

    ## now, get the HMB result:
    print(f"::: Started processing {date=}    {log_filename=}")
    result = process_helper.process_day(date)

    if gen_netcdf:
        nc_filename = f"{output_dir}/{output_prefix}{date}.nc"
        print(f":::   Ended processing {date=} =>  {nc_filename=}")
    else:
        print(f":::   Ended processing {date=} => (dataset generated in memory)")

    if result is not None:
        return result.dataset
    else:
        print(f"::: UNEXPECTED: no segments were processed for {date=}")

## A function to process multiple days

We use [Dask](https://examples.dask.org/delayed.html) to dispatch, in parallel, multiple instances of the `process_date` function defined above.

In [6]:
def process_multiple_dates(
    dates: list[str], gen_netcdf: bool = False
) -> list[xr.Dataset]:
    """
    Generates HMB for multiple days in parallel using Dask.
    Returns the resulting HMB datasets.

    :param dates: The dates to process, each in YYYYMMDD format.

    :param gen_netcdf:  Allows caller to skip the `.nc` creation here
    and instead save the datasets after all days have been generated.

    :return: the list of generated datasets.
    """

    @dask.delayed
    def delayed_process_date(date: str):
        return process_date(date, gen_netcdf=gen_netcdf)

    ## To display total elapsed time at the end the processing:
    start_time = time.time()

    ## This will be called by Dask when all dates have completed processing:
    def aggregate(*datasets) -> list[xr.Dataset]:
        elapsed_time = time.time() - start_time
        print(
            f"===> All {len(datasets)} dates completed. Elapsed time: {elapsed_time:.1f} seconds ({elapsed_time/60:.1f} mins)"
        )
        return datasets

    ## Prepare the processes:
    delayed_processes = [delayed_process_date(date) for date in dates]
    aggregation = dask.delayed(aggregate)(*delayed_processes)

    ## And launch them:
    return aggregation.compute()

# Generating the HMB products

## Processing a single day

In general, we are more interested in processing multiple dates, but we can process a single date by just calling `process_date` directly:

In [7]:
## Just uncomment the following line:
# process_date('20200101')

## Processing multiple days

We use the `process_multiple_dates` defined above to launch the generation of multiple HMB datasets in parallel.

**NOTE**: 
- Included JSON files in the current PBP image only cover Jan 01–31, 2020.
- Such JSON files could alternatively be located in external buckets.

In [8]:
## Here, we set `dates` as the list of 'YYYYMMDD' dates we want to process:

## For just a few dates, we can define the list explicitly:
# dates = ['20200110', '20200111', '20200112']

## but in general we can use pandas to help us generate the list:
date_range = pd.date_range(start="2020-01-01", end="2020-01-31")
dates = date_range.strftime("%Y%m%d").tolist()
dates

['20200101',
 '20200102',
 '20200103',
 '20200104',
 '20200105',
 '20200106',
 '20200107',
 '20200108',
 '20200109',
 '20200110',
 '20200111',
 '20200112',
 '20200113',
 '20200114',
 '20200115',
 '20200116',
 '20200117',
 '20200118',
 '20200119',
 '20200120',
 '20200121',
 '20200122',
 '20200123',
 '20200124',
 '20200125',
 '20200126',
 '20200127',
 '20200128',
 '20200129',
 '20200130',
 '20200131']

In [9]:
## Now, launch the generation:

print(f"Launching HMB generation for {len(dates)} {dates=}")

## NOTE: due to issues observed when concurrently saving the resulting netCDF files,
## this flag allows to postpone the saving for after all datasets have been generated:
gen_netcdf = False

## Get all HMB datasets:
generated_datasets = process_multiple_dates(dates, gen_netcdf=gen_netcdf)

print(f"Generated datasets: {len(generated_datasets)}\n")

if not gen_netcdf:
    # so, we now do the file saving here:
    print("Saving generated datasets...")
    for date, ds in zip(dates, generated_datasets):
        nc_filename = f"{output_dir}/{output_prefix}{date}.nc"
        print(f"  Saving {nc_filename=}")
        try:
            ds.to_netcdf(
                nc_filename,
                engine="netcdf4",
                encoding={
                    "effort": {"_FillValue": None},
                    "frequency": {"_FillValue": None},
                    "sensitivity": {"_FillValue": None},
                },
            )
        except Exception as e:  # pylint: disable=broad-exception-caught
            print(f"Unable to save {nc_filename}: {e}")

Launching HMB generation for 31 dates=['20200101', '20200102', '20200103', '20200104', '20200105', '20200106', '20200107', '20200108', '20200109', '20200110', '20200111', '20200112', '20200113', '20200114', '20200115', '20200116', '20200117', '20200118', '20200119', '20200120', '20200121', '20200122', '20200123', '20200124', '20200125', '20200126', '20200127', '20200128', '20200129', '20200130', '20200131']
::: Started processing date='20200123'    log_filename='NRS11/OUTPUT/NRS11_20200123.log'
::: Started processing date='20200103'    log_filename='NRS11/OUTPUT/NRS11_20200103.log'
::: Started processing date='20200121'    log_filename='NRS11/OUTPUT/NRS11_20200121.log'
::: Started processing date='20200131'    log_filename='NRS11/OUTPUT/NRS11_20200131.log'
::: Started processing date='20200127'    log_filename='NRS11/OUTPUT/NRS11_20200127.log'
::: Started processing date='20200124'    log_filename='NRS11/OUTPUT/NRS11_20200124.log'
::: Started processing date='20200107'    log_filename=

In [10]:
print("\nListing *.nc in OUTPUT folder:")
!ls -l NRS11/OUTPUT/*.nc


Listing *.nc in OUTPUT folder:
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200101.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200102.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200103.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200104.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200105.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200106.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200107.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200108.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200109.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200110.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200111.nc
-rw-r--r-- 1 jovyan users 6316372 Feb 12 19:39 NRS11/OUTPUT/NRS11_20200112.nc
-rw-r--r-- 1 jovyan users 631637