## Exercising PBP - PyPAM Based Processing

**NOTE**: PBP is not a python _package_, so we will be cloning the PBP repo to get the code.

PBP repo: https://github.com/mbari-org/pypam-based-processing

In short, the main steps in this notebook are:

- Clone PBP to support the HMB generation
- Install dependencies, including PyPAM
- Do preparations in terms of working space for downloaded and generated files
- Generate HMB for a single day
- Generate HMB for multiple days in parallel using Dask

## Code preparations

### PBP clone

We start by cloning the PBP repository:

In [None]:
## Clone:
!git clone https://github.com/mbari-org/pypam-based-processing.git

### NOTE: Skip this cell if you have already got the clone.


In [None]:
## Change directories to the clone location:
%cd pypam-based-processing

### NOTE: You will need to execute this cell if re-running after a restart of the kernel.


### Install requirements

In [None]:
!pip install -r requirements.txt --no-cache-dir
!pip install --no-cache-dir git+https://github.com/lifewatch/pypam.git

In [None]:
## This performs some basic PBP tests
!python -m pytest

## Workspace preparations

In [None]:
## Our JSON and WAV input files for the demo are already located under these folders:
!ls -l /home/jovyan/shared/readonly/data/mbari/pypam-based-processing/NB_SPACE/JSON/2022
!ls -l /home/jovyan/shared/readonly/data/mbari/pypam-based-processing/NB_SPACE/DOWNLOADS


In [None]:
## So, this is a convenient definition we will use to instruct PBP where to get the input files from:
INPUT_DIRECTORY = '/home/jovyan/shared/readonly/data/mbari/pypam-based-processing/NB_SPACE'

## Generated netCDF and log files will be stored in this location:
output_dir     = 'NB_SPACE/OUTPUT'

## So, make sure that output folder exists:
!mkdir -p NB_SPACE/OUTPUT

## The name of generated files will be given this prefix:
output_prefix  = 'MB05_'


## Imports

In [None]:
import logging
import os
import sys
import xarray as xr
import dask
import time

In [None]:
sys.path = ['.'] + sys.path
from src.process_helper import ProcessHelper
from src.file_helper import FileHelper
from src.logging_helper import create_logger

In [None]:
## NOTE: The needed files are already downloaded for this demo,
import boto3
from botocore import UNSIGNED
from botocore.client import Config

# A function to process a given day

PBP includes these two main modules that we will be using below:

- `FileHelper`, which facilitates input file reading (including from S3 buckets, though not really exercised in this notebook)
- `ProcessHelper`, which is the main processing module


In [None]:
## Supported by those PBP modules, we define a function that
## takes care of processing a given day:

def process_date(date: str, gen_netcdf: bool = True):
    """
    Main function to generate the HMB product for a given day.

    It makes use of supporting elements in PBP in terms of logging,
    file handling, and PyPAM based HMB generation.

    :param date: Date to process, in YYYYMMDD format.

    :param gen_netcdf:  Allows caller to skip the `.nc` creation here
    and instead save the datasets after all days have been generated
    (see parallel execution below).

    :return: the generated xarray dataset.
    """

    log_filename = f"{output_dir}/{output_prefix}{date}.log"

    logger = create_logger(
        log_filename_and_level=(log_filename, logging.INFO),
        console_level=None,
    )

    ## Note: we use S3 URIs and boto as general mechanism to get our files from AWS.
    ## We have already downloaded the necessary files for the demonstration.
    ## The settings below allow us to still continue using the original S3 URIs without
    ## triggering any new downloads.
    s3_client = boto3.client("s3", config=Config(signature_version=UNSIGNED))

    file_helper = FileHelper(
        logger=logger,
        json_base_dir           = f'{INPUT_DIRECTORY}/JSON',
        s3_client               = s3_client,
        download_dir            = f'{INPUT_DIRECTORY}/DOWNLOADS',
        assume_downloaded_files = True,
        retain_downloaded_files = True,
    )

    process_helper = ProcessHelper(
        logger=logger,
        file_helper=file_helper,
        gen_netcdf             = gen_netcdf,
        output_dir             = output_dir,
        output_prefix          = output_prefix,
        global_attrs_uri       = 'metadata/mb05/globalAttributes_MB05.yaml',
        variable_attrs_uri     = 'metadata/mb05/variableAttributes_MB05.yaml',
        voltage_multiplier     = 1,
        sensitivity_flat_value = 176,
        subset_to              = (10, 24_000),
        # max_segments=50  #TESTING
    )

    ## now, get the HMB result:
    print(f'::: Started processing {date=}    {log_filename=}')
    result = process_helper.process_day(date)

    if gen_netcdf:
        nc_filename = f"{output_dir}/{output_prefix}{date}.nc"
        print(f':::   Ended processing {date=} =>  {nc_filename=}')
    else:
        print(f':::   Ended processing {date=} => (dataset generated in memory)')

    if result is not None:
        return result.dataset
    else:
        print(f'::: UNEXPECTED: no segments were processed for {date=}')

# Generating the HMB products

### Processing a day

We can call the `process_date` function defined above directly as follows:

In [None]:
start_time = time.time()

generated_dataset = process_date('20220812')

elapsed_time = time.time() - start_time
print(f'===> date completed. Elapsed time: {elapsed_time:.1f} seconds ({elapsed_time/60:.1f} mins)')

generated_dataset

## Prepare process_date for parallel execution

We will use [Dask](https://examples.dask.org/delayed.html) to dispatch multiple instances of `process_date` in parallel.

In [None]:
def process_multiple_dates(dates: list[str], gen_netcdf: bool = False) -> list[xr.Dataset]:
    """
    Generates HMB for multiple days in parallel using Dask.
    Returns the resulting HMB datasets.
    
    :param dates: The dates to process, each in YYYYMMDD format.

    :param gen_netcdf:  Allows caller to skip the `.nc` creation here
    and instead save the datasets after all days have been generated.

    :return: the list of generated datasets.
    """

    @dask.delayed
    def delayed_process_date(date: str):
        return process_date(date, gen_netcdf=gen_netcdf)
    
    ## To display total elapsed time at the end the processing:
    start_time = time.time()

    ## This will be called by Dask when all dates have completed processing:
    def aggregate(*datasets) -> list[xr.Dataset]:
        elapsed_time = time.time() - start_time
        print(f'===> All {len(datasets)} dates completed. Elapsed time: {elapsed_time:.1f} seconds ({elapsed_time/60:.1f} mins)')
        return datasets


    ## Prepare the processes:
    delayed_processes = [delayed_process_date(date) for date in dates]
    aggregation = dask.delayed(aggregate)(*delayed_processes)

    ## And launch them:
    return aggregation.compute()


### Processing multiple days

We use the `process_multiple_dates` defined above to launch the generation of multiple HMB datasets in parallel.

In [None]:
## The dates to process in this demo:
dates = [
    '20220812', '20220813',
    '20220814', '20220815',
    '20220816', '20220817',
    '20220818', '20220819',
    '20220820', '20220821',
]

## NOTE: due to issues observed when concurrently saving the resulting netCDF files,
## this flag allows to postpone the saving for after all datasets have been generated:
gen_netcdf = False

## Get all HMB datasets:
generated_datasets = process_multiple_dates(dates, gen_netcdf=gen_netcdf)

print(f'Generated datasets: {len(generated_datasets)}\n')

if not gen_netcdf:
    # so, we now do the file saving here:
    print('Saving generated datasets...')
    for date, ds in zip(dates, generated_datasets):
        nc_filename = f'{output_dir}/{output_prefix}{date}.nc'
        print(f'  Saving {nc_filename=}')
        ds.to_netcdf(nc_filename,
                     engine="netcdf4",
                     encoding={
                        "effort": {"_FillValue": None},
                        "frequency": {"_FillValue": None},
                        "sensitivity": {"_FillValue": None},
                     },
        )

print('\nListing *.nc in OUTPUT folder:')
!ls -l NB_SPACE/OUTPUT/*.nc