# **Example Usage of Pangeo-Fish Software**

**Overview:**
This Jupyter notebook demonstrates the usage of the pangeo-fish software, a tool designed for analyzing biologging data in reference to Earth Observation (EO) data. Specifically, it utilizes data from the biologging tag 'A19124' and reference data from the European Union Copernicus Marine Service Information (CMEMS) product 'NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013,' which were employed in the study conducted by M. Gonze et al. titled "Combining acoustic telemetry with archival tagging to investigate the spatial dynamics of the understudied pollack *Pollachius pollachius*," published in the Journal of Fish Biology.

**Purpose:**
By executing this notebook, users will learn how to set up a workflow for utilizing the pangeo-fish software. The workflow involves ten steps outlined below:

1. **Configure the Notebook:** Prepare the notebook environment for analysis.
2. **Compare Reference Model with DST Tag Information:** Analyze and compare data from the reference model with information from the biologging data.
3. **Regrid the Grid from Reference Model Grid to Healpix Grid:** Transform the grid from the reference model grid to the healpix grid for further analysis.
4. **Construct Emission Matrix:** Create an emission matrix based on the transformed grid.
5. **Compute Additional Emission Probability Matrix:** Calculate additional emission probability matrix, particularly focusing on teledetection from acoustic signals.
6. **Combine and Normalize Emission Matrix:** Merge the emission matrix and normalize it for further processing.
7. **Estimate Model Parameters:** Determine the parameters of the model based on the normalized emission matrix.
8. **Compute State Probabilities:** Calculate the probabilities associated with different states within the model.
9. **Compute Tracks:** Analyze and compute tracks based on the model parameters and state probabilities.
10. **Visualization:** Visualize the results of the analysis for interpretation and insight.

Throughout this notebook, users will gain practical experience in setting up and executing a workflow using pangeo-fish, enabling them to apply similar methodologies to their own biologging data analysis tasks.


## 1. **Configure the Notebook:** Prepare the notebook environment for analysis.
This section sets up the notebook environment for analysis. It includes installing necessary packages, importing required libraries, setting up parameters, and configuring the cluster for distributed computing. It also retrieves the tag data needed for analysis.

    

In [None]:
# Run the following 3 lines of command to install pangeo-fish in the pangeo environment.
# Note: These commands install required packages for the analysis.  
# You may need to restart your kernel before executing the next cell.
#

!pip install git+https://github.com/iaocea/xarray-healpy
!pip install rich dask_image zstandard xmovie
!pip install -e ../.

In [None]:
# Import necessary libraries and modules for data analysis.
import dask
import numpy as np
import pandas as pd
import pint_xarray
import xarray as xr
from pint_xarray import unit_registry as ureg
from pangeo_fish.io import open_tag

In [None]:
# Set up execution parameters for the analysis.
# Note: This cell is tagged as parameters, allowing automatic updates when configuring with papermil.

# tag_name corresponds to the name of the biologging tag name (DST identification number), 
# which is also a path for storing all the information for the specific fish tagged with tag_name.
tag_name = "A19124"  

# tag_root specifies the root URL for tag data used for this computation.
tag_root = "https://data-taos.ifremer.fr/data_tmp/cleaned/tag/"

# catalog_url specifies the URL for the catalog for reference data used.
catalog_url = "https://data-taos.ifremer.fr/kerchunk/ref-copernicus.yaml"

# scratch_root specifies the root directory for storing output files.
scratch_root = "s3://destine-gfts-data-lake/demo"

# storage_options specifies options for the filesystem storing output files.
storage_options = {
    'anon': False, 
    'profile' : "gfts",
    'client_kwargs': {
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
        "region_name": "gra",
    }
}

# if you are using local file system, activate following two lines
# scratch_root = "."
# storage_options = None

#
# Parameters for Workflow 2. **Compare Reference Model with DST Tag Information:**
#
# bbox, bounding box, defines the latitude and longitude range for the analysis area.
bbox = {"lat": [47, 51], "lon": [-8, -1]} 

# relative_depth_threshold defines the acceptable fish depth relative to the maximum tag depth.
# It determines whether the fish can be considered to be in a certain location based on depth.
relative_depth_threshold = 0.6

#
# Parameters for Workflow 3. **Regrid the Grid from Reference Model Grid to Healpix Grid:** 
#
# nside defines the resolution of the healpix grid used for regridding.
nside = 4096*2

# rot defines the rotation angles for the healpix grid.
rot = {"lat": 0, "lon": 0}

# min_vertices sets the minimum number of vertices for a valid transcription for regridding.
min_vertices = 1

#
# Parameters for Workflow 4. **Construct Emission Matrix:**
#
# differences_std sets the standard deviation for scipy.stats.norm.pdf.
# It expresses the estimated certainty of the field of difference.
differences_std = 0.75

# recapture_std sets the covariance for recapture event.
# It shows the certainty of the final recapture area if it is known.
recapture_std = 1e-2

# earth_radius defines the radius of the Earth used for distance calculations.
earth_radius = ureg.Quantity(6371, "km")

# maximum_speed sets the maximum allowable speed for the tagged fish.
maximum_speed = ureg.Quantity(60, "km / day")

# adjustment_factor adjusts parameters for a more fuzzy search.
# It will factor the allowed maximum displacement of the fish.
adjustment_factor = 2.5  

# truncate sets the truncating factor for computed maximum allowed sigma for convolution process.
truncate = 1

#
# Parameters for Workflow 5. **Compute Additional Emission Probability Matrix:**
#
# receiver_buffer sets the maximum allowed detection distance for acoustic receivers.
receiver_buffer = ureg.Quantity(1000, "m")

#
# Parameters for Workflow 7. **Estimate Model Parameters:** 
#
# tolerance sets the tolerance level for EagerBoundsSearch calculations.
tolerance = 1e-2

#
# Parameters for Workflow 9. **Compute Tracks:**
#
# track_modes defines the modes for track calculation.
track_modes = ["mean", "mode"]

# additional_track_quantities sets quantities to compute for tracks.
additional_track_quantities = ["speed", "distance"]


In [None]:
# Define target and tracks root directories for storing analysis results.
target_root = f"{scratch_root}/{tag_name}"
tracks_root = f"{target_root}/tracks"

In [None]:
# Set up a local cluster for distributed computing.
from distributed import LocalCluster
cluster = LocalCluster()
client = cluster.get_client()
client

In [None]:
#Open and retrieve the tag data required for the analysis
tag = open_tag(tag_root, tag_name)
tag

## 2. **Compare Reference Model with DST Tag Information:**
In this step, we compare the reference model data with DST (Data Storage Tag) information.
The process involves reading and cleaning the reference model, aligning time, converting depth units, 
subtracting tag data from the model, and saving the results.

In [None]:
# Import necessary libraries
import intake
from pangeo_fish.cf import bounds_to_bins
from pangeo_fish.diff import diff_z
from pangeo_fish.io import open_copernicus_catalog
from pangeo_fish.tags import adapt_model_time, reshape_by_bins, to_time_slice

# Drop data outside the reference interval
time_slice = to_time_slice(tag["tagging_events/time"])
tag_log = tag["dst"].ds.sel(time=time_slice)

# Open and clean reference model
cat = intake.open_catalog(catalog_url)
model = open_copernicus_catalog(cat)

# Subset the reference_model by 
# - align model time with the time of tag_log, also
# - drop data for depth later that are unlikely due to the observed pressure from tag_log
# - defined latitude and longitude of bbox.  
#
reference_model = (
    model.sel(time=adapt_model_time(time_slice))
    .sel(lat=slice(*bbox["lat"]), lon=slice(*bbox["lon"]))
    .pipe(
        lambda ds: ds.sel(
            depth=slice(None, (tag_log["pressure"].max() - ds["XE"].min()).compute())
        )
    )
)
reference_model

In [None]:
%%time
# Reshape the tag log, so that it bins to the time step of reference_model
reshaped_tag = reshape_by_bins(
    tag_log,
    dim="time",
    bins=(
        reference_model.cf.add_bounds(["time"], output_dim="bounds")
        .pipe(bounds_to_bins, bounds_dim="bounds")
        .get("time_bins")
    ),
    bin_dim="bincount",
    other_dim="obs",
).chunk({"time": 1})

# Subtract the time_bined tag_log from the reference_model. 
# Here, for each time_bin, each observed value are compared with the correspoindng depth of reference_model using diff_z function.  
#
diff = (
    diff_z(reference_model, reshaped_tag, depth_threshold=relative_depth_threshold)
    .assign_attrs({"tag_id": tag_name})
    .assign(
        {
            "H0": reference_model["H0"],
            "ocean_mask": reference_model["H0"].notnull(),
        }
    )
)

# Persist the diff data
diff = diff.chunk({"time": 1, "lat": -1, "lon": -1}).persist()
diff

In [None]:
# Verify the data
diff["diff"].count(["lat", "lon"]).plot()

In [None]:
%%time
# Save snapshot to disk
diff.to_zarr(
    f"{target_root}/diff.zarr", mode="w", storage_options=storage_options
)

# Cleanup
del tag_log, cat, model, reference_model, reshaped_tag, diff

## 3. Regrid the grid from reference model grid to healpix grid.

In this step, we regrid the data from the reference model grid to a Healpix grid. This process involves defining the Healpix grid, creating the target grid, computing interpolation weights, performing the regridding, and saving the regridded data.


In [None]:
# Import necessary libraries
from xarray_healpy import HealpyGridInfo, HealpyRegridder
from pangeo_fish.grid import center_longitude

In [None]:
%%time

# Open the diff data and performs cleaning operations to prepare it for regridding.

ds = (
    xr.open_dataset(f"{target_root}/diff.zarr", engine="zarr", chunks={},
                    storage_options=storage_options)
    .pipe(lambda ds: ds.merge(ds[["latitude", "longitude"]].compute()))
    .swap_dims({"lat": "yi", "lon": "xi"})
    .drop_vars(["lat", "lon"])
)
ds


In [None]:
%%time
# Define the target Healpix grid information
grid = HealpyGridInfo(level=int(np.log2(nside)), rot=rot)
target_grid = grid.target_grid(ds).pipe(center_longitude, 0)
target_grid

In [None]:
%%time
# Compute the interpolation weights for regridding the diff data
regridder = HealpyRegridder(
    ds[["longitude", "latitude", "ocean_mask"]],
    target_grid,
    method="bilinear",
    interpolation_kwargs={"mask": "ocean_mask", "min_vertices": min_vertices},
)
regridder


In [None]:
%%time
# Perform the regridding operation using the computed interpolation weights.
regridded = regridder.regrid_ds(ds)
regridded

In [None]:
%%time
# Reshape the regridded data to 2D
reshaped = grid.to_2d(regridded).pipe(center_longitude, 0)
reshaped = reshaped.persist()
reshaped


In [None]:
# This cell verifies the regridded data by plotting the count of non-NaN values.
reshaped["diff"].count(["x", "y"]).plot()

In [None]:
%%time
# This cell saves the regridded data to Zarr format, then cleans up unnecessary variables to free up memory after the regridding process.  
reshaped.chunk({"x": -1, "y": -1, "time": 1}).to_zarr(
    f"{target_root}/diff-regridded.zarr",
    mode="w",
    consolidated=True,
    compute=True,
    storage_options=storage_options,
)
# Cleanup unnecessary variables to free up memory
del ds, grid, target_grid, regridder, regridded, reshaped

## 4. Construct emission probability matrix

In this section, we construct the emission probability matrix based on the differences between the observed tag temperature and the reference sea temperature computed in Workflow 2 and regridded in Workflow 3. The emission probability matrix represents the likelihood of observing a specific temperature difference given the model parameters and configurations.


In [None]:
# Import necessary libraries
from toolz.dicttoolz import valfilter
from pangeo_fish.distributions import create_covariances, normal_at
from pangeo_fish.pdf import normal
from pangeo_fish.utils import temporal_resolution

In [None]:
%%time
# Open the regridded diff data 
differences = xr.open_dataset(
    f"{target_root}/diff-regridded.zarr",
    engine="zarr",
    chunks={},
    storage_options=storage_options,
)
differences

In [None]:
%%time
# Compute initial and final position
grid = differences[["latitude", "longitude"]].compute()

initial_position = tag["tagging_events"].ds.sel(event_name="release")
cov = create_covariances(1e-6, coord_names=["latitude", "longitude"])
initial_probability = normal_at(
    grid, pos=initial_position, cov=cov, normalize=True, axes=["latitude", "longitude"]
)

final_position = tag["tagging_events"].ds.sel(event_name="fish_death")
if final_position[["longitude", "latitude"]].to_dataarray().isnull().all():
    final_probability = None
else:
    cov = create_covariances(recapture_std**2, coord_names=["latitude", "longitude"])
    final_probability = normal_at(
        grid,
        pos=final_position,
        cov=cov,
        normalize=True,
        axes=["latitude", "longitude"],
    )


In [None]:
# Compute maximum displacement for each reference model time step
# and estimate maximum sigma value for limiting the optimisation step

earth_radius_ = xr.DataArray(earth_radius, dims=None)

timedelta = temporal_resolution(differences["time"]).pint.quantify().pint.to("h")
grid_resolution = earth_radius_ * differences["resolution"].pint.quantify()

maximum_speed_ = xr.DataArray(maximum_speed, dims=None).pint.to("km / h")
max_grid_displacement = maximum_speed_ * timedelta * adjustment_factor / grid_resolution
max_sigma = max_grid_displacement.pint.to("dimensionless").pint.magnitude / truncate
max_sigma

In [None]:
%%time
#compute emission probability matrix

emission_pdf = (
    normal(differences["diff"], mean=0, std=differences_std, dims=["y", "x"])
    .to_dataset(name="pdf")
    .assign(
        valfilter(
            lambda x: x is not None,
            {
                "initial": initial_probability,
                "final": final_probability,
                "mask": differences["ocean_mask"],
            },
        )
    )
    .assign_attrs(differences.attrs | {"max_sigma": max_sigma})
    .chunk({"time": 1, "y": -1, "x": -1})
)

emission_pdf=emission_pdf.persist()
emission_pdf

In [None]:
#Verify the data
emission_pdf["pdf"].count(["x", "y"]).plot()

In [None]:
# This cell saves the emission data to Zarr format, then cleans up unnecessary variables to free up memory.

emission_pdf.to_zarr(
    f"{target_root}/emission.zarr", mode="w", consolidated=True,
    storage_options=storage_options,
)


del differences, grid, initial_probability, final_probability, emission_pdf

## 5. Compute additional emission probability matrix ( teledetection from acoustic )
    1. open and read acoustic detections for the selected tag
    2. convert times to UTC
    3. aggregate detections and compute weights
    4. construct detection maps
    5. weighted sum of the detection maps
    6. save

In [None]:
from pangeo_fish import acoustic, utils

open data and clean

In [None]:
emission = xr.open_dataset(
    f"{target_root}/emission.zarr", engine="zarr", chunks={"x": -1, "y": -1},
    storage_options=storage_options,
)

construct the emission probabilities

In [None]:
acoustic_pdf = acoustic.emission_probability(
    tag, emission[["time", "cell_ids", "mask"]].compute(), receiver_buffer
)
acoustic_pdf

In [None]:
acoustic_pdf=acoustic_pdf.persist()

Verify the data

In [None]:
import hvplot.xarray

tag['acoustic']["deployment_id"].hvplot.scatter(
    c='red',marker='x')*(
    acoustic_pdf['acoustic'] != 0).sum(dim=('y', 'x')).hvplot()


In [None]:
combined = emission.merge(acoustic_pdf).chunk({"x": -1, "y": -1, "time": 1}).persist()
combined

save

In [None]:
combined.to_zarr(
    f"{target_root}/emission.zarr", mode="w", consolidated=True,
    storage_options=storage_options    
)

cleanup

In [None]:
del emission, acoustic_pdf, combined

## 6. Combine the emission matrix and normalise

In [None]:
from pangeo_fish.pdf import combine_emission_pdf

In [None]:
combined = (
    xr.open_dataset(
        f"{target_root}/emission.zarr",
        engine="zarr",
        chunks={"x": -1, "y": -1, "time": "auto"},
        inline_array=True,
        storage_options=storage_options          
    )
    .pipe(combine_emission_pdf)
    .chunk({"x": -1, "y": -1, "time": "auto"})
    .persist()  # convert to comment if the emission matrix does *not* fit in memory
)
combined

Verify the data

In [None]:
combined["pdf"].sum(["x", "y"]).plot()

save

In [None]:
combined.to_zarr(
    f"{target_root}/combined.zarr", mode="w", consolidated=True,
    storage_options=storage_options    
)

## 7. Estimate the model parameter
    1. select and create estimator instance
    2. create an optimizer using the estimator and the expected parameter range
    3. fit the model to the data to get the model parameter
    4. save

In [None]:
import json
import fsspec
from pangeo_fish.hmm.estimator import EagerScoreEstimator
from pangeo_fish.hmm.optimize import EagerBoundsSearch

open the data

In [None]:
emission = (
    xr.open_dataset(
        f"{target_root}/combined.zarr",
        engine="zarr",
        chunks={"x": -1, "y": -1, "time": "auto"},
        inline_array=True,
        storage_options=storage_options            
    )
    .compute()  # convert to comment if the emission matrix does *not* fit in memory
)
emission

create and configure estimator and optimizer

In [None]:
estimator = EagerScoreEstimator()

optimizer = EagerBoundsSearch(
    estimator,
    (1e-4, emission.attrs["max_sigma"]),
    optimizer_kwargs={"disp": 3, "xtol": tolerance},
)

fit the model parameter to the data

In [None]:
%%time
optimized = optimizer.fit(emission)
optimized

save

In [None]:
params = optimized.to_dict()
with fsspec.open(f"{target_root}/parameters.json", mode="w",
                 storage_options=storage_options   #how do i pass storage_option here?         
) as f:
    json.dump(params, f)

## 8. State probabilities
    1. use the configured estimator to predict the state probabilities
    2. save

recreate the estimator

In [None]:
with fsspec.open(f"{target_root}/parameters.json", mode="r") as f:
    params = json.load(f)
optimized = EagerScoreEstimator(**params)
optimized

load the data

In [None]:
emission = (
    xr.open_dataset(
        f"{target_root}/combined.zarr",
        engine="zarr",
        chunks={"x": -1, "y": -1, "time": "auto"},
        inline_array=True,
        storage_options=storage_options            
    )
)
emission

predict the state probabilities

In [None]:
%%time
states = optimized.predict_proba(emission)
states

save

In [None]:
%%time
states.chunk({"time": 1, "x": -1, "y": -1}).to_zarr(
    f"{target_root}/states.zarr", mode="w", consolidated=True,  
        storage_options=storage_options                
)

cleanup

In [None]:
del states

## 9. Compute tracks
    1. compute mean and mode from the precomputed state probabilities and apply the viterbi algorithm to the emission matrix to get the most probable track
    2. save

In [None]:
from pangeo_fish.hmm.estimator import EagerScoreEstimator

open data

In [None]:
emission = None

In [None]:
#put if viterbi here
emission = (
    xr.open_dataset(
        f"{target_root}/combined.zarr",
        engine="zarr",
        chunks={"x": -1, "y": -1, "time": "auto"},
        inline_array=True,
        storage_options=storage_options            
    ).compute() #
)
emission


In [None]:
with fsspec.open(f"{target_root}/parameters.json", mode="r") as f:
    params = json.load(f)

In [None]:
optimized = EagerScoreEstimator(**params)

states = xr.open_dataset(
    f"{target_root}/states.zarr", engine="zarr", chunks={}, inline_array=True,
         storage_options=storage_options            
   
).compute()
states

decode tracks

In [None]:
%%time
trajectories = optimized.decode(
    emission,
    states,
    mode=track_modes,
    progress=True,
    additional_quantities=additional_track_quantities,
)
trajectories

save

In [None]:
from pangeo_fish.io import save_trajectories

In [None]:
#this need storage option
save_trajectories(trajectories, tracks_root, format="parquet")

cleanup

In [None]:
del emission, states, trajectories

## 10. Visualise results
    1. plot the emission matrix
    2. plot the state probabilities
    3. plot each of the tracks
    4. Create movie

In [None]:
import cmocean
import geopandas as gpd
import holoviews as hv
import hvplot.xarray
import movingpandas as mpd
import xmovie

from pangeo_fish import visualization
from pangeo_fish.io import read_trajectories

In [None]:
trajectories = read_trajectories(tracks_root, track_modes, format="parquet")
trajectories

In [None]:
trajectories.hvplot(c="speed", tiles="CartoLight", cmap="cmo.speed")

In [None]:
plots = [
    traj.hvplot(c="speed", tiles="CartoLight", title=traj.id, cmap="cmo.speed")
    for traj in trajectories.trajectories
]

hv.Layout(plots).cols(2)

In [None]:
emission = (
    xr.open_dataset(
        f"{target_root}/emission-acoustic.zarr",
        engine="zarr",
        chunks={},
        inline_array=True,
    )
    .pipe(combine_emission_pdf)
    .rename_vars({"pdf": "emission"})
    .drop_vars(["final", "initial"])
)
states = xr.open_dataset(
    f"{target_root}/states.zarr", engine="zarr", chunks={}, inline_array=True
).where(emission["mask"])
data = xr.merge([states, emission.drop_vars(["mask"])])
data

In [None]:
plot1 = visualization.plot_map(data["states"])
plot2 = visualization.plot_map(data["emission"])

hv.Layout([plot1, plot2]).cols(1)

In [None]:
%%time
mov = xmovie.Movie(
    data.pipe(lambda ds: ds.merge(ds[["longitude", "latitude"]].compute())).pipe(
        visualization.filter_by_states
    ),
    plotfunc=visualization.create_frame,
    input_check=False,
    pixelwidth=15 * 400,
    pixelheight=12 * 400,
    dpi=400,
)

mov.save(f"{target_root}/states.mp4", parallel=True, overwrite_existing=True)