# **Example Usage of Pangeo-Fish Software**


**Overview:**
This Jupyter notebook demonstrates the usage of the Pangeo-Fish software, a tool designed for analyzing biologging data in reference to Earth Observation (EO) data. Specifically, it utilizes data employed in the study conducted by M. Gonze et al. titled "Combining acoustic telemetry with archival tagging to investigate the spatial dynamics of the understudied pollack *Pollachius pollachius*," accepted for publication in the Journal of Fish Biology.

We showcase the application using the biologging tag 'A19124' attached to a pollack fish, along with reference EO data from the European Union Copernicus Marine Service Information (CMEMS) product 'NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013'. The biologging data consist of Data Storage Tag (DST) and teledetection by acoustic signals, along with release and recapture time and location of the species in question.  Both biologging data and the reference EO data are accessible with https and the access methods are incropolated in this notebook.   



**Purpose:**
By executing this notebook, users will learn how to set up a workflow for utilizing the Pangeo-Fish software. The workflow consists of 9 steps which are described below:

1. **Configure the Notebook:** Prepare the notebook environment for analysis.
2. **Compare Reference Model with DST Information:** Analyze and compare data from the reference model with information from the biologging data of the species in question. 
3. **Regrid the Grid from Reference Model Grid to Healpix Grid:** Transform the grid from the reference model to the Healpix grid for further analysis.
4. **Construct Emission Matrix:** Create an emission matrix based on the transformed grid.
5. **Compute Additional Emission Probability Matrix:** Calculate an additional emission probability matrix, particularly focusing on teledetection from acoustic signals.
6. **Combine and Normalize Emission Matrix:** Merge the emission matrix and normalize it for further processing.
7. **Estimate Model Parameters:** Determine the parameters of the model based on the normalized emission matrix.
8. **Compute State Probabilities and Tracks:** Calculate the probability distribution of the species in question and compute the tracks.
9. **Visualization:** Visualize the results of the analysis for interpretation and insight.

Throughout this notebook, users will gain practical experience in setting up and executing a workflow using Pangeo-Fish, enabling them to apply similar methodologies to their own biologging data analysis tasks.



## 1. **Configure the Notebook:** Prepare the notebook environment for analysis.

In this step, we sets up the notebook environment for analysis. It includes installing necessary packages, importing required libraries, setting up parameters, and configuring the cluster for distributed computing. It also retrieves the tag data needed for analysis.

    

In [None]:
!pip install rich zstandard
!pip install "xarray-healpy @ git+https://github.com/iaocea/xarray-healpy.git@0ffca6058f4008f4f22f076e2d60787fcf32ac82"
# !pip install -e ../.
!pip install movingpandas more_itertools
!pip install xarray --upgrade
!pip install xdggs healpix-convolution

In [None]:
from pint_xarray import unit_registry as ureg
import hvplot.xarray
import xarray as xr
import sys
sys.path.append("../")
import pangeo_fish

In [None]:
#
# Set up execution parameters for the analysis.
#
# Note: This cell is tagged as parameters, allowing automatic updates when configuring with papermil.

# tag_name corresponds to the name of the biologging tag name (DST identification number),
# which is also a path for storing all the information for the specific fish tagged with tag_name.
tag_name = "A19124"

# tag_root specifies the root URL for tag data used for this computation.
tag_root = "https://data-taos.ifremer.fr/data_tmp/cleaned/tag/"


# scratch_root specifies the root directory for storing output files.
scratch_root = "s3://destine-gfts-data-lake/demo"

# storage_options specifies options for the filesystem storing output files.
storage_options = {
    "anon": False,
    "profile": "gfts",
    "client_kwargs": {
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
        "region_name": "gra",
    },
}

# if you are using local file system, activate following two lines
scratch_root = "."
storage_options = None

# Default chunk value for time dimension.  This values depends on the configuration of your dask cluster.
chunk_time = 24

# Either to use a HEALPix grid (["cells"]) or a 2D grid (["x", "y"])
dims = ["x", "y"]

#
# Parameters for step 2. **Compare Reference Model with DST Information:**
#
# bbox, bounding box, defines the latitude and longitude range for the analysis area.
bbox = {"latitude": [46, 51], "longitude": [-8, -1]}

# relative_depth_threshold defines the acceptable fish depth relative to the maximum tag depth.
# It determines whether the fish can be considered to be in a certain location based on depth.
relative_depth_threshold = 0.8

#
# Parameters for step 3. **Regrid the Grid from Reference Model Grid to Healpix Grid:**
#
# optional rotation for the HEALPix grid
rot = {"lat": 0, "lon": 0}
# nside defines the resolution of the healpix grid used for regridding.
nside = 4096  # *2

# min_vertices sets the minimum number of vertices for a valid transcription for regridding.
min_vertices = 1

#
# Parameters for step 4. **Construct Emission Matrix:**
#
# differences_std sets the standard deviation for scipy.stats.norm.pdf.
# It expresses the estimated certainty of the field of difference.
differences_std = 0.75

# recapture_std sets the covariance for recapture event.
# It shows the certainty of the final recapture area if it is known.
recapture_std = 1e-2

# earth_radius defines the radius of the Earth used for distance calculations.
earth_radius = ureg.Quantity(6371, "km")

# maximum_speed sets the maximum allowable speed for the tagged fish.
maximum_speed = ureg.Quantity(60, "km / day")

# adjustment_factor adjusts parameters for a more fuzzy search.
# It will factor the allowed maximum displacement of the fish.
adjustment_factor = 5

# truncate sets the truncating factor for computed maximum allowed sigma for convolution process.
truncate = 4

#
# Parameters for step 5. **Compute Additional Emission Probability Matrix:**
#
# receiver_buffer sets the maximum allowed detection distance for acoustic receivers.
receiver_buffer = ureg.Quantity(1000, "m")

#
# Parameters for step 7. **Estimate Model Parameters:**
#
# tolerance sets the tolerance level for optimised parameter search computation.
# Smaller values will make the optimization iterate more
# In this tutorial, if 1D index (HEALPix grid) is used, we suggesting the value to 1e-6
tolerance = 1e-3 if dims == ["x", "y"] else 1e-6

#
# Parameters for step 8. **Compute State Probabilities and Tracks:**
#
# track_modes defines the modes for track calculation.
track_modes = ["mean", "mode"]

# additional_track_quantities sets quantities to compute for tracks using moving pandas.
additional_track_quantities = ["speed", "distance"]


#
# Parameters for step 9. **Visualization:**
#
# time_step defines for each time_step value we make movie of state and emission distributions
time_step = 3

In [None]:
# Define target root directories for storing analysis results.
target_root = f"{scratch_root}/{tag_name}"

# Defines default chunk size for optimisation.
default_chunk = {"time": chunk_time, "lat": -1, "lon": -1}
default_chunk_dims = {"time": chunk_time}
default_chunk_dims.update({d: -1 for d in dims})

In [None]:
# Set up a local cluster for distributed computing.
from distributed import LocalCluster

cluster = LocalCluster()
client = cluster.get_client()
client

In [None]:
from pangeo_fish.helpers import load_tag
tag, tag_log, time_slice = load_tag(tag_root, tag_name)
tag

In [None]:
from pangeo_fish.helpers import plot_tag

plot = plot_tag(tag, tag_log, save_html=True, storage_options=storage_options, target_root=target_root)
plot

## 2. **Compare Reference Model with DST Tag Information:** Analyze and compare data from the reference model with information from the biologging data of the species in question. 

In this step, we compare the reference model data with Data Storage Tag information.
The process involves reading and cleaning the reference model, aligning time, converting depth units, subtracting tag data from the model, and saving the results.

In [None]:
from pangeo_fish.helpers import load_model, compute_diff

reference_model = load_model(tag_log, time_slice, bbox=bbox, chunk_time=chunk_time)
diff = compute_diff(reference_model, tag_log, relative_depth_threshold, chunk_time=chunk_time)
diff = diff.compute()

_We can detect abnormal data by looking at the number of non null values for each timestep_

In [None]:
diff["diff"].count(["lat", "lon"]).plot()
diff

In [None]:
diff.to_zarr(f"{target_root}/diff.zarr", mode="w", storage_options=storage_options)

## 3. **Regrid the Grid from Reference Model Grid to Healpix Grid:** Transform the grid from the reference model to the Healpix grid for further analysis.

In this step, we regrid the data from the reference model grid to a Healpix grid. This process involves defining the Healpix grid, creating the target grid, computing interpolation weights, performing the regridding, and saving the regridded data.


In [6]:
from pangeo_fish.helpers import open_diff_dataset, regrid_dataset

In [7]:
# Open the previous dataset (in case we resume the notebook)
diff = open_diff_dataset(target_root, storage_options)
diff

In [None]:
reshaped = regrid_dataset(
    diff,
    nside,
    min_vertices=min_vertices,
    rot=rot,
    dims=dims
)
reshaped

Let's plot the same chart as before to check that the HEALPix regridding hasn't changed the data

In [None]:
reshaped["diff"].count(dims).plot()

In [None]:
reshaped.chunk(default_chunk_dims).to_zarr(
    f"{target_root}/diff-regridded.zarr",
    mode="w",
    consolidated=True,
    compute=True,
    storage_options=storage_options,
)

## 4. **Construct Emission Matrix:** Create an emission matrix based on the transformed grid.

In this step, we construct the emission probability matrix based on the differences between the observed tag temperature and the reference sea temperature computed in Workflow 2 and regridded in Workflow 3. The emission probability matrix represents the likelihood of observing a specific temperature difference given the model parameters and configurations.


In [21]:
from pangeo_fish.helpers import compute_emission_pdf

In [None]:
# Open the previous dataset (in case we resume the notebook)
differences = xr.open_dataset(
    f"{target_root}/diff-regridded.zarr",
    engine="zarr",
    chunks={},
    storage_options=storage_options,
).pipe(lambda ds: ds.merge(ds[["latitude", "longitude"]].compute()))
# ... and compute the emission matrices
emission_pdf = compute_emission_pdf(
    differences,
    tag["tagging_events"].ds,
    differences_std,
    recapture_std,
    dims=dims,
    chunk_time=chunk_time
)
emission_pdf

Whatever you data, it's important to **never have** only null values at any timestep.

How could we check that visually? You guess it, by using a similar plot as before!

In [None]:
emission_pdf = emission_pdf.chunk(default_chunk_dims).persist()
emission_pdf["pdf"].count(dims).plot()

In [None]:
# Save the dataset
emission_pdf.to_zarr(
    f"{target_root}/emission.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)

## 5. **Compute Additional Emission Probability Matrix**
Calculate an additional emission probability matrix, particularly focusing on teledetection from acoustic signals.

In this step, we compute additional emission probabilities based on acoustic detections for the selected tag. 

These additional probabilities enhance the emission probability matrix constructed in **step 4** by incorporating information from acoustic telemetry.

In [27]:
from pangeo_fish.helpers import compute_acoustic_pdf

In [None]:
# Load the previous emission pdf and compute the emission probabilities based on acoustic detections
emission_pdf = xr.open_dataset(
    f"{target_root}/emission.zarr",
    engine="zarr",
    chunks={},
    storage_options=storage_options,
) # chunk?
acoustic_pdf = compute_acoustic_pdf(
    emission_pdf,
    tag,
    receiver_buffer,
    chunk_time=chunk_time,
    dims=dims
).persist()
acoustic_pdf

This time, we check the data as before while pinpointing when detections occur!

In [None]:
tag["acoustic"]["deployment_id"].hvplot.scatter(c="red", marker="x") * (
    acoustic_pdf["acoustic"] != 0
).sum(dim=dims).hvplot()

### Explanations
On the plot above, at detection times the number of counted values drop to a few value (`5` in this example).

These numbers correspond to the number of pixels that covers the detection area.

Therefore, such drop is expected, since at those times we know that the fish was detected there, and so it can't be elsewhere.

These sporadic detections will constraint a lot the geolocation model upon optimizing!

**The next cell is optional. It will save the acoustic probabilities. It is not necessary (see the next step).**

In [None]:
acoustic_pdf.to_zarr(
    f"{target_root}/acoustic.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)

## 6. **Combine and Normalize the 2 distributions** 
Merge the `emission` distribution with the `acoustic` one and normalize it for further processing.

In this step, we combine the emission probability matrix constructed in **Workflow 4** and **5** then normalize it to ensure that the probabilities sum up to one. This step prepares the combined emission matrix for further analysis and interpretation.


In [31]:
from pangeo_fish.helpers import combine_pdfs

In [None]:
combined = combine_pdfs(emission_pdf, acoustic_pdf, default_chunk_dims, dims=dims)
combined.to_zarr(
    f"{target_root}/combined.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)

### In addition, we can check that our final temporal _pdf_ is valid, `i.e.`, it sums to `1` for all timesteps.

In [None]:
combined["pdf"].sum(dims).plot(ylim=(0, 2))

## 7. **Estimate Model's Parameters**
Determine the parameters of the model based on the normalized emission matrix:

1. We estimate maximum allowed value of parameter we aim to optimize, namely `sigma`.  
2. We then create an optimizer with an expected parameter range and fit the model to the normalized emission matrix.  
3. Finally, the resulting `sigma` along with any additional parameters used during optimization is saved to a `.json` file.  

In [34]:
from pangeo_fish.helpers import optimize_pdf

In [None]:
# Open the distributions
emission = xr.open_dataset(
    f"{target_root}/combined.zarr",
    engine="zarr",
    chunks=default_chunk_dims,
    inline_array=True,
    storage_options=storage_options,
)
# Call the optimization process
params = optimize_pdf(
    emission,
    earth_radius,
    adjustment_factor,
    truncate,
    maximum_speed,
    tolerance,
    dims=dims
)
params

In [36]:
# Save the results, mainly `sigma`
import pandas as pd
pd.DataFrame.from_dict(params, orient="index").to_json(
    f"{target_root}/parameters.json", storage_options=storage_options
)

## 8. **Compute State Probabilities and Trajectories**
Calculate the probability distribution of the species in question and compute the tracks (or trajectories).

This step involves predicting state probabilities using the optimised parameter sigma computed in the last step together with normalized emission matrix.  

In [37]:
from pangeo_fish.helpers import predict_positions

In [None]:
states, trajectories = predict_positions(target_root,
    storage_options,
    chunks=default_chunk_dims,
    track_modes=track_modes,
    additional_track_quantities=additional_track_quantities,
    dims=dims
)

Let's quickly check that the positional probability distribution `states` never sums to 0 for all timesteps!

In [None]:
(
    states.sum(dims).hvplot(width=500, ylim=(0, 2), title="Sum of the probabilities") +
    states.count(dims).hvplot(width=500, title="Number of none-zero probabilities")
).opts(shared_axes=False)

## 9. **Visualization** 
Visualize the results of the analysis for interpretation and insight.


In this step, we visualize various aspects of the analysis results to gain insights and interpret the model outcomes. 

We plot the emission matrix, which represents the likelihood of observing a specific temperature difference given the model parameters and configurations. 

Additionally, we visualize the state probabilities, showing the likelihood of the system being in different states at each time step. 

We also plot each of the tracks of the tagged fish, displaying their movement patterns over time. 

Finally, we create a movie that combines the emission matrix and state probabilities to provide a comprehensive visualization of the analysis results.

### 9.1 Plotting the trajectories 

In [47]:
from pangeo_fish.helpers import plot_trajectories

In [None]:
plot = plot_trajectories(
    target_root,
    track_modes,
    storage_options,
    save_html=True
)
plot

### 9.2 Plotting the `states` and `emission` distributions 

In [49]:
from pangeo_fish.helpers import open_distributions, render_distributions

In [None]:
data = open_distributions(target_root, storage_options, default_chunk_dims, chunk_time=chunk_time)
data

The interactive plot above is too large to be stored as a `HMTL` file (as done earlier with the trajectories).

Fortunately, `pangeo-fish` can efficiently render images of `data` and build a video from them! 

In [None]:
video_filename = render_distributions(
    data,
    xlim=bbox["longitude"],
    ylim=bbox["latitude"],
    time_step=3,
    extension="mp4",
    frames_dir="images",
    remove_frames=True
)

if target_root.startswith("s3://"):
    import s3fs

    s3 = s3fs.S3FileSystem(**storage_options)
    s3.put_file(video_filename, f"{target_root}/{video_filename}")