# **Example Usage of Pangeo-Fish Software with Healpix Convolution**


**Overview:**
This Jupyter notebook demonstrates the usage of the Pangeo-Fish software, a tool designed for analyzing biologging data in reference to Earth Observation (EO) data. Specifically, it utilizes data employed in the study conducted by M. Gonze et al. titled "Combining acoustic telemetry with archival tagging to investigate the spatial dynamics of the understudied pollack *Pollachius pollachius*," accepted for publication in the Journal of Fish Biology.

We showcase the application using the biologging tag 'A19124' attached to a pollack fish, along with reference EO data from the European Union Copernicus Marine Service Information (CMEMS) product 'NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013'. The biologging data consist of Data Storage Tag (DST) and teledetection by acoustic signals, along with release and recapture time and location of the species in question.  Both biologging data and the reference EO data are accessible with https and the access methods are incropolated in this notebook.   



**Purpose:**
By executing this notebook, users will learn how to set up a workflow for utilizing the Pangeo-Fish software. The workflow consists of 9 steps which are described below:

1. **Configure the Notebook:** Prepare the notebook environment for analysis.
2. **Compare Reference Model with DST Information:** Analyze and compare data from the reference model with information from the biologging data of the species in question. 
3. **Regrid the Grid from Reference Model Grid to Healpix Grid:** Transform the grid from the reference model to the Healpix grid for further analysis.
4. **Construct Emission Matrix:** Create an emission matrix based on the transformed grid.
5. **Compute Additional Emission Probability Matrix:** Calculate an additional emission probability matrix, particularly focusing on teledetection from acoustic signals.
6. **Combine and Normalize Emission Matrix:** Merge the emission matrix and normalize it for further processing.
7. **Estimate Model Parameters:** Determine the parameters of the model based on the normalized emission matrix.
8. **Compute State Probabilities and Tracks:** Calculate the probability distribution of the species in question and compute the tracks.
9. **Visualization:** Visualize the results of the analysis for interpretation and insight.

Throughout this notebook, users will gain practical experience in setting up and executing a workflow using Pangeo-Fish, enabling them to apply similar methodologies to their own biologging data analysis tasks.



## 1. **Configure the Notebook:** Prepare the notebook environment for analysis.

In this step, we sets up the notebook environment for analysis. It includes installing necessary packages, importing required libraries, setting up parameters, and configuring the cluster for distributed computing. It also retrieves the tag data needed for analysis.

    

In [11]:
# Import necessary libraries and modules.
import xarray as xr
from pint_xarray import unit_registry as ureg
from pangeo_fish.io import open_tag

In [12]:
#
# Set up execution parameters for the analysis.
#
# Note: This cell is tagged as parameters, allowing automatic updates when configuring with papermil.

# tag_name corresponds to the name of the biologging tag name (DST identification number),
# which is also a path for storing all the information for the specific fish tagged with tag_name.
tag_name = "A19124"

# tag_root specifies the root URL for tag data used for this computation.
tag_root = "https://data-taos.ifremer.fr/data_tmp/cleaned/tag/"

# catalog_url specifies the URL for the catalog for reference data used.
catalog_url = "https://data-taos.ifremer.fr/kerchunk/ref-copernicus.yaml"
open_catalog = True


# scratch_root specifies the root directory for storing output files.
scratch_root = "s3://destine-gfts-data-lake/demo"

# storage_options specifies options for the filesystem storing output files.
storage_options = {
    "anon": False,
    "profile": "gfts",
    "client_kwargs": {
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
        "region_name": "gra",
    },
}

# if you are using local file system, activate following two lines
scratch_root = "."
storage_options = None

# Default chunk value for time dimension.  This values depends on the configuration of your dask cluster.
chunk_time = 24

#
# Parameters for step 2. **Compare Reference Model with DST Information:**
#
# bbox, bounding box, defines the latitude and longitude range for the analysis area.
bbox = {"latitude": [46, 51], "longitude": [-8, -1]}

# relative_depth_threshold defines the acceptable fish depth relative to the maximum tag depth.
# It determines whether the fish can be considered to be in a certain location based on depth.
relative_depth_threshold = 0.8

#
# Parameters for step 3. **Regrid the Grid from Reference Model Grid to Healpix Grid:**
#
# nside defines the resolution of the healpix grid used for regridding.
nside = 4096  # *2

# min_vertices sets the minimum number of vertices for a valid transcription for regridding.
min_vertices = 1

#
# Parameters for step 4. **Construct Emission Matrix:**
#
# differences_std sets the standard deviation for scipy.stats.norm.pdf.
# It expresses the estimated certainty of the field of difference.
differences_std = 0.75

# recapture_std sets the covariance for recapture event.
# It shows the certainty of the final recapture area if it is known.
recapture_std = 1e-2

# earth_radius defines the radius of the Earth used for distance calculations.
earth_radius = ureg.Quantity(6371, "km")

# maximum_speed sets the maximum allowable speed for the tagged fish.
maximum_speed = ureg.Quantity(60, "km / day")

# adjustment_factor adjusts parameters for a more fuzzy search.
# It will factor the allowed maximum displacement of the fish.
adjustment_factor = 5

# truncate sets the truncating factor for computed maximum allowed sigma for convolution process.
truncate = 4

#
# Parameters for step 5. **Compute Additional Emission Probability Matrix:**
#
# receiver_buffer sets the maximum allowed detection distance for acoustic receivers.
receiver_buffer = ureg.Quantity(1000, "m")

#
# Parameters for step 7. **Estimate Model Parameters:**
#
# tolerance sets the tolerance level for optimised parameter serarch computation.
tolerance = 1e-6

#
# Parameters for step 8. **Compute State Probabilities and Tracks:**
#
# track_modes defines the modes for track calculation.
track_modes = ["mean", "mode"]
#track_modes = ["mean", "mode","mode_minrk","mean_minrk"]

# additional_track_quantities sets quantities to compute for tracks using moving pandas.
additional_track_quantities = ["speed", "distance"]


#
# Parameters for step 9. **Visualization:**
#
# time_step defines for each time_step value we make movie of state and emission matrix
time_step = 3

In [13]:
# Define target root directories for storing analysis results.
target_root = f"{scratch_root}/{tag_name}"

# Defines default chunk size for optimisation.
default_chunk = {"time": chunk_time, "lat": -1, "lon": -1}
default_chunk_xy = {"time": chunk_time, "x": -1, "y": -1}
default_chunk_cells = {"time": chunk_time, "cells": -1}

In [14]:
# Set up a local cluster for distributed computing.
from distributed import LocalCluster

cluster = LocalCluster()
client = cluster.get_client()
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 33411 instead


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:33411/status,

0,1
Dashboard: http://127.0.0.1:33411/status,Workers: 4
Total threads: 16,Total memory: 24.46 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:35031,Workers: 4
Dashboard: http://127.0.0.1:33411/status,Total threads: 16
Started: Just now,Total memory: 24.46 GiB

0,1
Comm: tcp://127.0.0.1:42195,Total threads: 4
Dashboard: http://127.0.0.1:37911/status,Memory: 6.11 GiB
Nanny: tcp://127.0.0.1:43293,
Local directory: /tmp/dask-scratch-space/worker-mo12uo7a,Local directory: /tmp/dask-scratch-space/worker-mo12uo7a

0,1
Comm: tcp://127.0.0.1:38081,Total threads: 4
Dashboard: http://127.0.0.1:37425/status,Memory: 6.11 GiB
Nanny: tcp://127.0.0.1:46241,
Local directory: /tmp/dask-scratch-space/worker-mpduc428,Local directory: /tmp/dask-scratch-space/worker-mpduc428

0,1
Comm: tcp://127.0.0.1:43331,Total threads: 4
Dashboard: http://127.0.0.1:36213/status,Memory: 6.11 GiB
Nanny: tcp://127.0.0.1:44299,
Local directory: /tmp/dask-scratch-space/worker-wyd4yd5c,Local directory: /tmp/dask-scratch-space/worker-wyd4yd5c

0,1
Comm: tcp://127.0.0.1:44449,Total threads: 4
Dashboard: http://127.0.0.1:33363/status,Memory: 6.11 GiB
Nanny: tcp://127.0.0.1:38907,
Local directory: /tmp/dask-scratch-space/worker-qfp5hj7j,Local directory: /tmp/dask-scratch-space/worker-qfp5hj7j


In [15]:
tag_root

'https://data-taos.ifremer.fr/data_tmp/cleaned/tag/'

In [16]:
# Open and retrieve the tag data required for the analysis
tag = open_tag(tag_root, tag_name)
tag

## 2. **Compare Reference Model with DST Tag Information:** Analyze and compare data from the reference model with information from the biologging data of the species in question. 

In this step, we compare the reference model data with Data Storage Tag information.
The process involves reading and cleaning the reference model, aligning time, converting depth units, subtracting tag data from the model, and saving the results.

In [None]:
# Import necessary libraries
import intake
from pangeo_fish.cf import bounds_to_bins
from pangeo_fish.diff import diff_z
from pangeo_fish.io import open_copernicus_catalog
from pangeo_fish.tags import adapt_model_time, reshape_by_bins, to_time_slice

In [None]:
# Drop tag data outside the tagged events interval
time_slice = to_time_slice(tag["tagging_events/time"])
tag_log = tag["dst"].ds.sel(time=time_slice)

# Verify the data
import hvplot.xarray
import cmocean
from pangeo_fish.io import save_html_hvplot

plot = (
    (-tag["dst"].pressure).hvplot(width=1000, height=500, color="blue")
    * (-tag_log).hvplot.scatter(
        x="time", y="pressure", color="red", size=5, width=1000, height=500
    )
    * (
        (tag["dst"].temperature).hvplot(width=1000, height=500, color="blue")
        * (tag_log).hvplot.scatter(
            x="time", y="temperature", color="red", size=5, width=1000, height=500
        )
    )
)
filepath = f"{target_root}/tags.html"

save_html_hvplot(plot, filepath, storage_options)

plot

In [None]:
# Open and clean reference model
if open_catalog:
    cat = intake.open_catalog(catalog_url)
    model = open_copernicus_catalog(cat)
else:
    from pangeo_fish.io import open_copernicus_zarr

    model = open_copernicus_zarr(
        # model='GLOBAL_ANALYSISFORECAST_PHY_001_024',
        # freq="D",
    )

In [None]:
# Subset the reference_model by
# - align model time with the time of tag_log, also
# - drop data for depth later that are unlikely due to the observed pressure from tag_log
# - defined latitude and longitude of bbox.
#
reference_model = (
    model.sel(time=adapt_model_time(time_slice))
    .sel(lat=slice(*bbox["latitude"]), lon=slice(*bbox["longitude"]))
    .pipe(
        lambda ds: ds.sel(
            depth=slice(None, (tag_log["pressure"].max() - ds["XE"].min()).compute())
        )
    )
).chunk({"time": chunk_time, "lat": -1, "lon": -1, "depth": -1})
reference_model

In [None]:
%%time
# Reshape the tag log, so that it bins to the time step of reference_model
reshaped_tag = reshape_by_bins(
    tag_log,
    dim="time",
    bins=(
        reference_model.cf.add_bounds(["time"], output_dim="bounds")
        .pipe(bounds_to_bins, bounds_dim="bounds")
        .get("time_bins")
    ),
    bin_dim="bincount",
    other_dim="obs",
).chunk({"time": chunk_time})

In [None]:
# Subtract the time_bined tag_log from the reference_model.
# Here, for each time_bin, each observed value are compared with the correspoindng depth of reference_model using diff_z function.
#
diff = (
    diff_z(reference_model, reshaped_tag, depth_threshold=relative_depth_threshold)
    .assign_attrs({"tag_id": tag_name})
    .assign(
        {
            "H0": reference_model["H0"],
            "ocean_mask": reference_model["H0"].notnull(),
        }
    )
)

# Persist the diff data
diff = diff.chunk(default_chunk).persist()
diff

In [None]:
# Verify the data
diff["diff"].count(["lat", "lon"]).plot()

In [None]:
%%time
# Save snapshot to disk
diff.to_zarr(f"{target_root}/diff.zarr", mode="w", storage_options=storage_options)

# Cleanup
del tag_log, cat, model, reference_model, reshaped_tag, diff

## 3. **Regrid the Grid from Reference Model Grid to Healpix Grid:** Transform the grid from the reference model to the Healpix grid for further analysis.

In this step, we regrid the data from the reference model grid to a Healpix grid. This process involves defining the Healpix grid, creating the target grid, computing interpolation weights, performing the regridding, and saving the regridded data.


In [None]:
# Import necessary libraries
import numpy as np
from xarray_healpy import HealpyGridInfo, HealpyRegridder
from pangeo_fish.grid import center_longitude

In [None]:
%%time

# Open the diff data and performs cleaning operations to prepare it for regridding.

ds = (
    xr.open_dataset(
        f"{target_root}/diff.zarr",
        engine="zarr",
        chunks={},
        storage_options=storage_options,
    )
    .pipe(lambda ds: ds.merge(ds[["latitude", "longitude"]].compute()))
    .swap_dims({"lat": "yi", "lon": "xi"})
    .drop_vars(["lat", "lon"])
)
# Trouver les valeurs minimales et maximales en ignorant les NaN
min_diff = ds['diff'].min(skipna=True).compute()
max_diff = ds['diff'].max(skipna=True).compute()

print(f"Valeur minimale de diff (en ignorant NaN) : {min_diff}")
print(f"Valeur maximale de diff (en ignorant NaN) : {max_diff}")

In [None]:
%%time
# Define the target Healpix grid information
grid = HealpyGridInfo(level=int(np.log2(nside)))
target_grid = grid.target_grid(ds).pipe(center_longitude, 0)
target_grid

In [None]:
%%time
# Compute the interpolation weights for regridding the diff data
regridder = HealpyRegridder(
    ds[["longitude", "latitude", "ocean_mask"]],
    target_grid,
    method="bilinear",
    interpolation_kwargs={"mask": "ocean_mask", "min_vertices": min_vertices},
)
regridder

In [None]:
%%time
# Perform the regridding operation using the computed interpolation weights.
regridded = regridder.regrid_ds(ds).assign_coords(cell_ids= lambda ds: ds.cell_ids.astype('int64') )
regridded

In [None]:
# This cell verifies the regridded data by plotting the count of non-NaN values.
regridded["diff"].count(["cells"]).plot()

In [None]:
%%time
regridded.to_zarr(
    f"{target_root}/diff-regridded-1D.zarr",
    mode="w",
    consolidated=True,
    compute=True,
    storage_options=storage_options,
)

# Cleanup unnecessary variables to free up memory
del ds, grid, target_grid, regridder, regridded

## 4. **Construct Emission Matrix:** Create an emission matrix based on the transformed 1D grid.

In this step, we construct the emission probability matrix based on the differences between the observed tag temperature and the reference sea temperature computed in Workflow 2 and regridded in Workflow 3. The emission probability matrix represents the likelihood of observing a specific temperature difference given the model parameters and configurations.


In [None]:
# Import necessary libraries
from toolz.dicttoolz import valfilter
from pangeo_fish.distributions import create_covariances
from pangeo_fish.distributions.healpix import normal_at
from pangeo_fish.pdf import normal
import xdggs

In [None]:
%%time
# Open the regridded diff data
differences = xr.open_dataset(
    f"{target_root}/diff-regridded-1D.zarr",
    engine="zarr",
    chunks={},
    storage_options=storage_options,
).pipe(lambda ds: ds.merge(ds[["latitude", "longitude"]].compute()))
differences
# Set required attributes cleanly
differences["cell_ids"].attrs["grid_name"] = "healpix"
# get existing attrs 
attrs_to_keep = ["level", "grid_name"]
# keep only specified attrs
differences["cell_ids"].attrs = {key: value for key, value in differences["cell_ids"].attrs.items() if key in attrs_to_keep}

differences = differences.pipe(xdggs.decode)
differences

In [None]:
%%time
# Compute initial and final position
grid = differences[["latitude", "longitude"]].compute()
  #print(grid['cell_ids'])
initial_position = tag["tagging_events"].ds.sel(event_name="release")
#cov = create_covariances(1e-6, coord_names=["latitude", "longitude"])
# sigma 1e-3**2 in the calculation
initial_probability = normal_at(grid, pos=initial_position, sigma=1e-3)

final_position = tag["tagging_events"].ds.sel(event_name="fish_death")
if final_position[["longitude", "latitude"]].to_dataarray().isnull().all():
    final_probability = None
else:
    final_probability = normal_at(grid, pos=final_position, sigma=recapture_std)

In [None]:
grid
print(grid)

In [None]:
%%time
# compute emission probability matrix

emission_pdf = (
    normal(differences["diff"], mean=0, std=differences_std, dims=["cells"])
    .to_dataset(name="pdf")
    .assign(
        valfilter(
            lambda x: x is not None,
            {
                "initial": initial_probability,
                "final": final_probability,
                "mask": differences["ocean_mask"],
            },
        )
    )
    .assign_attrs(differences.attrs)  # | {"max_sigma": max_sigma})
)

emission_pdf = emission_pdf.chunk(default_chunk_cells).persist()
emission_pdf

In [None]:
# Verify the data
emission_pdf["pdf"].count(["cells"]).plot()

In [None]:
# This cell saves the emission data to Zarr format, then cleans up unnecessary variables to free up memory.

emission_pdf.to_zarr(
    f"{target_root}/emission_1D.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)


del differences, grid, initial_probability, final_probability, emission_pdf

## 5. **Compute Additional Emission Probability Matrix:** Calculate an additional emission probability matrix, particularly focusing on teledetection from acoustic signals.

In this step, we compute additional emission probabilities based on acoustic detections for the selected tag. These additional probabilities enhance the emission probability matrix constructed in step 4 by incorporating information from acoustic telemetry.

In [None]:
%%time
# Import necessary libraries and open data and perform initial setup
from pangeo_fish import acoustic, utils
import hvplot.xarray

emission = xr.open_dataset(
    f"{target_root}/emission_1D.zarr",
    engine="zarr",
    chunks={},  # "x": -1, "y": -1},
    storage_options=storage_options,
)
emission

In [None]:
%%time
# Construct the emission probabilities based on acoustic detections
#emission.cell_ids.lat=0
emission.cell_ids.attrs['lat'] = 0
emission.cell_ids.attrs['lon'] = 0

acoustic_pdf = acoustic.emission_probability(
    tag,
    emission[["time", "cell_ids", "mask"]].compute(),
    receiver_buffer,
    nondetections="mask",
    chunk_time=chunk_time,
    cell_ids="keep",
    dims=["cells"]
)

acoustic_pdf = acoustic_pdf.persist()
print(acoustic_pdf)

In [None]:
# Verify the data and visualize the acoustic detections
tag["acoustic"]["deployment_id"].hvplot.scatter(c="red", marker="x") * (
    acoustic_pdf["acoustic"] != 0
).sum(dim=("cells")).hvplot()

In [None]:
acoustic_pdf["acoustic"].count(dim=("cells")).hvplot()

In [None]:
# Merge and save the combined emission probability matrix with acoustic probabilities

combined = emission.merge(acoustic_pdf)
combined

In [None]:
combined.initial

In [None]:
# This cell saves the emission data to Zarr format, then cleans up unnecessary variables to free up memory.

combined.to_zarr(
    f"{target_root}/emission_1D_acoustic.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)
# cleanup

del emission, acoustic_pdf, combined

## 6. **Combine and Normalize Emission Matrix:** Merge the emission matrix and normalize it for further processing.

In this step, we combine the emission probability matrix constructed in Workflow 4 and 5 then normalize it to ensure that the probabilities sum up to one. This step prepares the combined emission matrix for further analysis and interpretation.


In [None]:
# Import necessary libraries
from pangeo_fish.pdf import combine_emission_pdf
import hvplot.xarray

In [None]:
# Open and combine the emission probability matrix

combined = (
    xr.open_dataset(
        f"{target_root}/emission_1D_acoustic.zarr",
        engine="zarr",
        chunks=default_chunk_cells,
        inline_array=True,
        storage_options=storage_options,
    )
    .pipe(combine_emission_pdf)
    .chunk(default_chunk_cells)
    .persist()  # convert to comment if the emission matrix does *not* fit in memory
)
combined
print(combined.pdf.values)

In [None]:
# Verify the data and visualize the sum of probabilities
combined["pdf"].sum(["cells"]).hvplot(width=400)

In [None]:
# Save the combined and normalized emission matrix
combined.to_zarr(
    f"{target_root}/combined_1D.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)
del combined

## 7. **Estimate Model Parameters:** Determine the parameters of the model based on the normalized emission matrix.

This step first estimates maxixmum allowed value of  model parameter 'sigma' max_sigma.  Then we
create an optimizer with an expected parameter range, fitting the model to the normalized emission matrix.  
The resulting optimized parameters is saved to a json file.  

In [17]:
# Import necessary libraries and modules for data analysis.
import xarray as xr
import pandas as pd
from pangeo_fish.hmm.estimator import EagerEstimator
from pangeo_fish.hmm.optimize import EagerBoundsSearch
from pangeo_fish.utils import temporal_resolution
from pangeo_fish.hmm.estimator import EagerEstimator
from pangeo_fish.hmm.prediction import Gaussian1DHealpix
from tlz.functoolz import curry
# Open the data
emission = xr.open_dataset(
    f"{target_root}/combined_1D.zarr",
    engine="zarr",
    chunks={},
    inline_array=True,
    storage_options=storage_options,
)
emission

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int64 numpy.ndarray,int64 numpy.ndarray
"Array Chunk Bytes 872.01 kiB 872.01 kiB Shape (111617,) (111617,) Dask graph 1 chunks in 1 graph layer Data type int64 numpy.ndarray",111617  1,

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int64 numpy.ndarray,int64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 872.01 kiB 872.01 kiB Shape (111617,) (111617,) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",111617  1,

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 872.01 kiB 872.01 kiB Shape (111617,) (111617,) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",111617  1,

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 872.01 kiB 872.01 kiB Shape (111617,) (111617,) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",111617  1,

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 872.01 kiB 872.01 kiB Shape (111617,) (111617,) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",111617  1,

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 872.01 kiB 872.01 kiB Shape (111617,) (111617,) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",111617  1,

Unnamed: 0,Array,Chunk
Bytes,872.01 kiB,872.01 kiB
Shape,"(111617,)","(111617,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,219.71 MiB,20.44 MiB
Shape,"(258, 111617)","(24, 111617)"
Dask graph,11 chunks in 1 graph layer,11 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 219.71 MiB 20.44 MiB Shape (258, 111617) (24, 111617) Dask graph 11 chunks in 1 graph layer Data type float64 numpy.ndarray",111617  258,

Unnamed: 0,Array,Chunk
Bytes,219.71 MiB,20.44 MiB
Shape,"(258, 111617)","(24, 111617)"
Dask graph,11 chunks in 1 graph layer,11 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [18]:
# Compute maximum displacement for each reference model time step
# and estimate maximum sigma value for limiting the optimisation step

earth_radius_ = xr.DataArray(earth_radius, dims=None)

timedelta = temporal_resolution(emission["time"]).pint.quantify().pint.to("h")
grid_resolution = earth_radius_ * emission["resolution"].pint.quantify()

maximum_speed_ = xr.DataArray(maximum_speed, dims=None).pint.to("km / h")
max_grid_displacement = maximum_speed_ * timedelta * adjustment_factor / earth_radius_
max_sigma = max_grid_displacement.pint.to("dimensionless").pint.magnitude / truncate
emission.attrs["max_sigma"] = max_sigma.item()
max_sigma

  timedelta = to_offset(freq).delta.to_numpy()


np.float64(0.0004905038455501491)

In [19]:
## Create and configure estimator and optimizer
import xdggs
emission = (
    emission.compute()
)  # Convert to comment if the emission matrix does *not* fit in memory

# Set required attributes cleanly
emission["cell_ids"].attrs["grid_name"] = "healpix"
# Récupérer les attributs existants de cell_ids
attrs_to_keep = ["level", "grid_name"]
# should also have nest
emission["cell_ids"].attrs = {key: value for key, value in emission["cell_ids"].attrs.items() if key in attrs_to_keep}

print(emission["cell_ids"].attrs)
emission = emission.pipe(xdggs.decode)

predictor_factory = curry(
    Gaussian1DHealpix,
    cell_ids=emission["cell_ids"].data,
    grid_info=emission.dggs.grid_info,
    truncate=4.0,
    weights_threshold=1e-8,
    pad_kwargs={"mode": "constant", "constant_value": 0},
    optimize_convolution=True,
)

estimator = EagerEstimator(
    sigma=None, predictor_factory=predictor_factory
)
optimizer = EagerBoundsSearch(
    estimator,
    (1e-4, emission.attrs["max_sigma"]),
    optimizer_kwargs={"disp": 3, "xtol": tolerance},
)

{'grid_name': 'healpix', 'level': 12}


In [None]:
%%time
# Fit the model parameter to the data
optimized = optimizer.fit(emission)

In [None]:
# Save the optimized parameters
params = optimized.to_dict()
pd.DataFrame.from_dict(params, orient="index").to_json(
    f"{target_root}/parameters.json", storage_options=storage_options
)

In [None]:
# Cleanup
del optimized, emission

## 8. **Compute State Probabilities and Tracks:** Calculate the probability distribution of the species in question and compute the tracks.

This step involves predicting state probabilities using the optimised parameter sigma computed in the last step together with normalized emission matrix.  

In [21]:
# Import necessary libraries and modules for data analysis.
import xarray as xr
import pandas as pd
import hvplot.xarray
from pangeo_fish.hmm.estimator import EagerEstimator
from pangeo_fish.io import save_trajectories

# Recreate the Estimator
params = pd.read_json(
    f"{target_root}/parameters.json", storage_options=storage_options
).to_dict()[0] | {'predictor_factory':predictor_factory}
params.pop("predictor")
optimized = EagerEstimator(**params)
optimized.predictor_factory.

<class 'pangeo_fish.hmm.prediction.Gaussian1DHealpix'>

In [None]:
%%time
# Load the Data
emission = xr.open_dataset(
    f"{target_root}/combined_1D.zarr",
    engine="zarr",
    chunks=default_chunk_cells,
    inline_array=True,
    storage_options=storage_options,
).compute()

# Predict the State Probabilities

states = optimized.predict_proba(emission)
states = states.to_dataset().chunk(default_chunk_cells).persist()
states

In [None]:
# Verify the data and visualize the sum of probabilities
plot = states.sum(["cells"]).hvplot() + states.count(["cells"]).hvplot()
hvplot.save(plot, f"{target_root}/states_count_1D.html")

In [None]:
%%time
# Save probability distirbution, state matrix.
states.chunk(default_chunk_cells).to_zarr(
    f"{target_root}/states_cells.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)
states

In [None]:
%%time
# decode tracks

trajectories = optimized.decode(
    emission,
    states.fillna(0),
    mode=track_modes,
    progress=False,
    additional_quantities=additional_track_quantities,
)
trajectories

In [None]:
# Save trajectories.
# Here we can chose format parquet for loading files from 'R'
# or chose to  format 'geoparquet' for further analysis of tracks using
# geopands.

save_trajectories(trajectories, target_root, storage_options, format="parquet")

In [None]:
# Cleanup
del optimized, emission, states, trajectories

## 9. **Visualization:** Visualize the results of the analysis for interpretation and insight.


In this step, we visualize various aspects of the analysis results to gain insights and interpret the model outcomes. We plot the emission matrix, which represents the likelihood of observing a specific temperature difference given the model parameters and configurations. Additionally, we visualize the state probabilities, showing the likelihood of the system being in different states at each time step. We also plot each of the tracks of the tagged fish, displaying their movement patterns over time. Finally, we create a movie that combines the emission matrix and state probabilities to provide a comprehensive visualization of the analysis results.


In [None]:
# Import necessary libraries
import holoviews as hv
import hvplot.xarray
import cmocean
import xmovie
from pangeo_fish import visualization
from pangeo_fish.io import read_trajectories, save_html_hvplot

In [None]:
%time
# load trajectories
trajectories = read_trajectories(
    track_modes, target_root, storage_options, format="parquet"
)
print(trajectories)
# Plot trajectoriesand plot.

plots = [
    traj.hvplot(
        c="speed",
        tiles="CartoLight",
        title=traj.id,
        cmap="cmo.speed"
        #                ,xlim=bbox['longitude'],        ylim=bbox['latitude']
        ,
        width=300,
        height=300,
    )
    for traj in trajectories.trajectories
]
plot = hv.Layout(plots).cols(2)

filepath = f"{target_root}/trajectories.html"
save_html_hvplot(plot, filepath, storage_options)

plot

In [None]:
%%time
# load files for plotting

emission = (
    xr.open_dataset(
        f"{target_root}/emission.zarr",
        engine="zarr",
        chunks={},
        inline_array=True,
        storage_options=storage_options,
    )
    .rename_vars({"pdf": "emission"})
    .drop_vars(["final", "initial"])
)  # .where(emission["mask"])
states = xr.open_dataset(
    f"{target_root}/states_cells.zarr",
    engine="zarr",
    chunks={},
    inline_array=True,
    storage_options=storage_options,
).where(emission["mask"])
data = xr.merge([states, emission.drop_vars(["mask"])])

# visualize states and emission matrix.  Save the visualisation in an html file.
#
plot1 = visualization.plot_map(data["states"], bbox)
plot2 = visualization.plot_map(data["emission"], bbox)
plot = hv.Layout([plot1, plot2]).cols(2)
filepath = f"{target_root}/states_emission.html"

plot

In [None]:
%%time
## Create Movies
#
mov = xmovie.Movie(
    (
        data.isel(time=slice(0, data.time.size - 1, time_step))
        .chunk({"time": 1, "x": -1, "y": -1})
        .pipe(lambda ds: ds.merge(ds[["longitude", "latitude"]].compute()))
    ).pipe(visualization.filter_by_states),
    plotfunc=visualization.create_frame,
    input_check=False,
    pixelwidth=15 * 400,
    pixelheight=12 * 400,
    dpi=400,
)
## workaround dueto https://github.com/jbusecke/xmovie/issues/162
# use local file system

if target_root.startswith("s3://"):
    !mkdir -p movie
    mov.save(
        f"./movie/states.mp4",
        overwrite_existing=True,
    )
    import s3fs

    s3 = s3fs.S3FileSystem(**storage_options)
    s3.put_file(f"./movie/states.mp4", f"{target_root}/states.mp4")
else:
    mov.save(
        f"{target_root}/states.mp4",
        overwrite_existing=True,
    )