# Multi-sigma minimzation


**Overview.**

This Jupyter notebook demonstrates a more complex use of `pangeo-fish` than the one presented in the previous notebook.

_We thus recommend checking the latter first!_

In this guide, we follow the same workflow but consider a longer tag and fit the geolocation models:
1. One with one parameter (similarly as the previous notebook)
2. One with two parameters.

This will let us compare the differences between the results on several aspects:
1. Quantitavely, the value of the parameters themselves.
2. Qualitatively, the estimated trajectories of the fish.

We will use the biologging tag "A18831", which was also attached to a pollack fish, and the same reference data.

**Note that, because of the size of the tag, the computations might not be tractable by common laptops.**

**Workflow.**

The key differences compared to the other notebook happen in the seventh step, in which we'll fit two geolocation models.
If for the first one, nothing will change (since we will use only one parameter), for the second we additionally:
1. Analyze the acoustic detections to identify **two** time intervals for **two** `sigma` parameters.
2. Stamp the observation data `combined` with indices that follow the two time intervals identified before.
3. Fit the _bi-sigma_ model. 

## 1. Initialization and Biologging Data Preparation

In [None]:
!pip install rich zstandard
# !pip install "xarray-healpy @ git+https://github.com/iaocea/xarray-healpy.git@0ffca6058f4008f4f22f076e2d60787fcf32ac82"
!pip install xhealpixify
# !pip install -e ../.
!pip install movingpandas more_itertools
!pip install --upgrade "xarray<=2025.4.0"
!pip install xdggs
!pip install healpix-convolution
!pip install --upgrade "cf-xarray>=0.10.4"

In [None]:
from pint_xarray import unit_registry as ureg
import hvplot.xarray
import xarray as xr
import sys

sys.path.append("../")
import pangeo_fish

In [None]:
tag_name = "A18831"

tag_root = "https://data-taos.ifremer.fr/data_tmp/cleaned/tag/"

ref_url = "https://data-taos.ifremer.fr/kerchunk/ref-copernicus.yaml"

## example for remote storage
scratch_root = "s3://destine-gfts-data-lake/demo"
storage_options = {
    "anon": False,
    "profile": "gfts",
    "client_kwargs": {
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
        "region_name": "gra",
    },
}
## example for using your local file system instead
scratch_root = "."
storage_options = None

chunk_time = 24
dims = ["cells"]
bbox = {"latitude": [46, 51], "longitude": [-8, -1]}


relative_depth_threshold = 0.8
rot = {"lat": 0, "lon": 0}
nside = 4096
min_vertices = 1

differences_std = 0.75
initial_std = 1e-6 if dims == ["x", "y"] else 1e-5
recapture_std = 1e-2

earth_radius = ureg.Quantity(6371, "km")
maximum_speed = ureg.Quantity(60, "km / day")
adjustment_factor = 5
truncate = 4

# receiver_buffer sets the maximum allowed detection distance for acoustic receivers.
receiver_buffer = ureg.Quantity(1000, "m")

# tolerance value for the minimization
tolerance = 1e-3 if dims == ["x", "y"] else 1e-6

track_modes = ["mean", "mode"]
additional_track_quantities = ["speed", "distance"]
time_step = 3

In [None]:
# Define target root directories for storing analysis results.
target_root = f"{scratch_root}/{tag_name}"

# Defines default chunk size for optimization.
default_chunk = {"time": chunk_time, "lat": -1, "lon": -1}
default_chunk_dims = {"time": chunk_time}
default_chunk_dims.update({d: -1 for d in dims})

In [None]:
# Set up a local cluster for distributed computing.
from distributed import LocalCluster

cluster = LocalCluster()
client = cluster.get_client()
client

In [None]:
from pangeo_fish.helpers import load_tag

tag, tag_log, time_slice = load_tag(
    tag_root=tag_root, tag_name=tag_name, storage_options=storage_options
)
tag

You can plot the time series of the DST with the function `plot_tag()`:

In [None]:
from pangeo_fish.helpers import plot_tag

plot = plot_tag(
    tag=tag,
    tag_log=tag_log,
    # you can directly save the plot if you want
    save_html=True,
    storage_options=storage_options,
    target_root=target_root,
)
plot

## 2. Reference Data Preparation

In [None]:
from pangeo_fish.helpers import load_model, compute_diff

reference_model = load_model(
    uri=ref_url,
    tag_log=tag_log,
    time_slice=time_slice,
    bbox=(bbox | {"max_depth": tag_log["pressure"].max()}),
    chunk_time=chunk_time,
)
diff = compute_diff(
    reference_model=reference_model,
    tag_log=tag_log,
    relative_depth_threshold=relative_depth_threshold,
    chunk_time=chunk_time,
)[0]

_We can detect abnormal data by looking at the number of non null values for each time step._

In [None]:
diff = diff.compute()

In [None]:
diff["diff"].count(["lat", "lon"]).plot()
diff

_You can save the dataset if you want to resume the notebook later:_

In [None]:
diff.to_zarr(f"{target_root}/diff.zarr", mode="w", storage_options=storage_options)

In [None]:
from pangeo_fish.helpers import regrid_dataset

reshaped = regrid_dataset(
    ds=diff, nside=nside, min_vertices=min_vertices, rot=rot, dims=dims
)[0]
reshaped

In [None]:
# Saves the result if needed
reshaped.chunk(default_chunk_dims).to_zarr(
    f"{target_root}/diff-regridded.zarr",
    mode="w",
    consolidated=True,
    compute=True,
    storage_options=storage_options,
)

## 3. Computation of the emission probability distribution

In [None]:
from pangeo_fish.helpers import compute_emission_pdf

In [None]:
# Open the previous dataset (only necessary if you resume the notebook from here)
differences = xr.open_dataset(
    f"{target_root}/diff-regridded.zarr",
    engine="zarr",
    chunks={},
    storage_options=storage_options,
)
# Or uncomment the instruction below to keep using the previous variable
# differences = reshaped

differences = differences.pipe(
    lambda ds: ds.merge(ds[["latitude", "longitude"]].compute())
)

emission_pdf = compute_emission_pdf(
    diff_ds=differences,
    events_ds=tag["tagging_events"].ds,
    differences_std=differences_std,
    initial_std=initial_std,
    recapture_std=recapture_std,
    dims=dims,
    chunk_time=chunk_time,
)[0]
emission_pdf

_Save the intermediate result if needed:_

In [None]:
emission_pdf.to_zarr(
    f"{target_root}/emission.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)

Next, we compute a second _pdf_ based on the acoustic detection and combine the distributions.

In [None]:
from pangeo_fish.helpers import compute_acoustic_pdf

# Uncomment the following lines if you resume the notebook
# emission_pdf = xr.open_dataset(
#     f"{target_root}/emission.zarr",
#     engine="zarr",
#     chunks={},
#     storage_options=storage_options,
# )
acoustic_pdf = compute_acoustic_pdf(
    emission_ds=emission_pdf,
    tag=tag,
    receiver_buffer=receiver_buffer,
    chunk_time=chunk_time,
    dims=dims,
)[0].compute()
acoustic_pdf

In [None]:
from pangeo_fish.helpers import combine_pdfs

combined = combine_pdfs(
    emission_ds=emission_pdf,
    acoustic_ds=acoustic_pdf,
    chunks=default_chunk_dims,
    dims=dims,
)[0]
combined.to_zarr(
    f"{target_root}/combined.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)

**Let's perform a last check before fitting the models.**

In [None]:
combined["pdf"].sum(dims).plot(ylim=(0, 2))

_The sums should equal to `1`._

## 4. One-sigma Model Estimation

We assume that you have already followed the previous tutorial and as such, we don't comment the first minimization. 

In [None]:
from pangeo_fish.helpers import optimize_pdf

# Open the distributions
emission = xr.open_dataset(
    f"{target_root}/combined.zarr",
    engine="zarr",
    chunks=default_chunk_dims,
    inline_array=True,
    storage_options=storage_options,
)
# Define the parameter's bounds and search for the best value
params, updated_emission = optimize_pdf(
    ds=emission,
    earth_radius=earth_radius,
    adjustment_factor=adjustment_factor,
    truncate=truncate,
    maximum_speed=maximum_speed,
    tolerance=tolerance,
    dims=dims,
    # we save the result of the first model under a subfolder "one_sigma"
    save_parameters=True,
    storage_options=storage_options,
    target_root=f"{target_root}/one_sigma",
)
params

In [None]:
# Optionally, we can save the emission dataset with the new attributes in the subfolder (but this is mostly redundant data)
updated_emission.to_zarr(
    f"{target_root}/one_sigma/combined.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)

## 5. Bi-sigma Model Estimation

As introduced at the beginning of this guide, we aim to compare the previously fitted 1-`sigma` model with a more complex, 2-`sigma` one.

So let's finally define the optimization task, which has now **two `sigma`values**.

### 5.1 Finding the time intervals for each `sigma`

Instead of arbitrarily define the times, we use a convenient function that uses the acoustic detections. 

In [None]:
from pangeo_fish.helpers import compute_detection_time_intervals

times = compute_detection_time_intervals(tag=tag)
times

**Explanations**

_The tag includes a few detections that occur around the same time._

_The function above omits them and returns the first detection time over this period._

_Since we have one time, the Brownian motion of the model will have two parameters: the first `sigma` will be used until `times[0]`, and the second one will be used for the rest of the time._

### 5.2 Stamping the observations with the parameter indices

In order for the function `optimize_pdf()` to know where to use and minimize each parameter, we need to add a variable called `predictor_index` to our observations.

Luckily, the `pangeo_fish.helpers` module has a function that, given a list of times, automatically add these indices to the emission data:

In [None]:
from pangeo_fish.helpers import stamp_parameter_indices

emission = stamp_parameter_indices(pdf=emission, times=times)
emission

_Above, you can see that, since we have two parameters (as detailed above, `times` has just one time), `predictor_index` mostly consists of adding the index "0" for all the times before `times[0]`, and "1" for the remaining observations._

When it comes to performing the parameter optimization itself, nothing changes compared to what we have already done: the function `optimize_pdf()` will automatically detect the parameter indices and optimize a mutli-sigma model accordingly.

In [None]:
params2, updated_emission2 = optimize_pdf(
    ds=emission,
    earth_radius=earth_radius,
    adjustment_factor=adjustment_factor,
    truncate=truncate,
    maximum_speed=maximum_speed,
    tolerance=tolerance,
    dims=dims,
    # let's use another subfolder called "two_sigma"
    save_parameters=True,
    storage_options=storage_options,
    target_root=f"{target_root}/two_sigma",
)
params2

_As before, you can optionally save the emission dataset whose attributes have been updated with the results of the optimization..._

In [None]:
updated_emission2.to_zarr(
    f"{target_root}/two_sigma/combined.zarr",
    mode="w",
    consolidated=True,
    storage_options=storage_options,
)

## 6. State and Trajectory Estimation

Now, for each model, we generate the fish's location probabilities `states` as well as the `mean` and `mode` trajectories.

In [None]:
from pangeo_fish.helpers import predict_positions

for subfolder in ["one_sigma", "two_sigma"]:
    states, trajectories = predict_positions(
        target_root=f"{target_root}/{subfolder}",
        storage_options=storage_options,
        chunks=default_chunk_dims,
        track_modes=track_modes,
        additional_track_quantities=additional_track_quantities,
        save=True,
    )

**When predicting the fish's locations (dataset `states`) with more than one `sigma` parameter, the values of the parameters at each time are to `states`:**

In [None]:
states

## 7. Result Comparison

Let's briefly illustrate how we can start comparing the estimation of the two models.

Feel free to change and adapt the code!

### 7.1 Qualitative Comparison

A first comparison can simply consist of checking the `sigma` values.

In case you haven't checked them already, you can browse the `.json` files with the following function:

In [None]:
import fsspec, json


def open_json_file(filepath: str, storage_options: dict):
    json_file = {}
    try:
        with fsspec.open(
            filepath,
            "r",
            **{} if not filepath.startswith("s3://") else storage_options,
        ) as file:
            json_file = json.load(file)
    except Exception as e:
        print("The following error occurred upon opening the json file:", e)
    finally:
        return json_file

In [None]:
print(
    "Value(s) of the parameter(s) of the first model:",
    open_json_file(f"{target_root}/one_sigma/parameters.json", storage_options)[
        "sigmas"
    ],
)

In [None]:
print(
    "Value(s) of the parameter(s) of the second (bi-sigma) model:",
    open_json_file(f"{target_root}/two_sigma/parameters.json", storage_options)[
        "sigmas"
    ],
)

### 7.2 Qualitative Comparison

A less precise but more understandable comparison of the results can also be done by looking at the trajectories estimated by each model: 

In [None]:
from pangeo_fish.helpers import plot_trajectories

traj_plots = [
    plot_trajectories(
        target_root=f"{target_root}/{subfolder}",
        track_modes=track_modes,
        storage_options=storage_options,
        save_html=True,
    ).options(title=f"Model's Folder: {subfolder}")
    for subfolder in ["one_sigma", "two_sigma"]
]

(traj_plots[0] + traj_plots[1]).cols(2)