# temperature differences

In this notebook, we compute the difference between the temperature mesured by the tag and the temperature from the reference model

**Summary:**

1. Opening the data: reference model (mars) and tag log
2. Set up the dask cluster
3. Data alignment
4. Compute the differences
5. Save to disk

In [None]:
import cf_xarray
import dask
import fsspec
import intake
import numba
import numpy as np
import pandas as pd
import xarray as xr

from pangeo_fish.cf import bounds_to_bins
from pangeo_fish.diff import marc_diff_z
from pangeo_fish.model import marc_sigma_to_depth
from pangeo_fish.tags import adapt_model_time, reshape_by_bins, to_time_slice

parametrize with [papermill](https://papermill.readthedocs.io/en/latest/)

In [None]:
#scheduler_address: str | None = None
catalog_parameters: dict = {}
relative_depth_threshold: float = 0.8
tag_name: str = "A18832"
working_path: str = "/home/datawork-taos-s/public/fish/"
#working_path: str | "/Users/todaka/python/git/pangeo-fish/data_local/fish-intel/"
ref_model_name: str = "marc-f1-2500"
cluster_size: int = 50
# subset with location(iroise ocean) for quick computational tests
# This should be able to seted as 'None' or value (for copernicus, mars, ...)
# so that we can pass in the parameter for 
bbox = (
    -8,
    45,
    0,
    51,
)

In [None]:

domainname=!domainname

if domainname == ["nisdatarmor"]:
    # Datarmor
    tag_base_path = "/home/datawork-lops-iaocea/data/fish-intel/"
    catalog = "/home/datawork-taos-s/intranet/kerchunk/ref-marc.yaml"
    cluster_name="datarmor"
else:
    # local PC
    tag_base_path: str = "/Users/todaka/python/git/pangeo-fish/data_local/fish-intel/"
    catalog = "https://data-taos.ifremer.fr/kerchunk/ref-marc.yaml"
    cluster_name="local"

tag_url = tag_base_path + "tag/nc/" + tag_name + ".nc"
tag_db_path = tag_base_path + "acoustic/FishIntel_tagging_France.csv"
detections_path = tag_base_path + "/acoustic/detections_recaptured_fishintel.csv"

output_path = working_path + tag_name + "/" + ref_model_name + "/diff.zarr"

In [None]:
import dask_hpcconfig
from distributed import Client

In [None]:
if domainname == ["nisdatarmor"]:
    overrides = {}
    # overrides = { "cluster.cores": 28 , "cluster.processes": 6 }    
    cluster = dask_hpcconfig.cluster("datarmor", **overrides)
#    cluster = dask_hpcconfig.cluster("datarmor-local")
    cluster.scale(cluster_size)
else:
    cluster = dask_hpcconfig.cluster("local")

client = Client(cluster)
client

## Open the data: reference model (mars) and tag log

open the tag log

In [None]:
tag = xr.open_dataset(fsspec.open(tag_url).open(), engine="h5netcdf").load()
tag

open the reference model

TODO: for now, we will directly read the data, but in the future we might want to use [xpublish](https://github.com/xpublish-community/xpublish) to hide the reading / preprocessing of the reference model (especially computing the depth / pressure and stitching together different models)

In [None]:
cat = intake.open_catalog(catalog)["marc"]
catalog_parameters: dict = {  "region": "f1_e2500",  "year": "2022"}

catalog_kwargs = {
    "chunks": {"ni": -1, "nj": -1, "level": -1, "time": 1},
    "inline_array": True,
}
ds = (
    cat(**catalog_kwargs, **catalog_parameters)
    .to_dask()[["H0", "level", "XE", "theta", "b", "hc", "TEMP"]]
    .assign_coords(time=lambda ds: ds.time.astype("datetime64[ns]"))
)
ds

## data alignment

In order to compare measured temperature with the model, we need to
1. align time ranges
2. calculate the modelled depth
3. group the measured data into bins

### align time ranges

In [None]:
slice_ = to_time_slice(tag.times)

In [None]:
tag_log = tag.sel(time=slice_)
tag_log

In [None]:
model_subset = ds.sel(time=adapt_model_time(slice_))
model_subset

In [None]:

def geo_subset(ds, bbox):
    x0, y0, x1, y1 = bbox
    cond = (
        (ds.longitude.compute() >= x0)
        & (ds.longitude.compute() <= x1)
        & (ds.latitude.compute() >= y0)
        & (ds.latitude.compute() <= y1)
    )

    return ds.where(cond, drop=True)
model_subset = model_subset.pipe(geo_subset, bbox)
model_subset

### Convert sigma level to depth

The formula for the computation of the depth is model-specific.

*TODO*:
- calculate the modelled pressure and use that to compute the diff – essentially, that's what the tag measured
- have the hosted model (using `xpublish`) calculate the depth – that way, we don't need to worry about the model-specific formula

In [None]:
reference_model_ = marc_sigma_to_depth(model_subset)
reference_model_

### reshape the tag data

To further align both datasets, we need to reshape the data into bins, such that the `temperature(measured_time)` and `depth(measured_time)` coordinates become `temperature(model_time, obs)` and `depth(model_time, obs)`.

determine the bins

In [None]:
reference_model = reference_model_.cf.add_bounds(["time"], output_dim="bounds").pipe(
    bounds_to_bins, bounds_dim="bounds"
)
reference_model

reshape

In [None]:
%%time
reshaped_tag = (
    tag_log[["water_temperature", "pressure"]]
    .pipe(
        reshape_by_bins,
        dim="time",
        bins=reference_model.time_bins,
        bin_dim="bincount",
        other_dim="obs",
    )
    .assign_coords(time=lambda ds: reference_model.time.isel(time=ds.bincount))
    .swap_dims({"bincount": "time"})
    .drop_vars(["bincount", "time_bins"])
    .chunk({"time": 1})
)
reshaped_tag

## compute the differences

Now that both datasets are aligned, we can compute the actual difference. However, since the model's depth is a function of time and position, we can't just subtract the tag log from the model. Instead, we have to find the matching depth for each pixel separately, take the temperature at that depth and calculate the difference. Finally, we can compute the mean along the observation dimension to get a single value per pixel and timestep.

In [None]:
diff = (
    marc_diff_z(reference_model, reshaped_tag, depth_threshold=relative_depth_threshold)
    .to_dataset()
    .assign_attrs({"tag_id": tag_log.attrs["tag_id"]})
    .assign({"H0": reference_model.H0})
)
diff

## Save the differences to disk

In [None]:
%%time
# need to drop the bins since zarr cannot represent that
diff.drop_vars(["time_bins"]).to_zarr(output_path, mode="w", consolidated=True)

In [None]:
import xarray as xr

diff_ = xr.open_zarr(output_path)

In [None]:
import hvplot.xarray

diff_["diff"].isel(time=0).plot(x="longitude", y="latitude")