# Load data from Smart-Kages into `movement` datasets
Load all DLC .h5 pose files for each kage and concatenate them
into a single `movement` dataset.

Assign a datetime index across the `time` dimension for easy access.

Save the resulting dataset to a netCDF file.

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import xarray as xr
from matplotlib import pyplot as plt
from movement.io import load_poses

from smart_kages_movement.io import load_background_frame

## Configuration
Define some global variables and paths.

In [2]:
FPS = 2  # frames per second
PIXELS_PER_CM = 10  # pixels per centimetre (need to double-check this)
TIME_PRECISION = "ns"

# Configure seaborn for prettier plots
sns.set_context("notebook")
sns.set_style("ticks")

In [3]:
data_dir = Path.home() / "UCL Dropbox" / "Loukia Katsouri" / "DataProtocolsEquipment" / "SmartKages" /"1.Analysis_DS_Apr-May2024" / "RawData_300525"
analysis_dir = data_dir / "movement_analysis"
# csv file generated by the 01 notebook
df_path = analysis_dir / "all_segments.csv"
overlaps_path = analysis_dir / "segment_overlaps.csv"


for path in [data_dir, analysis_dir, df_path, overlaps_path]:
    if not path.exists():
        print(f"Path does not exist: {path}")

## Load CSV files as datframes

We load the dataframe containing the paths to all the DLC .h5 files, which is generated in the 01 notebook.

In [4]:
df = pd.read_csv(
    df_path,
    index_col=[0, 1, 2],
    dtype={
        "date": str,
        "hour": str,
        "n_frames": int,
    },
    parse_dates=["start_datetime", "end_datetime"],
)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,start_datetime,end_datetime,duration,n_frames,n_channels,height,width,pose_file_path,video_file_path
kage,date,hour,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
kage1,20240403,9,2024-04-03 09:54:24,2024-04-03 09:59:56.500,0 days 00:05:32.500000,665,3,376,500,/Users/loukia/UCL Dropbox/Loukia Katsouri/Data...,/Users/loukia/UCL Dropbox/Loukia Katsouri/Data...
kage1,20240403,10,2024-04-03 10:00:06,2024-04-03 10:59:57.500,0 days 00:59:51.500000,7183,3,376,500,/Users/loukia/UCL Dropbox/Loukia Katsouri/Data...,/Users/loukia/UCL Dropbox/Loukia Katsouri/Data...
kage1,20240403,11,2024-04-03 11:01:07,2024-04-03 11:59:59.000,0 days 00:58:52,7064,1,376,500,/Users/loukia/UCL Dropbox/Loukia Katsouri/Data...,/Users/loukia/UCL Dropbox/Loukia Katsouri/Data...
kage1,20240403,12,2024-04-03 12:01:08,2024-04-03 12:59:57.500,0 days 00:58:49.500000,7059,3,376,500,/Users/loukia/UCL Dropbox/Loukia Katsouri/Data...,/Users/loukia/UCL Dropbox/Loukia Katsouri/Data...
kage1,20240403,13,2024-04-03 13:01:07,2024-04-03 13:59:56.000,0 days 00:58:49,7058,3,376,500,/Users/loukia/UCL Dropbox/Loukia Katsouri/Data...,/Users/loukia/UCL Dropbox/Loukia Katsouri/Data...


We see that the dataframe holds various information on each 1-hour segment of the data.

The index is hierarchical, organising the data first by `kage`, then by `date`, and finally by `hour`.

Of special relevance to us here:
- Paths: `pose_file_path`, `video_file_path`
- `start_datetime` and `end_datetime` for each segment

Now let's see which segments overlap with each other (these were pre-computed in the 01 notebook).

In [5]:
overlaps = pd.read_csv(
    overlaps_path,
    index_col=0,
    parse_dates=["end_A", "start_B"],
)
overlaps

Unnamed: 0,segment_A,segment_B,end_A,start_B,overlap_duration
0,"('kage3', '20240425', '06')","('kage3', '20240425', '07')",2024-04-25 06:59:52.500,2024-04-25 06:59:21,00:00:31.500000
1,"('kage3', '20240504', '06')","('kage3', '20240504', '07')",2024-05-04 06:59:39.000,2024-05-04 06:59:21,00:00:18


## Load all data from a given kage
We will create a function that, given a kage name, will load all the data from the DLC .h5 files and merge them into a single `movement` dataset.

In [6]:
def kage_to_movement_ds(
    df: pd.DataFrame,
    kage: str,
    overlaps: pd.DataFrame | None = None,
) -> tuple[xr.Dataset, np.ndarray]:
    """Load all poses for a given kage and return an xarray Dataset.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing the paths to pose files as well as metadata
        for each 1-hour segment.
    kage : str
        The name of the kage to process, e.g., "kage1", "kage2", etc.
    overlaps : pd.DataFrame | None
        Optional, DataFrame containing information about overlapping segments.

    Returns
    -------
    xr.Dataset
        An xarray Dataset containing the poses for the specified kage,
        with time coordinates assigned based on the corrected timestamps.
    np.ndarray
        A background image (numpy array) loaded from the middle segment
        of the kage, used for visualization purposes.

    Notes
    -----
    The returned Dataset will have two time coordinates:
    - ``time``: the primary time coordinate based on the corrected timestamps.
    - ``seconds_since_start``: the secondary time coordinate representing
       seconds elapsed since the start of the kage.
    """

    def _is_monotonic_increasing(arr):
        """Check if a 1D array is monotonically increasing."""
        return (arr[1:] >= arr[:-1]).all()

    print(f"Processing kage: {kage}")
    df_kage = df.loc[kage]
    df_kage = df_kage.sort_index()  # ensure chronological order
    n_days = df_kage.index.get_level_values("date").nunique()
    print(f"Number of days: {n_days}")
    n_segments = df_kage.shape[0]
    print(f"Number of 1-hour segments: {n_segments}")

    kage_start_datetime = pd.Timestamp(
        df_kage["start_datetime"].iloc[0], unit=TIME_PRECISION
    )

    ds_segments = []  # List of xarray Datasets for each 1-hour segment

    for date, hour in df_kage.index:
        # Load the pose data for the current 1-hour segment
        poses = load_poses.from_file(
            df_kage.loc[(date, hour), "pose_file_path"],
            source_software="DeepLabCut",
            fps=FPS,
        )

        # Assert that length of tracks matches the number of video frames
        n_frames = df_kage.loc[(date, hour), "n_frames"]
        assert poses.sizes["time"] == n_frames, (
            f"Number of tracked timepoints ({poses.sizes['time']}) does not "
            f"match the number of frames ({n_frames}) for {date} at {hour}!"
        )

        # Create timestamps starting from the segment start datetime
        # at the specified FPS
        segment_start_datetime = pd.Timestamp(
            df_kage.loc[(date, hour), "start_datetime"], unit=TIME_PRECISION
        )
        timestamps = pd.date_range(
            start=segment_start_datetime,
            periods=n_frames,
            freq=pd.Timedelta(seconds=1 / FPS),
            unit=TIME_PRECISION,
        )

        # Sanity check: final timestamp not differ by more than 1 frame
        # from the segment end datetime
        segment_end_datetime = pd.Timestamp(
            df_kage.loc[(date, hour), "end_datetime"], unit=TIME_PRECISION
        )
        tolerance = pd.Timedelta(1 / FPS, "sec")
        assert timestamps[-1] - segment_end_datetime <= tolerance, (
            f"Final timestamp {timestamps[-1]} differs by more than "
            f"{tolerance} from segment end datetime {segment_end_datetime}"
        )

        # assign time coordinates to the actual datetime timestamps
        poses = poses.assign_coords(time=timestamps)
        poses.attrs["time_unit"] = f"datetime64[{TIME_PRECISION}]"

        # If this segment overlaps with the next one,
        # we'll delete overlapping frames at the end of this segment
        segment_str = f"('{kage}', '{date}', '{hour}')"
        if (overlaps is not None) and segment_str in overlaps[
            "segment_A"
        ].values:
            row_index = overlaps.index[overlaps["segment_A"] == segment_str][0]
            # Delete everything that comes after the start time of segment B
            next_segment_start = pd.Timestamp(
                overlaps.loc[row_index, "start_B"], unit=TIME_PRECISION
            )
            poses = poses.sel(time=slice(None, next_segment_start - tolerance))
            n_removed = n_frames - poses.sizes["time"]
            print(
                f"Removed {n_removed} overlapping frames at the end of "
                f"segment {date} {hour} for {kage}."
            )

        # add to list of loaded segments
        ds_segments.append(poses)

    # Combine all segments into a single xarray Dataset
    ds_kage = xr.concat(ds_segments, dim="time")
    ds_kage.attrs["kage"] = kage
    ds_kage.attrs["kage_start_datetime"] = kage_start_datetime.isoformat()

    # Ensure the concatenated timestamps are monotonic increasing
    assert _is_monotonic_increasing(ds_kage.time.values), (
        f"Combined timestamps for {kage} are not monotonic increasing!"
    )

    # Assign secondary time coordinate as seconds elapsed since kage start
    seconds_since_kage_start = (
        ds_kage.time.data - np.datetime64(kage_start_datetime)
    ) / pd.Timedelta("1s")
    ds_kage = ds_kage.assign_coords(
        seconds_elapsed=("time", seconds_since_kage_start)
    )

    # load image to use as background frame
    video_path = df_kage.iloc[n_segments // 2]["video_file_path"]
    background_img = load_background_frame(
        video_path=video_path, i=0, n_average=100
    )
    print(f"Loaded background image for {kage} from {video_path} \n")

    return ds_kage, background_img

Now let's create a combined `movement` dataset for each kage.

We also assign a background image to each kage, which is used for visualisation purposes.

In [7]:
ds_dict = {}  # List of `movement` datasets for each kage
img_dict = {}  # List of background images for each kage

for kage in df.index.get_level_values("kage").unique():
    ds_dict[kage], img_dict[kage] = kage_to_movement_ds(df, kage, overlaps)

Processing kage: kage1
Number of days: 35
Number of 1-hour segments: 760
Loaded background image for kage1 from /Users/loukia/UCL Dropbox/Loukia Katsouri/DataProtocolsEquipment/SmartKages/1.Analysis_DS_Apr-May2024/RawData_300525/kage1/videos/2024/04/23/kage1_20240423_130002.mp4 

Processing kage: kage10
Number of days: 38
Number of 1-hour segments: 859
Loaded background image for kage10 from /Users/loukia/UCL Dropbox/Loukia Katsouri/DataProtocolsEquipment/SmartKages/1.Analysis_DS_Apr-May2024/RawData_300525/kage10/videos/2024/04/21/kage10_20240421_010002.mp4 

Processing kage: kage11
Number of days: 38
Number of 1-hour segments: 866
Loaded background image for kage11 from /Users/loukia/UCL Dropbox/Loukia Katsouri/DataProtocolsEquipment/SmartKages/1.Analysis_DS_Apr-May2024/RawData_300525/kage11/videos/2024/04/21/kage11_20240421_040002.mp4 

Processing kage: kage12
Number of days: 38
Number of 1-hour segments: 854
Loaded background image for kage12 from /Users/loukia/UCL Dropbox/Loukia Ka

In [9]:
# Inspect the dataset for kage1
ds_dict["kage3"]

## Save the combined datasets to netCDF files
Finally, we save the combined `movement` datasets for each kage to a netCDF file.

We also save the background image for each kage in the same directory.

In [10]:
for kage, ds in ds_dict.items():
    print(f"Saving dataset for {kage}...")
    kage_dir = analysis_dir / kage
    kage_dir.mkdir(parents=True, exist_ok=True)
    ds_file_path = kage_dir / f"{kage}.nc"
    ds.to_netcdf(ds_file_path, mode="w", engine="netcdf4", format="NETCDF4")
    print(f"Dataset for {kage} saved to {ds_file_path}.")

    img_file_path = kage_dir / f"{kage}_background.png"
    plt.imsave(img_file_path, img_dict[kage])
    print(f"Background image for {kage} saved to {img_file_path}.\n")

Saving dataset for kage1...
Dataset for kage1 saved to /Users/loukia/UCL Dropbox/Loukia Katsouri/DataProtocolsEquipment/SmartKages/1.Analysis_DS_Apr-May2024/RawData_300525/movement_analysis/kage1/kage1.nc.
Background image for kage1 saved to /Users/loukia/UCL Dropbox/Loukia Katsouri/DataProtocolsEquipment/SmartKages/1.Analysis_DS_Apr-May2024/RawData_300525/movement_analysis/kage1/kage1_background.png.

Saving dataset for kage10...
Dataset for kage10 saved to /Users/loukia/UCL Dropbox/Loukia Katsouri/DataProtocolsEquipment/SmartKages/1.Analysis_DS_Apr-May2024/RawData_300525/movement_analysis/kage10/kage10.nc.
Background image for kage10 saved to /Users/loukia/UCL Dropbox/Loukia Katsouri/DataProtocolsEquipment/SmartKages/1.Analysis_DS_Apr-May2024/RawData_300525/movement_analysis/kage10/kage10_background.png.

Saving dataset for kage11...
Dataset for kage11 saved to /Users/loukia/UCL Dropbox/Loukia Katsouri/DataProtocolsEquipment/SmartKages/1.Analysis_DS_Apr-May2024/RawData_300525/movemen