# Parse data structure from Smart-Kages
Parse data paths from the Smart-Kages folder structure and store them in a pandas DataFrames.

Also, load corrected timestamps to help reconstruct the correct time axis.

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
import sleap_io as sio

from smart_kages_movement.io import (
    load_corrected_timestamps,
    parse_data_into_df,
)

## Summarise data paths into a single dataframe

First let's define the path to the folder containing all the data.

In [2]:
data_dir = Path.home() / "Data" / "Smart-Kages"
assert data_dir.exists(), f"Data directory {data_dir} does not exist."

The data is stored per Smart-Kage, in folders names as `kageN`, e.g. `kage1`, `kage2`, etc.

Each Smart-Kage folder contains:
- daily videos are stored in `videos/YYYY/MM/DD/`, split into 1-hour segments. Each 1-hour segment is an `.mp4` file named `kageN_YYYYMMDD_HHMMSS.mp4`.
- corresponding DeepLabCut (DLC) predictions are stored in `analysis/dlc_output/YYYY/MM/DD/`. Each 1-hour `.h5` file therein is prefixed with `kageN_YYYYMMDD_HHMMSS`.

Let's parse the relevant parts of the data structure into a single dataframe.

In [3]:
df = parse_data_into_df(data_dir)
df.head()

Found 2 kage directories:  kage1 kage3
Found a total of 1615 .h5 pose files output by DLC.


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,start_datetime,pose_file_path,video_exists,video_file_path
kage,date,hour,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
kage1,20240403,9,2024-04-03 09:54:20,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...
kage1,20240403,10,2024-04-03 10:00:02,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...
kage1,20240403,11,2024-04-03 11:01:03,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...
kage1,20240403,12,2024-04-03 12:01:04,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...
kage1,20240403,13,2024-04-03 13:01:03,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...


## Let's add some video metadata
We reach each video's `n_frames`, `height`, `width`, and `n_channels` from the video file itself, using `sleap-io`.

These metadata are added as columns to the dataframe.

In [4]:
video_shapes = pd.DataFrame(
    np.zeros((len(df), 4), dtype=int),
    index=df.index,
    columns=["n_frames", "height", "width", "n_channels"],
)

for idx, row in df.iterrows():
    video_path = row["video_file_path"]
    video = sio.load_video(video_path)  # Lazy-Load the video using sleap_io
    # Extract video shape information
    video_shapes.loc[idx, "n_frames"] = video.shape[0]
    video_shapes.loc[idx, "height"] = video.shape[1]
    video_shapes.loc[idx, "width"] = video.shape[2]
    video_shapes.loc[idx, "n_channels"] = (
        video.shape[3] if len(video.shape) > 3 else 1
    )
    video.close()  # Close the video to free resources

# Concatenate the video shapes with the original DataFrame
df = pd.concat([df, video_shapes], axis=1)

In [5]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,start_datetime,pose_file_path,video_exists,video_file_path,n_frames,height,width,n_channels
kage,date,hour,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
kage1,20240403,9,2024-04-03 09:54:20,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,665,376,500,3
kage1,20240403,10,2024-04-03 10:00:02,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7183,376,500,3
kage1,20240403,11,2024-04-03 11:01:03,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7064,376,500,1
kage1,20240403,12,2024-04-03 12:01:04,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7059,376,500,3
kage1,20240403,13,2024-04-03 13:01:03,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7058,376,500,3


## Load corrected timestamps

There is one file per day, stored in `kageN/analysis/dlc_output/YYYY/MM/DD/corrected_timestamps.pkl`.

This file contains a dictionary mapping each pose .h5 file to an array of seconds since the start of the hour.

Let's load the corrected timestamps as a dictionary, where the keys are the pose .h5 file names and the values are the corresponding arrays of seconds.

In [6]:
corrected_timestamps = load_corrected_timestamps(data_dir)
# Display first 10 values for a specific pose file
last_pose_file = list(corrected_timestamps.keys())[-1]
print(f"Corrected timestamps for: {last_pose_file}")
print(corrected_timestamps[last_pose_file][:10])

Corrected timestamps for: kage3_20240510_080001DLC_resnet101_v2Jan17shuffle2_580000.h5
[5.         5.50230416 6.00460832 6.50691248 7.00921663 7.51152079
 8.01382495 8.51612911 9.01843327 9.52073743]


The timestamps are expressed in seconds since the start of the hour segment, meaning that we are missing a few seconds in-between the segments.

Now let's verify that the number of timestamps per pose file matches the number of frames in the corresponding video.

In [7]:
df["n_timestamps"] = 0

for idx, row in df.iterrows():
    pose_file = row["pose_file_path"].name
    if pose_file in corrected_timestamps:
        df.loc[idx, "n_timestamps"] = len(corrected_timestamps[pose_file])
    else:
        print(f"Warning: no corrected timestamps for {pose_file}")

# Check if n_timestamps matches the number of frames for all rows
assert (df["n_timestamps"] == df["n_frames"]).all(), (
    "Mismatch between n_timestamps and n_frames"
)

In [8]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,start_datetime,pose_file_path,video_exists,video_file_path,n_frames,height,width,n_channels,n_timestamps
kage,date,hour,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
kage1,20240403,9,2024-04-03 09:54:20,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,665,376,500,3,665
kage1,20240403,10,2024-04-03 10:00:02,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7183,376,500,3,7183
kage1,20240403,11,2024-04-03 11:01:03,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7064,376,500,1,7064
kage1,20240403,12,2024-04-03 12:01:04,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7059,376,500,3,7059
kage1,20240403,13,2024-04-03 13:01:03,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7058,376,500,3,7058


Let's convert the timestamps for each pose file to be expressed in datetime format.
- Our input is `corrected_timestamps` which is a dictionary mapping hour-long pose file to an array of seconds since the start of the hour segment.
- We use these to compute datetime timestamps and save them to files names as `kageN_YYYYMMDD_HH_timestamps.txt` (1 file per hour-long segment).

In this process, we'll also get the chance to correct the `start_datetime` for each hour-long segment.

In [9]:
for idx, row in df.iterrows():
    kage = row.name[0]
    date = row.name[1]
    hour = row.name[2]

    pose_file_name = row["pose_file_path"].name
    seconds_since_hour = corrected_timestamps.get(pose_file_name, np.array([]))
    time_since_midnight = pd.to_timedelta(
        int(hour) * 3600 + seconds_since_hour, unit="s"
    )
    # Update the datetime column accordingly
    datetime = pd.Timestamp(date) + time_since_midnight
    start_datetime = datetime[0]
    df.loc[idx, "start_datetime"] = start_datetime.strftime(
        "%Y-%m-%d %H:%M:%S"
    )

    # Save the absolute datetime timestamps to a txt file per pose file
    save_dir = data_dir / "movement_analysis" / "timestamps"
    save_dir.mkdir(parents=True, exist_ok=True)
    timestamps_file = save_dir / f"{kage}_{date}_{hour}_timestamps.txt"
    # save pandas timestamp series to a text file
    pd.Series(datetime).to_csv(timestamps_file, index=False, header=False)
    # Store the path to the timestamps file in the DataFrame
    df.loc[idx, "timestamps_file_path"] = timestamps_file

In [11]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,start_datetime,pose_file_path,video_exists,video_file_path,n_frames,height,width,n_channels,n_timestamps,timestamps_file_path
kage,date,hour,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
kage1,20240403,9,2024-04-03 09:54:24,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,665,376,500,3,665,/Users/nsirmpilatze/Data/Smart-Kages/movement_...
kage1,20240403,10,2024-04-03 10:00:06,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7183,376,500,3,7183,/Users/nsirmpilatze/Data/Smart-Kages/movement_...
kage1,20240403,11,2024-04-03 11:01:07,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7064,376,500,1,7064,/Users/nsirmpilatze/Data/Smart-Kages/movement_...
kage1,20240403,12,2024-04-03 12:01:08,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7059,376,500,3,7059,/Users/nsirmpilatze/Data/Smart-Kages/movement_...
kage1,20240403,13,2024-04-03 13:01:07,/Users/nsirmpilatze/Data/Smart-Kages/kage1/ana...,True,/Users/nsirmpilatze/Data/Smart-Kages/kage1/vid...,7058,376,500,3,7058,/Users/nsirmpilatze/Data/Smart-Kages/movement_...


## Save the dataframe to a CSV file

In [12]:
save_dir = data_dir / "movement_analysis"
save_dir.mkdir(parents=True, exist_ok=True)
df.to_csv(save_dir / "dlc_files.csv")
print(f"Data saved to {save_dir / 'dlc_files.csv'}")

Data saved to /Users/nsirmpilatze/Data/Smart-Kages/movement_analysis/dlc_files.csv
