In [2]:
%load_ext autoreload
%autoreload 2

# Vectronics Preprocesisng

**Author:** Medha Agarwal | 
**Last Modified:** December 12, 2025

This notebook documents the preprocessing pipeline used to generate summary-level Vectronics data for training models that are subsequently applied to RVC data. The raw Vectronics data are collected at a high temporal resolution (16 Hz). To align these data with the coarser temporal granularity of the RVC dataset, we aggregate the raw signals into 30-second windows and compute summary statistics for each window.

Using the raw Vectronics data together with behavior annotations, we construct two primary data products:

1. **Labeled Vectronics summary data**
2. **Unlabeled Vectronics summary data**

### Required Files

The notebook expects the following input files:

1. Video annotations: `data/2025_10_31_awd_video_annotations.csv`  
2. Audio annotations: `data/2025_10_31_awd_audio_annotations.csv`  
3. Metadata: `data/metadata.csv`  
4. Vectronics acceleration files, organized by half-day as specified in the `file path` column of the metadata file


---

#### 1. Labeled Vectronics Summary Data

The labeled summary dataset is created through the following sequence of steps:

1.1. **Annotation matching** : Match the high-frequency Vectronics data with the available behavior annotations.

1.2. **Window extraction** : Segment the matched portions of the Vectronics data into non-overlapping 30-second windows.

1.3. **Feature computation**: Compute nine summary statistics for each 30-second window.

Finally, the resulting summarized Vectronics dataset is saved for downstream modeling.

---

#### 2. Unlabeled Vectronics Summary Data

The unlabeled summary dataset is generated as follows:

2.1. **Data loading**: Load raw Vectronics data for a given (animal ID, date) pair.

2.2. **Windowing**: Partition the data into consecutive 30-second windows.

2.3. **Feature computation**: Compute the same nine summary statistics for each window.

2.4. **Iteration over metadata**: Repeat steps 1–3 for all unique (animal ID, date) pairs listed in the metadata.

---

#### Implementation Notes
All preprocessing steps described above are implemented in  
`scripts/run_Vectronics_preprocessing.py`.  
This notebook provides a step-by-step explanation of the pipeline and illustrates each stage of the preprocessing process in detail.


In [3]:
import sys
import os
sys.path.append('.')
sys.path.append('../')
sys.path.append('../../')

import yaml
import pandas as pd
import numpy as np
import warnings
from tqdm import tqdm

In [6]:
import src.utils.io as io
from src.utils.Vectronics_preprocessing import (create_max_windows,
                                                create_summary_data,
                                                load_annotations,
                                                create_windowed_features)
import config as config
from src.utils.data_prep import create_matched_data, create_metadata

In [7]:
with open(config.VECTRONICS_PREPROCESSING_YAML) as f:
    Vectronics_preprocessing_config = yaml.safe_load(f)

window_duration = Vectronics_preprocessing_config['window_duration']
window_length = int(window_duration*config.SAMPLING_RATE)

## 1. Labeled Vectronics Summary Data

We begin by loading the behavior annotations using the function `load_annotations`. This function merges video-based annotations obtained from `io.get_video_labels_path()` with audio-based annotations obtained from `io.get_audio_labels_path()` into a single consolidated table.

The resulting `all_annotations` CSV is expected to contain the following columns:

- `id`  
- `Behavior`  
- `Timestamp_start`  
- `Timestamp_end`  
- `Source`  
- `Confidence (H-M-L)`  
- `Eating intensity`  
- `duration`

In addition to the annotations, a metadata CSV is created to organize and iterate over the Vectronics files saved as half days. The expected columns in the metadata file are:

- `file path`  
- `individual ID`  
- `year`  
- `UTC Date [yyyy-mm-dd]`  
- `am/pm`  
- `half day [yyyy-mm-dd_am/pm]`  
- `avg temperature [C]`

This metadata enables identification of unique animal–date combinations and provides contextual information (e.g., recording period and temperature) for each Vectronics file.

> **Note:** Generating this metadata is computationally expensive and takes approximately **50 minutes** to run. This step should be executed only once to create the metadata corresponding to your local data paths. The resulting file is saved in the `data/` directory and can be reused in subsequent runs.




In [10]:
# load annotations and metadata

all_annotations = load_annotations()
if not os.path.exists(io.get_vectronics_metadata_path()):
    create_metadata(config.AWD_VECTRONICS_PATHS, io.get_vectronics_metadata_path())
metadata = pd.read_csv(io.get_vectronics_metadata_path())

individual jessie has 506 halfdays.


100%|██████████| 506/506 [05:33<00:00,  1.52it/s]


individual green has 900 halfdays.


100%|██████████| 900/900 [10:57<00:00,  1.37it/s]


individual palus has 744 halfdays.


100%|██████████| 744/744 [09:03<00:00,  1.37it/s]


individual ash has 792 halfdays.


100%|██████████| 792/792 [09:51<00:00,  1.34it/s]


individual fossey has 886 halfdays.


100%|██████████| 886/886 [11:35<00:00,  1.27it/s]


#### Step 1.1: Annotations Matching

To ensure that each 30-second window is associated with a well-defined behavioral label, we enforce a majority-behavior criterion within each window. Specifically, the behavior assigned to a window must be the dominant annotated behavior over that interval. We apply the following procedure:

- For each contiguous annotation segment with duration greater than 15 seconds, we extract the corresponding high-frequency Vectronics signal.
- These matched signal segments are symmetrically padded on the left and right until a total duration of 30 seconds is reached. 
- Annotation segments with duration shorter than 15 seconds are discarded later and are not used to create labeled Vectronics summary data.

In [8]:
min_window_for_padding = 15.0
_, acc_data, _, _ = create_matched_data(filtered_metadata=metadata, 
                                                            annotations=all_annotations, 
                                                            verbose=True, 
                                                            min_window_for_padding=min_window_for_padding,
                                                            min_matched_duration=window_duration)


individual jessie has 506 halfdays in the filtered metadata.


Processing unique half days for jessie: 100%|██████████| 506/506 [00:53<00:00,  9.40it/s]


individual green has 900 halfdays in the filtered metadata.


Processing unique half days for green: 100%|██████████| 900/900 [00:37<00:00, 24.31it/s]


individual palus has 744 halfdays in the filtered metadata.


Processing unique half days for palus: 100%|██████████| 744/744 [00:22<00:00, 32.88it/s]


individual ash has 792 halfdays in the filtered metadata.


Processing unique half days for ash: 100%|██████████| 792/792 [00:51<00:00, 15.26it/s]


individual fossey has 448 halfdays in the filtered metadata.


Processing unique half days for fossey: 100%|██████████| 448/448 [00:39<00:00, 11.22it/s]


#### Step 1.2: Window Extraction

From the matched and padded signal segments obtained in Step 1.1, we extract fixed-length 30-second windows for downstream processing.

- For all matched signal chunks with total duration greater than 30 seconds, we segment the data into 30-second windows.
- After window extraction, any remaining matched chunks with duration shorter than 30 seconds are discarded.

In [12]:
print(f"Creating windows of durations {window_duration}...")
acc_data_split = create_max_windows(acc_data=acc_data, window_duration=window_duration, sampling_rate=config.SAMPLING_RATE)
acc_data_split = acc_data_split[acc_data_split.duration >= window_duration]


Creating windows of durations 30.0...


#### Step 1.3: Feature Computation

With the 30-second windows defined in Step 1.2, we compute a set of summary statistics for each window. 

- Each 30-second window is summarized using **nine predefined statistics** (mean peak-to-peak, max peak-to-peak, amean acceleration for X, Y, and Z axis).


In [13]:
print(f"Creating summary statistics...")
df_preprocessed = create_summary_data(acc_data_split, sampling_rate=config.SAMPLING_RATE)
df_preprocessed = df_preprocessed.drop(columns=['duration'])

Creating summary statistics...


In [14]:
print(f"Saving preprocessed data to {io.get_Vectronics_preprocessed_path(window_duration)}")
df_preprocessed.to_csv(io.get_Vectronics_preprocessed_path(window_duration), index=False)

Saving preprocessed data to /home/medhaaga/BotswanaML/data/Vectronics_preprocessed_duration30.0.csv


## 2. Unlabeled Vectronics Summary Data

The unlabeled Vectronics data are processed similarly to the labeled data but without associating behavioral annotations. Due to the large volume of raw Vectronics recordings, generating the full summary dataset can take approximately **50 minutes** to complete. 

**Recommendation:**  
For full-scale preprocessing, it is advisable to run the script  
`scripts/run_Vectronics_preprocessing.py` from the terminal.

In this notebook, we demonstrate the preprocessing on a **subset of 10 random (animal ID, date) pairs** as a sanity check. This allows verification of the pipeline without incurring the full runtime.


In [19]:
grouped = metadata.groupby(["individual ID", "UTC Date [yyyy-mm-dd]"])
group_keys = list(grouped.groups.keys())   # list of (individual, date) tuples
np.random.shuffle(group_keys)    

In [23]:
results = []
i = 0

for individual, date in tqdm(group_keys, total=len(group_keys)):
    group = grouped.get_group((individual, date))

    # load all half-day files for this animal/day
    dfs = []
    for _, row in group.iterrows():
        df_half = pd.read_csv(row['file path'])
        df_half['Timestamp'] = pd.to_datetime(df_half['Timestamp'], utc=True, format='%Y-%m-%d %H:%M:%S.%f')
        dfs.append(df_half)

    full_day_data = pd.concat(dfs, ignore_index=True).sort_values("Timestamp")

    if len(full_day_data) < window_length:
        warnings.warn(f'{individual}-{date} has fewer samples than the window length. Skipped.')
        continue

    features = create_windowed_features(full_day_data, sampling_frequency=config.SAMPLING_RATE, 
                                        window_duration=window_duration, window_length=window_length)
    features['animal_id'] = individual
    features['UTC date [yyyy-mm-dd]'] = date
    results.append(features) 

    i+=1
    if i == 10:
        break

df = pd.concat(results, ignore_index=True)    

  0%|          | 0/1708 [00:00<?, ?it/s]

  1%|          | 9/1708 [00:16<52:43,  1.86s/it]


In [None]:
print(f"Saving preprocessed data to {io.get_Vectronics_full_summary_path()}")
df.to_csv(io.get_Vectronics_full_summary_path(), index=False)