In [10]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Vectronics Preprocessing

**Author:** Medha Agarwal | 
**Last Modified:** December 12, 2025

This notebook documents the preprocessing pipeline used to generate summary-level Vectronics data for training models that are subsequently applied to RVC data. The raw Vectronics data are collected at a high temporal resolution (16 Hz). To align these data with the coarser temporal granularity of the RVC dataset, we aggregate the raw signals into 30-second windows and compute summary statistics for each window.

Using the raw Vectronics data together with behavior annotations, we construct two primary data products:

1. **Labeled Vectronics summary data**
2. **Unlabeled Vectronics summary data**

### Required Files

The notebook expects the following input files:

1. Video annotations: `data/2025_10_31_awd_video_annotations.csv`  
2. Audio annotations: `data/2025_10_31_awd_audio_annotations.csv`  
3. Raw Vectronics data: `config.AWD_VECTRONICS_PATHS` (dictionary of directories for each individual. The key is the individual name ('jessie', 'palus', etc) and value is the path to the directories containing yearly CSV files of raw data). Please adjust the paths in this dictionary in `/config.paths.py`.

---

#### 1. Labeled Vectronics Summary Data

The labeled summary dataset is created through the following sequence of steps:

1.1 **Creating halfday segments & metadata**: Split the raw vectronics data into halfday segments from raw acceleration files stored in separate folders for each individual. Store the features of each halfday file (filepath individual, halfday, etc) in a metadats file for easy indexing.

1.2. **Annotation matching** : Match the high-frequency Vectronics data with the available behavior annotations.

1.3. **Window extraction** : Segment the matched portions of the Vectronics data into non-overlapping 30-second windows.

1.4. **Feature computation**: Compute nine summary statistics for each 30-second window.

Finally, the resulting summarized Vectronics dataset is saved for downstream modeling.

---

#### 2. Unlabeled Vectronics Summary Data

The unlabeled summary dataset is generated as follows:

2.1. **Data loading**: Load raw Vectronics data for a given (animal ID, date) pair.

2.2. **Windowing**: Partition the data into consecutive 30-second windows.

2.3. **Feature computation**: Compute the same nine summary statistics for each window.

2.4. **Iteration over metadata**: Repeat steps 1–3 for all unique (animal ID, date) pairs listed in the metadata.

---

#### Implementation Notes
All preprocessing steps described above are implemented in  
`scripts/run_Vectronics_preprocessing.py`.  
This notebook provides a step-by-step explanation of the pipeline and illustrates each stage of the preprocessing process in detail.


In [20]:
import sys
import os
sys.path.append('.')
sys.path.append('../')
sys.path.append('../../')

import yaml
import pandas as pd
import numpy as np
import warnings
from tqdm import tqdm
from typing import Optional
from dataclasses import dataclass


In [21]:
import src.utils.io as io
from src.utils.vectronics_preprocessing import (create_max_windows,
                                                create_summary_data,
                                                load_annotations,
                                                create_windowed_features)
from src.utils.vectronics_data_prep import (create_vectronics_halfday_segments)
import config as config
from src.utils.vectronics_data_prep import create_matched_data, create_metadata

In [22]:
with open(config.VECTRONICS_PREPROCESSING_YAML) as f:
    Vectronics_preprocessing_config = yaml.safe_load(f)

window_duration = Vectronics_preprocessing_config['window_duration']
window_length = int(window_duration*config.SAMPLING_RATE)

In [23]:
@dataclass
class TrainConfig:

    # ---------------- Preprocessing ----------------
    min_window_for_padding: Optional[float] = None # minimum duration of signal beyond which pad the signal, None means no padding
    seed: int = 0

    # ---------------- Data ----------------
    source_data_path: Optional[str] = None
    
args = TrainConfig()

np.random.seed(seed=args.seed)

## 1. Labeled Vectronics Summary Data

We begin by creating half day segments from the raw Vectronics data which is expected to be stored in the following format:  

- root data directory
  - individual 1
    - 2022.csv
    - 2023.csv
    - 2024.csv
  - individual 2
    - 2022.csv
    - 2023.csv
  - individual 3
    - 2023.csv
    - 2024.csv

**Note**: Individual name cannot contain an ``undercsore`` ($\_$).

Each acceleration data CSV is expected to have columns 
| Column Name                     | Data Type | Description                                      |
|---------------------------------|-----------|--------------------------------------------------|
| UTC Date[mm/dd]                 | string    | date in format mm/dd                             |
| UTC DateTime                    | string    | time in format HH:MM:SS                          |
| Milliseconds                    | int       | milliseconds (three digits)                      |
| Acc X [g]                       | float     | acceleration reading along X axis                |
| Acc Y [g]                       | float     | acceleration reading along Y axis                |
| Acc Z [g]                       | float     | acceleration reading along Z axis                |
| Temperature [Celsius]           | float     | temperature reading in celsius                   |


Based on this a `combined_acc` directory is created inside each individual's directory which stores the data split into half days. To easily index the halfday segements, a metadata CSV is created to organize and iterate over the Vectronics files saved as half days. The expected columns in the metadata file are:

- `file path`  
- `individual ID`  
- `year`  
- `UTC Date [yyyy-mm-dd]`  
- `am/pm`  
- `half day [yyyy-mm-dd_am/pm]`  
- `avg temperature [C]`


Next, we load the behavior annotations using the function `load_annotations`. This function merges video-based annotations obtained from `io.get_video_labels_path()` with audio-based annotations obtained from `io.get_audio_labels_path()` into a single consolidated table.
The resulting `all_annotations` CSV is expected to contain the following columns:

- `id`  
- `Behavior`  
- `Timestamp_start`  
- `Timestamp_end`  
- `Source`  
- `Confidence (H-M-L)`  
- `Eating intensity`  
- `duration`



This metadata enables identification of unique animal–date combinations and provides contextual information (e.g., recording period and temperature) for each Vectronics file.

> **Note:** Generating this metadata is computationally expensive and takes approximately **50 minutes** to run. This step should be executed only once to create the metadata corresponding to your local data paths. The resulting file is saved in the `data/` directory and can be reused in subsequent runs.




#### Step 1.1: Creating half-day segments and metadata

Generating half-day segments and associated metadata from approximately two years of data across five dogs takes **~2.5 hours** to complete. Creating the metadata takes **~20 minutes**. For full-scale processing, it is recommended to run the script `src/utils/vectronics_data_prep.py` in the background (e.g., using `tmux`).

In this tutorial, we generate halfdays from **only one chunk (17.36 hours long)** per yearly CSV per individual to validate that the pipeline is functioning correctly. This is achieved by setting `max_chunks=1` in the relevant data-processing functions. 

Using the default value `max_chunks=None` processes **all chunks**, thereby generating half-day files for the entire dataset.


In [None]:
# create half-day segments of vectronics data by reading it in chunks
create_vectronics_halfday_segments(config.VECTRONICS_PATHS, max_chunks=1)

Processing individual:         jessie
Files for this individual :    ['2023.csv', '2022.csv']
Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2022_44934_Samurai_Jessie/2023.csv


Reading 2023:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2023-01-01_AM' '2023-01-01_PM'], chunk duration: 17.36 hrs.


Reading 2023:   0%|          | 0/1 [00:39<?, ?chunk/s]


Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2022_44934_Samurai_Jessie/2022.csv


Reading 2022:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2022-05-09_PM' '2022-05-09_AM' '2022-05-10_AM' '2022-05-10_PM'
 '2022-05-11_AM'], chunk duration: 17.36 hrs.


Reading 2022:   0%|          | 0/1 [00:20<?, ?chunk/s]



Processing individual:         green
Files for this individual :    ['2021.csv', '2023.csv', '2022.csv']
Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2021_44915_Samurai_Green/2021.csv


Reading 2021:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2021-03-24_PM' '2021-03-24_AM' '2021-03-25_AM' '2021-03-25_PM'
 '2021-03-26_AM'], chunk duration: 17.36 hrs.


Reading 2021:   0%|          | 0/1 [00:20<?, ?chunk/s]


Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2021_44915_Samurai_Green/2023.csv


Reading 2023:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2023-01-26_AM' '2023-02-09_PM'], chunk duration: 0.26 hrs.


Reading 2023:   0%|          | 0/1 [00:00<?, ?chunk/s]


Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2021_44915_Samurai_Green/2022.csv


Reading 2022:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2022-01-01_AM' '2022-01-01_PM'], chunk duration: 17.36 hrs.


Reading 2022:   0%|          | 0/1 [00:15<?, ?chunk/s]



Processing individual:         palus
Files for this individual :    ['2021.csv', '2022.csv']
Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2021_44910_Aqua_Palus/2021.csv


Reading 2021:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2021-03-24_PM' '2021-03-24_AM' '2021-03-25_AM' '2021-03-25_PM'
 '2021-03-26_AM'], chunk duration: 17.36 hrs.


Reading 2021:   0%|          | 0/1 [00:16<?, ?chunk/s]


Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2021_44910_Aqua_Palus/2022.csv


Reading 2022:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2022-01-01_AM' '2022-01-01_PM'], chunk duration: 17.36 hrs.


Reading 2022:   0%|          | 0/1 [00:15<?, ?chunk/s]



Processing individual:         ash
Files for this individual :    ['2021.csv', '2022.csv']
Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2021_44904_Ninja_Ash/2021.csv


Reading 2021:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2021-03-24_PM' '2021-03-24_AM' '2021-03-25_AM' '2021-03-25_PM'
 '2021-03-26_AM'], chunk duration: 17.36 hrs.


Reading 2021:   0%|          | 0/1 [00:15<?, ?chunk/s]


Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2021_44904_Ninja_Ash/2022.csv


Reading 2022:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2022-01-01_AM' '2022-01-01_PM'], chunk duration: 17.36 hrs.


Reading 2022:   0%|          | 0/1 [00:16<?, ?chunk/s]



Processing individual:         fossey
Files for this individual :    ['2023.csv', '2022.csv']
Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2022_44907_Aqua_Fossey/2023.csv


Reading 2023:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2023-01-01_AM' '2023-01-01_PM'], chunk duration: 17.36 hrs.


Reading 2023:   0%|          | 0/1 [00:16<?, ?chunk/s]


Handling the csv:              /mnt/ssd/medhaaga/wildlife/Vectronics/2022_44907_Aqua_Fossey/2022.csv


Reading 2022:   0%|          | 0/1 [00:00<?, ?chunk/s]

Half days in chunk:            ['2022-06-12_AM' '2022-06-12_PM' '2022-06-13_AM' '2022-06-13_PM'
 '2022-06-14_AM' '2022-06-14_PM' '2022-06-23_AM' '2022-07-03_AM'
 '2022-07-03_PM'], chunk duration: 17.36 hrs.


Reading 2022:   0%|          | 0/1 [00:16<?, ?chunk/s]





In [None]:
# create a metadata of the vectronics data
if os.path.exists(io.get_vectronics_metadata_path()):
    print(f"Metadata is already saved in {io.get_vectronics_metadata_path()}. Creation skipped.")
else:
    create_metadata(config.VECTRONICS_PATHS, io.get_vectronics_metadata_path())

Metadata is already saved in /home/medhaaga/BotswanaML/data/vectronics_metadata.csv. Creation skipped.


In [24]:
# load annotations and metadata

all_annotations = load_annotations()
metadata = pd.read_csv(io.get_vectronics_metadata_path())

#### Step 1.2: Annotations Matching

To ensure that each 30-second window is associated with a well-defined behavioral label, we enforce a majority-behavior criterion within each window. Specifically, the behavior assigned to a window must be the dominant annotated behavior over that interval. We apply the following procedure:

- For each contiguous annotation segment with duration greater than 15 seconds, we extract the corresponding high-frequency Vectronics signal.
- These matched signal segments are symmetrically padded on the left and right until a total duration of 30 seconds is reached. 
- Annotation segments with duration shorter than 15 seconds are discarded later and are not used to create labeled Vectronics summary data.

In [None]:
_, acc_data, _, _ = create_matched_data(filtered_metadata=metadata, 
                                                            annotations=all_annotations, 
                                                            verbose=True, 
                                                            min_window_for_padding=args.min_window_for_padding,
                                                            min_matched_duration=window_duration)


individual jessie has 506 halfdays in the filtered metadata.


Processing unique half days for jessie: 100%|██████████| 506/506 [00:53<00:00,  9.40it/s]


individual green has 900 halfdays in the filtered metadata.


Processing unique half days for green: 100%|██████████| 900/900 [00:37<00:00, 24.31it/s]


individual palus has 744 halfdays in the filtered metadata.


Processing unique half days for palus: 100%|██████████| 744/744 [00:22<00:00, 32.88it/s]


individual ash has 792 halfdays in the filtered metadata.


Processing unique half days for ash: 100%|██████████| 792/792 [00:51<00:00, 15.26it/s]


individual fossey has 448 halfdays in the filtered metadata.


Processing unique half days for fossey: 100%|██████████| 448/448 [00:39<00:00, 11.22it/s]


#### Step 1.3: Window Extraction

From the matched and padded signal segments obtained in Step 1.1, we extract fixed-length 30-second windows for downstream processing.

- For all matched signal chunks with total duration greater than 30 seconds, we segment the data into 30-second windows.
- After window extraction, any remaining matched chunks with duration shorter than 30 seconds are discarded.

In [12]:
print(f"Creating windows of durations {window_duration}...")
acc_data_split = create_max_windows(acc_data=acc_data, window_duration=window_duration, sampling_rate=config.SAMPLING_RATE)
acc_data_split = acc_data_split[acc_data_split.duration >= window_duration]


Creating windows of durations 30.0...


#### Step 1.4: Feature Computation

With the 30-second windows defined in Step 1.2, we compute a set of summary statistics for each window. 

- Each 30-second window is summarized using **nine predefined statistics** (mean peak-to-peak, max peak-to-peak, amean acceleration for X, Y, and Z axis).


In [13]:
print(f"Creating summary statistics...")
df_preprocessed = create_summary_data(acc_data_split, sampling_rate=config.SAMPLING_RATE)
df_preprocessed = df_preprocessed.drop(columns=['duration'])

Creating summary statistics...


The preprocessed Vectronics data is saved in `args.source_data_path` if it is not `None`. Otherwise, the Vectronics data is stored in the path returned by
`io.get_Vectronics_preprocessed_path(args.source_padding_duration)`. When `args.source_padding_duration` is `None` (i.e., no padding is applied), this function resolves to the default path:

```
/data/Vectronics_preprocessed.csv
```


In [None]:
print(f"Saving preprocessed data to {io.get_Vectronics_preprocessed_path(window_duration)}")
if args.source_data_path is not None:
    df_preprocessed.to_csv(args.source_data_path, index=False)
else:
    df_preprocessed.to_csv(io.get_Vectronics_preprocessed_path(window_duration), index=False)

Saving preprocessed data to /home/medhaaga/BotswanaML/data/Vectronics_preprocessed_duration30.0.csv


## 2. Unlabeled Vectronics Summary Data

The unlabeled Vectronics data are processed similarly to the labeled data but without associating behavioral annotations. Due to the large volume of raw Vectronics recordings, generating the full summary dataset can take approximately **50 minutes** to complete. 

**Recommendation:**  
For full-scale preprocessing, it is advisable to run the script  
`scripts/run_Vectronics_preprocessing.py` from the terminal.

In this notebook, we demonstrate the preprocessing on a **subset of 10 random (animal ID, date) pairs** as a sanity check. This allows verification of the pipeline without incurring the full runtime.


In [19]:
grouped = metadata.groupby(["individual ID", "UTC Date [yyyy-mm-dd]"])
group_keys = list(grouped.groups.keys())   # list of (individual, date) tuples
np.random.shuffle(group_keys)    

In [23]:
results = []
i = 0

for individual, date in tqdm(group_keys, total=len(group_keys)):
    group = grouped.get_group((individual, date))

    # load all half-day files for this animal/day
    dfs = []
    for _, row in group.iterrows():
        df_half = pd.read_csv(row['file path'])
        df_half['Timestamp'] = pd.to_datetime(df_half['Timestamp'], utc=True, format='%Y-%m-%d %H:%M:%S.%f')
        dfs.append(df_half)

    full_day_data = pd.concat(dfs, ignore_index=True).sort_values("Timestamp")

    if len(full_day_data) < window_length:
        warnings.warn(f'{individual}-{date} has fewer samples than the window length. Skipped.')
        continue

    features = create_windowed_features(full_day_data, sampling_frequency=config.SAMPLING_RATE, 
                                        window_duration=window_duration, window_length=window_length)
    features['animal_id'] = individual
    features['UTC date [yyyy-mm-dd]'] = date
    results.append(features) 

    i+=1
    if i == 10:
        break

df = pd.concat(results, ignore_index=True)    

  0%|          | 0/1708 [00:00<?, ?it/s]

  1%|          | 9/1708 [00:16<52:43,  1.86s/it]


In [None]:
print(f"Saving preprocessed data to {io.get_Vectronics_full_summary_path()}")
df.to_csv(io.get_Vectronics_full_summary_path(), index=False)