In [1]:
%load_ext autoreload
%autoreload 2

# RVC Data Preprocessing

**Author:** Medha Agarwal  
**Last Modified:** December 12, 2025

This notebook documents the preprocessing pipeline for RVC data. The goal of this pipeline is to standardize raw sensor outputs, enforce physical constraints based on sensor specifications, and derive weak behavioral labels for downstream analysis.

The RVC data preprocessing consists of the following steps:

1. **Sensor calibration** : Calibrate the raw acceleration signals according to the specifications of the sensor used at the time of data collection. This step converts all measurements into a uniform units (g).

2. **Range-based thresholding** : Threshold the calibrated signals based on the maximum physically achievable range of the sensors.

3. **Weak label aggregation** : Construct weak behavioral labels by aggregating behavior indicator binaries present in the RVC data.

This code requires two files:

1. The weakly-labeled RVC data stored in `config.RVC_ACC_ANNOTATED_COMBINED`. Edit this path appropriately based on where the data is stored on your device. 

2. The metadata file stored in `data/RVC_merged_metadata.xlsx`.

In [6]:
import yaml
import pandas as pd
import os
import sys
import numpy as np
sys.path.append('.')
sys.path.append('../')

from src.utils.RVC_preprocessing import remove_duplicates
from src.utils.RVC_calibration import calibrate_RVC_data, threshold_RVC
import config as config
import src.utils.io as io

First, we load the RVC data which are stored as a single consolidated dataset.

Next, we load the metadata containing sensor-specific information required for calibration and downstream processing.

The RVC dataframe is expected to contain the following columns, representing summary acceleration features, behavioral indicators, and identifiers:

| Time & ID | Acc (Max Peak-to-Peak) | Acc (Mean Peak-to-Peak) | Acc (Mean) |  Behavioral Labels |
|------|---------------------|---------------------|-------------------|-------------------|
| `UTC.time..yyyy.mm.dd.HH.MM.SS.` | `Max.accel.peak.X` | `Mean.accel.peak.X` | `Mean.accel.X` | `resting_binary` |
| `GPS.time..s.` | `Max.accel.peak.Y` | `Mean.accel.peak.Y` | `Mean.accel.Y` |`moving_binary` |
|  `id`| `Max.accel.peak.Z` | `Mean.accel.peak.Z` | `Mean.accel.Z` | `feeding_binary` |



The metadata dataframe is expected to contain the following columns:

| Identification | Timestamps | Hardware & Firmware | Calibration Parameters |
|---------------|---------------|---------------------|------------------------|
| `animal_id` | `start_date_dd_mm_yyyy` | `firmware_major_version` | `six_point_X` |
| `collar_number` | `end_date_dd_mm_yyyy` | `range` | `six_point_Y` |
|  | | | `six_point_Z` |
|  |  |  | `sensitivity_X` |
|  |  |  | `sensitivity_Y` |
|  |  |  | `sensitivity_Z` |
|  |  |  | `offset_X` |
|  |  |  | `offset_Y` |
|  |  |  | `offset_Z` |

In [None]:
# load RVC data
with open(config.RVC_PREPROCESSING_YAML) as f:
    RVC_preprocessing_config = yaml.safe_load(f)

print("Loading the RVC data...")
RVC_df = pd.read_csv(config.RVC_ACC_ANNOTATED)
RVC_df['animal_id'] = RVC_df['id'].str.capitalize()
RVC_df["animal_id"] = RVC_df["animal_id"].apply(lambda x: x.upper() if x == 'Mj' else x)
RVC_df = RVC_df.rename(columns={'UTC.time..yyyy.mm.dd.HH.MM.SS.': 'UTC time [yyyy-mm-dd HH:MM:SS]',
                            'GPS.time..s.': 'GPS time', 
                            'Max.accel.peak.X': 'acc_x_ptp_max',
                            'Max.accel.peak.Y': 'acc_y_ptp_max',
                            'Max.accel.peak.Z': 'acc_z_ptp_max',
                            'Mean.accel.peak.X': 'acc_x_ptp_mean',
                            'Mean.accel.peak.Y': 'acc_y_ptp_mean',
                            'Mean.accel.peak.Z': 'acc_z_ptp_mean',
                            'Mean.accel.X': 'acc_x_mean',
                            'Mean.accel.Y': 'acc_y_mean',
                            'Mean.accel.Z': 'acc_z_mean'
                            })
RVC_df['UTC time [yyyy-mm-dd HH:MM:SS]'] = pd.to_datetime(RVC_df['UTC time [yyyy-mm-dd HH:MM:SS]'])
RVC_df = RVC_df.sort_values(by=['animal_id', 'UTC time [yyyy-mm-dd HH:MM:SS]'])
RVC_df['UTC date [yyyy-mm-dd]'] = RVC_df['UTC time [yyyy-mm-dd HH:MM:SS]'].dt.date
RVC_df = remove_duplicates(RVC_df)

# Load metadata_df
print("Loading the RVC metadata...")
metadata_df = pd.read_excel(io.get_RVC_merged_metadata_path())
metadata_df.start_date_dd_mm_yyyy = pd.to_datetime(metadata_df.start_date_dd_mm_yyyy, format='%d/%m/%Y')
metadata_df.end_date_dd_mm_yyyy = pd.to_datetime(metadata_df.end_date_dd_mm_yyyy, format='%d/%m/%Y')

Loading the RVC data...
Removed 0 duplicates.


Next, we calibrate the acceleration values according to the sensor specifications to convert all measurements into consistent units of gravitational acceleration (g). Following calibration, we apply range-based thresholding to remove values that fall outside the physically valid limits defined by the sensor specifications.


In [8]:
# Calibration
print("Calibrating the RVC data...")
RVC_df = calibrate_RVC_data(RVC_df, metadata_df)

# Thresholding
print("Thresholding the RVC data...")
RVC_df = threshold_RVC(RVC_df)
RVC_df['UTC date [yyyy-mm-dd]'] = pd.to_datetime(RVC_df['UTC date [yyyy-mm-dd]'])
RVC_df = RVC_df[RVC_df['UTC date [yyyy-mm-dd]'].dt.year >= 2000]

Calibrating the RVC data...
No data found for animal_id Augustus in the date range NaT to NaT
No data found for animal_id Bali in the date range 2013-12-23 00:00:00 to 2014-09-17 00:00:00
No data found for animal_id Scorpion in the date range 2012-10-08 00:00:00 to 2012-10-30 00:00:00
No data found for animal_id Scorpion in the date range 2012-11-02 00:00:00 to 2013-10-05 00:00:00
No data found for animal_id Seronera in the date range NaT to NaT
No data found for animal_id Yolo in the date range 2012-09-21 00:00:00 to 2012-10-23 00:00:00
No data found for animal_id Gobi in the date range NaT to NaT
Number of rows without calibration metadata: 0/7230570
Thresholding the RVC data...
Number of outliers removed: 1330674/7230570.


Based on the binary behavior values for feeding, moving, and resting, we assign a column of weak labels.

In [9]:
# Create behavior column
RVC_df['behavior'] = np.select(
    [
        RVC_df['feeding_binary'] == 1,
        RVC_df['moving_binary'] == 1,
        RVC_df['resting_binary'] == 1
    ],
    ['Feeding', 'Moving', 'Stationary'],
    default=None
)

In [None]:
RVC_df = RVC_df[RVC_preprocessing_config['feature_cols'] + RVC_preprocessing_config['helper_cols']]
print("Saving the preprocessed RVC data...")
RVC_df.to_csv(io.get_RVC_preprocessed_path(), index=False)
