## Validate Python modules against the full dataset

This notebook checks that the existing Python modules work correctly when applied to the complete dataset.

The full dataset includes additional overview columns (e.g. battery changes and recording end dates), and all device detections are stored in a single Excel sheet. Minor adjustments may be required to produce a clean NumPy DataFrame suitable for analysis.

Currently the Schema will ignore any extra columns, I will keep it this way and introduce extra columns when required.

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd


# Go up one level to .../audiomoth
PROJECT_ROOT = Path(os.getcwd()).resolve().parent

# Add project root to sys.path so `src` is importable
sys.path.insert(0, str(PROJECT_ROOT))

EXCEL_PATH = PROJECT_ROOT / "data_raw" / "helman_tor_audiomoth_data.xlsx"

# Make pandas show more columns/rows while exploring
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

## Basic normalisation
Standardise column names and parse timestamps if present.


In [None]:
import src.audio_moth_schema as audio_moth_schema
import src.normaliser as normaliser

# Get all the excel sheets available in the auditomoth sample file
sheets = normaliser.get_excel_sheets(EXCEL_PATH)

# Before merging we should combine date and time columns in Overview sheet
sheets["Overview"] = normaliser.combine_date_and_time(
    sheets["Overview"],
    date_col="deployment_date",
    time_col="deployment_time",
    output_col="deployment_timestamp",
)


# Flatten all the sheets into a single DataFrame
df = normaliser.flatten_data(sheets)


# Lowercase/underscore column names (non-destructive copy)
df = normaliser.combine_date_and_time(
    df, date_col="date", time_col="time", output_col="detection_timestamp"
)

# Validate and convert types according to AudioMoth schema
df = audio_moth_schema.AudioMothSchema.validate(df)

df.head()
# df.shape