## Amphibian Audio Data ETL Notebook

### Overview:

This notebook serves as a tool for the reformatting of amphibian audio data collected from 2019 through 2023.

### Objective:

The primary objective of this notebook is to reformat raw audio recordings into an analytically usable format and create a common file structurte. The audio recordings will be clipped into positive and negative samples, as defined by the annual summary reports provided through WIldTrax.

### Contents:

1. **Data Extraction**:
   - Accessing and retrieving historical audio files from internal repositories.
   - Reviewing the structure and organization of audio data collected during the specified timeframe.
   
2. **Data Preprocessing**:
   - Conducting thorough cleaning and filtering of historical audio recordings.
   - Standardizing audio formats and associated metadata to ensure consistency across datasets.
   
3. **Feature Extraction**:
   - Extracting pertinent features from historical audio signals to aid in subsequent analysis.
   - Calculating acoustic metrics essential for amphibian species identification and classification within the designated timeframe.
   
4. **Data Loading**:
   - Efficiently loading processed audio data into structured databases or file systems.
   - Establishing robust data pipelines for automated ETL processes, ensuring scalability and repeatability.

### For future use :

1. Execute each code cell sequentially by pressing Shift + Enter.
2. Adhere to the provided instructions and comments within the code cells for guidance throughout the reformatting process.
3. Tailor the code and parameters to suit the specific requirements and characteristics of the file structures of the data obtained after 2023.

#### Environment Setup:

Prior to commencing the reformatting process, ensure the presence of requisite Python libraries and dependencies. It is recommended to employ established tools such as Anaconda or virtual environments to manage the Python environment effectively.
__________________________________________________________________________________________________________________________________________________


## 1. Data Extraction

In [1]:
#py v 3.12.3
import os
import glob
import wave
import csv

#Not in base python
import pandas as pd
import numpy as np

In [2]:
#specify years for data collection
year_list = ['2019', '2021', '2022', '2023']

In [3]:
#Create folders in parent directory (./analysis) to store wofr, weto, and negative audio samples (non-weto & non-wofr)

parent = os.path.dirname(os.getcwd())

wofr_path = os.path.join(parent, 'data', 'wofr')
if not os.path.exists(wofr_path):
    os.makedirs(wofr_path)

weto_path = os.path.join(parent, 'data', 'weto')
if not os.path.exists(weto_path):
    os.makedirs(weto_path)

negative_path = os.path.join(parent, 'data', 'negative')
if not os.path.exists(negative_path):
    os.makedirs(negative_path)

In [4]:
#Collect filepaths for annual data - preference shown for copies with naming convention that is consistent with WildTrax uploads

wav_root = os.path.join('\\\\BAN-NAS-DATA', 'EI_Monitoring', 'Amphibian recordings')

filepath_wav = [glob.glob(os.path.join(wav_root, '2019', 'Site*', '2019*', '*.wav')),
                glob.glob(os.path.join(wav_root, '2021', '*for WildTrax', '2021*', '*.wav')),
                glob.glob(os.path.join(wav_root, '2022', '*WildTrax copies', '*.wav')),
                glob.glob(os.path.join(wav_root, '2023', '*.wav'))]


#Create dictionary with years as the keys and the audio .wav filepaths as values

wav_dict = dict(zip(year_list, filepath_wav))

In [5]:
#Collect metadata (main_report and recording_report CSVs from WIldTrax)

metadata_root = os.path.join('\\\\Ban-files-01', 'groups', 'Resource Conservation', 'EI Monitoring', 'Amphibians', '_ARUs', 'Data and metadata', 'Data from WildTrax')
filepath_metadata_tags = glob.glob(os.path.join(metadata_root, '*', '*main_report.csv'))
filepath_metadata_index = glob.glob(os.path.join(metadata_root, '*', '*recording_report.csv'))


#Create dictionary with years as keys and the 'main_report' .csv filepaths as values

metadata_tag_dict = dict(zip(year_list, filepath_metadata_tags))


#Create dictionary with years as keys and the 'recording_report' .csv filepaths as values

metadata_index_dict = dict(zip(year_list, filepath_metadata_index))

In [6]:
#Create dictionary of 'recording_id' and 'source_file_name' from main_report.csv and recording_report.csv, respectively.
#This step is necessary to index tags from wildtrax back to the original wav files

fileindex_dict = {}

for year in year_list:
    #fileindex_dict = {}
    with open(metadata_index_dict[year],'r') as f: 
        reader = csv.reader(f)
        for row in reader:
            if row[7] not in fileindex_dict.keys(): #Prevent duplicate entries
                fileindex_dict.update({row[7]: row[-2]})

In [7]:
#Create a nested Function to accept wav filepath and cut it in accordance with specified start time and clip duration
#Resulting file is deposited at outpath

def snip_file(filepath, start, duration, outpathfile, min_snip_length = 2):
        # file to extract the snippet from
    with wave.open(filepath, "rb") as infile:
        # get file data
        nchannels = infile.getnchannels()
        sampwidth = infile.getsampwidth()
        framerate = infile.getframerate()
        # extract data
        # adjust so that the minimum extraction length is 2s
        if duration > 2:
            # set position in wave to start of segment
            infile.setpos(int(start * framerate))
            data = infile.readframes(int(duration * framerate))
        else:
            # Set position in wave to centre - (min_snip_length/2)
            start_adj = start + (duration/2) - (min_snip_length/2)
            if start_adj < 0:
                start_adj = 0
            infile.setpos(int((start_adj) * framerate))
            data = infile.readframes(int(min_snip_length * framerate))

    # write the extracted data to a new file
    with wave.open(outpathfile, 'w') as outfile:
        outfile.setnchannels(nchannels)
        outfile.setsampwidth(sampwidth)
        outfile.setframerate(framerate)
        outfile.setnframes(int(len(data) / sampwidth))
        outfile.writeframes(data)


#Row-wise function to extract year of the row in question. ALong with start time of a proposed audio clip tag, and the duration. THe filepath is also identified.

def snip_row(df_row, index, file_dict, outpath):
    datetime = df_row['recording_date_time']
    year = str(datetime.year)
    start = df_row['detection_time']
    duration = df_row['tag_duration']
    recording_id = df_row['recording_id']
    filename = index[recording_id]
    matching_filepaths = [filepath for filepath in file_dict[year] if filename in filepath]
    if len(matching_filepaths) >0:
        filepath = matching_filepaths[0]
        outpathfile = outpath + "/" + filename[:-4] + '_' + str(start) + '_' + str(duration) + '.wav'
        snip_file(filepath = filepath, start = start, duration = duration, outpathfile = outpathfile)
    else: 
        print(f"No matching wave file for {filename}")


#Wrapper to accept csv filepath as input and process with above functions

def snip_csv(wildtrax_mainreport_filepath, species_code, index, outpath, file_dict):
    df = pd.read_csv(wildtrax_mainreport_filepath, parse_dates = ['recording_date_time'], dtype = {'recording_id': str})
    species_types = ['weto', 'wofr']
    if species_code.casefold() not in species_types:
        raise ValueError("Invalid species_code. Expected one of: %s" % species_types)
    species_mask = df['species_code'].str.contains(species_code, case = False)
    df[species_mask].apply(snip_row, index = index, outpath = outpath, file_dict = file_dict, axis = 1)

In [17]:
#Populate folder with snipped audio samples for weto and wofr

for year in year_list:
    snip_csv(metadata_tag_dict[year], species_code = 'WETO', index = fileindex_dict , outpath = weto_path, file_dict = wav_dict)
    snip_csv(metadata_tag_dict[year], species_code = 'WOFR', index = fileindex_dict , outpath = wofr_path, file_dict = wav_dict)

No matcing wave file for BANFF-A-11_20190505_190000.wav
No matcing wave file for BANFF-A-11_20190505_200000.wav
No matcing wave file for BANFF-A-11_20190505_205345.wav
No matcing wave file for BANFF-A-147_20190508_220000.wav
No matcing wave file for BANFF-A-11_20190505_190000.wav
No matcing wave file for BANFF-A-11_20190505_190000.wav
No matcing wave file for BANFF-A-11_20190505_190000.wav
No matcing wave file for BANFF-A-11_20190505_190000.wav
No matcing wave file for BANFF-A-11_20190505_190000.wav
No matcing wave file for BANFF-A-11_20190505_200000.wav
No matcing wave file for BANFF-A-11_20190505_200000.wav
No matcing wave file for BANFF-A-11_20190505_200000.wav
No matcing wave file for BANFF-A-11_20190505_200000.wav
No matcing wave file for BANFF-A-11_20190505_200000.wav
No matcing wave file for BANFF-A-11_20190505_200000.wav
No matcing wave file for BANFF-A-11_20190505_205345.wav
No matcing wave file for BANFF-A-11_20190505_205345.wav
No matcing wave file for BANFF-A-11_20190505_20