## Amphibian Audio Data ETL Notebook

### Overview:

This notebook serves as a tool for the reformatting of amphibian audio data collected from 2019 through 2023.

### Objective:

The primary objective of this notebook is to reformat raw audio recordings into an analytically usable format and create a common file structurte. The audio recordings will be clipped into positive and negative samples, as defined by the annual summary reports provided through WIldTrax.

### Contents:

1. **Data Extraction**:
   - Accessing and retrieving historical audio files from internal repositories.
   - Reviewing the structure and organization of audio data collected during the specified timeframe.
   
2. **Data Preprocessing**:
   - Conducting thorough cleaning and filtering of historical audio recordings.
   - Standardizing audio formats and associated metadata to ensure consistency across datasets.
   
3. **Feature Extraction**:
   - Extracting pertinent features from historical audio signals to aid in subsequent analysis.
   - Calculating acoustic metrics essential for amphibian species identification and classification within the designated timeframe.
   
4. **Data Loading**:
   - Efficiently loading processed audio data into structured databases or file systems.
   - Establishing robust data pipelines for automated ETL processes, ensuring scalability and repeatability.

### For future use :

1. Execute each code cell sequentially by pressing Shift + Enter.
2. Adhere to the provided instructions and comments within the code cells for guidance throughout the reformatting process.
3. Tailor the code and parameters to suit the specific requirements and characteristics of the file structures of the data obtained after 2023.

#### Environment Setup:

Prior to commencing the reformatting process, ensure the presence of requisite Python libraries and dependencies. It is recommended to employ established tools such as Anaconda or virtual environments to manage the Python environment effectively.
__________________________________________________________________________________________________________________________________________________


## 1. Data Extraction

In [20]:
#py v 3.12.3
import os
import glob
import wave
import csv
import audioop

#Not in base python
import pandas as pd
import numpy as np

In [21]:
#specify years for data collection
year_list = ['2019', '2021', '2022', '2023']

In [22]:
#Create folders in parent directory (./analysis/) to store training data of wofr, weto, and negative audio samples (non-weto & non-wofr)

train_parent = os.path.join(os.path.dirname(os.getcwd()), 'data')

train_wofr_path = os.path.join(train_parent, 'wofr', 'train', 'positive')
if not os.path.exists(train_wofr_path):
    os.makedirs(train_wofr_path)

train_weto_path = os.path.join(train_parent, 'weto', 'train', 'positive')
if not os.path.exists(train_weto_path):
    os.makedirs(train_weto_path)


#Create folders in negative directory (./analysis/negative) to store none_wofr, none_weto, and all_negative audio samples

train_weto_negative_path = os.path.join(train_parent, 'weto', 'train', 'negative')
if not os.path.exists(train_weto_negative_path):
    os.makedirs(train_weto_negative_path)

train_wofr_negative_path = os.path.join(train_parent, 'wofr', 'train', 'negative')
if not os.path.exists(train_wofr_negative_path):
    os.makedirs(train_wofr_negative_path)

In [23]:
#Collect filepaths for annual data - preference shown for copies with naming convention that is consistent with WildTrax uploads

wav_root = os.path.join('\\\\BAN-NAS-DATA', 'EI_Monitoring', 'Amphibian recordings')

filepath_wav = [glob.glob(os.path.join(wav_root, '2019', 'Site*', '2019*', '*.wav')),
                glob.glob(os.path.join(wav_root, '2021', '*for WildTrax', '2021*', '*.wav')),
                glob.glob(os.path.join(wav_root, '2022', '*WildTrax copies', '*.wav')),
                glob.glob(os.path.join(wav_root, '2023', '*.wav'))]


#Create dictionary with years as the keys and the audio .wav filepaths as values

wav_dict = dict(zip(year_list, filepath_wav))

In [24]:
#Collect metadata (main_report and recording_report CSVs from WIldTrax)

metadata_root = os.path.join('\\\\Ban-files-01', 'groups', 'Resource Conservation', 'EI Monitoring', 'Amphibians', '_ARUs', 'Data and metadata', 'Data from WildTrax')
filepath_metadata_tags = glob.glob(os.path.join(metadata_root, '*', '*main_report.csv'))
filepath_metadata_index = glob.glob(os.path.join(metadata_root, '*', '*recording_report.csv'))


#Create dictionary with years as keys and the 'main_report' .csv filepaths as values

metadata_tag_dict = dict(zip(year_list, filepath_metadata_tags))


#Create dictionary with years as keys and the 'recording_report' .csv filepaths as values

metadata_index_dict = dict(zip(year_list, filepath_metadata_index))

In [25]:
#Create dictionary of 'recording_id' and 'source_file_name' from main_report.csv and recording_report.csv, respectively.
#This step is necessary to index tags from wildtrax back to the original wav files
#NOTE: It appears that the source filenames listed in the wildtrax metadata (recording_report.csv) does not always align with existing wav files
#       This appears to be true for all of 2019 data

fileindex_dict = {}

for year in year_list:
    #fileindex_dict = {}
    with open(metadata_index_dict[year],'r') as f: 
        reader = csv.reader(f)
        for row in reader:
            if row[7] not in fileindex_dict.keys(): #Prevent duplicate entries
                fileindex_dict.update({row[7]: row[-2]})

In [26]:
#Create dict of all possible tag codes and common names that have been used for amphibian recortdings in the wildtrax database

tagname_tagcode_dict = {}

for year in year_list:
    tmp_dict = dict(zip(list(pd.read_csv(metadata_tag_dict[year], usecols = ['species_code'])['species_code']), 
                        list(pd.read_csv(metadata_tag_dict[year], usecols = ['species_common_name'])['species_common_name'])
                        ))
    tagname_tagcode_dict = {**tagname_tagcode_dict, **tmp_dict}

In [27]:
#Create a nested Function (x3) to accept wav filepath and cut it in accordance with specified start time and clip duration
#Resulting file is deposited at outpath

def snip_file(filepath, start, duration, outpathfile, min_snip_length = 2):
        # file to extract the snippet from
    
    if duration != duration or start != start:
        print(f"{filepath} contains NAN for start and/or duration")
    else:
        with wave.open(filepath, "rb") as infile:
            # get file data
            nchannels = infile.getnchannels()
            sampwidth = infile.getsampwidth()
            framerate = infile.getframerate()
            # extract data
            # adjust so that the minimum extraction length is 2s
            if duration > 2:
                # set position in wave to start of segment
                infile.setpos(int(start * framerate))
                data = infile.readframes(int(duration * framerate))
            else:
                # Set position in wave to centre - (min_snip_length/2)
                start_adj = start + (duration/2) - (min_snip_length/2)
                # Set start time floor at 0s
                if start_adj < 0:
                    start_adj = 0
                infile.setpos(int((start_adj) * framerate))
                data = infile.readframes(int(min_snip_length * framerate))

        # write the extracted data to a new file
        with wave.open(outpathfile, 'w') as outfile:
            outfile.setnchannels(nchannels)
            outfile.setsampwidth(sampwidth)
            outfile.setframerate(framerate)
            outfile.setnframes(int(len(data) / sampwidth))
            outfile.writeframes(data)


#Row-wise function to extract year of the row in question. ALong with start time of a proposed audio clip tag, and the duration. THe filepath is also identified.

def snip_row(df_row, index, file_dict, outpath):
    datetime = df_row['recording_date_time']
    year = str(datetime.year)
    start = df_row['detection_time']
    duration = df_row['tag_duration']
    recording_id = df_row['recording_id']
    filename = index[recording_id]
    matching_filepaths = [filepath for filepath in file_dict[year] if filename in filepath]
    if len(matching_filepaths) >0:
        filepath = matching_filepaths[0]
        outpathfile = outpath + "/" + filename[:-4] + '_' + str(int(start)) + '_' + str(int(duration)) + '.wav'
        snip_file(filepath = filepath, start = start, duration = duration, outpathfile = outpathfile)
    else: 
        print(f"No matching wave file for {filename}")


#Wrapper to accept csv filepath as input and process with above functions

def snip_csv(wildtrax_mainreport_filepath, species_code, index, outpath, file_dict, regex= False):
    df = pd.read_csv(wildtrax_mainreport_filepath, parse_dates = ['recording_date_time'], dtype = {'recording_id': str})
    species_types = list(tagname_tagcode_dict.keys())
    if species_code.casefold() not in [str.casefold(accepted_code) for accepted_code in species_types] and regex == False:
        raise ValueError(f"Invalid species_code. Expected one of: %s {species_types}")
    species_mask = df['species_code'].str.contains(species_code, case = False, regex = regex)
    df[species_mask].apply(snip_row, index = index, outpath = outpath, file_dict = file_dict, axis = 1)

##### Extract positive samples

In [28]:
#Populate folder with snipped audio samples for weto and wofr

for year in year_list:
    snip_csv(metadata_tag_dict[year], species_code = 'WETO', index = fileindex_dict , outpath = train_weto_path, file_dict = wav_dict)
    snip_csv(metadata_tag_dict[year], species_code = 'WOFR', index = fileindex_dict , outpath = train_wofr_path, file_dict = wav_dict)

No matching wave file for BANFF-A-11_20190505_190000.wav
No matching wave file for BANFF-A-11_20190505_200000.wav
No matching wave file for BANFF-A-11_20190505_205345.wav
No matching wave file for BANFF-A-147_20190508_220000.wav
No matching wave file for BANFF-A-11_20190505_190000.wav
No matching wave file for BANFF-A-11_20190505_190000.wav
No matching wave file for BANFF-A-11_20190505_190000.wav
No matching wave file for BANFF-A-11_20190505_190000.wav
No matching wave file for BANFF-A-11_20190505_190000.wav
No matching wave file for BANFF-A-11_20190505_200000.wav
No matching wave file for BANFF-A-11_20190505_200000.wav
No matching wave file for BANFF-A-11_20190505_200000.wav
No matching wave file for BANFF-A-11_20190505_200000.wav
No matching wave file for BANFF-A-11_20190505_200000.wav
No matching wave file for BANFF-A-11_20190505_200000.wav
No matching wave file for BANFF-A-11_20190505_205345.wav
No matching wave file for BANFF-A-11_20190505_205345.wav
No matching wave file for BANF

In [29]:
#View the tag codes and corresponding names

for code in sorted(tagname_tagcode_dict): print(code, ":", tagname_tagcode_dict[code]) 

CANG : Canada Goose
COSN : Common Snipe
CSFR : Columbia Spotted Frog
HETF : Heavy traffic
HETN : Heavy train
HEWI : Heavy wind
LIBA : Light Background Noise
LIRA : Light rain
LITF : Light traffic
MOAI : Moderate aircraft
MOBA : Moderate Background Noise
MORA : Moderate rain
MOTF : Moderate traffic
MOTN : Moderate train
MOWI : Moderate wind
NONE : NONE
NSWO : Northern Saw-whet Owl
RESQ : Red Squirrel
UNDU : Unidentified Duck
UNFR : Unidentified Frog
UNKN : Unidentified signal
UNPA : Unidentified Passerine
UNWO : Unidentified Woodpecker
WETO : Western Toad
WISN : Wilson's Snipe
WOFR : Wood Frog


##### Extract negative samples

In [30]:
#Function to subset the main_report csvs from wildtrax to only include recordings that DO NOT contain the specified species code tag

def get_neg_filepaths(neg_species:str, main_report_filepath:str, file_dict):
    df = pd.read_csv(main_report_filepath, parse_dates = ['recording_date_time'], dtype = {'recording_id': str}) #read in csv
    neg_mask = df[['recording_id', 'species_code']].groupby(['recording_id'])['species_code'].apply(lambda x: any(x == neg_species))
    neg_dict = dict(zip(neg_mask.index, neg_mask.values)) #dict of key= recording_id and value= target species presence
    neg_record_ids = [record_id for record_id, presence in neg_dict.items() if not presence] #extract record_ids of audio files that do not contain target species
    non_null_mask = (df['species_code'] != 'NaN') & (df['species_code'] != 'NONE') & (df['recording_id'].isin(neg_record_ids)) #create maske
    masked_df = df[non_null_mask] #apply mask
    return(masked_df)

In [31]:
#Identify and extract audio clips that DO NOT contain the specified target species, but DO contain non-target species/sounds (e.g. train)
#populate the negative directories

for year in year_list:
    #Apply function to create df that does not contain target species, but does contain tagged vocalizations/sounds
    weto_masked_df = get_neg_filepaths(neg_species ='WETO', main_report_filepath = metadata_tag_dict[year], file_dict = wav_dict[year])
    wofr_masked_df = get_neg_filepaths(neg_species ='WOFR', main_report_filepath = metadata_tag_dict[year], file_dict = wav_dict[year])

    #apply snip_row function (created above) to populate negative folders
    weto_masked_df.apply(snip_row, index = fileindex_dict, outpath = train_weto_negative_path, file_dict = wav_dict, axis = 1)
    wofr_masked_df.apply(snip_row, index = fileindex_dict, outpath = train_wofr_negative_path, file_dict = wav_dict, axis = 1)

No matching wave file for BANFF-A-11_20190505_170000.wav
No matching wave file for BANFF-A-11_20190505_170000.wav
No matching wave file for BANFF-A-11_20190505_170000.wav
No matching wave file for BANFF-A-11_20190505_170000.wav
No matching wave file for BANFF-A-11_20190505_170000.wav
No matching wave file for BANFF-A-11_20190505_180000.wav
No matching wave file for BANFF-A-11_20190505_180000.wav
No matching wave file for BANFF-A-11_20190505_180000.wav
No matching wave file for BANFF-A-11_20190505_180000.wav
No matching wave file for BANFF-A-11_20190505_180000.wav
No matching wave file for BANFF-A-147_20190508_160000.wav
No matching wave file for BANFF-A-147_20190508_160000.wav
No matching wave file for BANFF-A-147_20190508_160000.wav
No matching wave file for BANFF-A-147_20190508_160000.wav
No matching wave file for BANFF-A-147_20190508_160000.wav
No matching wave file for BANFF-A-147_20190508_160000.wav
No matching wave file for BANFF-A-147_20190508_160000.wav
No matching wave file fo

#### Partition the audio files into a training and testing set
Split the audio files randomly (using np.random) such that 10% of all data types (weto/wofr; pos/neg) are witheld for testing.

In [32]:
#Create folders in parent directory (./analysis/) to store wofr and weto testing  samples (non-weto & non-wofr)

test_parent = os.path.join(os.path.dirname(os.getcwd()), 'data')

test_wofr_path = os.path.join(test_parent, 'wofr', 'test', 'positive')
if not os.path.exists(test_wofr_path):
    os.makedirs(test_wofr_path)

test_weto_path = os.path.join(test_parent, 'weto', 'test', 'positive')
if not os.path.exists(test_weto_path):
    os.makedirs(test_weto_path)


#Create folders in negative directory (./analysis/negative) to store none_wofr and none_weto testing audio samples

test_weto_negative_path = os.path.join(test_parent, 'weto', 'test', 'negative')
if not os.path.exists(test_weto_negative_path):
    os.makedirs(test_weto_negative_path)

test_wofr_negative_path = os.path.join(test_parent, 'wofr', 'test', 'negative')
if not os.path.exists(test_wofr_negative_path):
    os.makedirs(test_wofr_negative_path)

In [33]:
print(len(os.listdir(train_weto_path)), len(os.listdir(test_weto_path)))
print(len(os.listdir(train_weto_negative_path)), len(os.listdir(test_weto_negative_path)))
print(len(os.listdir(train_wofr_path)), len(os.listdir(test_wofr_path)))
print(len(os.listdir(train_wofr_negative_path)), len(os.listdir(test_wofr_negative_path)))

313 0
779 0
1038 0
64 0


In [34]:
training_directories = {"weto_pos":[train_weto_path, test_weto_path], 
                        "weto_neg":[train_weto_negative_path, test_weto_negative_path], 
                        "wofr_pos":[train_wofr_path, test_wofr_path], 
                        "wofr_neg":[train_wofr_negative_path, test_wofr_negative_path]}

np.random.seed(42) #Set seed for reproduceability

for key, value in training_directories.items():
    train_dir = value[0]
    test_dir = value[1]
    if len(os.listdir(test_dir)) ==0:
        files = glob.glob(os.path.join(train_dir, '*.wav'))
        random_files = np.random.choice(a = files,                  
                                        size =  int(len(files)*.1), 
                                        replace = False)            #Sample without replacement
        for filepath in random_files:
            filename = os.path.relpath(filepath, train_dir)

            test_filepath = os.path.join(test_dir, filename)
            os.rename(filepath, test_filepath)

In [35]:
print(len(os.listdir(train_weto_path)), len(os.listdir(test_weto_path)))
print(len(os.listdir(train_weto_negative_path)), len(os.listdir(test_weto_negative_path)))
print(len(os.listdir(train_wofr_path)), len(os.listdir(test_wofr_path)))
print(len(os.listdir(train_wofr_negative_path)), len(os.listdir(test_wofr_negative_path)))

282 31
702 77
935 103
58 6


In [74]:
#Audio clips of three seconds will be used to trian the model
#For extracted audio that is greater than 3s, split it into sequential clips of 3s.

parent = os.path.join(os.path.dirname(os.getcwd()), 'data')

train_weto_path = os.path.join(parent, 'weto', 'train', 'positive')
train_wofr_path = os.path.join(parent, 'wofr', 'train', 'positive')

train_no_weto_path = os.path.join(parent, 'weto', 'train', 'negative')
train_no_wofr_path = os.path.join(parent, 'wofr', 'train', 'negative')

for dir in [train_weto_path, train_wofr_path, train_no_weto_path, train_no_wofr_path]:
    for filename in os.listdir(dir):
        filepath = os.path.join(dir, filename)
        with wave.open(filepath, "rb") as infile:
            framerate = infile.getframerate()
            nchannels = infile.getnchannels()
            sampwidth = infile.getsampwidth()
            frames = infile.getnframes()
            n_3s_segments = round(frames/(framerate*3)) # round down to number of full 3s segments within target audio clip
            start = 0
            for i in range(0, n_3s_segments):
                infile.setpos(1+int(start * framerate))
                data = infile.readframes(int(3 * framerate))
                outpathfile = os.path.join(dir, str.split(filename, ".wav")[0] + "_" + str(start) + "-" +str(start+3) + ".wav")
                
                #Resample to 16k hz
                #Maintains data integrity and improves computation efficiency
                try:
                    data_16k = audioop.ratecv(data,       #data
                                            sampwidth,    #width
                                            nchannels,    #nchannels
                                            framerate,    #inrate
                                            16000,        #outrate
                                            None)         #state
                except:
                    print(f"Failed to downsample {filepath}")
                with wave.open(outpathfile, 'w') as outfile:
                    outfile.setnchannels(nchannels)
                    outfile.setsampwidth(sampwidth)
                    outfile.setframerate(16000)
                    outfile.setnframes(int(len(data_16k[0]) / sampwidth))
                    outfile.writeframes(data_16k[0])    
                start += 3
        try:
            os.remove(filepath)
        except:
            print(f"Could not remove {filepath}")

## 2. Data Transformation

In [1]:
#Switch to conda environment (for TensorFLow)
#Signal processing with Tensorflow rather than librosa to ensure consistency with incoming data during model implementation

#py v 3.9.19
import os
import wave

#Not in base python
import tensorflow as tf 
import tensorflow_io as tfio
import numpy as np
import matplotlib.pyplot as plt

In [116]:
file_contents = tf.io.read_file(filename)
wav, sample_rate = tf.audio.decode_wav(file_contents, desired_channels=1)
wav = tfio.audio.resample(
                            tf.squeeze(wav, axis=-1), #remove trailing axis
                            rate_in=tf.cast(sample_rate, dtype=tf.int64), #specify input sampling rate as int64
                            rate_out=16000 #specify desired output sampling rate
                          )

wav = wav[:48000]
zero_padding = tf.zeros([48000] - tf.shape(wav), dtype=tf.float32)
wav = tf.concat([zero_padding, wav],0)
spectrogram = tf.signal.stft(wav, frame_length=255, frame_step=64)
spectrogram = tf.abs(spectrogram)
spectrogram = spectrogram[..., tf.newaxis]


display.display(display.Audio(file_contents, rate=16000))

AttributeError: 'function' object has no attribute 'display'