# This notebook produces spectrograms which represent a collection of decimated WAV files. These spectrograms will be stored in the "data" directory. A LST file is also produced, which contains each spectrogram's annotation information. Finally, either a REC file or Augmented Manifest file is created for the purpose of model training.

#### Make sure that the kernel is either "Python 3 Data Science" (for SageMaker Studio) or "conda_amazonei_pytorch_latest_p37" (for SageMaker Notebook Instances).

#### Change the name of the S3 Bucket (wherever it appears in the code) to reflect the name of the S3 Bucket in your AWS Account.

## Installs and Imports

OPTIONAL: Upgrades "pip" (only run this code chunk once each time you start the notebook instance)

In [None]:
# I recommend leaving this commented out since package installation in this notebook is very finicky.
#!pip install --upgrade pip

Installs the "librosa" library for spectrogram creation (always needed; must run exactly once each time you start the notebook instance).
If the installation processes "locks up", perform the following steps: log out of JupyterLab, stop and start the notebook instance again, reopen JupyterLab, and try again.

In [1]:
# This takes roughly five minutes to fully install when it is your first time ever running it, but it is absolutely necessary.
    # (It should only take a couple of minutes to install every other time.)
!conda install -y -c conda-forge librosa

Retrieving notices: ...working... done
Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - librosa


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    audioread-3.0.1            |  py311h267d04e_1          44 KB  conda-forge
    ffmpeg-4.3.2               |       h38cfed3_3         9.3 MB  conda-forge
    gettext-0.22.5             |       h8fbad5d_2         471 KB  conda-forge
    gettext-tools-0.22.5       |       h8fbad5d_2         2.4 MB  conda-forge
    gnutls-3.6.13              |       h706517b_1         2.0 MB  conda-forge
    lame-3.100                 |    h1a8c8d9_1003         516 KB  conda-forge
    libasprintf-0.22.5         |       h8fbad5d_2          40 KB  conda-forge
    libas

Installs the "mxnet" library for REC file creation (currently always needed; must run exactly once each time you start the notebook instance)

In [2]:
# This is absolutely necessary, but the installation should only take less than a minute every time.
!pip install mxnet

Collecting mxnet
  Downloading mxnet-1.6.0-py2.py3-none-any.whl.metadata (3.4 kB)
Collecting graphviz<0.9.0,>=0.8.1 (from mxnet)
  Downloading graphviz-0.8.4-py2.py3-none-any.whl.metadata (6.4 kB)
Downloading mxnet-1.6.0-py2.py3-none-any.whl (68.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.7/68.7 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading graphviz-0.8.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: graphviz, mxnet
Successfully installed graphviz-0.8.4 mxnet-1.6.0


Properly installs the "opencv" library for use in im2rec.py (only needed if working in SageMaker Studio)

In [3]:
"""WARNING: Only run this if you are working in SageMaker Studio, not if you are working in SageMaker Notebook Instances."""
#!pip install opencv-python-headless



Import Statements

In [7]:
# Package Imports (leave in this order)
import math
import pandas as pd
import numpy as np
import librosa
import warnings
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from scipy.io import wavfile
from collections import OrderedDict
from tqdm import tqdm
import pickle
import json
import glob
import os
from os import path
import boto3
from PIL import Image
import json
from os.path import exists
import io

## Specifications

Constants and Variables

In [8]:
# Original annotation files' column names
left_col, right_col = "Begin Time (s)", "End Time (s)"
top_col, bot_col = "High Freq (Hz)", "Low Freq (Hz)"
class_col = "Species"

# Files and Directories
# Note that you should create this "data" folder in the same directory as this notebook.
output_dir = "./data"

# Desired format for the final data file (either "REC" or "AugmentedManifest")
file_format = "REC"

# SPECTROGRAM CONSTANTS
# Window size (n_fft) in seconds
WINDOW_SIZE_SEC = 3/20
# Hop Length in seconds
HOP_LEN_SEC = 15/300
# Number of frequency bands (y dimension of spectrogram)
N_MELS = 300
# Maximum frequency considered (highest value in y dimension)
FREQUENCY_MAX = 1600
# Length of one chunk in seconds
CHUNK_SIZE_SEC = 30
# Amount of overlap between subsequent spectrograms (in seconds)
SPEC_OVERLAP = 3.0
# Proportion of spectrogram that reflects the minimum area hand-annotation boxes will be allowed to have
MIN_BOX_AREA = 0.0006

# Dictionary mapping annotators' inconsistent species labels to consistent species labels (i.e., class labels)
    # All species labels found so far: 
    # {nan, '?', 'sl', 'rf', 'KW', 'hhb', 'jn', 'hb3', '3l', 'SL', 'sl3', 'kw?', 'hb?', 'al', 'lw', 's;', 'hb', 'kw', 'Hb', 'hn', 'hb ', 'jhb'}

CLASS_LABEL_MAP = {
        "humpback whale": "hb",
        "hb whale": "hb",
        "hb?": "hb",
        "hhb": "hb",
        "hb": "hb",
        "hb3": "hb",
        "Hb": "hb",
        "jhb": "hb",
        "jn": "hb",
        "hn": "hb",
        "hb ": "hb",
        "hbn": "hb",
    
        "killer whale": "kw",
        "kw": "kw",
        "kw ": "kw",
        "KW": "kw",
        "kw?": "kw",
        "lw": "kw",
    
        "rockfish": "rf",
        "rf": "rf",

        "sea lion": "sl",
        "sl": "sl",
        "sl ": "sl",
        "SL": "sl",
        "s;": "sl",
        "al": "sl",
        "3l": "sl",
        "sl3": "sl",
    
        "mech": "mech",
        "mech ": "mech",
        "mech.": "mech",
        "mechanical": "mech",
    
        "?": "?",
        "? ": "?"
    }

# Specifies which classes' annotations to remove when preparing the annotation data
DISALLOWED_CLASSES = ["?", "mech"]

## Setup

Creates a connection to the S3 Bucket where the decimated WAV files and annotation files are stored.

In [9]:
def awsKeys(file):
    awsKeys = pd.read_csv(file)
    access_key = awsKeys['Access key ID'][0]
    secret_key = awsKeys['Secret access key'][0]
    return access_key, secret_key


def clientAndBucket(file, region='us-west-2'):
    aws_access_key_id, aws_secret_access_key = awsKeys(file)
    s3_client = boto3.client(
        's3',
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key,
        region_name=region
    )
    bucket_name = 'whale-recordings'
    s3 = boto3.resource('s3',
                        aws_access_key_id=aws_access_key_id,
                        aws_secret_access_key=aws_secret_access_key,
                        region_name=region
                        )
    bucket = s3.Bucket(bucket_name)
    return s3_client, bucket


KEYS = "ssundar_accessKeys.csv"
s3_client, bucket = clientAndBucket(KEYS)

%run model_functions.ipynb
warnings.filterwarnings('ignore')

# List to store processed data
processed_data = []
D2 = []
backgroundFiles = []

path = "CPhydrophone/Avila/Deployment 2/selection-tables/"

keys = [obj.key for obj in bucket.objects.all()]
selectionTables = [(obj.split("/")[-1], obj) for obj in keys if path in obj][1:]

KEYS = "ssundar_accessKeys.csv"
s3_client, bucket = clientAndBucket(KEYS)

finished preprocessing


## Download Script

Downloads all **39** audio files and the corresponding annotated .txt files from the S3 bucket

to a folder named "files" 

In [10]:
from tqdm import tqdm
wavPath = "CPhydrophone/Avila/Deployment 2/wav-files/decimated_files/"
backgroundFiles = []
! cd files
! find . -type f \( -name "*.wav" -o -name "*-SS.txt" \) -exec rm {} +
! cd ..
FOLDER = "files_aws"
os.makedirs(FOLDER, exist_ok=True)
for item in tqdm(selectionTables):
    try:
        ss = item[0]
        # wav = ss.split("-SS.txt")[0] + "_processed.wav"
        wav = ss.split("-SS.txt")[0]
        p1 = os.path.join('files', ss)
        p2 = os.path.join('files', wav)
        if not os.path.exists(p1):
            s3_client.download_file(bucket_name, item[1], f'{FOLDER}/{ss}')
        if not os.path.exists(p2):
            s3_client.download_file(bucket_name, wavPath + wav, f'{FOLDER}/{wav}')
        
    except:
        continue
len(selectionTables)

100%|██████████| 39/39 [00:10<00:00,  3.83it/s]


39

In [44]:
# len([f for f in os.listdir('files') if f.endswith('_processed.wav')])/

38

In [None]:
def read_wavfile(wav_name, normalize=True, verbose=False):
    """
    Reads in a decimated wav file from the S3 Bucket.
    
    PARAMETERS
    ----------
        wav_name: string
            Numeric portion of the decimated WAV file's name
        normalize: boolean
            Indicates whether or not to normalize the sound data (i.e., the amplitudes of the sound wave recorded by the hydrophone)
        verbose: boolean
            Indicates whether or not to make output excessively detailed
    ----------
    
    RETURNS
    ----------
        sr: int
            Sampling rate of WAV file
        data: numpy array
            Contains floats representing the amplitudes of the sound wave for each sample (automatically ordered from earliest to latest)
    ----------
    """
    # Downloads the decimated WAV file from the S3 Bucket where it is stored.
    wav_name = np.random.choice([file for file in os.listdir('files') if file.endswith('_processed.wav')])
    file_name = f"{wav_name}_processed.wav"
    bucket_path = f"CPhydrophone/Avila/Deployment 2/wav-files/decimated_files/{file_name}"
    # bucket.download_file(bucket_path, f'files/{file_name}')
    print("read_wav", file_name)
    
    # Reads-in the decimated WAV file's information.
    if verbose:
        print("Reading {}".format(file_name))     
    sr, data = wavfile.read(file_name)
        # NOTE: Sampling rate (sr) seems to be 8000 samples per second
    print(sr)
    
    # Removes the WAV file from our working directory since we have obtained the information we need
    os.remove(file_name)
    
    # Normalizes the decimated WAV file and returns important information
    if verbose:
        print("{} samples at {} samples/sec --> {} seconds".format(data.shape[0], sr, data.shape[0]/sr))
    if normalize:
        data = data.astype(float)
        data = data - data.min()
        data = data / data.max()
        data = data - 0.5
    return sr, data

def read_annotations(fname, verbose=False):
    """
    Given the name of a WAV file (fname), tries to find the corresponding annotation file (i.e., selection table) in the S3 Bucket.
    
    PARAMETERS
    ----------
        fname: string
            Numeric portion of the TXT file's name (i.e., the TXT file which contains the WAV file's annotations)
        verbose: boolean
            Indicates whether or not to make output excessively detailed
    ----------
    
    RETURNS
    ----------
        annotations: pandas dataframe
            Contains information on the hand-annotation boxes (i.e., "ground truth" information) for the given WAV file.
            This data frame is derived from a "Raven" selection table (each row is an annotation box).
            The professor's "Bio Team" makes each annotation by drawing a selection box around an animal vocalization using the "Raven" app.
    ----------
    """
    # Establishes the initials for first name and last name of every single annotator so far
        # NOTE: Add to this list if new annotators join the team.
    annotators = ['AS.txt', 'AW.txt', 'JW.txt', 'MS.txt', 'SS.txt']
    
    # Finds the annotation files corresponding to "fname" and creates a list containing their names
    # annot_files = []
    # for annotator in annotators:
    #     file_name = f"{fname}-{annotator}"
    #         # EXAMPLE: file_name = f"{fname}-AW.txt"
    #     bucket_path = f"CPhydrophone/Avila/Deployment 2/selection-tables/{file_name}"
    #     try:
    #         print(bucket_path)
    #         bucket.download_file(bucket_path, file_name)
    #         annot_files.append(file_name)
    #     except Exception as e:
    #         print(e)
    #         continue
    
    annotators = ['AS.txt', 'AW.txt', 'JW.txt', 'MS.txt', 'SS.txt']
    annot_files = []
    for annotator in annotators:
        file_name = f"{fname}-{annotator}"
        local_path = f"{file_name}"
        if os.path.exists(local_path):
            annot_files.append(local_path)
        else:
            print(f"File {local_path} does not exist.")
            continue
    
    print(len(annot_files))
    # Accounts for the misnamed annotation file which includes the word "txt" twice
    if len(annot_files) == 0 and fname == "671658014.181007063421":
        print("File not found. Assuming it is misnamed.")
        file_name = "671658014.181007063421-AStxt.txt"
        bucket_path = f"selection-tables/{file_name}"
        try:
            bucket.download_file(bucket_path, file_name)
            annot_files.append(file_name)
        except Exception:
            print("Did not find misnamed file.")
            exit(1)
    
    # Gets the "better" annotation file out of the two that correspond to the WAV file called "671658014.180929213545.wav"
    if len(annot_files) == 2 and fname == "671658014.180929213545":
        print("Two annotation files found. Using AW's annotations.")
        os.remove("671658014.180929213545-SS.txt")
        file_name = "671658014.180929213545-AW.txt"
        bucket_path = f"selection-tables/{file_name}"
        try:
            bucket.download_file(bucket_path, file_name)
            annot_files = [file_name]
        except Exception:
            print("Did not find AW's annotations.")
            exit(1)
    
    # Gets the "better" annotation file out of the two that correspond to the decimated WAV file called "671658014.180929033558.wav"
    if len(annot_files) == 2 and fname == "671658014.180929033558":
        print("Two annotation files found. Using AS's annotations.")
        os.remove("671658014.180929033558-JW.txt")
        file_name = "671658014.180929033558-AS.txt"
        bucket_path = f"selection-tables/{file_name}"
        try:
            bucket.download_file(bucket_path, file_name)
            annot_files = [file_name]
        except Exception:
            print("Did not find AS's annotations.")
            exit(1)
    
    # Takes the "list" of annotation file names (which really only contains one name) and reads its information as a PANDAS data frame
    print(annot_files)
    annots = []
    for file_name in annot_files:
        file = pd.read_csv(file_name, sep="\t")
        
        # Corrects a known mispelling of the "Species" column name in one of the WAV files
        try:
            len(file[class_col])
        except Exception as e:
            file.rename(columns = {"Spcies": "Species"}, inplace = True)
        
        # Adds the data frame to a "list" of the WAV file's annotation data frames (which really only contains one data frame)
        annots.append(file)
    
    
    # Catches any situations where an unexpected number of annotation files is found
    if len(annots) == 0:
        print("ERROR: File not found. Terminating Program.")
        exit(1)
    elif len(annots) == 1:
        annotations = annots[0]
    else:
        print("There are multiple annotation files for this WAV file. It is unclear which one you wish to use.")
        exit(1)
        
        
    if verbose:
        print("Read {} annotations from {}".format(len(annotations), fname))
        print("Columns:", ",".join([" {} ({})".format(c, type(c)) for c in annotations.columns]))
    
    # Removes the annotation files from our working directory since we have the information we need
    for file_name in annot_files:
        os.remove(file_name)
    return annotations

randomFile = np.random.choice([file for file in os.listdir('files') if file.endswith('_processed.wav')])
read_wavfile(f'files/{randomFile}', verbose=True)

Defines and calls the function that creates the training, validation, and testing datasets. It gets all the decimated WAV files from the relevant S3 Bucket. Then, it sets aside the hardcoded WAV files reserved for testing and validation purposes. The remaining files become the training data.

In [55]:
def get_data_sets():
    """
    Gets the file names associated with the training data, validation data, and testing data.
        *Note that the "validation" data is used to tune parameters, while the "testing" data provides an unbiased estimate of model performance.
        
    PARAMETERS
    ----------
        N/A
    ----------
    
    RETURNS
    ----------
        train_set: list of strings
            Contains the names (numeric portions only) of the WAV files which belong to the training dataset.
        validation_set: list of strings
            Contains the names (numeric portions only) of the WAV files which belong to the validation dataset.
        testing_set: list of strings
            Contains the names (numeric portions only) of the WAV files which belong to the testing dataset.
    ----------
    """
    
    # Testing Set: Represents "new data" after the model has been trained/validated
    # testing_set = ['671658014.181008003414', '671658014.180929003601']
    # 2024 Training Set
    files = [f for f in os.listdir('files') if f.endswith('_processed.wav')]
    train_set = np.random.choice(files, int(len(files)*0.8), replace=False)
    validation_set = np.random.choice([file for file in files if file not in train_set], int(len(files)*0.1), replace=False)
    testing_set = [file for file in files if file not in train_set and file not in validation_set]
    # Capstone Team's testing set
    #testing_set = ['671658014.181008003414']

    # Validation Set: Evaluates the model after training and helps tune post-processing parameters
    # validation_set = ['671658014.181008033412', '671658014.180930183532']
    # Capstone Team's validation set
    #validation_set = ['671658014.181008033412']
    
    # Gets all the dataset names from the S3 Bucket
    annotatedFiles = [file.key.split("/")[1] for file in bucket.objects.all() if (file.key[-1] != '/' and 
                                                                                  file.key.split("/")[0] == "selection-tables")]
    dataset = [file.split("-")[0] for file in annotatedFiles]

    # Training Set: Teaches the model how to make predictions
        # NOTE: It contains any data that is not in the testing set or validation set
    # notAllowedSet = testing_set + validation_set
    # train_set = [file for file in dataset if all(file not in notAllowed for notAllowed in notAllowedSet)]
    
    return train_set, validation_set, testing_set

train_set, validation_set, testing_set = get_data_sets()
len(train_set), len(validation_set), len(testing_set)

(28, 3, 4)

Displays the WAV file names that belong to the training data (with and without duplicate names removed)

In [37]:
# Duplicate names indicate the presence of multiple annotation files for a single WAV file (only one will get used)
print("Size with Duplicates: ", len(train_set))
print(train_set)
print()

# Removes duplicate names from the training set
train_set = list(set(train_set))
print("Size without Duplicates: ", len(train_set))
print(train_set)

Size with Duplicates:  30
['6805.230204003826_processed.wav' '6805.230203210826_processed.wav'
 '6805.230206030826_processed.wav' '6805.230204180826_processed.wav'
 '6805.230206163827_processed.wav' '6805.230205183826_processed.wav'
 '6805.230206090827_processed.wav' '6805.230207000827_processed.wav'
 '6805.230203090826_processed.wav' '6805.230205210826_processed.wav'
 '6805.230204210826_processed.wav' '6805.230203110826_processed.wav'
 '6805.230202030825_processed.wav' '6805.230206233827_processed.wav'
 '6805.230202100825_processed.wav' '6805.230205180826_processed.wav'
 '6805.230206000826_processed.wav' '6805.230205090826_processed.wav'
 '6805.230202120825_processed.wav' '6805.230202150825_processed.wav'
 '6805.230203000825_processed.wav' '6805.230201210825_processed.wav'
 '6805.230205000826_processed.wav' '6805.230201180825_processed.wav'
 '6805.230207043827_processed.wav' '6805.230205150826_processed.wav'
 '6805.230202180825_processed.wav' '6805.230204120826_processed.wav'
 '6805.2

Defines and calls the function that uses CLASS_LABEL_MAP to determine which species names are present in the annotation data.
Also produces a dictionary that maps the correct class labels to unique class numbers.

In [10]:
def get_all_classes(annotation_fnames, verbose=False):
    """
    Returns a list of all classes (i.e., species names) seen in the annotation files after class mapping, sorted alphabetically.
    Note that this function does NOT alter the annotation files to reflect the class mapping (this occurs in a later function).
    
    PARAMETERS
    ----------
        annotation_fnames: list of strings
            Contains the file names (numeric portions only) of the annotation files whose classes should be considered 
            when developing the class mapping
        verbose: boolean
            Indicates whether or not to make output excessively detailed
    ----------
    
    RETURNS
    ----------
        clean_classes: set
            After interpreting mislabeled species names as their correct names, contains every species name seen in annotation_fnames
            (Since this is a set, it does not include duplicate species names.)
            Names are sorted in alphabetical order.
    ----------
    """
    # Gets every class name seen in annotation_fnames
    classes = set()
    for annot_fname in annotation_fnames:
        try:
            classes.update(list(read_annotations(annot_fname)[class_col].unique()))
        except Exception as e:
            pass
    print("Raw classes: ", classes)
    
    # Combines class names that refer to the same animal into a single class name
    clean_classes = set()
    for cname in classes:
        # Ignores annotations that lack a class name
        if type(cname) == float:
            continue
        clean_classes.add(CLASS_LABEL_MAP[cname])
    
    # Sorts the class names in alphabetical order
    clean_classes = sorted([s for s in list(clean_classes)])
    if verbose:
        print("Cleaned Class Names: ", clean_classes)
    return clean_classes


        
# Gets all class names seen across the annotation files and removes undesired class names
classes = get_all_classes(train_set+validation_set+testing_set, verbose=True)
classes = [c for c in classes if c not in DISALLOWED_CLASSES]
print("Allowed Classes: ", classes)

# Produces dictionaries that map each "Allowed Class" to a unique class number (class numbering starts at 1)
class_map = {}
rev_class_map = {}
for i in range(len(classes)):
    # Class numbers are the keys and class names are the values
        # UNUSED
    class_map[i+1] = classes[i]
    # Class names are the keys and class numbers are the values
        # USED
    rev_class_map[classes[i]] = i+1

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U31'), dtype('<U31')) -> None

Displays the class mapping that will get used.

In [11]:
print(rev_class_map)

NameError: name 'rev_class_map' is not defined

Defines a function that calculates the area of a given annotation box (measured as the proportion of the spectrogram it covers).

In [12]:
def get_area(annotation):
    """
    Given an annotated bounding box, returns the calculated area of the box.
    
    PARAMETERS
    ----------
        annotation: row of a PANDAS data frame
            Contains the information denoting a hand-annotated bounding box.
    ----------
    
    RETURNS
    ----------
        A float denoting the calculated area of the box (relative to the full spectrogram, which has an area of 1).
    ----------
    """
    return ((annotation[right_col] - annotation[left_col])
            * (annotation[bot_col] - annotation[top_col]))

Defines functions that use "data chunks" to create spectrograms and add their annotation data to the corresponding LST file.
Each line of the LST file contains the annotation data for a given spectrogram, and the entire LST file contains the annotation data for a single dataset (such as the training set).

In [13]:
def process_file(wav_filename, annot_filename, min_bound, max_bound, chunk_size, lst_file_name, chunk_layout="dense",
                 drop_last_chunk=False, verbose=False):
    """
    Iterates through all data chunks in the current WAV file, serving as a shell for extract_chunk().
    
    PARAMETERS
    ----------
        wav_filename: string
            Numeric portion of the decimated WAV file's name
        annot_filename: string
            Numeric portion of the TXT file's name (will correspond to "wav_filename")
        min_bound: float 
            Used during spectrogram preparation within extract_chunk()
        max_bound: float 
            Used during spectrogram preparation within extract_chunk()
        chunk_size: int
            Specifies the total elapsed time (in seconds) that each spectrogram will cover along its x-axis
        lst_file_name: string
            Specifies the name of the LST file (including the ".lst" portion) that extract_chunk() should write its information to
        chunk_layout: string
            Either "dense" or "sparse".
            It seems like the intention of this parameter was to modify the calculation of spectrogram overlap depending on how many 
                annotations were within the current spectrogram (i.e., the current chunk). However, its value is hardcoded as "dense", 
                and the code for spectrogram overlap has since been modified to where all spectrograms feature the same amount of 
                overlap regardless of chunk_layout.
        drop_last_chunk: boolean
            Boolean indicating whether or not to skip over the final spectrogram for the WAV file ("False" means, "include the spectrogram")
        verbose: boolean
            Indicates whether or not to make output excessively detailed
    ----------
    
    RETURNS
    ----------
        examples: list of dictionaries
            For each spectrogram, contains a dictionary mapping important spectrogram information (and annotation information) to 
                intuitive keywords.
            Note that, within this function, some of the information gets written to the LST file.
    ----------
    """
    # Reads-in WAV file information and annotation information
        # For every decimated WAV file, sampling rate (sr) seems to be 8000 samples per second.
    sr, data = read_wavfile(wav_filename, normalize=True, verbose=verbose)
    annotations = read_annotations(annot_filename, verbose=verbose)
    
    # Converts spectrogram constants from being measured in seconds to being measured in samples
    n_fft = int(WINDOW_SIZE_SEC * sr)
    hop_len = int(HOP_LEN_SEC * sr)
    chunk_size = int(chunk_size * sr)
    
    # Implements spectrogram overlap 
        # Example: 
        # If SPEC_OVERLAP = 3.0, the start time for one spectrogram is 3 seconds earlier than the end time of the previous spectrogram.
    if chunk_layout == "dense":
        step = chunk_size - int(SPEC_OVERLAP * sr)
    elif chunk_layout == "sparse":
        step = chunk_size - int(SPEC_OVERLAP * sr)
    
    # Builds a list of "start values", where each value specifies the index where a "chunk's" information starts in the "data" list
        # Each chunk will be used to produce a single spectrogram.
    start_vals = [s for s in range(0, len(data), step)]
    
    # Removes the last "start value" from the list if the corresponding chunk is too small
    if len(data) - start_vals[-1] < int(chunk_size / 2):
        start_vals = start_vals[:-1]
        

    def extract_chunk(start_i, end_i, spec_name, annot_name, json_name, index, use_pcen=True, M_init=None):
        """
        Uses a single chunk of audio data to produce a spectrogram (and its corresponding line in the LST file).
        This function gets called by process_file() multiple times until every spectrogram has been produced.
        
        PARAMETERS
        ----------
            start_i: int
                Specifies the index within the audio dataset where the current spectrogram begins
                (Note that a value of 0 represents the first sample in the decimated WAV file.)
            end_i: int
                Specifies the index within the audio dataset where the next spectrogram begins
                (Note that a value of len(data) represents the end of the decimated WAV file.)
            spec_name: string
                The name of the current spectrogram for when it gets saved to a PNG file (note that this includes the ".png" portion).
            annot_name: string
                Currently-unused name for a TXT file which does not get created
                (This nonexistant file would presumably contain the current spectrogram's annotations).
            json_name: string
                Currently-unused name for a JSON Lines file which does not get created
                (This nonexistant file would presumably be the Augmented Manifest File for the current WAV file).
            index: int
                Indicates which spectrogram is currently being produced
                (Note that this starts at 0 and increases by 1 with each subsequent spectrogram.
                    This information is recorded within the LST file, so that it is clear which row corresponds to which spectrogram.)
            use_pcen: boolean
                Currently unused, but would indicate whether or not PCEN should be applied to the current spectrogram.
            M_init: value (optional)
                Would indicate the "final filter delay value" corresponding to the current spectrogram (for use in PCEN streaming).
                    (Note that "spectrogram", "chunk", and "block" are presumably interchangeable terms in this context).
                Since PCEN (let alone PCEN streaming) is currently unused, this is set to None.
        ----------
    
        RETURNS
        ----------
            example_dict: dictionary
                Maps important spectrogram information (and annotation information) to intuitive keywords for ease-of-access.
                Note that, within this function, some of the information gets written to the LST file.
            next_M_init: value (optional)
                Would indicate the "final filter delay value" telling when to produce the next spectrogram (for use in PCEN streaming).
                    (Note that "spectrogram", "chunk", and "block" are presumably interchangeable terms in this context).
                Since PCEN (let alone PCEN streaming) is currently unused, this is set to None.
        ----------
        """
        # Produces a "spectrogram dataset" for the current chunk
        mel_spec = librosa.feature.melspectrogram(y=data[start_i:end_i],
                                                  sr=sr,
                                                  n_fft=n_fft,
                                                  hop_length=hop_len,
                                                  n_mels=N_MELS,
                                                  fmax=FREQUENCY_MAX,
                                                  center=False)
        
        # Attempt to Implement PCEN (Per-Channel Energy Normalization)
            # SOURCE: https://librosa.org/doc/main/generated/librosa.pcen.html
            # NOTE: Sampling rate is hardcoded as 8000 samples per second
            # NOTE: "hop_length" seems to represent the number of audio samples within a 30-second spectrogram
                # Calculation of "hop_length": 
                    # (8000 samples per second) * (60 seconds per minute) * (180 minutes per audio file) / (360 spectrograms per file)
                    # Equals 240,000 samples per spectrogram
                    # Assumes "hop_length" is simply the length of a 30-second spectrogram (hence the lack of spectrogram overlap)
        #mel_spec = librosa.pcen(mel_spec, sr = 8000, hop_length = 240000, max_size = FREQUENCY_MAX)
        # Using new value for hop_len that is more likely to be correct (equal to the hop len specified for the mel spectrogram)
        #mel_spec = librosa.pcen(mel_spec * (2**31), sr = 8000, hop_length = hop_len, max_size = FREQUENCY_MAX)
        # End of Attempt to Implement PCEN
        
        # Prepares current spectrogram
        next_M_init = None
        mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
            # In-Progress: PCEN Implementation
        ###S = librosa.feature.melspectrogram(y=data[start_i:end_i], sr=sr, power = 1)
        ###mel_spec = librosa.pcen(S * (2**31), sr = sr, hop_length = hop_len, max_size = FREQUENCY_MAX)
        ###mel_spec = librosa.pcen(S, sr = sr, hop_length = hop_len, power = 1/4, time_constant = 0.7, max_size = FREQUENCY_MAX)
            # End of PCEN Implementation
        mel_spec = np.clip((mel_spec - min_bound) / (max_bound - min_bound) * 255, a_min=0, a_max=255)
        mel_spec = mel_spec.astype(np.uint8)
        spec_height, spec_width = mel_spec.shape
        
        # Creates a white horizontal stripe at the second highest frequency band on the current spectrogram
        nrow, ncol = mel_spec.shape
        for i in range(ncol):
            # Draws the stripe on the spectrogram
            mel_spec[nrow-2, i] = 255

        # Gets annotions for the current chunk
        start_s, end_s = start_i/sr, end_i/sr
        freq_axis_low, freq_axis_high = librosa.hz_to_mel(0.0), librosa.hz_to_mel(FREQUENCY_MAX)
        chunk_annotations = annotations.loc[~((annotations[left_col] > end_s)
                                              | (annotations[right_col] < start_s))].copy()
        print(start_s, end_s)
        

        # Rescale axes to 0.0-1.0 based on location inside chunk
        chunk_annotations.loc[:,[left_col,right_col]] = ((chunk_annotations[[left_col,right_col]]
                                                         - start_s) / (end_s - start_s))

        chunk_annotations.loc[:,[bot_col,top_col]] = (1.0 - ((librosa.hz_to_mel(chunk_annotations[[bot_col,top_col]])
                                                      - freq_axis_low) / (freq_axis_high - freq_axis_low)))
        
        
        # Takes any mispelled or inconsistent annotation labels and replaces them with a consistent and correctly-spelled label
        chunk_annotations[class_col] = chunk_annotations[class_col].map(CLASS_LABEL_MAP, na_action = "ignore")
        # Code appears to filter the annotations based on which classes are allowed according to "Allowed Classes"
        chunk_annotations = chunk_annotations.loc[chunk_annotations[class_col].isin(classes)]
        
        
        # Clips hand-annotated bounding boxes that extend outside of the spectrogram (so that they end at the spectrogram's border)
        trimmed_annots = chunk_annotations.copy()
        trimmed_annots[left_col] = trimmed_annots[left_col].clip(lower=0, upper=1.0)
        trimmed_annots[right_col] = trimmed_annots[right_col].clip(lower=0, upper=1.0)
        trimmed_annots[bot_col] = trimmed_annots[bot_col].clip(lower=0, upper=1.0)
        trimmed_annots[top_col] = trimmed_annots[top_col].clip(lower=0, upper=1.0)
        
        
        # Implements a minimum area requirement for hand-annotated boxes
        trimmed_box_areas = []
        for i in trimmed_annots.index:
            trimmed_box = trimmed_annots.loc[i]
            trimmed_box_area = get_area(trimmed_box)
            trimmed_box_areas.append(trimmed_box_area)
        trimmed_annots["Trimmed_Box_Area"] = trimmed_box_areas
        trimmed_annots = trimmed_annots.loc[trimmed_annots["Trimmed_Box_Area"] >= MIN_BOX_AREA]


        if verbose:
            print("Found {} annotations in chunk".format(len(chunk_annotations)))


        if verbose:
            print("Saved spectrogram to '{}'".format(spec_name))

            
        # Prepares spectrogram and annotation information to be written to files
        image_filepath = path.join(output_dir, spec_name)
        example_dict = {
            "filepath": spec_name,
            "height": spec_height,
            "width": spec_width,
            "xmins": trimmed_annots[left_col].tolist(),
            "xmaxs": trimmed_annots[right_col].tolist(),
            "ymins": trimmed_annots[top_col].tolist(),
            "ymaxs": trimmed_annots[bot_col].tolist(),
            "classes_text": trimmed_annots[class_col].tolist(),
            "classes": trimmed_annots[class_col].map(rev_class_map).tolist()
        }


        # Saves chunk as PNG image (lossless compression)
        im = Image.fromarray(mel_spec[::-1, :])
        im = im.convert("L")

        image_filepath = path.join(output_dir, spec_name)
        im.save(image_filepath)
        
        # Starts preparing the information for the LST file
        res = [index, 2, 5]
        for i in range(len(example_dict["xmins"])):
            # Skips "flattened" annotations, since they do not provide valuable information to the model
                # NOTE: Assumes clipping has "flattened" annotations that are "offscreen"
            if example_dict["ymins"][i] == example_dict["ymaxs"][i] or example_dict["xmins"][i] == example_dict["xmaxs"][i]:
                continue
            # Obtains annotation information:
            temp = [example_dict["classes"][i], example_dict["xmins"][i], example_dict["ymins"][i], 
                    example_dict["xmaxs"][i], example_dict["ymaxs"][i]]
            res.extend(temp)
            
        # Creates an annotation for the horizontal stripe near the top of the spectrogram.
            # Note that the "stripe annotation" (often called a "blank annotation") receives a class number of 0
        BLANK_CLASS_NUM = 0
        temp = [BLANK_CLASS_NUM, 0, 1/(2*N_MELS), 1, 5/(2*N_MELS)]
        res.extend(temp)
        # Updates "example_dict" to reflect the new blank annotation
        new_xmins = [temp[1]] + example_dict["xmins"]
        new_ymins = [temp[2]] + example_dict["ymins"]
        new_xmaxs = [temp[3]] + example_dict["xmaxs"]
        new_ymaxs = [temp[4]] + example_dict["ymaxs"]
        new_classes_text = ["blank"] + example_dict["classes_text"]
        new_classes = [temp[0]] + example_dict["classes"]
        new_dict = {
            "filename": spec_name,
            "height": spec_height,
            "width": spec_width,
            "xmins": new_xmins,
            "xmaxs": new_xmaxs,
            "ymins": new_ymins,
            "ymaxs": new_ymaxs,
            "classes_text": new_classes_text,
            "classes": new_classes
        }
        example_dict.update(new_dict)
        
        # Ensures that the file path to the current spectrogram is included in the LST file
        res.append(image_filepath) 

        # Writes information to LST file
        text = "\t".join([str(el) for el in res])
        with open(lst_file_name, "a") as f:
            f.write(text)
            f.write('\n')

        return example_dict, next_M_init
    
    # NOTE THAT "extract_chunk()" ENDS HERE 
    # ALSO NOTE THAT "process_file()" RESUMES HERE
    
    # Iterates through the WAV file, producing each chunk's spectrogram (and its line in the LST file) along the way
    examples = []
    M_init = None
    for ind, start_i in enumerate(start_vals[:-1]):
        spec_name = "{}-{}.png".format(wav_filename, ind)
        annot_name = "{}-{}-labels.txt".format(wav_filename, ind)
        json_name = f"{wav_filename}.jsonl"
        ex, M_init = extract_chunk(start_i, start_i+chunk_size, spec_name, annot_name, json_name, ind, M_init=M_init)
        examples.append(ex)
    if not drop_last_chunk:
        spec_name = "{}-{}.png".format(wav_filename, len(start_vals)-1)
        annot_name = "{}-{}-labels.txt".format(wav_filename, len(start_vals)-1)
        json_name = f"{wav_filename}.jsonl"
        ex, _ = extract_chunk(start_vals[-1], len(data), spec_name, annot_name, json_name, len(start_vals)-1, M_init=M_init)
        examples.append(ex)
    else:
        print("Dropping Last Chunk.")
    return examples

This function creates the LST file and obtains the information required to produce an Augmented Manifest file.

In [14]:
def create_lst_file(dataset, lst_file_name):
    """
    Creates spectrograms for the files included in "dataset" (storing them in the "data" directory) and creates the corresponding LST file.
        *NOTE: Before this function gets called, create a folder called "data" (in the same directory as this notebook) for the spectrograms.
        *NOTE: Make sure the "data" folder is empty before this function gets called. Calling cleanup() before create_data_files() will do this.
        
    PARAMETERS
    ----------
        dataset: string
            Specifies which dataset (train_set, validation_set, or testing_set) to produce a LST file for
        lst_file_name: string
            Specifies the name that you want the LST file to have (including the ".lst" portion)
    ----------
    
    RETURNS
    ----------
        aug_manif_info: two-dimensional list of dictionaries
            For each spectrogram in the LST file, contains a dictionary mapping important spectrogram information (and annotation information) to 
                intuitive keywords.
            Each list contains the dictionaries associated with a specific WAV file.
                (These lists are all contained within one big list.)
            This only gets used when producing an Augmented Manifest file, so it will not be used if file_format = "REC".
    ----------
    """
    # Initializes important variables
    index = 0
    aug_manif_info = []
    
    # Iterates through the WAV file names in the "dataset"
    for file in dataset:
        # Displays progress update
        print(f"{index + 1}/{len(dataset)} wav files converted")
        # Increments index for next progress update
        index += 1
        
        # Produces the spectrograms and LST file lines corresponding to the current WAV file
        cur_info = process_file(file, file, -80.0, 0, CHUNK_SIZE_SEC, lst_file_name,chunk_layout="dense", drop_last_chunk=False, verbose=False)
        # Appends important spectrogram information and annotation information to the list
        aug_manif_info.append(cur_info)
        
    return aug_manif_info

This function creates the REC file corresponding to the given LST file and the spectrograms in the "data/" directory.
REC files are the recommended format for data during model training, and they contain everything necessary to reproduce the spectrograms and annotation boxes.

In [15]:
def create_rec_file(lst_file_name):
    """
    Runs the code contained within im2rec.py, creating a REC file that corresponds to the given LST file.
    The REC file will have the same name as the LST file (with ".rec" instead of ".lst").
    
    PARAMETERS
    ----------
        lst_file_name: string
            Specifies the name of the LST file (including the ".lst" portion) that should be used to make the REC file.
    ----------
    
    RETURNS
    ----------
        N/A
    ----------
    """
    RESIZE_SIZE = 256
    !python im2rec.py --resize $RESIZE_SIZE --pack-label $lst_file_name .

These functions create the Augmented Manifest file corresponding to the information returned from create_lst_file().

Augmented Manifest Files are an alternative format for data during model training. They contain all annotation information as well as the file paths (relative to the S3 Bucket) for each spectrogram. This means that, if Augmented Manifest files are used, all spectrograms must be uploaded to the S3 Bucket (in the correct folder) before model training.

Note that, so far, model training has only been successful using REC files.

In [16]:
import json

def create_json(file_prefix, spectro_with_annots):
    """
    Creates a JSON object containing a single spectrogram's information along with the corresponding annotation information.
    This JSON object is properly formatted to be a line in an Augmented Manifest File (JSON Lines file) for SageMaker training jobs.
    
    PARAMETERS
    ----------
        file_prefix: string
            Specifies the name that you want the Augmented Manifest file to have (without the "_AugmentedManifestFile.jsonl" portion).
                (Corresponds to the LST file created at the same time as "aug_manif_info".)
        spectro_with_annots: dictionary
            Maps important spectrogram information (and annotation information) for the current spectrogram to intuitive keywords.
    ----------
    
    RETURNS
    ----------
        json_obj: JSON Object
            Contains a single spectrogram's information along with the corresponding annotation information.
            Properly formatted to be a line in an Augmented Manifest File (JSON Lines file) for SageMaker training jobs.
    ----------
    """
    # Creates intuitive variable names to reference important spectrogram information
    spectro_name = spectro_with_annots["filename"]
    s3_location = f"s3://sagemaker-us-west-2-************/{file_prefix}_manifest/spectrograms/{spectro_name}"
    width = spectro_with_annots["width"]
    height = spectro_with_annots["height"]
        # "depth" was found by downloading a spectrogram, opening it in the "Photos" application on Windows, and viewing file information.
    depth = 8
    
    # Properly formats spectrogram size dimensions
    image_size = [{
        "width": width,
        "height": height,
        "depth": depth
    }]
    
    # Creates intuitive variable names to reference important annotation information
    classes = spectro_with_annots["classes"]
    xmins = spectro_with_annots["xmins"]
    ymins = spectro_with_annots["ymins"]
    xmaxs = spectro_with_annots["xmaxs"]
    ymaxs = spectro_with_annots["ymaxs"]
    
    # Properly formats annotation information, rescaling measurements to reflect the spectrogram's size dimensions
    annotations = []
    for i in range(len(spectro_with_annots["classes"])):
        cur_class = classes[i]
        left = width*xmins[i]
        top = height*ymins[i]
        box_width = width*xmaxs[i]-left
        box_height = height*ymaxs[i]-top
        cur_annot = {
            "class_id": cur_class, 
            "left": left,
            "top": top,
            "width": box_width,
            "height": box_height
        }
        annotations.append(cur_annot)
        
    # Builds (and returns) a JSON object from the formatted information
    boxes = {"image_size": image_size, "annotations": annotations}
    json_obj_info = {"spectrogram": s3_location, "boxes": boxes}
    json_obj = json.dumps(json_obj_info, indent=4)
    return json_obj

In [17]:
def create_augmented_manifest_file(file_prefix, aug_manif_info):
    """
    Iterates through every spectrogram in the LST file (specified by file_prefix).
    Produces a JSON object for each spectrogram (using aug_manif_info).
    Writes the JSON objects to a JSON Lines file.
    
    PARAMETERS
    ----------
        file_prefix: string
            Specifies the name that you want the Augmented Manifest file to have (without the "_AugmentedManifestFile.jsonl" portion).
                (Corresponds to the LST file created at the same time as "aug_manif_info".)
        aug_manif_info: two-dimensional list of dictionaries
            For each spectrogram in the LST file, contains a dictionary mapping important spectrogram information (and annotation information) to 
                intuitive keywords.
            Each list contains the dictionaries associated with a specific WAV file.
                (These lists are all contained within one big list.)
    ----------
    
    RETURNS
    ----------
        N/A
    ----------
    """
    # Creates every JSON object and appends them to a list
    all_json_objects = []
    for wav_file_data in aug_manif_info:
        for image_info in wav_file_data:
            cur_json_obj = create_json(file_prefix, image_info)
            all_json_objects.append(cur_json_obj)
            
    # Removes the previous Augmented Manifest File from the working directory (if one exists)
    filename = file_prefix + "_AugmentedManifestFile.jsonl"
    if exists(filename):
        print(f"{filename} exists, removing now")
        !rm $filename
        
    # Writes the JSON objects to a JSON Lines file.
    with open(f"{file_prefix}_AugmentedManifestFile.jsonl", "a") as f:
        for obj in all_json_objects:
            text = str(json.loads(obj))
            f.write(text)
            f.write('\n')

This function removes all spectrograms from the data folder.

In [18]:
def cleanup():
    !rm data/*

This function removes an old LST file from the working directory when a new one is being created to replace it.

In [19]:
def remove_Lst_fileIfOpen(file_name):
    """
    Removes a LST file from the working directory with the same name as the one being created (if one was created previously).
    """
    if exists(file_name):
        print(f"{file_name} exists, removing now")
        !rm $file_name

The "copy_to_bucket()" function can copy any file from this notebook's working directory to the S3 Bucket.

In [None]:
# def copy_to_bucket(fileSource, fileDestination):
#     """
#     Copies a file from this notebook's working directory to the S3 Bucket.
#     """
#     # NOTE: Change the following name of the S3 Bucket (in parentheses) to reflect the name of the S3 Bucket for your current AWS account.
#     """WARNING: This S3 Bucket should be the one that contains SageMaker files (NOT the one with WAV files and TXT files)."""
#     write_bucket = s3.Bucket('sagemaker-us-...')
#     write_bucket.upload_file(fileSource, fileDestination)

This function takes in a list of WAV file names (along with the desired name for the final data file). 
It then creates the corresponding spectrograms, LST file, and final data file.
Depending on the "file_format", the final data file will either be a REC file or an Augmented Manifest file.

In [20]:
def create_data_files(dataset, file_prefix, file_format = file_format):
    """
    This function takes in a list of WAV file names, and the name of the REC file the annotations need to be stored in, 
    and then creates the corresponding spectrograms, LST file, and REC file for the WAV files.
    
    PARAMETERS
    ----------
        dataset: string
            Specifies which dataset (train_set, validation_set, or testing_set) to produce a LST file and final data file for.
        file_prefix: string
            Specifies the name that you want the LST file and final data file to have (without the ".lst", ".rec", or ".jsonl" portions).
                (Corresponds to the LST file created at the same time as "aug_manif_info".)
        file_format: string
            Specifies the desired format for the final data file (which would be used during any training jobs and tuning jobs).
                (Either "REC" or "AugmentedManifest")
    ----------
    
    RETURNS
    ----------
        N/A
    ----------
    """
    # Produces spectrograms and LST file
    lst_file_name = f"{file_prefix}.lst"
    remove_Lst_fileIfOpen(lst_file_name)
    aug_manif_info = create_lst_file(dataset, lst_file_name)
    
    # Produces either a REC file or Augmented Manifest file (depending on file_format)
    if file_format == "REC":
        create_rec_file(lst_file_name)
    elif file_format == "AugmentedManifest":
        create_augmented_manifest_file(file_prefix, aug_manif_info)
        
    # Displays message indicating that all desired files have been created
    print("Done!")

Remember to call cleanup() before any call to create_data_files().

## Getting Training Data

Deletes all spectrograms in the "data" directory.

In [22]:
cleanup()

zsh:1: no matches found: data/*


|Creates spectrograms, LST file, and final data file for the training data.

In [33]:
create_data_files(train_set, "train_full", file_format = file_format)

1/31 wav files converted


FileNotFoundError: [Errno 2] No such file or directory: '6805.230207120827_processed.wav_processed.wav'

Copies training REC file from here to the S3 Bucket.

In [None]:
"""WARNING: I recommend archiving the train_full.rec file currently in the S3 Bucket before replacing it with the new one."""
#copy_to_bucket("train_full.rec", "train/train_full.rec")

Copies training Augmented Manifest File to the S3 Bucket.

In [None]:
"""WARNING: I recommend archiving the train_full_AugmentedManifestFile.jsonl file currently in the S3 Bucket before replacing it with the new one."""
#copy_to_bucket("train_full_AugmentedManifestFile.jsonl", "train_full_manifest/train_full_AugmentedManifestFile.jsonl")

Copies training spectrograms to the correct location in the S3 Bucket (specified in the Augmented Manifest File). Note that, in this case, the spectrograms should be produced in the professor's S3 Bucket, so that they can be easily copied over for model training.

In [44]:
import glob

# Gets the file path to each PNG file in the "data/" directory
files = glob.glob('data/*.png')

# Iterates through all spectrograms in "files", copying each one to the S3 Bucket
for file in files:
    spec_name = file.split("/")[-1]
    copy_to_bucket(file, "train_full_manifest/spectrograms/" + spec_name)

## Getting Validation Data

Deletes all spectrograms in the "data" directory.

In [None]:
cleanup()

Creates spectrograms, LST file, and final data file for the validation data.

In [None]:
create_data_files(validation_set, "val", file_format = file_format)

Copies validation REC file from here to the S3 Bucket.

In [None]:
"""WARNING: I recommend archiving the val.rec file currently in the S3 Bucket before replacing it with the new one."""
#copy_to_bucket("val.rec", "validation/val.rec")

Copies validation Augmented Manifest File to the S3 Bucket.

In [None]:
"""WARNING: I recommend archiving the val_AugmentedManifestFile.jsonl file currently in the S3 Bucket before replacing it with the new one."""
#copy_to_bucket("val_AugmentedManifestFile.jsonl", "val_manifest/val_AugmentedManifestFile.jsonl")

Copies validation spectrograms to the correct location in the S3 Bucket (specified in the Augmented Manifest File). Note that, in this case, the spectrograms should be produced in the professor's S3 Bucket, so that they can be easily copied over for model training.

In [None]:
import glob

# Gets the file path to each PNG file in the "data/" directory
files = glob.glob('data/*.png')

# Iterates through all spectrograms in "files", copying each one to the S3 Bucket
for file in files:
    spec_name = file.split("/")[-1]
    copy_to_bucket(file, "val_manifest/spectrograms/" + spec_name)

## Getting Testing Data

Deletes all spectrograms in the "data" directory.

In [None]:
cleanup()

Creates spectrograms, LST file, and final data file for the testing data.

In [None]:
create_data_files(testing_set, "test", file_format = file_format)

Copies testing REC file from here to the S3 Bucket.

In [None]:
"""WARNING: I recommend archiving the test.rec file currently in the S3 Bucket before replacing it with the new one."""
#copy_to_bucket("test.rec", "testing/test.rec")