<center><img src="https://keras.io/img/logo-small.png" alt="Keras logo" width="100"><br/>
This starter notebook is provided by the Keras team.</center>

# BirdCLEF 2024 with [KerasCV](https://github.com/keras-team/keras-cv) and [Keras](https://github.com/keras-team/keras)

> The objective of this competition is to identify under-studied Indian bird species by their calls.

<div align="center">
  <img src="https://i.ibb.co/47F4P9R/birdclef2024.png">
</div>

This notebook guides you through the process of inferring a Deep Learning model to recognize bird species by their songs (audio data). As the inference requires running only on the `CPU`, we had to create a separate notebooks for training and inference. You can find the [training notebook here](https://www.kaggle.com/code/awsaf49/birdclef24-kerascv-starter-train). Just as a recap of the training notebook, it uses the EfficientNetV2 backbone from KerasCV on the competition dataset. That notebook also demonstrates how to convert audio data to mel-spectrograms using Keras.

<u>Fun fact</u>: Both the training and inference notebooks are backend-agnostic, supporting TensorFlow, PyTorch, and JAX. Utilizing KerasCV and Keras allows us to choose our preferred backend. Explore more details on [Keras](https://keras.io/keras_core/announcement/).

In this notebook, you will learn:

- Designing a data pipeline for audio data, including audio-to-spectrogram conversion.
- Loading the data efficiently using [`tf.data`](https://www.tensorflow.org/guide/data).
- Creating the model using KerasCV presets.
- Inferring the trained model.

**Note**: For a more in-depth understanding of KerasCV, refer to the [KerasCV guides](https://keras.io/guides/keras_cv/).

# Import Libraries 📚

In [1]:
import os

os.environ["KERAS_BACKEND"] = "tensorflow"  # "jax" or "tensorflow" or "torch"

import keras_cv
import keras
import keras.backend as K
import tensorflow as tf
import tensorflow_io as tfio

import numpy as np
import pandas as pd

from glob import glob
from tqdm import tqdm

import librosa
import IPython.display as ipd
import librosa.display as lid

import matplotlib.pyplot as plt
import matplotlib as mpl

cmap = mpl.cm.get_cmap("coolwarm")

2024-05-24 10:59:23.023162: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-24 10:59:23.149059: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-24 10:59:23.633310: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  cmap = mpl.cm.get_cmap("coolwarm")


In [2]:
# limit keras by one thread
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

# Configuration ⚙️

In [3]:
class CFG:
    seed = 42

    # Input image size and batch size
    img_size = [128, 384]

    # Audio duration, sample rate, and length
    duration = 15  # second
    sample_rate = 32000
    audio_len = duration * sample_rate

    # STFT parameters
    nfft = 2028
    window = 2048
    hop_length = audio_len // (img_size[1] - 1)
    fmin = 20
    fmax = 16000

    # Number of epochs, model name
    preset = "efficientnetv2_b2_imagenet"

    # Class Labels for BirdCLEF 24
    class_names = sorted(os.listdir("../data/birdclef-2024/train_audio"))
    num_classes = len(class_names)
    class_labels = list(range(num_classes))
    label2name = dict(zip(class_labels, class_names))
    name2label = {v: k for k, v in label2name.items()}

In [4]:
CFG.img_size[1]

384

# Reproducibility ♻️
Sets value for random seed to produce similar result in each run.

In [5]:
tf.keras.utils.set_random_seed(CFG.seed)

# Dataset Path 📁

In [6]:
BASE_PATH = "../data/birdclef-2024"

# Test Data 📖

In [7]:
test_paths = glob(f"{BASE_PATH}/test_soundscapes/*ogg")
# During commit use `unlabeled` data as there is no `test` data.
# During submission `test` data will automatically be populated.
if len(test_paths) == 0:
    test_paths = glob(f"{BASE_PATH}/unlabeled_soundscapes/*ogg")
test_df = pd.DataFrame(test_paths, columns=["filepath"])[0:90]
test_df.head()

Unnamed: 0,filepath
0,../data/birdclef-2024/unlabeled_soundscapes/13...
1,../data/birdclef-2024/unlabeled_soundscapes/92...
2,../data/birdclef-2024/unlabeled_soundscapes/13...
3,../data/birdclef-2024/unlabeled_soundscapes/19...
4,../data/birdclef-2024/unlabeled_soundscapes/91...


In [8]:
len(test_paths)

8444

# Modeling 🤖

Note that our model was trained on `10 second` duration audio files, but we will infer on `5-second` audio files (as per competition rules). To facilitate this, we have set the model input shape to `(None, None, 3)`, which will allow us to have variable-length input during training and inference.

In [9]:
# Create an input layer for the model
inp = keras.layers.Input(shape=(None, None, 3))
# Pretrained backbone
backbone = keras_cv.models.EfficientNetV2Backbone.from_preset(
    CFG.preset,
)
out = keras_cv.models.ImageClassifier(
    backbone=backbone, num_classes=CFG.num_classes, name="classifier"
)(inp)
# Build model
model = keras.models.Model(inputs=inp, outputs=out)
# Load weights of trained model
# model.load_weights(
#     "/kaggle/input/birdclef24-kerascv-starter-train/best_model.weights.h5"
# )

2024-05-24 10:59:26.010305: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-24 10:59:26.010748: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


# Data Loader 🍚

The following code will decode the raw audio from `.ogg` file and also decode the spectrogram from the `audio` file. Additionally, we will apply Z-Score standardization and Min-Max normalization to ensure consistent inputs to the model.

In [10]:
# Decodes Audio
def build_decoder(with_labels=True, dim=1024):
    def get_audio(filepath):
        file_bytes = tf.io.read_file(filepath)
        audio = tfio.audio.decode_vorbis(file_bytes)  # decode .ogg file
        audio = tf.cast(audio, tf.float32)
        if tf.shape(audio)[1] > 1:  # stereo -> mono
            audio = audio[..., 0:1]
        audio = tf.squeeze(audio, axis=-1)
        return audio

    def create_frames(audio, duration=5, sr=32000):
        frame_size = int(duration * sr)
        audio = tf.pad(
            audio[..., None], [[0, tf.shape(audio)[0] % frame_size], [0, 0]]
        )  # pad the end
        audio = tf.squeeze(audio)  # remove extra dimension added for padding
        frames = tf.reshape(audio, [-1, frame_size])  # shape: [num_frames, frame_size]
        return frames

    def apply_preproc(spec):
        # Standardize
        mean = tf.math.reduce_mean(spec)
        std = tf.math.reduce_std(spec)
        spec = tf.where(tf.math.equal(std, 0), spec - mean, (spec - mean) / std)

        # Normalize using Min-Max
        min_val = tf.math.reduce_min(spec)
        max_val = tf.math.reduce_max(spec)
        spec = tf.where(
            tf.math.equal(max_val - min_val, 0),
            spec - min_val,
            (spec - min_val) / (max_val - min_val),
        )
        return spec

    def decode(path):
        # Load audio file
        audio = get_audio(path)
        # Split audio file into frames with each having 5 seecond duration
        audio = create_frames(audio)
        # Convert audio to spectrogram
        spec = keras.layers.MelSpectrogram(
            num_mel_bins=CFG.img_size[0],
            fft_length=CFG.nfft,
            sequence_stride=CFG.hop_length,
            sampling_rate=CFG.sample_rate,
        )(audio)
        # Apply normalization and standardization
        spec = apply_preproc(spec)
        # Covnert spectrogram to 3 channel image (for imagenet)
        spec = tf.tile(spec[..., None], [1, 1, 1, 3])
        return spec

    return decode

In [11]:
# Build data loader
def build_dataset(paths, batch_size=1, decode_fn=None, cache=False):
    if decode_fn is None:
        decode_fn = build_decoder(dim=CFG.audio_len)  # decoder
    AUTO = tf.data.experimental.AUTOTUNE
    slices = (paths,)
    ds = tf.data.Dataset.from_tensor_slices(slices)
    ds = ds.map(
        decode_fn, num_parallel_calls=AUTO
    )  # decode audio to spectrograms then create frames
    ds = ds.cache() if cache else ds  # cache files
    ds = ds.batch(batch_size, drop_remainder=False)  # create batches
    ds = ds.prefetch(AUTO)
    return ds

In [12]:
# get size of model
# model.build((None, *CFG.img_size))
model.summary()

# Inference 🏃

In [16]:
import torch

# Initialize empty list to store ids
ids = []

# Initialize empty array to store predictions
preds = np.empty(shape=(0, CFG.num_classes), dtype="float32")

# Build test dataset
test_paths = test_df.filepath.tolist()
test_ds = build_dataset(paths=test_paths, batch_size=1)

# Iterate over each audio file in the test dataset
for idx, specs in enumerate(tqdm(iter(test_ds), desc="test ", total=len(test_df))):
    # Extract the filename without the extension
    filename = test_paths[idx].split("/")[-1].replace(".ogg", "")

    # Convert to backend-specific tensor while excluding extra dimension
    specs = keras.ops.convert_to_tensor(specs[0])

    # # tf tensor into pytorch tensor
    # specs = specs.numpy()
    # specs = torch.from_numpy(specs).permute(0, 3, 1, 2).float()

    # Predict bird species for all frames in a recording using all trained models
    frame_preds = model.predict(specs, verbose=0)

    # Create a ID for each frame in a recording using the filename and frame number
    frame_ids = [f"{filename}_{(frame_id+1)*5}" for frame_id in range(len(frame_preds))]

    # Concatenate the ids
    ids += frame_ids
    # Concatenate the predictions
    preds = np.concatenate([preds, frame_preds], axis=0)

test :  90%|█████████ | 81/90 [00:05<00:00, 16.34it/s]2024-05-24 12:00:09.179410: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
test : 100%|██████████| 90/90 [00:05<00:00, 17.73it/s]


In [19]:
import torch
import timm

model = timm.create_model(
    "tf_efficientnet_b0_ns",
    pretrained=True,
    num_classes=182,
    global_pool="avg",
    in_chans=3,
)

model = model.eval()
# Initialize empty list to store ids
ids = []

# Initialize empty array to store predictions
preds = np.empty(shape=(0, CFG.num_classes), dtype="float32")

# Build test dataset
test_paths = test_df.filepath.tolist()
test_ds = build_dataset(paths=test_paths, batch_size=1)

# Iterate over each audio file in the test dataset
for idx, specs in enumerate(tqdm(iter(test_ds), desc="test ", total=len(test_df))):
    # Extract the filename without the extension
    filename = test_paths[idx].split("/")[-1].replace(".ogg", "")

    # Convert to backend-specific tensor while excluding extra dimension
    specs = keras.ops.convert_to_tensor(specs[0])

    # # tf tensor into pytorch tensor
    # specs = specs.numpy()
    # specs = torch.from_numpy(specs).permute(0, 3, 1, 2).float()

    # Predict bird species for all frames in a recording using all trained models
    frame_preds = model.predict(specs, verbose=0)

    # Create a ID for each frame in a recording using the filename and frame number
    frame_ids = [f"{filename}_{(frame_id+1)*5}" for frame_id in range(len(frame_preds))]

    # Concatenate the ids
    ids += frame_ids
    # Concatenate the predictions
    preds = np.concatenate([preds, frame_preds], axis=0)

test :  98%|█████████▊| 88/90 [00:04<00:00, 21.89it/s]2024-05-24 01:00:38.453911: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
test : 100%|██████████| 90/90 [00:05<00:00, 17.92it/s]


# Submission ✉️

In [13]:
# Submit prediction
pred_df = pd.DataFrame(ids, columns=["row_id"])
pred_df.loc[:, CFG.class_names] = preds
pred_df.to_csv("submission.csv", index=False)
pred_df.head()

Unnamed: 0,row_id,asbfly,ashdro1,ashpri1,ashwoo2,asikoe2,asiope1,aspfly1,aspswi1,barfly1,...,whbwoo2,whcbar1,whiter2,whrmun,whtkin2,woosan,wynlau1,yebbab1,yebbul3,zitcis1
0,1384345978_5,0.004106,0.005859,0.006027,0.006039,0.006472,0.004889,0.007669,0.006761,0.006802,...,0.005133,0.005576,0.005953,0.007071,0.005214,0.004628,0.005736,0.005262,0.006484,0.004008
1,1384345978_10,0.004106,0.005853,0.006028,0.006041,0.006469,0.004892,0.007661,0.00676,0.006799,...,0.005134,0.005575,0.005958,0.007072,0.005214,0.00463,0.005738,0.005261,0.006479,0.004012
2,1384345978_15,0.004105,0.005854,0.006026,0.006039,0.006473,0.004891,0.007668,0.006758,0.006799,...,0.005134,0.005572,0.005952,0.007076,0.005212,0.00463,0.005738,0.005261,0.006479,0.004005
3,1384345978_20,0.004104,0.005857,0.006026,0.00604,0.006468,0.004889,0.007661,0.006761,0.006794,...,0.005134,0.005573,0.005957,0.007075,0.005208,0.004631,0.005738,0.00526,0.006473,0.004012
4,1384345978_25,0.004104,0.005854,0.006032,0.006042,0.006471,0.004886,0.007663,0.006758,0.006805,...,0.005132,0.005575,0.005959,0.007071,0.005218,0.004634,0.00574,0.005261,0.00648,0.004007


In [16]:
len(pred_df)

4320

# Reference ✍️
* [Fake Speech Detection: Conformer [TF]](https://www.kaggle.com/code/awsaf49/fake-speech-detection-conformer-tf/) by @awsaf49
* [RANZCR: EfficientNet TPU Training](https://www.kaggle.com/code/xhlulu/ranzcr-efficientnet-tpu-training) by @xhlulu
* [Triple Stratified KFold with TFRecords](https://www.kaggle.com/code/cdeotte/triple-stratified-kfold-with-tfrecords) by @cdeotte