## About

In this notebook I tried to make submission using ResNet based model trained with log melspectrogram. I will create a notebook to show the way I trained the model but here I briefly describe my approach.

* Randomly crop 5 seconds for each train audio clip each epoch.
* No augmentation.
* Use pretrained weight of `torchvision.models.resnet50`.
* Used `BCELoss`.
* Trained 100 epoch and used the weight which got best F1 (at 92epoch).
* `Adam` optimizer (`lr=0.001`) with `CosineAnnealingLR` (`T_max=10`).
* Use `StratifiedKFold(n_splits=5)` to split dataset and used only first fold

Here are the parameter details.

* `batch_size`: 100 (on V100, took 2 ~ 3hrs to run 100epochs)
* melspectrogram parameters
  - `n_mels`: 128
  - `fmin`: 20
  - `fmax`: 16000
* image size: 224 x 541 (I don't remember the exact width)

## Libraries

In [1]:
import cv2 #image processing
import audioread #reading and processing audio files
import logging #record events and errors for debugging and monitoring
import os #file and directory manipulation
import random #shuffling data
import time #calculate time taken or introduce delays
import warnings #issuing warning messages about potential issues

import librosa 
import numpy as np
import pandas as pd
import soundfile as sf #reading and writing audio files
import torch #core module of PyTorch
import torch.nn as nn #building and training neural networks
import torch.nn.functional as F #for element-wise functions
import torch.utils.data as data #for loading and batching data during training

from contextlib import contextmanager #to ensure that resources are properly managed
from pathlib import Path #for working with paths in a more object-oriented way
from typing import Optional #type hint

from fastprogress import progress_bar 
from sklearn.metrics import f1_score
from torchvision import models #provides pre-trained models

## Utilities

In [2]:
def set_seed(seed: int = 42): #sets random seed for different libraries
    random.seed(seed) 
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed) #for hash functions
    torch.manual_seed(seed) #random of pytorch
    torch.cuda.manual_seed(seed)  # type: ignore #same but for GPU
    torch.backends.cudnn.deterministic = True  # type: ignore #deterministic mode
    torch.backends.cudnn.benchmark = True  # type: ignore #optimize performance
    
    
def get_logger(out_file=None): 
    logger = logging.getLogger() 
    formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s") #timestamp, log level, message
    logger.handlers = [] #object that transfers log
    logger.setLevel(logging.INFO) #capture only INFO level

    handler = logging.StreamHandler() #to the console
    handler.setFormatter(formatter)
    handler.setLevel(logging.INFO)
    logger.addHandler(handler)

    if out_file is not None:
        fh = logging.FileHandler(out_file)
        fh.setFormatter(formatter)
        fh.setLevel(logging.INFO)
        logger.addHandler(fh)
    logger.info("logger set up")
    return logger
    
    
@contextmanager #measure execution time 
def timer(name: str, logger: Optional[logging.Logger] = None):
    t0 = time.time() #current time
    msg = f"[{name}] start" 
    if logger is None:
        print(msg)
    else:
        logger.info(msg)
    yield

    msg = f"[{name}] done in {time.time() - t0:.2f} s"
    if logger is None:
        print(msg)
    else:
        logger.info(msg)

In [20]:
logger = get_logger("main.log")
set_seed(1213)

2024-01-10 14:09:25,439 - INFO - logger set up


## Data Loading

In [15]:
TARGET_SR = 32000

In [4]:
test = pd.read_csv("/kaggle/input/birdcall-check/test.csv")
test_audio = "/kaggle/input/birdcall-check/test_audio"

test.head()

Unnamed: 0,site,row_id,seconds,audio_id
0,site_1,site_1_41e6fe6504a34bf6846938ba78d13df1_5,5.0,41e6fe6504a34bf6846938ba78d13df1
1,site_1,site_1_41e6fe6504a34bf6846938ba78d13df1_10,10.0,41e6fe6504a34bf6846938ba78d13df1
2,site_1,site_1_41e6fe6504a34bf6846938ba78d13df1_15,15.0,41e6fe6504a34bf6846938ba78d13df1
3,site_1,site_1_41e6fe6504a34bf6846938ba78d13df1_20,20.0,41e6fe6504a34bf6846938ba78d13df1
4,site_1,site_1_41e6fe6504a34bf6846938ba78d13df1_25,25.0,41e6fe6504a34bf6846938ba78d13df1


## Define Model

In [5]:
class ResNet(nn.Module): #base class
    def __init__(self, base_model_name: str, pretrained=False, #constructor method #weights
                 num_classes=264):
        super().__init__() #initializing the base class
        base_model = models.__getattribute__(base_model_name)(
            pretrained=pretrained)
        layers = list(base_model.children())[:-2] #except pooling and dense
        layers.append(nn.AdaptiveMaxPool2d(1)) 
        self.encoder = nn.Sequential(*layers)

        in_features = base_model.fc.in_features #number of input features

        self.classifier = nn.Sequential(
            nn.Linear(in_features, 1024), nn.ReLU(), nn.Dropout(p=0.2), 
            nn.Linear(1024, 1024), nn.ReLU(), nn.Dropout(p=0.2),
            nn.Linear(1024, num_classes))

    def forward(self, x):
        batch_size = x.size(0) #input tensor
        x = self.encoder(x).view(batch_size, -1) #1D tensor
        x = self.classifier(x) 
        multiclass_proba = F.softmax(x, dim=1)
        multilabel_proba = F.sigmoid(x)
        return {
            "logits": x,
            "multiclass_proba": multiclass_proba,
            "multilabel_proba": multilabel_proba
        }

## Parameters

In [6]:
model_config = {
    "base_model_name": "resnet50",
    "pretrained": False,
    "num_classes": 264
}

melspectrogram_parameters = {
    "n_mels": 128, #number of Mel bins
    "fmin": 20,
    "fmax": 16000
}

weights_path = "../input/birdcall-resnet50-init-weights/best.pth"

In [7]:
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv("/kaggle/input/birdsong-recognition/train.csv")

unique_bird_names = df['ebird_code'].unique()

label_encoder = LabelEncoder()

encoded_labels = label_encoder.fit_transform(unique_bird_names)

BIRD_CODE = dict(zip(unique_bird_names, encoded_labels))

# for bird_name, label in BIRD_CODE.items():
#     print(f"{bird_name}: {label}")
    
INV_BIRD_CODE = {v: k for k, v in BIRD_CODE.items()}
# for bird_name, label in INV_BIRD_CODE.items():
#     print(f"{bird_name}: {label}")

## Define Dataset

In [8]:
def mono_to_color(X: np.ndarray,
                  mean=None,
                  std=None,
                  norm_max=None,
                  norm_min=None,
                  eps=1e-6):
    
    X = np.stack([X, X, X], axis=-1)

    # Standardize
    mean = mean or X.mean()
    X = X - mean
    std = std or X.std()
    Xstd = X / (std + eps)
    _min, _max = Xstd.min(), Xstd.max()
    norm_max = norm_max or _max
    norm_min = norm_min or _min
    
    if (_max - _min) > eps:
        # Normalize to [0, 255]
        V = Xstd
        V[V < norm_min] = norm_min
        V[V > norm_max] = norm_max
        V = 255 * (V - norm_min) / (norm_max - norm_min)
        V = V.astype(np.uint8)
    else:
        V = np.zeros_like(Xstd, dtype=np.uint8)
    return V


class TestDataset(data.Dataset):
    def __init__(self, df: pd.DataFrame, clip: np.ndarray,
                 img_size=224, melspectrogram_parameters={}):
        self.df = df
        self.clip = clip
        self.img_size = img_size
        self.melspectrogram_parameters = melspectrogram_parameters
        
    def __len__(self):
        return len(self.df) #number of samples
    
    def __getitem__(self, idx: int):
        SR = 32000
        sample = self.df.loc[idx, :] #return row
        site = sample.site
        row_id = sample.row_id
        
        if site == "site_3":
            y = self.clip.astype(np.float32)
            len_y = len(y)
            start = 0
            end = SR * 5
            images = []
            while len_y > start:
                y_batch = y[start:end].astype(np.float32)
                if len(y_batch) != (SR * 5):
                    break
                start = end
                end = end + SR * 5
                
                melspec = librosa.feature.melspectrogram(y_batch,
                                                         sr=SR,
                                                         **self.melspectrogram_parameters)
                melspec = librosa.power_to_db(melspec).astype(np.float32)
                image = mono_to_color(melspec)
                height, width, _ = image.shape
                image = cv2.resize(image, (int(width * self.img_size / height), self.img_size))
                image = np.moveaxis(image, 2, 0) #color channel axis to the first dimension
                image = (image / 255.0).astype(np.float32)
                images.append(image)
            images = np.asarray(images)
            return images, row_id, site
        else:
            end_seconds = int(sample.seconds)
            start_seconds = int(end_seconds - 5)
            
            start_index = SR * start_seconds
            end_index = SR * end_seconds
            
            y = self.clip[start_index:end_index].astype(np.float32)

            melspec = librosa.feature.melspectrogram(y, sr=SR, **self.melspectrogram_parameters)
            melspec = librosa.power_to_db(melspec).astype(np.float32)

            image = mono_to_color(melspec)
            height, width, _ = image.shape
            image = cv2.resize(image, (int(width * self.img_size / height), self.img_size))
            image = np.moveaxis(image, 2, 0)
            image = (image / 255.0).astype(np.float32)

            return image, row_id, site

## Prediction loop

In [9]:
def get_model(config: dict, weights_path: str):
    model = ResNet(**config)
    checkpoint = torch.load(weights_path) #pretrained weights
    model.load_state_dict(checkpoint["model_state_dict"]) #initializing learned parameters of the model
    device = torch.device("cuda")
    model.to(device)
    model.eval()
    return model

In [10]:
def prediction_for_clip(test_df: pd.DataFrame, 
                        clip: np.ndarray, 
                        model: ResNet, 
                        mel_params: dict, 
                        threshold=0.5):

    dataset = TestDataset(df=test_df, 
                          clip=clip,
                          img_size=224,
                          melspectrogram_parameters=mel_params)
    loader = data.DataLoader(dataset, batch_size=1, shuffle=False)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    model.eval()
    prediction_dict = {}
    for image, row_id, site in progress_bar(loader):
        site = site[0]
        row_id = row_id[0]
        if site in {"site_1", "site_2"}:
            image = image.to(device)

            with torch.no_grad():
                prediction = model(image)
                proba = prediction["multilabel_proba"].detach().cpu().numpy().reshape(-1)

            events = proba >= threshold
            labels = np.argwhere(events).reshape(-1).tolist()

        else:
            # to avoid prediction on large batch
            image = image.squeeze(0)
            batch_size = 16
            whole_size = image.size(0)
            if whole_size % batch_size == 0:
                n_iter = whole_size // batch_size
            else:
                n_iter = whole_size // batch_size + 1
                
            all_events = set()
            for batch_i in range(n_iter):
                batch = image[batch_i * batch_size:(batch_i + 1) * batch_size]
                if batch.ndim == 3:
                    batch = batch.unsqueeze(0)

                batch = batch.to(device)
                with torch.no_grad():
                    prediction = model(batch)
                    proba = prediction["multilabel_proba"].detach().cpu().numpy()
                    
                events = proba >= threshold
                for i in range(len(events)):
                    event = events[i, :]
                    labels = np.argwhere(event).reshape(-1).tolist()
                    for label in labels:
                        all_events.add(label)
                        
            labels = list(all_events)
        if len(labels) == 0:
            prediction_dict[row_id] = "nocall"
        else:
            labels_str_list = list(map(lambda x: INV_BIRD_CODE[x], labels))
            label_string = " ".join(labels_str_list)
            prediction_dict[row_id] = label_string
    return prediction_dict

In [13]:
def prediction(test_df: pd.DataFrame,
               test_audio: Path,
               model_config: dict,
               mel_params: dict,
               weights_path: str,
               threshold=0.5):
    model = get_model(model_config, weights_path)
    unique_audio_id = test_df.audio_id.unique()

    warnings.filterwarnings("ignore")
    prediction_dfs = []
    for audio_id in unique_audio_id:
        with timer(f"Loading {audio_id}", logger):
            clip, _ = librosa.load(test_audio + "/" + (audio_id + ".mp3"),
                                   sr=TARGET_SR,
                                   mono=True,
                                   res_type="kaiser_fast")
        
        test_df_for_audio_id = test_df.query(
            f"audio_id == '{audio_id}'").reset_index(drop=True)
        with timer(f"Prediction on {audio_id}", logger):
            prediction_dict = prediction_for_clip(test_df_for_audio_id,
                                                  clip=clip,
                                                  model=model,
                                                  mel_params=mel_params,
                                                  threshold=threshold)
        row_id = list(prediction_dict.keys())
        birds = list(prediction_dict.values())
        prediction_df = pd.DataFrame({
            "row_id": row_id,
            "birds": birds
        })
        prediction_dfs.append(prediction_df)
    
    prediction_df = pd.concat(prediction_dfs, axis=0, sort=False).reset_index(drop=True)
    return prediction_df

## Prediction

In [16]:
submission = prediction(test_df=test,
                        test_audio=test_audio,
                        model_config=model_config,
                        mel_params=melspectrogram_parameters,
                        weights_path=weights_path,
                        threshold=0.8)
submission.to_csv("submission.csv", index=False)

2024-01-10 13:40:14,516 - INFO - [Loading 41e6fe6504a34bf6846938ba78d13df1] start
2024-01-10 13:40:15,842 - INFO - [Loading 41e6fe6504a34bf6846938ba78d13df1] done in 1.33 s
2024-01-10 13:40:15,856 - INFO - NumExpr defaulting to 4 threads.
2024-01-10 13:40:15,863 - INFO - [Prediction on 41e6fe6504a34bf6846938ba78d13df1] start


2024-01-10 13:40:18,221 - INFO - [Prediction on 41e6fe6504a34bf6846938ba78d13df1] done in 2.36 s
2024-01-10 13:40:18,223 - INFO - [Loading cce64fffafed40f2b2f3d3413ec1c4c2] start
2024-01-10 13:40:18,889 - INFO - [Loading cce64fffafed40f2b2f3d3413ec1c4c2] done in 0.67 s
2024-01-10 13:40:18,896 - INFO - [Prediction on cce64fffafed40f2b2f3d3413ec1c4c2] start


2024-01-10 13:40:19,078 - INFO - [Prediction on cce64fffafed40f2b2f3d3413ec1c4c2] done in 0.18 s
2024-01-10 13:40:19,080 - INFO - [Loading 99af324c881246949408c0b1ae54271f] start
2024-01-10 13:40:19,771 - INFO - [Loading 99af324c881246949408c0b1ae54271f] done in 0.69 s
2024-01-10 13:40:19,777 - INFO - [Prediction on 99af324c881246949408c0b1ae54271f] start


2024-01-10 13:40:19,960 - INFO - [Prediction on 99af324c881246949408c0b1ae54271f] done in 0.18 s
2024-01-10 13:40:19,962 - INFO - [Loading 6ab74e177aa149468a39ca10beed6222] start
2024-01-10 13:40:20,584 - INFO - [Loading 6ab74e177aa149468a39ca10beed6222] done in 0.62 s
2024-01-10 13:40:20,591 - INFO - [Prediction on 6ab74e177aa149468a39ca10beed6222] start


2024-01-10 13:40:20,746 - INFO - [Prediction on 6ab74e177aa149468a39ca10beed6222] done in 0.16 s
2024-01-10 13:40:20,749 - INFO - [Loading b2fd3f01e9284293a1e33f9c811a2ed6] start
2024-01-10 13:40:21,399 - INFO - [Loading b2fd3f01e9284293a1e33f9c811a2ed6] done in 0.65 s
2024-01-10 13:40:21,405 - INFO - [Prediction on b2fd3f01e9284293a1e33f9c811a2ed6] start


2024-01-10 13:40:21,586 - INFO - [Prediction on b2fd3f01e9284293a1e33f9c811a2ed6] done in 0.18 s
2024-01-10 13:40:21,588 - INFO - [Loading de62b37ebba749d2abf29d4a493ea5d4] start
2024-01-10 13:40:21,901 - INFO - [Loading de62b37ebba749d2abf29d4a493ea5d4] done in 0.31 s
2024-01-10 13:40:21,906 - INFO - [Prediction on de62b37ebba749d2abf29d4a493ea5d4] start


2024-01-10 13:40:21,940 - INFO - [Prediction on de62b37ebba749d2abf29d4a493ea5d4] done in 0.03 s
2024-01-10 13:40:21,942 - INFO - [Loading 8680a8dd845d40f296246dbed0d37394] start
2024-01-10 13:40:22,706 - INFO - [Loading 8680a8dd845d40f296246dbed0d37394] done in 0.76 s
2024-01-10 13:40:22,712 - INFO - [Prediction on 8680a8dd845d40f296246dbed0d37394] start


2024-01-10 13:40:22,945 - INFO - [Prediction on 8680a8dd845d40f296246dbed0d37394] done in 0.23 s
2024-01-10 13:40:22,947 - INFO - [Loading 940d546e5eb745c9a74bce3f35efa1f9] start
2024-01-10 13:40:24,027 - INFO - [Loading 940d546e5eb745c9a74bce3f35efa1f9] done in 1.08 s
2024-01-10 13:40:24,033 - INFO - [Prediction on 940d546e5eb745c9a74bce3f35efa1f9] start


2024-01-10 13:40:24,380 - INFO - [Prediction on 940d546e5eb745c9a74bce3f35efa1f9] done in 0.35 s
2024-01-10 13:40:24,382 - INFO - [Loading 07ab324c602e4afab65ddbcc746c31b5] start
2024-01-10 13:40:24,922 - INFO - [Loading 07ab324c602e4afab65ddbcc746c31b5] done in 0.54 s
2024-01-10 13:40:24,929 - INFO - [Prediction on 07ab324c602e4afab65ddbcc746c31b5] start


2024-01-10 13:40:25,074 - INFO - [Prediction on 07ab324c602e4afab65ddbcc746c31b5] done in 0.15 s
2024-01-10 13:40:25,076 - INFO - [Loading 899616723a32409c996f6f3441646c2a] start
2024-01-10 13:40:25,914 - INFO - [Loading 899616723a32409c996f6f3441646c2a] done in 0.84 s
2024-01-10 13:40:25,921 - INFO - [Prediction on 899616723a32409c996f6f3441646c2a] start


2024-01-10 13:40:26,179 - INFO - [Prediction on 899616723a32409c996f6f3441646c2a] done in 0.26 s
2024-01-10 13:40:26,181 - INFO - [Loading 9cc5d9646f344f1bbb52640a988fe902] start
2024-01-10 13:40:29,395 - INFO - [Loading 9cc5d9646f344f1bbb52640a988fe902] done in 3.21 s
2024-01-10 13:40:29,401 - INFO - [Prediction on 9cc5d9646f344f1bbb52640a988fe902] start


2024-01-10 13:40:31,440 - INFO - [Prediction on 9cc5d9646f344f1bbb52640a988fe902] done in 2.04 s
2024-01-10 13:40:31,442 - INFO - [Loading a56e20a518684688a9952add8a9d5213] start
2024-01-10 13:40:32,021 - INFO - [Loading a56e20a518684688a9952add8a9d5213] done in 0.58 s
2024-01-10 13:40:32,029 - INFO - [Prediction on a56e20a518684688a9952add8a9d5213] start


2024-01-10 13:40:32,958 - INFO - [Prediction on a56e20a518684688a9952add8a9d5213] done in 0.93 s
2024-01-10 13:40:32,960 - INFO - [Loading 96779836288745728306903d54e264dd] start
2024-01-10 13:40:33,408 - INFO - [Loading 96779836288745728306903d54e264dd] done in 0.45 s
2024-01-10 13:40:33,414 - INFO - [Prediction on 96779836288745728306903d54e264dd] start


2024-01-10 13:40:34,655 - INFO - [Prediction on 96779836288745728306903d54e264dd] done in 1.24 s
2024-01-10 13:40:34,657 - INFO - [Loading f77783ba4c6641bc918b034a18c23e53] start
2024-01-10 13:40:35,008 - INFO - [Loading f77783ba4c6641bc918b034a18c23e53] done in 0.35 s
2024-01-10 13:40:35,014 - INFO - [Prediction on f77783ba4c6641bc918b034a18c23e53] start


2024-01-10 13:40:35,050 - INFO - [Prediction on f77783ba4c6641bc918b034a18c23e53] done in 0.04 s
2024-01-10 13:40:35,052 - INFO - [Loading 856b194b097441958697c2bcd1f63982] start
2024-01-10 13:40:35,634 - INFO - [Loading 856b194b097441958697c2bcd1f63982] done in 0.58 s
2024-01-10 13:40:35,641 - INFO - [Prediction on 856b194b097441958697c2bcd1f63982] start


2024-01-10 13:40:35,750 - INFO - [Prediction on 856b194b097441958697c2bcd1f63982] done in 0.11 s


In [17]:
submission

Unnamed: 0,row_id,birds
0,site_1_41e6fe6504a34bf6846938ba78d13df1_5,aldfly
1,site_1_41e6fe6504a34bf6846938ba78d13df1_10,aldfly
2,site_1_41e6fe6504a34bf6846938ba78d13df1_15,aldfly
3,site_1_41e6fe6504a34bf6846938ba78d13df1_20,nocall
4,site_1_41e6fe6504a34bf6846938ba78d13df1_25,aldfly
...,...,...
71,site_3_9cc5d9646f344f1bbb52640a988fe902,aldfly
72,site_3_a56e20a518684688a9952add8a9d5213,aldfly
73,site_3_96779836288745728306903d54e264dd,aldfly
74,site_3_f77783ba4c6641bc918b034a18c23e53,aldfly


## EOF

In [1]:
import librosa

print(librosa.__version__)


0.7.2
