<a href="https://colab.research.google.com/github/lzziuhh/machine_learning/blob/master/pytorch_metric_learning_effnet_arcface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[<img src="https://github.com/KevinMusgrave/pytorch-metric-learning/raw/master/docs/imgs/Logo2.png">]()

## Introduction

This notebook makes use of the fantastic library `pytorch-metric-learning` developed and maintained by Kevin Musgrave. You can find the github at the following link:

- https://github.com/KevinMusgrave/pytorch-metric-learning

You can find a ton of useful metric learning modules there, along with a super friendly API for rapid training and evaluation. I recommend reading through the example notebooks because they are very well put together (below borrows from them heavily).

Here we use the library to train a basic whale detector using an efficient net backbone (https://arxiv.org/abs/1905.11946) with ArcFace loss (https://arxiv.org/abs/1801.07698). This is a very straightforward example and there are many ways to improve. Here are some suggestions:

- Change the train/validation split to better resemble the public LB.
- Change the model trunk.
- Pre-process the images by e.g. applying bounding boxes.
- Experiment with the training proceedure.

I will continue to develop this notebook over time and hopefully improve the results.

All feedback appreciated.

**Change Log**

- Version 9: switched to 384x384 dataset, added training augmentation, and switched from Adam to SGD with cosine schedule.
- Version 8 (LB: 0.245): fixed bug where same individual predicted multiple times for single image and increased the KNN search range.
- Version 6 (LB: 0.229): switched to cropped YOLO5 input, switched to b3 model, reduced epochs, and updated logging.
- Version 4 (LB: 0.190): initial notebook completed.

## Check GPU Type

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Feb 21 23:09:08 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    22W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Google drive connection

In [2]:
import glob
from google.colab import drive
drive.mount('/content/gdrive')

!ln -s /content/gdrive/My\ Drive/ /mydrive

!ls /mydrive

Mounted at /content/gdrive
 bangali	    download	       JTdemo  'My Drive'
'Colab Notebooks'   download.gslides   kaggle


# Installing the [Kaggle API](https://github.com/Kaggle/kaggle-api) in Colab

In [3]:
!pip install kaggle



# Authenticating with Kaggle using kaggle.json

Navigate to https://www.kaggle.com. Then go to the [Account tab of your user profile](https://www.kaggle.com/me/account) and select Create API Token. This will trigger the download of kaggle.json, a file containing your API credentials.

Then run the cell below to upload kaggle.json to your Colab runtime.

In [4]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 68 bytes


# Using the Kaggle API

For a more complete list of what you can do with the API, visit https://github.com/Kaggle/kaggle-api.

## Listing competitions

In [5]:
!kaggle competitions list

ref                                            deadline             category            reward  teamCount  userHasEntered  
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
contradictory-my-dear-watson                   2030-07-01 23:59:00  Getting Started     Prizes         67           False  
gan-getting-started                            2030-07-01 23:59:00  Getting Started     Prizes         95           False  
store-sales-time-series-forecasting            2030-06-30 23:59:00  Getting Started  Knowledge        782           False  
tpu-getting-started                            2030-06-03 23:59:00  Getting Started  Knowledge        163           False  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       1791           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge      13634            True  
house-pr

## download dataset

In [6]:
!kaggle datasets download -d rdizzl3/jpeg-happywhale-384x384

Downloading jpeg-happywhale-384x384.zip to /content
100% 4.41G/4.42G [01:06<00:00, 37.0MB/s]
100% 4.42G/4.42G [01:06<00:00, 71.2MB/s]


In [7]:
# unzip the download dataset
!mkdir /kaggle
!mkdir /kaggle/input
!mkdir /kaggle/logs
!mkdir /kaggle/models


In [8]:
!unzip -q ./jpeg-happywhale-384x384.zip -d /kaggle/input/

In [9]:
# !kaggle competitions files  happy-whale-and-dolphin/train.csv

## Dependencies

In [10]:
!pip install timm
!pip install pytorch-metric-learning[with-hooks]

Collecting timm
  Downloading timm-0.5.4-py3-none-any.whl (431 kB)
[?25l[K     |▊                               | 10 kB 22.5 MB/s eta 0:00:01[K     |█▌                              | 20 kB 30.3 MB/s eta 0:00:01[K     |██▎                             | 30 kB 36.1 MB/s eta 0:00:01[K     |███                             | 40 kB 34.6 MB/s eta 0:00:01[K     |███▉                            | 51 kB 14.7 MB/s eta 0:00:01[K     |████▋                           | 61 kB 17.0 MB/s eta 0:00:01[K     |█████▎                          | 71 kB 14.1 MB/s eta 0:00:01[K     |██████                          | 81 kB 11.7 MB/s eta 0:00:01[K     |██████▉                         | 92 kB 13.0 MB/s eta 0:00:01[K     |███████▋                        | 102 kB 13.6 MB/s eta 0:00:01[K     |████████▍                       | 112 kB 13.6 MB/s eta 0:00:01[K     |█████████▏                      | 122 kB 13.6 MB/s eta 0:00:01[K     |█████████▉                      | 133 kB 13.6 MB/s eta 0:00:01

## Imports

In [11]:
import os
import glob
import pandas as pd
import numpy as np
import logging
import timm
from tqdm.notebook import tqdm

import torch
import torch.nn as nn
import torch.optim as optim

from torch.utils.data import Dataset, DataLoader
from torchvision.io import ImageReadMode, read_image
from torchvision.transforms import Compose, Lambda, Normalize, AutoAugment, AutoAugmentPolicy

import pytorch_metric_learning
import pytorch_metric_learning.utils.logging_presets as LP
from pytorch_metric_learning.utils import common_functions
from pytorch_metric_learning import losses, miners, samplers, testers, trainers
from pytorch_metric_learning.utils.accuracy_calculator import AccuracyCalculator
from pytorch_metric_learning.utils.inference import InferenceModel

for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)

logging.getLogger().setLevel(logging.INFO)
logging.info("VERSION %s" % pytorch_metric_learning.__version__)

INFO:root:VERSION 1.1.2


In [12]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

## Parameters

There is no logic behind these, really. Go wild.

In [13]:
MODEL_NAME='tf_efficientnet_b3_ns'
N_CLASSES=15587
OUTPUT_SIZE = 1536
EMBEDDING_SIZE = 512
N_EPOCH=15
BATCH_SIZE=24
MODEL_LR = 1e-3
PCT_START=0.3
PATIENCE=5
N_WORKER=2
N_NEIGHBOURS = 1000

## Directories

We have now switched to using cropped images provided by Awsaf in the following notebook: https://www.kaggle.com/awsaf49/happywhale-cropped-dataset-yolov5. Please go give him an upvote if you like this notebook.

In [14]:
TRAIN_DIR = '/kaggle/input/train_images-384-384/train_images-384-384'
TEST_DIR = '/kaggle/input/test_images-384-384/test_images-384-384'
LOG_DIR = "/kaggle/logs/{}".format(MODEL_NAME)
MODEL_DIR = "/kaggle/models/{}".format(MODEL_NAME)

## Dataset

Create a basic dataset for loading images. 

Since we're planning to use pre-trained imagenet weights we need to normalize appropriately.

In [15]:

class HappyWhaleDataset(Dataset):
    def __init__(
        self,
        df: pd.DataFrame,
        image_dir: str,
        return_labels=True,
    ):
        self.df = df
        self.images = self.df["image"]
        self.image_dir = image_dir
        self.image_transform = Compose(
            [
                AutoAugment(AutoAugmentPolicy.IMAGENET),
                Lambda(lambda x: x / 255),
                Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
                
            ]
        )
        self.return_labels = return_labels

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        
        image_path = os.path.join(self.image_dir, self.images.iloc[idx])
        image = read_image(path=image_path)
        image = self.image_transform(image)
        
        if self.return_labels:
            label = self.df['label'].iloc[idx]
            return image, label
        else:
            return image


# Data Split

Load in the csv:

In [16]:
!cp /content/gdrive/MyDrive/kaggle/Whale/input/train.csv /kaggle/input/
!cp /content/gdrive/MyDrive/kaggle/Whale/input/sample_submission.csv /kaggle/input/

df = pd.read_csv('/kaggle/input/train.csv')
df.head()

Unnamed: 0,image,species,individual_id
0,00021adfb725ed.jpg,melon_headed_whale,cadddb1636b9
1,000562241d384d.jpg,humpback_whale,1a71fbb72250
2,0007c33415ce37.jpg,false_killer_whale,60008f293a2b
3,0007d9bca26a99.jpg,bottlenose_dolphin,4b00fe572063
4,00087baf5cef7a.jpg,humpback_whale,8e5253662392


Add a label for the classes:

In [17]:
df['label'] = df.groupby('individual_id').ngroup()
df['label'].describe()

count    51033.000000
mean      7651.356240
std       4465.552697
min          0.000000
25%       3748.000000
50%       7605.000000
75%      11443.000000
max      15586.000000
Name: label, dtype: float64

Split into training and validation:

In [18]:
valid_proportion = 0.1

valid_df = df.sample(frac=valid_proportion, replace=False, random_state=1).copy()
train_df = df[~df['image'].isin(valid_df['image'])].copy()

print(train_df.shape)
print(valid_df.shape)

(45930, 4)
(5103, 4)


Reset index on both since we want to use it for KNN lookups later:

In [19]:
train_df.reset_index(drop=True, inplace=True)
valid_df.reset_index(drop=True, inplace=True)

Create our dataset objects:

In [20]:
train_dataset = HappyWhaleDataset(df=train_df, image_dir=TRAIN_DIR, return_labels=True)
len(train_dataset)

45930

In [21]:
valid_dataset = HappyWhaleDataset(df=valid_df, image_dir=TRAIN_DIR, return_labels=True)
len(valid_dataset)

5103

In [22]:
dataset_dict = {"train": train_dataset, "val": valid_dataset}

## Model Setup

We need to specify three components to build our model:

- Trunk
- Embedder
- Loss

Setup the trunk using a pre-trained model from timm:

In [23]:
trunk = timm.create_model(MODEL_NAME, pretrained=True)
trunk.classifier = common_functions.Identity()
trunk = trunk.to(device)
trunk_optimizer = optim.SGD(trunk.parameters(), lr=MODEL_LR, momentum=0.9)
trunk_schedule = optim.lr_scheduler.OneCycleLR(
    trunk_optimizer,
    max_lr=MODEL_LR,
    total_steps = N_EPOCH * int(len(train_dataset)/BATCH_SIZE),
    pct_start = PCT_START
)

INFO:timm.models.helpers:Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/tf_efficientnet_b3_ns-9d44bf68.pth)
Downloading: "https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/tf_efficientnet_b3_ns-9d44bf68.pth" to /root/.cache/torch/hub/checkpoints/tf_efficientnet_b3_ns-9d44bf68.pth


Add our embedder. This is just a linear layer that will create the embeddings for KNN:

In [None]:
embedder = nn.Linear(OUTPUT_SIZE, EMBEDDING_SIZE).to(device)
embedder_optimizer = optim.SGD(trunk.parameters(), lr=MODEL_LR, momentum=0.9)
embedder_schedule = optim.lr_scheduler.OneCycleLR(
    embedder_optimizer,
    max_lr=MODEL_LR,
    total_steps = N_EPOCH * int(len(train_dataset)/BATCH_SIZE),
    pct_start = PCT_START
)

And add the loss function:

In [None]:
loss_func = losses.ArcFaceLoss(num_classes=N_CLASSES, embedding_size=EMBEDDING_SIZE).to(device)
loss_optimizer = optim.SGD(trunk.parameters(), lr=MODEL_LR, momentum=0.9)
loss_schedule = optim.lr_scheduler.OneCycleLR(
    loss_optimizer,
    max_lr=MODEL_LR,
    total_steps = N_EPOCH * int(len(train_dataset)/BATCH_SIZE),
    pct_start = PCT_START
)

Setup some hooks for validation, logging and model saving at the end of the epoch:

In [None]:
record_keeper, _, _ = LP.get_record_keeper(LOG_DIR)
hooks = LP.get_hook_container(record_keeper, primary_metric='mean_average_precision')

In [None]:
tester = testers.GlobalEmbeddingSpaceTester(
    end_of_testing_hook=hooks.end_of_testing_hook,
    accuracy_calculator=AccuracyCalculator(
        include=['mean_average_precision'],
        device=torch.device("cpu"),
        k=5),
    dataloader_num_workers=N_WORKER,
    batch_size=BATCH_SIZE
)

By adding the tester as an end of epoch hook in this way, it will automatically use the embedder model to generate train and validation embeddings, then for each validation embedding find the k nearest neighbours and evaluate MAP@5. This won't take into account the `new_individual` problem, but it should give us an idea of model performance on the task regardless.

In [28]:
end_of_epoch_hook = hooks.end_of_epoch_hook(
    tester, 
    dataset_dict,
    MODEL_DIR,
    test_interval=1, 
    patience=PATIENCE, 
    splits_to_eval = [('val', ['train'])]
)

Finally, setup our trainer object:

In [29]:
trainer = trainers.MetricLossOnly(
    models={"trunk": trunk, "embedder": embedder},
    optimizers={"trunk_optimizer": trunk_optimizer, "embedder_optimizer": embedder_optimizer, "metric_loss_optimizer": loss_optimizer},
    batch_size=BATCH_SIZE,
    loss_funcs={"metric_loss": loss_func},
    mining_funcs={},
    dataset=train_dataset,
    dataloader_num_workers=N_WORKER,
    end_of_epoch_hook=end_of_epoch_hook,
    lr_schedulers={
        'trunk_scheduler_by_iteration': trunk_schedule,
        'embedder_scheduler_by_iteration': embedder_schedule,
        'metric_loss_scheduler_by_iteration': loss_schedule,
    }
)

## Model Training

Train the model:

In [None]:
trainer.train(num_epochs=36)

INFO:PML:Initializing dataloader
INFO:PML:Initializing dataloader iterator
INFO:PML:Done creating dataloader iterator
INFO:PML:TRAINING EPOCH 1
total_loss=41.79399: 100%|██████████| 1913/1913 [10:42<00:00,  2.98it/s]
INFO:PML:Evaluating epoch 1
INFO:PML:Getting embeddings for the val split
100%|██████████| 213/213 [00:49<00:00,  4.29it/s]
INFO:PML:Getting embeddings for the train split
100%|██████████| 1914/1914 [07:19<00:00,  4.36it/s]
INFO:PML:Computing accuracy for the val split w.r.t ['train']
INFO:PML:running k-nn with k=5
INFO:PML:embedding dimensionality is 512
INFO:PML:New best accuracy! 0.06143222006577173
INFO:PML:TRAINING EPOCH 2
total_loss=40.85210: 100%|██████████| 1913/1913 [19:06<00:00,  1.67it/s]
INFO:PML:Evaluating epoch 2
INFO:PML:Getting embeddings for the val split
100%|██████████| 213/213 [01:27<00:00,  2.43it/s]
INFO:PML:Getting embeddings for the train split
100%|██████████| 1914/1914 [10:41<00:00,  2.99it/s]
INFO:PML:Computing accuracy for the val split w.r.t ['

## Inference (validation set)

Here we want to use the validation set to help us choose the appropriate distance threshold between our query and reference images after which we classify the former as a `new_individual`. To do so, we loop through the validation set for a number of thresholds and find that which maximises our MAP@5.

Load in the best weights:

In [None]:
logging.getLogger().setLevel(logging.WARNING)

In [None]:
best_trunk_weights = glob.glob('../models/{}/trunk_best*.pth'.format(MODEL_NAME))[0]
trunk.load_state_dict(torch.load(best_trunk_weights))

In [None]:
best_embedder_weights = glob.glob('../models/{}/embedder_best*.pth'.format(MODEL_NAME))[0]
embedder.load_state_dict(torch.load(best_embedder_weights))

Setup the inference model object to easily generate embeddings and find nearest neighbours:

In [None]:
inference_model = InferenceModel(
    trunk=trunk,
    embedder=embedder,
    normalize_embeddings=True,
)

Train this on the training data:

In [None]:
inference_model.train_knn(train_dataset)

Loop through the validation data and loop through to find k nearest neighbours:

In [None]:
valid_dataloader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=N_WORKER, pin_memory=True)

In [None]:
valid_labels_list = []
valid_distance_list = []
valid_indices_list = []

for images, labels in tqdm(valid_dataloader):

    distances, indices = inference_model.get_nearest_neighbors(images, k=N_NEIGHBOURS)
    valid_labels_list.append(labels)
    valid_distance_list.append(distances)
    valid_indices_list.append(indices)

valid_labels = torch.cat(valid_labels_list, dim=0).cpu().numpy()
valid_distances = torch.cat(valid_distance_list, dim=0).cpu().numpy()
valid_indices = torch.cat(valid_indices_list, dim=0).cpu().numpy()

We have the indices of the nearest neighbours in our training set, so setup the lookups to return the `individual_id`:

In [None]:
new_whale_idx = -1

train_labels = train_df['individual_id'].unique()
train_idx_lookup = train_df['individual_id'].copy().to_dict()
train_idx_lookup[-1] = 'new_individual'

valid_class_lookup = valid_df.set_index('label')['individual_id'].copy().to_dict()

Loop through a range of thresholds and find which maximises our MAP@5:

In [None]:
thresholds = [np.quantile(valid_distances, q=q) for q in np.arange(0, 1.0, 0.01)]

In [None]:
results = []

for threshold in tqdm(thresholds):

    prediction_list = []
    running_map=0

    for i in range(len(valid_distances)):

        pred_knn_idx = valid_indices[i, :].copy()  
        insert_idx = np.where(valid_distances[i, :] > threshold) 

        if insert_idx[0].size != 0:  
            pred_knn_idx = np.insert(pred_knn_idx, np.min(insert_idx[0]), new_whale_idx) 

        predicted_label_list = []
        
        for predicted_idx in pred_knn_idx:
            predicted_label = train_idx_lookup[predicted_idx]
            if len(predicted_label_list) == 5:
                break
            if (predicted_label == 'new_individual') | (predicted_label not in predicted_label_list):
                predicted_label_list.append(predicted_label)

        gt = valid_class_lookup[valid_labels[i]]

        if gt not in train_labels:
            gt = "new_individual"

        precision_vals = []

        for j in range(5):
            if predicted_label_list[j] == gt:
                precision_vals.append(1/(j+1))
            else:
                precision_vals.append(0)

        running_map += np.max(precision_vals)

    results.append([threshold, running_map / len(valid_distances)])

results_df = pd.DataFrame(results, columns=['threshold','map5'])

In [None]:
results_df = results_df.sort_values(by='map5', ascending=False).reset_index(drop=True)
results_df.head(5)

Grab the best result:

In [None]:
threshold = results_df.loc[0, 'threshold']
threshold

## Inference (test set)

We want to make sure we use both our training and validation images for comparison. Combine the two dataframes and add a new dataset: 

In [None]:
combined_df = pd.concat([train_df, valid_df], axis=0).reset_index(drop=True)
combined_dataset = HappyWhaleDataset(df=combined_df, image_dir=TRAIN_DIR, return_labels=True)
len(combined_dataset)

Re-train the KNN model on this:

In [None]:
inference_model.train_knn(combined_dataset)

Grab the submission file:

In [None]:
# test_df = pd.read_csv('../input/happy-whale-and-dolphin/sample_submission.csv')

In [None]:
test_df = pd.read_csv('/kaggle/input/sample_submission.csv')

Create our dataset and dataloader objects for the test set:

In [None]:
test_dataset = HappyWhaleDataset(df=test_df, image_dir=TEST_DIR, return_labels=False)
len(test_dataset)

In [None]:
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=N_WORKER, pin_memory=True)

Find the k nearest neighbours in our combined dataset:

In [None]:
test_distance_list = []
test_indices_list = []

for images in tqdm(test_dataloader):

    distances, indices = inference_model.get_nearest_neighbors(images, k=N_NEIGHBOURS)
    test_distance_list.append(distances)
    test_indices_list.append(indices)

test_distances = torch.cat(test_distance_list, dim=0).cpu().numpy()
test_indices = torch.cat(test_indices_list, dim=0).cpu().numpy()

Prepare the labels for lookup based on index:

In [None]:
combined_idx_lookup = combined_df['individual_id'].copy().to_dict()
combined_idx_lookup[-1] = 'new_individual'

Loop through applying the threshold we found earlier to insert `new_individual`:

In [None]:
results = []

prediction_list = []

for i in range(len(test_distances)):

    pred_knn_idx = test_indices[i, :].copy() 
    insert_idx = np.where(test_distances[i, :] > threshold)  

    if insert_idx[0].size != 0:  
        pred_knn_idx = np.insert(pred_knn_idx, np.min(insert_idx[0]), new_whale_idx)  

    predicted_label_list = []

    for predicted_idx in pred_knn_idx:
        predicted_label = combined_idx_lookup[predicted_idx]
        if len(predicted_label_list) == 5:
            break
        if (predicted_label == 'new_individual') | (predicted_label not in predicted_label_list):
            predicted_label_list.append(predicted_label)

    prediction_list.append(predicted_label_list)

prediction_df = pd.DataFrame(prediction_list)
prediction_df.head()

Create the prediction label:

In [None]:
prediction_df['predictions'] = prediction_df[0].astype(str) + ' ' + prediction_df[1].astype(str) + ' ' + prediction_df[2 ].astype(str) + ' ' + prediction_df[3].astype(str) + ' ' + prediction_df[4].astype(str)
prediction_df.head()

Attach this to the submission:

In [None]:
submission = pd.read_csv('/kaggle/input/sample_submission.csv')
submission['predictions'] = prediction_df['predictions']
submission.head(1)

Save our submission:

In [None]:
submission.to_csv('submission.csv', index=False)

In [None]:
!cp submission.csv  /content/gdrive/MyDrive/kaggle/Whale/ouput/