# Wheat head detection with optimal transport-based domain adaptation

This notebook focuses on preprocessing the images from the GWHD 2021 dataset, using domain adaptation based on Sinkhorn divergences with the aid of the GeomLoss package. This process creates a copy of the dataset where all the images have a similar color palette, which improves the detection metrics of a YOLO model.

## Preparing the images for the YOLO model

By executing the 'setup.py' script, we create the correct directory structure, as well as the neccessary label files in the YOLO format, which are copied to the folders where the modified dataset copy will be stored. This allows us to use the Ultralytics module to train and validate the YOLO model. Finally, 'setup.py' also creates YAML files to indicate Ultralytics where the images and its labels are located.

In [2]:
!python 'setup.py'

4563856cc6d75c670eafd86d5eb7245fbe8f273c28f9e36f7c6aaf097c7ce423.png is not located in ../gwhd_2021/images/
a2a15938845d9812de03bd44799c4b1bf856a8ad11752e81c94dc8d138515021.png is not located in ../gwhd_2021/images/
401f89a2bb6ab63e3f406bd59b9cadccfe953230feb6cdd7d1ce8a0f19be7d2b.png is not located in ../gwhd_2021/images/
0a3937653483c36dfb4d957b6f82ae96dbdc7ba36cc3d8bdb633bada3023c085.png is not located in ../gwhd_2021/images/
be1652110a44acd24b42784356e965ce84a04893c3f1bb3958b09fc7bc4eda2e.png is not located in ../gwhd_2021/images/
b3884abdad8e98013cb50b37733929975b55d84c44cfa7b097e3b88384eaf78a.png is not located in ../gwhd_2021/images/
891ea3e0214fa8d0b1fa79300479050a260dab709ede432056e4429d5def3593.png is not located in ../gwhd_2021/images/
48403161f9d95c58f7f3f7e71eb27390796bbfcee770132a59d956918b346f66.png is not located in ../gwhd_2021/images/
696abfd0977d0046cf6e493f8ba2e66b301d1e320fd29c58cda14065306a1663.png is not located in ../gwhd_2021/images/
82b3f5da5af875fe376c9fe5a408

## Perform the OT domain adaptation on the images and save them to the new folders

In [3]:
import os
import pandas as pd
import numpy as np
import random
import imageio
import torch
from geomloss import SamplesLoss
from PIL import Image
from utils import RGB_cloud, color_transfer, copy_files


[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK


In [4]:
use_cuda = torch.cuda.is_available()
dtype = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor
sampling = 8 if not use_cuda else 1

# Function to apply domain adaptation iteratively to images from the source domains (all the images that don't belong to target domain), 
# choosing a random image from the target domain in every iteration and using Sinkhorn divergences as the loss function for thecolor transfer function
def OT_and_save(source_domain_folder, img_destination_folder, img_list):
    for img in img_list:
        source_image_path = os.path.join(source_domain_folder, img) 
        target_image_path = os.path.join(train_folder, random.choice(target_list))

        X_i = RGB_cloud(source_image_path, sampling, dtype)
        Y_j = RGB_cloud(target_image_path, sampling, dtype)
        
        # Blur and reach relate to ε (entropy regularization) and ρ (unbalanced transport) respectively, these parameters regulate the domain adaptation
        new_cloud = color_transfer(X_i, Y_j, SamplesLoss("sinkhorn", blur=0.6, reach=None))
        
        # The new RGB cloud must be converted to a uint8 image so it can be saved
        W = int(np.sqrt(len(new_cloud)))
        img_matrix = new_cloud.view(W, W, 3).detach().cpu().numpy()
        new_image = np.clip(img_matrix, 0, 1)
        final_image = (new_image * 255).astype(np.uint8)

        # Images are saved to be used later by the YOLO model
        save_path = os.path.join(img_destination_folder, img)
        imageio.imwrite(save_path, final_image)
        print(f'New OT image {img} successfully created at {img_destination_folder}')

In [5]:
# Path of the csv files that contain data for the train, valid and test splits
csv_train_path = "../gwhd_2021/competition_train.csv"
csv_valid_path = "../gwhd_2021/competition_val.csv"
csv_test_path = "../gwhd_2021/competition_test.csv"

# Read the csv files
train = pd.read_csv(csv_train_path) 
valid = pd.read_csv(csv_valid_path)
test = pd.read_csv(csv_test_path) 

# Print the names of the domains in the training set and the number of images in each of them, so we can choose a target domain in the next step
train.value_counts("domain", normalize=False)

domain
ETHZ_1            747
Arvalis_3         588
Arvalis_5         448
Rres_1            432
Arvalis_2         401
Arvalis_4         204
Inrae_1           176
Arvalis_6         160
NMBU_2             98
NMBU_1             82
Arvalis_1          66
Arvalis_11         60
Arvalis_10         60
Arvalis_9          32
ULiège-GxABT_1     30
Arvalis_12         29
Arvalis_7          24
Arvalis_8          20
Name: count, dtype: int64

In [6]:
# Create a list of files that belong to the chosen target domain so we can transport the other images to this domain
target_list = train[train["domain"] == "ETHZ_1"]["image_name"].tolist()

# Create lists of files from the source domains, i.e. every domain except target domain
source_train_list = train[train["domain"] != "ETHZ_1"]["image_name"].tolist()
source_valid_list = valid["image_name"].tolist()
source_test_list = test["image_name"].tolist()

# Defines the folders where the original images are stored
train_folder = "../gwhd_2021/Original/train/images/"
valid_folder = "../gwhd_2021/Original/valid/images/"
test_folder = "../gwhd_2021/Original/test/images/"

# Path to destination folders for the images after OT is applied to them
train_img_destination = "../gwhd_2021/OT/train/images/"
valid_img_destination = "../gwhd_2021/OT/valid/images/"
test_img_destination = "../gwhd_2021/OT/test/images/"

# Create destination folders in case they don't exist
os.makedirs(train_img_destination, exist_ok=True)
os.makedirs(valid_img_destination, exist_ok=True)
os.makedirs(test_img_destination, exist_ok=True)


In [7]:
# Make a copy of the target domain images in the new folder, as we're not applying OT to them
copy_files(train_folder, target_list, train_img_destination)

06349f1520b395d898703994bf55283eeaf85e1071fe18412afbedfcfd563c17.png successfully copied to ../gwhd_2021/OT/train/images/
b772dbb8a3ea1e9548993811b3a3db03d03c11be436a295563fc8084ec26abd4.png successfully copied to ../gwhd_2021/OT/train/images/
d44472c550c55380e442c63e425755db543ca1d9315f2e2ac57065a44cfd97ac.png successfully copied to ../gwhd_2021/OT/train/images/
2d489ca9d8030d4e23a09142cc45a86c4103cde700fd2ff86641963b60261cdd.png successfully copied to ../gwhd_2021/OT/train/images/
0ef333bd24f8020b805fa7623207de23ddae642749c15ac4bb535c1f47206403.png successfully copied to ../gwhd_2021/OT/train/images/
5f06808f5a6386d54ca4fd85f488288c35cd52f8bd684a5519f0159835dd9bbf.png successfully copied to ../gwhd_2021/OT/train/images/
d0e06edbf7ad712dd8bd0d510311cc1af01a920cd654cd3c95e950ed85f4af17.png successfully copied to ../gwhd_2021/OT/train/images/
9e9489a011529057edcc877b6c87140e3e96275cd3e2a9c5edcc662767087884.png successfully copied to ../gwhd_2021/OT/train/images/
8c5715c27dadd6b734696abb

In [None]:
# Set a seed so we can reproduce the OT results
random.seed(0)

# Performs optimal transport on every image and stores them in the corresponding folders
OT_and_save(train_folder, train_img_destination, source_train_list)
OT_and_save(valid_folder, valid_img_destination, source_valid_list)
OT_and_save(test_folder, test_img_destination, source_test_list)

## YOLOv5s for wheat head detection in GWHD 2021

### Training and validation of YOLO model with the original dataset

In [None]:
import torch
from ultralytics import YOLO

In [None]:
model_path = '../gwhd_2021/yolov5su.pt'
data_path = '../gwhd_2021/gwhd_2021.yaml'

# Load a COCO pretrained YOLOv5s model
model = YOLO(model_path) 

# Train the model on the gwhd 2021 dataset 
results = model.train(data=data_path, epochs=100, imgsz=640, patience=10, seed=0)

In [None]:
best_model = "runs/detect/train/weights/best.pt"
data_path = '../gwhd_2021/gwhd_2021.yaml'

# Load the weights for the best performing epoch in the training above
model = YOLO(best_model)

# Obtain results for the test set
validation_results = model.val(data=data_path, imgsz=640, split="test")

### Training and validation with the OT modified dataset

In [None]:
import torch
from ultralytics import YOLO

In [None]:
model_path = '../gwhd_2021/yolov5su.pt'
data_path = '../gwhd_2021/gwhd_2021_OT.yaml'

# Load a COCO pretrained YOLOv5s model
model = YOLO(model_path) 

# Train the model on the version of gwhd 2021 dataset modified by OT domain adaptation
results = model.train(data=data_path, epochs=100, imgsz=640, patience=10, seed=0)

In [None]:
best_model = "runs/detect/train/weights/best.pt"
data_path = '../gwhd_2021/gwhd_2021_OT.yaml'

# Load the weights for the best performing epoch in the training above
model = YOLO(best_model)

# Obtain results for the test set
validation_results = model.val(data=data_path, imgsz=640, split="test")