<img align="left" src="https://panoptes-uploads.zooniverse.org/project_avatar/86c23ca7-bbaa-4e84-8d8a-876819551431.png" type="image/png" height=100 width=100>
</img>
<h1 align="right">KSO Notebook #5: Train ML models</h1>
<h3 align="right"><a href="https://colab.research.google.com/github/ocean-data-factory-sweden/kso/blob/main/notebooks/05_Train_ML_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></h3>
<h3 align="right">Written by the KSO Team</h3>

This notebook takes you through the process of importing a baseline model, training it on a dataset and evaluating the quality of the model. If you do not have a project with us yet, you can run the template project to get a taste of how it all works. This notebook assumes that the user has prepared the dataset for model training, see Tutorial #8 for details on the required setup.

🔴 <span style="color:red">&nbsp;NOTE: In order to run this notebook, you need to have a Weights and Biases account. If you want to become a member of our Koster team on Weights and Biases, you may request this access by contacting jurie.germishuys@combine.se. But this is not necessary to run the template project. </span>

# Set up KSO requirements

### Install all the requirements

Installing the requirements in Google Colab takes ~4 mins and might automatically crash/restart the session. Please run this cell until you get the "Successful installation!" message.

In [1]:
import sys
import os

# Check if notebook is running in colab
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    # Clone kso repo and install requirements
    if not os.path.exists("kso"):
        print("Installing all dependencies...")
        !git clone https://github.com/ocean-data-factory-sweden/kso.git
        !pip install -r /content/kso/requirements_colab.txt

    # Enable external widgets and navigate to the kso tutorial folder
    try:
        from google.colab import output

        output.enable_custom_widget_manager()
        os.chdir("kso/notebooks")
    except ImportError:
        pass

# Prepare the dev settings if needed
try:
    if "kso_utils" not in sys.modules:
        sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), "..")))
        import kso_utils

        print("Using development version...")
        # Enables testing changes in utils
        %load_ext autoreload
        %autoreload 2
except ImportError:
    print("Installing latest version from PyPI...")
    %pip install -q kso-utils

if IN_COLAB:

    def restart_runtime():
        os.kill(os.getpid(), 9)

    # Check if there are any issues with previously imported packages
    try:
        from kso_utils.project import ProjectProcessor
    except Exception as e:
        print(f"Error importing package: {e}")
        print("Restarting runtime to apply package changes...")
        restart_runtime()

# Avoid issues with widgets not displaying properly
!jupyter nbextension enable --user --py widgetsnbextension
!jupyter nbextension enable --user --py jupyter_bbox_widget
!jupyter nbextension enable --user --py ipysheet

# Load the clear output function to keep things clean
from IPython.display import clear_output

clear_output()
print("Successful installation... you're good to go!")

Successful installation... you're good to go!


### Import python packages

In [2]:
# Import required modules for tut#5
import kso_utils.widgets as kso_widgets
import kso_utils.project_utils as p_utils
import kso_utils.server_utils as s_utils
import kso_utils.yolo_utils as y_utils
from kso_utils.project import ProjectProcessor, MLProjectProcessor
from ipyfilechooser import FileChooser
from IPython.display import display

print("Packages loaded successfully")

DEBUG:matplotlib:matplotlib data path: /opt/tljh/user/lib/python3.10/site-packages/matplotlib/mpl-data
DEBUG:matplotlib:CONFIGDIR=/home/jupyter-benjamin.hoeree@st-2de9f/.config/matplotlib
DEBUG:matplotlib:interactive is False
DEBUG:matplotlib:platform is linux
DEBUG:matplotlib:CACHEDIR=/tmp/matplotlib-7oftjzxl
DEBUG:matplotlib.font_manager:font search path [PosixPath('/opt/tljh/user/lib/python3.10/site-packages/matplotlib/mpl-data/fonts/ttf'), PosixPath('/opt/tljh/user/lib/python3.10/site-packages/matplotlib/mpl-data/fonts/afm'), PosixPath('/opt/tljh/user/lib/python3.10/site-packages/matplotlib/mpl-data/fonts/pdfcorefonts')]
INFO:matplotlib.font_manager:generated new fontManager


Packages loaded successfully


### Choose your project

In [3]:
project_name = kso_widgets.choose_project()

Dropdown(description='Project:', options=('Template project', 'Koster_Seafloor_Obs', 'Spyfish_Aotearoa', 'SGU'…

### Initiate project's database

In [4]:
# Find project
project = p_utils.find_project(project_name=project_name.value)
# Initialise pp
pp = ProjectProcessor(project)

INFO:root:Koster_Seafloor_Obs loaded succesfully
INFO:root:Running locally, no external connection to server needed.
INFO:root:Running locally so no csv files were downloaded from the server.
INFO:root:Updated species table from the temporary database
INFO:root:Updated sites table from the temporary database
INFO:root:Updated photos table from the temporary database
INFO:root:Updated movies table from the temporary database


In [5]:
# Initiate mlp
mlp = MLProjectProcessor(pp)

In [None]:
# Only for Template Project (downloading prepared data)
s_utils.get_ml_data(project)

# Train the model

### Configure data paths

If you are running the Template project, the output_folder that you want to select is the ml-template-data. The path to this folder is printed in the cell above. For any other project, it is the folder where you have saved your data.

In [6]:
# Specify path containing the images and labels folders.
mlp.output_path = kso_widgets.choose_folder(
    project.photo_folder if not project.photo_folder == "None" else ".", "output"
)

FileChooser(path='.', filename='', title='HTML(value='Choose location of output')', show_hidden='False', use_d…

🔴 <span style="color:red">&nbsp;NOTE: Each model type requires a specific folder structure to be in place. To be able to train your own Object Detection models, your data_path must contain a yml file for data and hyperparameters. See https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data#11-create-datasetyaml. For image classification models, there should be 3 folders (train, val, test) each containing images in class_name folders. For segmentation models, polygon coordinates are also required. </span>

In [7]:
# Fix important paths
mlp.setup_paths()

INFO:root:Success! Paths to data.yaml and hyps.yaml found.


### Choose a suitable experiment name

In [8]:
exp_name = kso_widgets.choose_experiment_name()

Text(value='exp_name', description='Experiment name:', placeholder='Choose an experiment name', style=TextStyl…

### Choose model to use for training

In the next cell you will specify the folder (can be any folder of choice) where you want to download the baseline model to, which you will select in the cell after. This baseline model will be used as the starting point for the training.

In [9]:
# Specify path to download baseline model
download_folder = kso_widgets.choose_folder(
    project.photo_folder if not project.photo_folder == "None" else ".",
    "model download",
)

FileChooser(path='.', filename='', title='HTML(value='Choose location of model download')', show_hidden='False…

In [10]:
weights = mlp.choose_baseline_model(download_folder.selected)

Dropdown(description='Select model:', layout=Layout(width='50%'), options=('runs:/b61b1002052e4c788118284e6053…

Output()

### Train model with given configuration

The cell below will ask you which batch size and how many epochs you want to use during training. There are no strict rules for this and the best settings will depend on the choice of GPU and some randomness that we have encountered while training models. Therefore it will be some trial and error. As a starting point we advice to use a batch size of 8. For smaller datasets, we have experienced that 50-100 epochs has been sufficient to get good performance on the model (metrics that have reached a plateau), but to not overfit to the training set.

In [11]:
batch_size, epochs, img_h, img_w = mlp.choose_train_params()

HBox(children=(FloatLogSlider(value=1.0, base=2.0, description='Batch size:', max=10.0, readout_format='d', st…

In [12]:
# Give your WandB username, or team name where you want to sent the runs to.
# If you are part of the koster project, you can keep the default 'koster'.
entity = mlp.choose_entity(alt_name=False)

INFO:root:Found team name: koster. If you want to use a different team name for this experiment set the argument alt_name to True


In [13]:
mlp.train_yolo(
    exp_name=exp_name.value,
    weights=weights.artifact_path,
    project=mlp.project_name,
    epochs=epochs.value,
    batch_size=batch_size.value,
    img_size=img_h.value,  # this requires an int
)

EmptyDataError: No columns to parse from file

# Evaluate model performance

The model is now done with training. To see the loss, precision, recall and some other parameters per training epoch, click on the link in the previous cell. Here you can see your run in Weights and Biases. To evaluate the resulting model, please run the cells below. These execute the standard evaluation process from YOLO.

For a biological evaluation of the model, please see Notebook 6.

In [None]:
conf_thres = kso_widgets.choose_eval_params()

In [None]:
# Choose model: The folder you want to select for eval_model is the folder with your experiment_name.
eval_model = FileChooser(".")
display(eval_model)

When you run the cell below, you will get some numbers logged on the screen, and 3 files that are stored in the folder 'your_experiment_name'_val.

The numbers logged on the screen represent the following:
* The first 7 numbers are the: mean precision, mean recal, mean average precision calculated at IOU threshold 0.5 (map@0.5), the mean average precision calculated at different IOU thresholds of 0.5-0.95 with steps of 0.05 (map@0.5:0.95) and then 3 training losses based on predicting the box, object or class.
* The array gives the ap@0.5 per class.
* The last 3 numbers are the same as the numbers that are already printed in a line above, where it says 'Speed: … ms per....'

In [None]:
# Evaluate YOLO Model on Unseen Test data
mlp.eval_yolo(exp_name=exp_name.value, conf_thres=conf_thres.value)

# (Optional) : Enhance annotations using trained model

Enhancement uses the trained model to increase the amount of annotations in the training data. This should only be done in cases where it is absolutely necessary as bad predictions lead to worse predictions when used to train the next iteration of the model.


🔴 <span style="color:red">&nbsp;NOTE: We recommend using a relatively high confidence threshold when enhancing trained models as low confidence predictions could significantly impact the quality of your annotated data. This is currently only available for object detection models.  </span>

In [None]:
eh_conf_thres = kso_widgets.choose_eval_params()

In [None]:
# Choose an input path
input_path = FileChooser(mlp.project_name)
display(input_path)

In [None]:
# Find the project path
project_path = FileChooser(mlp.project_name)
display(project_path)

In [None]:
mlp.enhance_yolo(
    in_path=input_path.selected,
    project_path=project_path.selected,
    conf_thres=eh_conf_thres.value,
    img_size=[640, 640],
)

### Choose run to use as enhanced annotations

In [None]:
runs = FileChooser(".")
display(runs)

In [None]:
# Move enhanced annotations to original run folder (NB: This will replace the original annotations)
mlp.enhance_replace(runs.selected)

#### Once you have moved the new labels to the original label location, you can return to Step 2 and train your model again.

In [None]:
# END