Author: Ioannis Kalfas (email1, email2)
This research project aims to classify images of insects using deep learning techniques. This repository naturally follows up the stickybugs-dataextraction
repository that generates folders with insect image (tiles / bounding boxes). The project provides a robust set of tools to organize insect image data, train classification models using the timm
library of PyTorch, and visualize performance metrics such as confusion matrices. Data splitting is performed based on the sticky plate level i.e. insect images belonging to the same plate will always belong in the same data split (train/validation/test).
All settings are read from a 'config.yaml' file using PyYAML
library. This allows for easy customization of experimental parameters and makes reproducibility of results easier. Moreover, the project provides a standardized directory structure for exporting models and results, making it easy to share findings and compare results across experiments. The user only needs to set the base_dir
in the config file.
To assist with experiment tracking and visualization, the project integrates with Weights and Biases, enabling users to log and visualize experiment progress and results.
A structured workflow is provided for organizing and preprocessing image data, training models, and evaluating their performance.
There are python scripts and bash scripts that can be used to run the various steps of the workflow. All settings are stored in the config.yaml
file. The settings.py
is defining a python dataclass so that we can create an object (called settings) to pass around everywhere in our workflow. The settings are edited or read from the python scripts: edit_config_file.py
& read_config_file.py
.
The bash scripts (.sh) use edit_config_file
and read_config_file
to change the settings before running the main python scripts (001_data_preparation.py
, 002_data_splitting.py
, 003_model_training.py
etc.).
You can always just use the main python scripts and manually change the config.yaml
file, but the bash scripts provide higher-level options. For example, in run_000_data_preparations.sh
we define specific systems (systems --> fuji, photobox, etc.) and run a loop over them to save dataframes only for those systems. Similarly, we have a bash script for data splitting, or training with specific imaging systems.
-
Data preparation: Organize data into a standardized dataframe (for the setup specified in the config file). The dataframe will contain information about the images, such as the class, the path to the image, the system that was used to capture the image, the year, bounding box coordinates, etc. We also calculate some custom features for model error analysis, such as the mean RGB values, the number of objects in each tile, etc. The dataframe is saved as a parquet file.
-
Data cleaning: Remove images that are either mislabeled or are of poor quality. Here we train a model to classify all classes for a specific system, and we discard images with high loss. This is an iterative process that can be repeated multiple times. The discarded images are copied (not moved!) to a separate folder with the suffix
_outliers
. The data_splitting script knows to ignore these images (check_if_outlier function). Note again, that they are not removed from the original folder, but they are copied to the_outliers
folder. This is done to avoid having to run the cleaning process multiple times, because it's time-consuming. -
Data splitting: Split the data into train/validation/test sets. This is done at the sticky plate level. There are options to make sure that specific classes are present in all splits within the script. When done, the script saves the dataframes for each split as parquet files. These are read by the model training script. Some histograms are also saved in the exports folder to visualize the class distribution in the splits.
-
Model training: Train, validate and test the model. A "coding" system is implemented in the settings file to specify the imaging setups that were used for training the model (0:cannon, 1:fuji, 2:photobox, 3:phoneboxS20FE, 4:phoneboxS22Ultra). This script will read the dataframes for the train/validation/test splits and train the model. The model is saved in the exports folder. The script also saves the confusion matrix and the classification report for the test set. There is an option for weights and biases monitoring. It is also possible to train the model on top of a pre-trained model you previously trained. For example, you can train a model on fuji & photobox (multi12) and then train a model on top of that model for the phoneboxS22Ultra only. This will create a folder in the results folder called: multi4_PTmulti12, meaning that you train a model on multi4 (phoneboxS22Ultra) on top of a pre-trained model on multi12 (fuji & photobox). PT stands for pre-trained.
-
Model results: Visualize the performance of the model using confusion matrices and other metrics. The model_results script will load the model and the train/validation/test dataframes and generate some plots to show the performance of the model. The plots are saved in the results folder. The script also saves the classification report for the test set and a folder of misclassified images.
-
Results visualizations: In its current state, the results_visualizations script supports the bash scripts
run_010_cumulative_weeks
andrun_005_results_visualization
to make a boxplot where we have the accuracy of the model for a given class on the Y axis and the number of weeks used for training/validating and testing on the X axis. We run the bash scriptrun_010_cumulative_weeks
to get the results of the model for each week for a number of trials. Then we run the bash scriptrun_005_results_visualization
to get the boxplot (might have to run this twice due to a bug). The boxplot is saved in the exports folder. Within the results_visualizations script, you can define for which insect class you want to see the boxplot. Ideally, the class should be passed as an argument in the bash script, but this is not implemented yet.
The bash scripts are used to run the main python scripts. They are used to change the settings in the config.yaml
file. The bash scripts are used to run the main python scripts with different settings. For example, in run_000_data_preparations.sh
we define specific systems (systems --> fuji, photobox, etc.) and run a loop over them to save dataframes only for those systems. Similarly, we have a bash script for data splitting, or training with specific imaging systems. Here's a short description of the bash scripts:
run_000_data_preparations.sh
: Run the data preparation script for specific systems. The script will loop over the systems and save the dataframes for each system. The dataframes are saved in the exports folder.run_001_data_cleaning.sh
: Run the data cleaning script for specific systems. The script will loop over the systems and clean the data for each system. By cleaning, we mean the creation of folders per system with the _outliers suffix. The images that are considered outliers are copied to these folders.run_002_data_splitting.sh
: Run the data splitting script for specific systems. The script will loop over the systems and save the dataframes for each split. The dataframes are saved in the exports folder.run_003_004_model_training_model_results.sh
: Run the model training and model results scripts for specific systems. You can also select whether you want to use a pre-trained model or not and specify which systems the pre-trained model was trained on. It's possible to exclude some classes from the training in case there's not enough data. There is also a "weeks" argument where you can select the first N weeks of the data to be used.run_005_results_visualization.sh
&run_010_cumulative_weeks.sh
: Run the results_visualizations script to generate a boxplot where we have the accuracy of the model for a given class on the Y axis and the number of weeks used for training/validating and testing on the X axis. We run the bash scriptrun_010_cumulative_weeks
to get the results of the model for each week for a number of trials. Then we run the bash scriptrun_005_results_visualization
to get the boxplot (might have to run this twice due to a bug). The boxplot is saved in the exports folder.
For all scripts, I use three digits to simply create some semantic order. The last digit of the 3 digit codes is the main step of the workflow (0: data preparation, 1: data cleaning, 2: data splitting, 3: model training, 4: model results, 5: results visualizations). The second-to-last digit is not used at the moment, but it could be used to build another semantic layer on top of the existing one, or to create a new workflow which is not part of the main workflow. For example, in the bash scripts we have run_000, run_001, run_002, run_003_004, run_005 and then we have run_010. The run_010 is not part of the main workflow, but it's a script that uses most of the previous scripts for its own purpose. One could then create scripts run_011, run_012, etc. if they wanted to create an order that continues the "work" of run_010., or create a new workflow with run_020, run_030, etc.
As one comes to realize with time, coding for Research is not the same as coding for Software Development. In Research, we have to be flexible and adapt to new ideas and new workflows. Sometimes a wild idea appears and we might have to create a new workflow simply to test it. Things that persist and appear often as needed are then integrated into the main workflow.
This repository has been built for fast prototyping and has been continuously evolving. This has some advantages, like saving lots of time to do tasks that have been needed repeatedly over the past years, but has some disadvantages, too. The main disadvantage is that it's not very concise to an "outsider". Hopefully, the documentation above can help with that.
One of the main objectives I have in mind to develop is to save a configuration file inside a corresponding results folder that has the name of the experiment (e.g. multi4_PTmulti12). This configuration file (e.g. .yaml) will contain all the settings that were used for the experiment. This will make it easier to reproduce the results. Ideally it could be picked up by any of the main scripts and used to run the whole experiment again.
- The code was developed as part of Yannis Kalfas PhD and Postdoc tenure, to analyze, pre-process, model (and perform model-error-analysis on) insect image data using
Pytorch
. Reach out to me here if something is not clear in the documentation. - Another purpose of this repository is to train models for the MeBioS streamlit app (repo) and the MeBioS FastAPI server (repo).
- All data shown as examples in figures in this repo belong to KU Leuven.