# Initial analysis

## Executive summary

- most important bullet points
- the train dataset was modified to be used in this study
- not all images seem to be perfectly annotated

## More information

**1 - restructure of the validation and train datasets**

As observed and explained on the dataset & paper, the validation dataset was a part of the training set. However this makes no sense. Because of that, the current train will be renamed as train+val and the new train_fixed will be the original dataset without the validation images. This structure fixed the methodological mistake done originally in the dataset (validating in training data).

**2 - EDA**

From the images it is possible to see that the labels and annotations might not be perfect, some seem not to be perfectly annotated, some missing, and such. With that, the model can be only as good as the input data, so that might be a upper limit to its performance.

- xxxx
- xxxx


## imports & configs


In [1]:
#### default imports ####
import numpy as np
import os
import sys
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### specific imports ###
import matplotlib.pyplot as plt
import xml.etree.ElementTree as ET
from matplotlib.patches import Ellipse

# forces local code to be reloaded to avoid problems
%load_ext autoreload
%autoreload 2

#### important configs ####
# uses seaborn configs for prettier graphs
sns.set_theme()
# shows thousand separator for values
pd.options.display.float_format = '{:,.2f}'.format
# enable import from src/
sys.path.append('..')  

#### paths ####
# change path to base folder
project_path = "/mnt/c/Users/nicol/My Drive/personal/coding projects/2024/blood-cell-detection"

## auxiliar functions


In [2]:
def get_annotations(xml_path):
    tree = ET.parse(xml_path)
    root = tree.getroot()
    sample_annotations = []

    for neighbor in root.iter("object"):
        label = neighbor.find("name").text
        xmin = int(neighbor.find("bndbox").find("xmin").text)
        ymin = int(neighbor.find("bndbox").find("ymin").text)
        xmax = int(neighbor.find("bndbox").find("xmax").text)
        ymax = int(neighbor.find("bndbox").find("ymax").text)

        #     print(xmin, ymin, xmax, ymax)
        sample_annotations.append([label, xmin, ymin, xmax, ymax])

    return sample_annotations

# 0 - data prep


In [3]:
%cd {project_path}

# # clone dataset in the data/raw folder
# if os.path.exists('data/'):
#     os.removedirs('data/')
# os.makedirs('data/processed')
# os.makedirs('data/raw')

# %cd "data/raw"

# !git clone git@github.com:MahmudulAlam/Complete-Blood-Cell-Count-Dataset.git

# %cd {project_path}

/mnt/c/Users/nicol/My Drive/personal/coding projects/2024/blood-cell-detection


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


# 1 - validation dataset problem


In [4]:
# # where the images & labels are
# raw_path = "data/raw/Complete-Blood-Cell-Count-Dataset"

# # gets the path for all files
# df_all = pd.DataFrame()
# for dirname, _, filenames in os.walk(raw_path):
#     paths = [dirname + "/" + filename for filename in filenames]
#     folder_name = os.path.split(dirname)[-1]
#     df_all = pd.concat([df_all, pd.DataFrame({"path": paths})], ignore_index=True)

# # transforms to df
# df_all = pd.DataFrame(df_all)

# # also gets the filename
# df_all["filename"] = df_all["path"].apply(lambda s: s.split("/")[-1])

# # and finally check possible extensions
# extensions = df_all["path"].apply(lambda s: s.split(".")[-1])
# extensions.value_counts()

In [5]:
# # creates a reference for the dataset (which folder it is from)
# df_all["dataset"] = df_all["path"].apply(lambda s: s.split("/")[-3])
# df_all["dataset"].value_counts()

In [6]:
# # check if all the files in validation dataset are also in the training one
# for filename in df_all[df_all["dataset"] == "Validation"]["filename"]:
#     if filename not in df_all[df_all["dataset"] == "Training"]["filename"].values:
#         print(filename)

**Conclusion:** Here it is possible to observe that all files in the validation folder are (as explained in the paper & GitHub) duplicated from the training dataset. This utilization is a methodological problem, so will not be used in our study as it is.


# 2 - EDA


In [7]:
# # select only images
# images = df_all[df_all["filename"].apply(lambda s: s.split(".")[-1] in ["jpg"])]

# label_colors = {"RBC": "red", "WBC": "white", "Platelets": "purple"}

# # show 3 images
# fig, ax = plt.subplots(1, 3, figsize=(15, 5))
# for i, (index, row) in enumerate(images.sample(3, random_state=42).iterrows()):
#     # img show
#     img = plt.imread(row["path"])
#     ax[i].imshow(img)
#     ax[i].axis("off")
#     ax[i].set_title(f"{row['dataset']} - {row['filename']}")

#     # get annotations
#     annotations = get_annotations(
#         row["path"].replace("Images", "Annotations").replace("jpg", "xml")
#     )
#     print(annotations)

#     # show annotations
#     for label, xmin, ymin, xmax, ymax in annotations:
#         ax[i].add_patch(
#             # plt.Rectangle(
#             #     (xmin, ymin),
#             #     xmax - xmin,
#             #     ymax - ymin,
#             #     linewidth=2,
#             #     edgecolor=label_colors[label],
#             #     facecolor="none",
#             # )
#             Ellipse(
#                 ((xmin + xmax) / 2, (ymin + ymax) / 2),
#                 xmax - xmin,
#                 ymax - ymin,
#                 linewidth=2,
#                 edgecolor=label_colors[label],
#                 facecolor="none",
#             )
#         )
#         ax[i].text(xmin, ymin, label, fontsize=12, color="k")

**Info:** From the images it is possible to see that the labels and annotations might not be perfect, some seem not to be perfectly annotated, some missing, and such. With that, the model can be only as good as the input data, so that might be a upper limit to its performance.


# 3 - creating datasets


## .1 - adapting train, val & test


In [8]:
# # coping all the datasets to the processed folder
# !cp -r data/raw/Complete-Blood-Cell-Count-Dataset/Training data/processed/Training
# !cp -r data/raw/Complete-Blood-Cell-Count-Dataset/Validation data/processed/Validation
# !cp -r data/raw/Complete-Blood-Cell-Count-Dataset/Testing data/processed/Testing
# print('Done!')

In [9]:
# # removing the duplicated images from the training dataset
# removed_files = 0
# for validation_file_path in df_all[df_all["dataset"] == "Validation"]["path"]:
#     validation_file_path_processed_folder = validation_file_path.replace(
#         "raw", "processed"
#     ).replace("Complete-Blood-Cell-Count-Dataset/", "")

#     if os.path.exists(
#         validation_file_path_processed_folder.replace("Validation", "Training")
#     ):
#         os.remove(
#             validation_file_path_processed_folder.replace("Validation", "Training")
#         )
#         removed_files += 1

# print(f"Removed {removed_files} files")

In [10]:
# rename Images folders to images

## .2 - verify and recreate df


In [11]:
# # where the images & labels are
# processed_path = "data/processed"

# # gets the path for all files
# df_processed = pd.DataFrame()
# for dirname, _, filenames in os.walk(processed_path):
#     paths = [dirname + "/" + filename for filename in filenames]
#     folder_name = os.path.split(dirname)[-1]
#     df_processed = pd.concat(
#         [df_processed, pd.DataFrame({"path": paths})], ignore_index=True
#     )

# # transforms to df
# df_processed = pd.DataFrame(df_processed)

# # also gets the filename
# df_processed["filename"] = df_processed["path"].apply(lambda s: s.split("/")[-1])

# # and finally check possible extensions
# extensions = df_processed["path"].apply(lambda s: s.split(".")[-1])
# extensions.value_counts()

# processed_images = df_processed[
#     df_processed["filename"].apply(lambda s: s.split(".")[-1] in ["jpg"])
# ]

In [12]:
# # check if all the files in validation dataset are also in the training one
# for filename in df_all[df_all["dataset"] == "Validation"]["filename"]:
#     if filename not in df_all[df_all["dataset"] == "Training"]["filename"].values:
#         print(filename)

The validation images & labels were removed from the training dataset.


## .2 - creating labels


In [13]:
# # definitions for the dataset
# WIDTH = 640
# HEIGHT = 480
# cells_id = {"RBC": 0, "WBC": 1, "Platelets": 2}

# cells_classes = list(cells_id.keys())
# cells_classes
# # saves the dataset into the yolo format
# for i, (index, row) in enumerate(processed_images.iterrows()):
#     # get annotations
#     annotations = get_annotations(
#         row["path"].replace("Images", "Annotations").replace("jpg", "xml")
#     )

#     # get label path
#     label_path = row["path"].replace("Images", "labels").replace("jpg", "txt")

#     # create folders
#     os.makedirs(os.path.split(label_path)[0], exist_ok=True)

#     # save annotations
#     with open(label_path, "w") as file:
#         for label, xmin, ymin, xmax, ymax in annotations:
#             # get the center of the rectangle
#             x_center = (xmin + xmax) / 2
#             y_center = (ymin + ymax) / 2

#             # normalize the values
#             x_center /= WIDTH
#             y_center /= HEIGHT
#             width = (xmax - xmin) / WIDTH
#             height = (ymax - ymin) / HEIGHT

#             # save the values
#             file.write(f"{cells_id[label]} {x_center} {y_center} {width} {height}\n")

# print("done!")

## .3 - create dataset yaml


In [14]:
# yaml_file = "data/processed/blood_cell_dataset.yaml"

# full_path = "/mnt/c/Users/nicol/My Drive/personal/coding projects/2024/blood-cell-detection/data/processed"
# train_images_dir = "Training/images/"
# val_images_dir = "Validation/images/"
# test_images_dir = "Testing/images/"

# names_str = ""
# for item in cells_classes:
#     names_str = names_str + ", '%s'" % item
# names_str = "names: [" + names_str[1:] + "]"

# with open(yaml_file, "w") as wobj:
#     wobj.write("path: %s\n" % full_path)
#     wobj.write("train: %s\n" % train_images_dir)
#     wobj.write("val: %s\n" % val_images_dir)
#     # wobj.write("test: %s\n" % test_images_dir)
#     wobj.write("nc: %d\n" % len(cells_classes))
#     wobj.write(names_str + "\n")

# 4 - training


## .1 - initial baseline


In [21]:
from ultralytics import YOLO

# Load a model
model = YOLO("models/yolov8n.pt")  # load a pretrained model (recommended for training)

results = model.train(
    data="data/processed/blood_cell_dataset.yaml", epochs=3, imgsz=640, batch=4
)

# show results
for dataset in ["train", "val"]:
    print("-" * 30)
    print(dataset)
    results = model.val(split=dataset)
    print(results.results_dict)

New https://pypi.org/project/ultralytics/8.1.34 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.1.33 🚀 Python-3.11.8 torch-2.2.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3050 6GB Laptop GPU, 6144MiB)
[34m[1mengine/trainer: [0mtask=detect, mode=train, model=models/yolov8n.pt, data=data/processed/blood_cell_dataset.yaml, epochs=3, time=None, patience=100, batch=4, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=train31, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, class

[34m[1mtrain: [0mScanning /mnt/c/Users/nicol/My Drive/personal/coding projects/2024/blood-cell-detection/data/processed/Training/labels.cache... 240 images, 0 backgrounds, 0 corrupt: 100%|██████████| 240/240 [00:00<?, ?it/s]
[34m[1mval: [0mScanning /mnt/c/Users/nicol/My Drive/personal/coding projects/2024/blood-cell-detection/data/processed/Validation/labels.cache... 60 images, 0 backgrounds, 0 corrupt: 100%|██████████| 60/60 [00:00<?, ?it/s]


: 

## .2 - hyperparam tuning


## .3 - pre & post-processing


### A - boxes on same location


### B - confidence of result


### C - model to predict this part


# 5 - retraining


# 6 - model results


# 7 - Analisys


# 8 - conclusion
