# Classification of Sugarcane Diseases based on Images

## Initial Setup

Examining the train data shows that there are six (6) classes in total:

In [9]:
classes = [
    "Banded_Chlorosis",
    "Brown_Rust",
    "Brown_Spot",
    "Viral",
    "Yellow_Leaf",
    "Healthy",
]

To make it easier to handle the image data, a Python class called `Classification` is defined and it will use the OpenCV library.

This class will contain the method `load_images` that can be used to load the images, as well as the method `resize_images` for resizing all the images.

Property methods `image_count` and `image_dimensions` also allow us to analyze the loaded images and determine if further preprocessing is needed.

In [10]:
from pathlib import Path

import cv2
import numpy as np


class Classification:
    name: str
    images: list[np.ndarray]

    def __init__(self, name: str):
        self.name = name
        self.images = list()

    @property
    def image_count(self) -> int:
        return len(self.images)

    @property
    def image_dimensions(self) -> dict[tuple[int, int, int], int]:
        dims = {}
        for img in self.images:
            if dims.get(img.shape) is None:
                dims[img.shape] = 1
                continue
            dims[img.shape] += 1
        return dims

    @property
    def image_dimensions_distribution(self) -> dict[tuple[int, int, int], int]:
        distrib = {}
        for key, value in self.image_dimensions.items():
            distrib[key] = value/self.image_count
        return distrib

    def load_images(self, top_folder: str = "train") -> None:
        """
        Empties self.images  then loads images using opencv imread

        Returns the length of image_array

        Raises exception upon error
        """
        try:
            self.images = list()
            path = Path(f"{top_folder}/{self.name}")
            image_paths = [img_path for img_path in path.iterdir() if img_path.is_file()]
            for path in image_paths:
                self.images.append(cv2.imread(str(path)))

        except Exception as e:
            raise e

    def resize_images(self, image_size: tuple[int, int]) -> None:
        """
        Resizes all images in self.image_array to the specified image size.

        Returns None
        """
        for index in range(len(self.images)):
            self.images[index] = cv2.resize(self.images[index], image_size)

## Loading the data

First, create the instances for each class.

In [11]:
classifications = [Classification(_class) for _class in classes]

Then, for each instance, let us load the images under the path `./train/<class_name>`.

We will also display the number of images as well as the dimensions of all images.

In [15]:
for x in classifications:
    x.load_images()
    print(f"{x.name}\n image count: {x.image_count}\n image dimensions: {x.image_dimensions}\n dimension distribution: {x.image_dimensions_distribution} ")

Banded_Chlorosis
 image count: 424
 image dimensions: {(1024, 768, 3): 404, (576, 768, 3): 20}
 dimension distribution: {(1024, 768, 3): 0.9528301886792453, (576, 768, 3): 0.04716981132075472} 
Brown_Rust
 image count: 282
 image dimensions: {(1024, 768, 3): 280, (576, 768, 3): 2}
 dimension distribution: {(1024, 768, 3): 0.9929078014184397, (576, 768, 3): 0.0070921985815602835} 
Brown_Spot
 image count: 1550
 image dimensions: {(1024, 768, 3): 1481, (576, 768, 3): 69}
 dimension distribution: {(1024, 768, 3): 0.9554838709677419, (576, 768, 3): 0.044516129032258066} 
Viral
 image count: 597
 image dimensions: {(1024, 768, 3): 501, (576, 768, 3): 96}
 dimension distribution: {(1024, 768, 3): 0.8391959798994975, (576, 768, 3): 0.16080402010050251} 
Yellow_Leaf
 image count: 1074
 image dimensions: {(1024, 768, 3): 1005, (576, 768, 3): 69}
 dimension distribution: {(1024, 768, 3): 0.9357541899441341, (576, 768, 3): 0.06424581005586592} 
Healthy
 image count: 387
 image dimensions: {(1024,

Two important pieces of information can be gleaned from the output.

1. It can be seen that the six (6) different classes have an imbalance in image count.

This is a problem because it may introduce excessive bias in our models.
To avoid this problem, when training the models, we must use proper sampling techniques to ensure equal class distribution.

2. The images are rectangular and the dimensions are not homogenous.

Most images have a size of `1024x768` but there are some whose size are `576x768` instead.
Looking at the distribution show that they are safe to drop due to the acceptable percentage.
The class `Viral` has the highest percentage of `576x768` images at about 16%. 
However, class `Viral` still would still have more `1024x768` images than class `Brown_Rust` even after dropping all `576x768` images.
Therefore, these images can be dropped. 
The difference in aspect ratios mean that resizing these images to match the majority will introduce distortion that may lead to unintended effects.


In [14]:
# Remove all images with dimensions (576, 768, 3)
for x in classifications:
    x.images = list(filter(lambda img: img.shape != (576, 768, 3), x.images))
    print(f"{x.name}\n image count: {x.image_count}\n image dimensions: {x.image_dimensions}\n")

Banded_Chlorosis
 image count: 404
 image dimensions: {(1024, 768, 3): 404}

Brown_Rust
 image count: 280
 image dimensions: {(1024, 768, 3): 280}

Brown_Spot
 image count: 1481
 image dimensions: {(1024, 768, 3): 1481}

Viral
 image count: 501
 image dimensions: {(1024, 768, 3): 501}

Yellow_Leaf
 image count: 1005
 image dimensions: {(1024, 768, 3): 1005}

Healthy
 image count: 374
 image dimensions: {(1024, 768, 3): 374}

