# **DATA VISUALIZATION**

## Objectives

1. Answer Business Requirement 1: Provide visual insights to differentiate between healthy and mildew-affected leaves.
2. Visualize class distributions and image characteristics.
3. Generate visuals to aid in building the Streamlit dashboard.

## Inputs

- Dataset directories:
    - inputs/mildew_dataset/cherry-leaves/train
    - inputs/mildew_dataset/cherry-leaves/validation
    - inputs/mildew_dataset/cherry-leaves/test

## Outputs

- Image shape embeddings pickle file.
- Mean and variability of images per label plot.
- Visualization of class distributions.
- Image montage for use in the Streamlit dashboard.

## Additional Comments

- Data visualization provides insights into dataset quality, balance, and structure.
- Helps in identifying potential biases or imbalances in the dataset.
- Supports informed decisions regarding preprocessing, data augmentation, and modeling strategies.
- Ensures that the dataset is well-prepared for the modeling phase by revealing class imbalances and image variability.



---

## Set Up Environment

---

### Import libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
sns.set_style("white")
from matplotlib.image import imread

### Set working directory

In [None]:
cwd= os.getcwd()

In [None]:
os.chdir('/workspace/cv-mildew-detector')
print("You set a new current directory")

In [None]:
work_dir = os.getcwd()
work_dir

### Set input directories

In [None]:
# Set input directories
my_data_dir = 'inputs/mildew_dataset/cherry-leaves'
train_path = os.path.join(my_data_dir, 'train')
val_path = os.path.join(my_data_dir, 'validation')
test_path = os.path.join(my_data_dir, 'test')

### Set output directory

In [None]:
version = 'v1'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(work_dir) and version in os.listdir(work_dir + '/outputs'):
    print('Old version is already available create a new version.')
    pass
else:
    os.makedirs(name=file_path)

### Set label names

In [None]:
# Set the labels for the images
labels = os.listdir(train_path)
print('Label for the images are', labels)

## Data visualization of image data

### Image shape

In [None]:
# Calculate average image size across training dataset
dim1, dim2 = [], []
for label in labels:
    for image_filename in os.listdir(train_path + '/' + label):
        file_path = os.path.join(train_path, label, image_filename)

        try:
            img = imread(file_path)
            if len(img.shape) == 3:  # Ensure image has three dimensions (height, width, channels)
                d1, d2, colors = img.shape
                dim1.append(d1)
                dim2.append(d2)
        except Exception as e:
            print(f"Skipping file {image_filename}: {e}")

# Convert lists to NumPy arrays for efficient numerical operations
dim1 = np.array(dim1)
dim2 = np.array(dim2)

# Compute average width and height of all images in the training dataset
dim1_mean = int(dim1.mean())
dim2_mean = int(dim2.mean())

# Plot image dimensions
sns.set_style("whitegrid")
fig, axes = plt.subplots(figsize=(6,6))
sns.scatterplot(x=dim2, y=dim1, alpha=0.2)
axes.set_xlabel("Width (pixels)")
axes.set_ylabel("Height (pixels)")
axes.axvline(x=dim2_mean, color='r', linestyle='--', label=f'Mean Width: {dim2_mean}')
axes.axhline(y=dim1_mean, color='r', linestyle='--', label=f'Mean Height: {dim1_mean}')
axes.legend()
plt.show()

print(f"Width average: {dim2_mean} \nHeight average: {dim1_mean}")

Image size is set to **128×128** to reduce computational cost and overfitting risk while preserving essential features for classification.  

In [None]:
image_shape = (128, 128, 3)
image_shape

### Save the image shape embeddings

In [None]:
joblib.dump(value=image_shape ,
            filename=f"{file_path}/image_shape.pkl")

### Image Statistics (Mean & Variability)

### Function to load images into arrays

In [None]:
from tensorflow.keras.preprocessing import image

def load_image_as_array(my_data_dir, new_size=image_shape[:2], n_images_per_label=20):
    """
    Loads images, resizes them, and returns them as arrays with labels.
    """
    X, y = np.array([], dtype='int'), np.array([], dtype='object')
    labels = os.listdir(my_data_dir)

    for label in labels:
        counter = 0
        for image_filename in os.listdir(my_data_dir + '/' + label):
            if counter < n_images_per_label:
                img = image.load_img(
                    my_data_dir + '/' + label + '/' + image_filename, 
                    target_size=new_size)  # (height, width)

                img_resized = image.img_to_array(img) / 255  # Normalize pixel values

                # Append image data and reshape correctly
                X = np.append(X, img_resized).reshape(-1, new_size[0], new_size[1], img_resized.shape[2])
                y = np.append(y, label)
                counter += 1

    return X, y

### Load image shapes and labels in an array

In [None]:
X, y = load_image_as_array(my_data_dir=train_path,
                           new_size=image_shape,
                           n_images_per_label=30)
print(X.shape, y.shape)

### Plot mean and variability

In [None]:
def plot_mean_variability_per_labels(X, y, figsize=(12, 5), save_image=False):
    """
    The pseudo-code for the function is:
    * Iterate through all unique labels in the dataset.
    * Filter the dataset to include only images corresponding to the current label.
    * Calculate the mean and standard deviation for the filtered subset.
    * Create a figure with two subplots:
        - One displaying the mean image for the label.
        - The other showing the variability (standard deviation) image.
    * Optionally save the generated plots to the specified directory.
    """

    for label_to_display in np.unique(y):
        sns.set_style("white")  # Set the plot style

        # Create a boolean mask to filter images for the current label
        boolean_mask = (y == label_to_display)
        arr = X[boolean_mask]

        # Compute the mean image and standard deviation image for the current label
        avg_img = np.mean(arr, axis=0)
        std_img = np.std(arr, axis=0)

        print(f"==== Label {label_to_display} ====")
        print(f"Image Shape: {avg_img.shape}")

        # Create a figure to display the average and variability images
        fig, axes = plt.subplots(nrows=1, ncols=2, figsize=figsize)
        axes[0].set_title(f"Average image for label {label_to_display}")
        if avg_img.shape[-1] == 3:  # Check if RGB
            axes[0].imshow(avg_img)  # No cmap for RGB
            axes[1].imshow(std_img)
        else:  # Grayscale
            axes[0].imshow(avg_img, cmap='gray')
            axes[1].imshow(std_img, cmap='gray')

        axes[1].set_title(f"Variability image for label {label_to_display}")

        # Save or display the figure based on the `save_image` argument
        if save_image:
            plt.savefig(f"{file_path}/avg_var_{label_to_display}.png",
                        bbox_inches='tight', dpi=150)
        else:
            plt.tight_layout()  # Adjust layout for better spacing
            plt.show()
        print("\n")

In [None]:
plot_mean_variability_per_labels(X=X, y=y, figsize=(12, 5), save_image=True)

---

Counts image files within a directory structure, handling missing directories and providing a total count.

In [None]:
import os

sets = ['train', 'test', 'validation']
labels = ['Healthy', 'Infected']  

for set_name in sets:
    for label in labels:
        path = f'inputs/mildew_dataset/cherry-leaves/{set_name}/{label}' 
        try:
            number_of_files = len(os.listdir(path))
            print(f'There are {number_of_files} images in {set_name}/{label}')
        except FileNotFoundError:
            print(f"Error: Directory '{path}' not found.")

# Calculate and print total number of images
total_images = 0
for set_name in sets:
    for label in labels:
        path = f'inputs/mildew_dataset/cherry-leaves/{set_name}/{label}'  
        try:
            total_images += len(os.listdir(path))
        except FileNotFoundError:
            pass
print(f"\nTotal number of images: {total_images}")

---

### Comparing Healthy vs. Infected Leaves

---

In [None]:
def subset_image_label(X, y, label_to_display):
    """
    Filters the dataset to include only the images that belong to a specific label.
    """
    # Create a boolean mask to filter for the given label
    boolean_mask = (y == label_to_display)
    df = X[boolean_mask]  # Subset the dataset
    return df


def diff_bet_avg_image_labels_data_as_array(X, y, label_1, label_2, figsize=(20, 5), save_image=False):
    """
    Compares the average images between two specified labels.

    - Verifies that both labels exist in the dataset.
    - Calculates the mean image for each label.
    - Calculate pixel-wise difference between class averages
    - Displays or optionally saves a figure with:
        * Average image for label_1
        * Average image for label_2
        * Difference between the two averages
    """
    sns.set_style("white")

    # Validate that both labels are present in the dataset
    unique_labels = np.unique(y)
    if (label_1 not in unique_labels) or (label_2 not in unique_labels):
        print(f"Either label {label_1} or label {label_2} is not in {unique_labels}")
        return

    # Calculate mean from label_1
    images_label = subset_image_label(X, y, label_1)
    label1_avg = np.mean(images_label, axis=0)

    # Calculate mean from label_2
    images_label = subset_image_label(X, y, label_2)
    label2_avg = np.mean(images_label, axis=0)

    # Calculate pixel-wise difference between class averages
    difference_mean = label1_avg - label2_avg

    # Create and display a plot with the results
    fig, axes = plt.subplots(nrows=1, ncols=3, figsize=figsize)
    
    # Determine if the images are RGB or Grayscale
    is_rgb = label1_avg.shape[-1] == 3

    axes[0].imshow(label1_avg if is_rgb else label1_avg, cmap=None if is_rgb else 'gray')
    axes[0].set_title(f'Average {label_1}')

    axes[1].imshow(label2_avg if is_rgb else label2_avg, cmap=None if is_rgb else 'gray')
    axes[1].set_title(f'Average {label_2}')

    axes[2].imshow(difference_mean if is_rgb else difference_mean, cmap=None if is_rgb else 'gray')
    axes[2].set_title(f'Difference image: Avg {label_1} & {label_2}')

    # Save the plot to a file if save_image=True, otherwise display it
    if save_image:
        plt.savefig(f"{file_path}/avg_diff.png", bbox_inches='tight', dpi=150)
    else:
        plt.tight_layout()
        plt.show()

In [None]:
diff_bet_avg_image_labels_data_as_array(X=X, y=y,
                                        label_1='Healthy', label_2='Infected',
                                        figsize=(12, 10),
                                        save_image=True
                                        )

### Image Montage

In [None]:
import itertools
import random
sns.set_style("white")


def image_montage(dir_path, label_to_display, nrows, ncols, figsize=(15, 10)):
    """
    - Verify if the specified label exists in the directory.
    - Ensure the grid size (nrows * ncols) does not exceed the number of available images.
    - Select random images to fill the montage grid
    - Create a figure to display the images, loading and plotting each in the respective grid space.
    """

    labels = os.listdir(dir_path)

    # Check if the specified label exists in the directory
    if label_to_display in labels:

        # Validate that the montage grid can fit the available images
        images_list = os.listdir(dir_path + '/' + label_to_display)
        if nrows * ncols <= len(images_list):  
            img_idx = random.sample(images_list, nrows * ncols)
        else:
            print(
                f"Reduce the number of rows or columns for the montage. \n"
                f"There are only {len(images_list)} images available. "
                f"You requested a grid for {nrows * ncols} images.")
            return

        # Generate grid indices based on the number of rows and columns
        list_rows = range(0, nrows)
        list_cols = range(0, ncols)
        plot_idx = list(itertools.product(list_rows, list_cols))

        # Create a figure and populate it with the selected images
        fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
        for x in range(0, nrows * ncols):
            img = imread(dir_path + '/' + label_to_display + '/' + img_idx[x])
            img_shape = img.shape
            axes[plot_idx[x][0], plot_idx[x][1]].imshow(img)
            axes[plot_idx[x][0], plot_idx[x][1]].set_title(
                f"Width: {img_shape[1]}px, Height: {img_shape[0]}px")
            axes[plot_idx[x][0], plot_idx[x][1]].set_xticks([])
            axes[plot_idx[x][0], plot_idx[x][1]].set_yticks([])
        plt.tight_layout()
        plt.show()
        plt.close(fig)  

    else:
        # Notify the user if the selected label does not exist
        print(f"The selected label '{label_to_display}' does not exist.")
        print(f"Available labels are: {labels}")

### Run Montage

In [None]:
for label in labels:
    print(label)
    image_montage(dir_path=train_path,
                  label_to_display=label,
                  nrows=3, ncols=3,
                  figsize=(10, 15)
                  )
    print("\n")

---

## Conclusion and Next Steps

### Summary of Findings
- This notebook provided **an in-depth visual analysis** of the cherry leaf dataset.
- Key insights:
  - **Image dimensions:** Most images are approximately **128×128 pixels**.
  - **Class distribution:** The dataset appears **balanced** between healthy and infected leaves.
  - **Mean & variability analysis:** 
    - **Healthy leaves** show **consistent structure**.
    - **Infected leaves** exhibit **higher variability**, likely due to **different infection stages**.
  - **Image Montage:** Representative samples provide **visual confirmation** of dataset quality.

### Next Steps
1. **Integrate Visuals into Streamlit Dashboard**:
   - Use **image montages** to allow users to explore dataset samples interactively.
   - Display **class distributions and variability** plots.

2. **Enhance Data Preprocessing**:
   - Implement **data augmentation** based on observed class variability.
   - Consider **histogram equalization** for better image contrast.

3. **Modeling Considerations**:
   - Use the **image shape information** to preprocess input data for model training.
   - Experiment with **feature extraction techniques** (e.g., **PCA, edge detection**) to aid classification.

4. **Further Improvements**:
   - Investigate **color distributions** between classes (e.g., **HSV transformation**).
   - Expand **visualizations** to include **frequency distributions of pixel intensities**.

This concludes the **Data Visualization phase**. The next step is to **prepare data for modeling** and ensure effective preprocessing strategies are in place.
