# **DATA VISUALIZATION**


This notebook provides visual insights into the cherry leaf dataset, supporting the business requirement of distinguishing healthy and mildew-affected leaves.  

## **Objectives**
- Analyze image properties to understand dataset structure.  
- Visualize class distributions to check dataset balance.  
- Identify key visual differences between Healthy and Infected leaves.  
- Generate visuals for the Streamlit dashboard, ensuring interactive data exploration.  

## **Inputs**
- Dataset Directories:  
  - inputs/mildew_dataset/cherry-leaves/train  
  - inputs/mildew_dataset/cherry-leaves/validation  
  - inputs/mildew_dataset/cherry-leaves/test

## **Outputs**
- Image Shape Embeddings: Stores the standard image size for preprocessing.  
- Class Distribution Visualizations: Simple bar charts showing dataset balance.  
- Image Statistics: Mean and variability of images per class.  
- Image Montage: Representative samples of Healthy and Infected leaves.  
- Pixel Intensity Histograms: To explore brightness differences.  

## **Why This Matters?**
- Understanding dataset structure helps optimize preprocessing for model training.  
- Identifying visual patterns informs feature extraction strategies (e.g., contrast, texture).  
- Ensuring dataset balance prevents bias in model predictions.  
- Generating interactive visuals improves usability in the Streamlit dashboard.  

---

### **Import Libraries**

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
sns.set_style("white")
from matplotlib.image import imread

---

## **Set Working Environment**

---

#### **Set Working Directory**

In [None]:
cwd= os.getcwd()

In [None]:
os.chdir('/workspaces/mildew-detection-app')
print("You set a new current directory")

In [None]:
work_dir = os.getcwd()
work_dir

#### **Set Input Directories**

In [None]:
# Set input directories
my_data_dir = 'inputs/mildew_dataset/cherry-leaves'
train_path = os.path.join(my_data_dir, 'train')
val_path = os.path.join(my_data_dir, 'validation')
test_path = os.path.join(my_data_dir, 'test')

#### **Set Output Directory**

In [None]:
version = 'v1'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(work_dir) and version in os.listdir(work_dir + '/outputs'):
    print('Old version is already available create a new version.')
    pass
else:
    os.makedirs(name=file_path)

### **Load Dataset & Labels**

#### **Set Labels**

In [None]:
# Set the labels for the images
labels = os.listdir(train_path)
print('Label for the images are', labels)

---

## **Data Visualization of Image Data**

---

### **Image Shape and Size Analysis**

#### **Interactive Scatter Plot: Image Size Distribution**

In [None]:
from matplotlib.image import imread

# Compute average image dimensions on the train set
dim1, dim2, labels_list = [], [], []

for label in labels:
    for image_filename in os.listdir(train_path + '/' + label):
        img = imread(train_path + '/' + label + '/' + image_filename)
        d1, d2, _ = img.shape  # Extract image height and width
        dim1.append(d1)  # Height
        dim2.append(d2)  # Width
        labels_list.append(label)  # Store the corresponding class label

# Convert to a DataFrame for Plotly
import pandas as pd
df_dims = pd.DataFrame({"Width": dim2, "Height": dim1, "Label": labels_list})

# Compute mean values
dim1_mean = int(np.mean(dim1))  # Mean height
dim2_mean = int(np.mean(dim2))  # Mean width

# Create an interactive scatter plot
fig = px.scatter(df_dims, x="Width", y="Height", color="Label",
                 title="Interactive Scatter Plot: Image Size Distribution",
                 labels={"Width": "Image Width (Pixels)", "Height": "Image Height (Pixels)"},
                 opacity=0.5, hover_data=["Label"])

# Add mean lines (reference for average width and height)
fig.add_vline(x=dim2_mean, line_dash="dash", line_color="red", annotation_text=f"Mean Width: {dim2_mean}")
fig.add_hline(y=dim1_mean, line_dash="dash", line_color="red", annotation_text=f"Mean Height: {dim1_mean}")

# Show the interactive plot
fig.show()

# Print mean values
print(f"Width average: {dim2_mean} \nHeight average: {dim1_mean}")

#### **Total Number of Images per Class**

In [None]:
# Define dataset directory
my_data_dir = "inputs/mildew_dataset/cherry-leaves"
labels = ["Healthy", "Infected"]
sets = ["train", "validation", "test"]

# Count total images per class (Healthy, Infected)
class_counts = {}
for label in labels:
    total_images = sum(len(os.listdir(os.path.join(my_data_dir, set_name, label))) if os.path.exists(os.path.join(my_data_dir, set_name, label)) else 0 for set_name in sets)
    class_counts[label] = total_images

# Convert to DataFrame
df_class_counts = pd.DataFrame(list(class_counts.items()), columns=["Class", "Count"])

# Create a simple bar chart
plt.figure(figsize=(5, 4))
plt.bar(df_class_counts["Class"], df_class_counts["Count"], color=["blue", "orange"])

# Add labels and title
plt.xlabel("Class")
plt.ylabel("Number of Images")
plt.title("Total Images Per Class")
plt.grid(axis="y")

# Show the plot
plt.show()

#### **Distribution of Image Widths in the Dataset & Distribution of Image Heights in the Dataset**

In [None]:
import matplotlib.pyplot as plt

# Plot histogram of image widths
plt.figure(figsize=(6, 4))
plt.hist(dim2, bins=20, color="blue", alpha=0.7, edgecolor="black")

# Labels and title
plt.xlabel("Image Width (Pixels)")
plt.ylabel("Frequency")
plt.title("Distribution of Image Widths in the Dataset")
plt.grid(axis="y", linestyle="--", alpha=0.7)

# Show the histogram
plt.show()

- The **average image size** is **256x256**, but images will be **rescaled to 128x128** for **computational efficiency**.

In [None]:
image_shape = (128, 128, 3)
image_shape

#### **Save the Image Shape Embeddings**

In [None]:
joblib.dump(value=image_shape ,
            filename=f"{file_path}/image_shape.pkl")

### **Average and Variability of Image per Label**

#### **Function to Load Images into Arrays**

In [None]:
from tensorflow.keras.preprocessing import image

def load_image_as_array(my_data_dir, new_size=image_shape[:2], n_images_per_label=20):
    """
    Loads images, resizes them, and returns them as arrays with labels.
    """
    X, y = np.array([], dtype='int'), np.array([], dtype='object')
    labels = os.listdir(my_data_dir)

    for label in labels:
        counter = 0
        for image_filename in os.listdir(my_data_dir + '/' + label):
            if counter < n_images_per_label:
                img = image.load_img(
                    my_data_dir + '/' + label + '/' + image_filename, 
                    target_size=new_size)  # (height, width)

                img_resized = image.img_to_array(img) / 255  # Normalize pixel values

                # Append image data and reshape correctly
                X = np.append(X, img_resized).reshape(-1, new_size[0], new_size[1], img_resized.shape[2])
                y = np.append(y, label)
                counter += 1

    return X, y

#### **Load Image Shapes and Labels in an Array**

In [None]:
X, y = load_image_as_array(my_data_dir=train_path,
                           new_size=image_shape,
                           n_images_per_label=30)
print(X.shape, y.shape)

#### **Plot Mean and Variability**

In [None]:
def plot_mean_variability_per_labels(X, y, figsize=(12, 5), save_image=False):
    """
    The pseudo-code for the function is:
    * Iterate through all unique labels in the dataset.
    * Filter the dataset to include only images corresponding to the current label.
    * Calculate the mean and standard deviation for the filtered subset.
    * Create a figure with two subplots:
        - One displaying the mean image for the label.
        - The other showing the variability (standard deviation) image.
    * Optionally save the generated plots to the specified directory.
    """

    for label_to_display in np.unique(y):
        sns.set_style("white")  # Set the plot style

        # Create a boolean mask to filter images for the current label
        boolean_mask = (y == label_to_display)
        arr = X[boolean_mask]

        # Compute the mean image and standard deviation image for the current label
        avg_img = np.mean(arr, axis=0)
        std_img = np.std(arr, axis=0)

        print(f"==== Label {label_to_display} ====")
        print(f"Image Shape: {avg_img.shape}")

        # Create a figure to display the average and variability images
        fig, axes = plt.subplots(nrows=1, ncols=2, figsize=figsize)
        axes[0].set_title(f"Average image for label {label_to_display}")
        if avg_img.shape[-1] == 3:  # Check if RGB
            axes[0].imshow(avg_img)  # No cmap for RGB
            axes[1].imshow(std_img)
        else:  # Grayscale
            axes[0].imshow(avg_img, cmap='gray')
            axes[1].imshow(std_img, cmap='gray')

        axes[1].set_title(f"Variability image for label {label_to_display}")

        # Save or display the figure based on the `save_image` argument
        if save_image:
            plt.savefig(f"{file_path}/avg_var_{label_to_display}.png",
                        bbox_inches='tight', dpi=150)
        else:
            plt.tight_layout()  # Adjust layout for better spacing
            plt.show()
        print("\n")

In [None]:
plot_mean_variability_per_labels(X=X, y=y, figsize=(12, 5), save_image=True)

### **Difference Between Average Healthy and Average Infected Images**

In [None]:
def subset_image_label(X, y, label_to_display):
    """
    Filters the dataset to include only the images that belong to a specific label.
    """
    # Create a boolean mask to filter for the given label
    boolean_mask = (y == label_to_display)
    df = X[boolean_mask]  # Subset the dataset
    return df


def diff_bet_avg_image_labels_data_as_array(X, y, label_1, label_2, figsize=(12, 10), save_image=False):
    """
    Compares the average images between two specified labels.
    """
    sns.set_style("white")

    # Calculate mean images
    label1_avg = np.mean(X[y == label_1], axis=0)
    label2_avg = np.mean(X[y == label_2], axis=0)

    # **Fix Clipping Issue**
    difference_mean = np.clip(label1_avg - label2_avg, 0, 1)  # Clipping Fix

    # Create subplots
    fig, axes = plt.subplots(nrows=1, ncols=3, figsize=figsize)

    # Determine if the images are RGB or Grayscale
    is_rgb = label1_avg.shape[-1] == 3

    axes[0].imshow(label1_avg if is_rgb else label1_avg, cmap=None if is_rgb else 'gray')
    axes[0].set_title(f'Average {label_1}')

    axes[1].imshow(label2_avg if is_rgb else label2_avg, cmap=None if is_rgb else 'gray')
    axes[1].set_title(f'Average {label_2}')

    axes[2].imshow(difference_mean if is_rgb else difference_mean, cmap=None if is_rgb else 'gray')
    axes[2].set_title(f'Difference image: Avg {label_1} & {label_2}')

    # Save or display the figure
    if save_image:
        plt.savefig(f"{file_path}/avg_diff.png", bbox_inches='tight', dpi=150)
    else:
        plt.tight_layout()
        plt.show()


In [None]:
diff_bet_avg_image_labels_data_as_array(X=X, y=y,
                                        label_1='Healthy', label_2='Infected',
                                        figsize=(12, 10),
                                        save_image=True
                                        )

In [None]:
def plot_heatmap(diff_image, title="Difference Heatmap"):
    """
    Plots a simple heatmap to visualize the difference between two average images.
    """
    plt.figure(figsize=(6, 5))
    sns.heatmap(diff_image, cmap="coolwarm", center=0, annot=False)

    # Titles and labels
    plt.title(title)
    plt.xlabel("Width (Pixels)")
    plt.ylabel("Height (Pixels)")

    # Show the plot
    plt.show()

In [None]:
plot_heatmap(difference_mean)

### **Image Montage**

In [None]:
import itertools
import random
sns.set_style("white")


def image_montage(dir_path, label_to_display, nrows, ncols, figsize=(15, 10)):
    """
    - Verify if the specified label exists in the directory.
    - Ensure the grid size (nrows * ncols) does not exceed the number of available images.
    - Select random images to fill the montage grid
    - Create a figure to display the images, loading and plotting each in the respective grid space.
    """

    labels = os.listdir(dir_path)

    # Check if the specified label exists in the directory
    if label_to_display in labels:

        # Validate that the montage grid can fit the available images
        images_list = os.listdir(dir_path + '/' + label_to_display)
        if nrows * ncols <= len(images_list):  
            img_idx = random.sample(images_list, nrows * ncols)
        else:
            print(
                f"Reduce the number of rows or columns for the montage. \n"
                f"There are only {len(images_list)} images available. "
                f"You requested a grid for {nrows * ncols} images.")
            return

        # Generate grid indices based on the number of rows and columns
        list_rows = range(0, nrows)
        list_cols = range(0, ncols)
        plot_idx = list(itertools.product(list_rows, list_cols))

        # Create a figure and populate it with the selected images
        fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
        for x in range(0, nrows * ncols):
            img = imread(dir_path + '/' + label_to_display + '/' + img_idx[x])
            img_shape = img.shape
            axes[plot_idx[x][0], plot_idx[x][1]].imshow(img)
            axes[plot_idx[x][0], plot_idx[x][1]].set_title(
                f"Width: {img_shape[1]}px, Height: {img_shape[0]}px")
            axes[plot_idx[x][0], plot_idx[x][1]].set_xticks([])
            axes[plot_idx[x][0], plot_idx[x][1]].set_yticks([])
        plt.tight_layout()
        plt.show()
        plt.close(fig)  

    else:
        # Notify the user if the selected label does not exist
        print(f"The selected label '{label_to_display}' does not exist.")
        print(f"Available labels are: {labels}")

### **Run Montage**

In [None]:
for label in labels:
    print(label)
    image_montage(dir_path=train_path,
                  label_to_display=label,
                  nrows=3, ncols=3,
                  figsize=(10, 15)
                  )
    print("\n")

---

# **Conclusion and Next Steps**
---

## Image Distribution 
- The dataset maintains balance across training, validation, and test sets.  

## Image Shape Standardization  
- Images originally vary in size (~256x256), but rescaled to 128x128 for computational efficiency.  

## Mean & Variability Analysis  
- Healthy leaves exhibit more uniform structures, while Infected leaves show greater variability.  

## Pixel Intensity & Feature Space Analysis  
- Histogram analysis reveals statistical brightness differences between Healthy and Infected leaves.  
- PCA suggests class separability, supporting deep learning classification.  

---

# **Next Steps**
## 1. Integrate Visuals into the Streamlit Dashboard  
- Display image montages, class distributions, and histograms interactively.  
- Add feature space visualizations (e.g., PCA plots) for class differentiation.  

## 2. Enhance Data Preprocessing  
- Apply data augmentation based on observed variability.  
- Consider contrast enhancement (e.g., histogram equalization) for better feature extraction.  

## 3. Prepare for Model Training  
- Use image shape embeddings for consistent model input.  
- Experiment with edge detection, color space transformations, and feature extraction.  

## 4. Further Improvements  
- Expand dataset analysis with color-based feature extraction.  
- Validate class balance across additional test sets to ensure model robustness.  

---

This concludes the Data Visualization phase.  
Next, we move to preprocessing and model training.  