# **DATA VISUALIZATION**

## Objectives

1. Answer Business Requirement 1: Provide visual insights to differentiate between healthy and mildew-affected leaves.
2. Visualize class distributions and image characteristics.
3 Generate visuals to aid in building the Streamlit dashboard.

## Inputs

Dataset directories:
- inputs/mildew_dataset/cherry-leaves/train
- inputs/mildew_dataset/cherry-leaves/validation
- inputs/mildew_dataset/cherry-leaves/test

## Outputs

- Image shape embeddings pickle file.
- Mean and variability of images per label plot.
- Visualization of class distributions.
- Image montage for use in the Streamlit dashboard.

## Additional Comments

- Data visualization provides insights into dataset quality, balance, and structure.
- Helps in identifying potential biases or imbalances in the dataset.
- Supports informed decisions regarding preprocessing, data augmentation, and modeling strategies.
- Ensures that the dataset is well-prepared for the modeling phase by revealing class imbalances and image variability.



---

# Set Up Environment

### Import Libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
sns.set_style("white")
from matplotlib.image import imread

### Set Working Directory

In [None]:
cwd= os.getcwd()

In [None]:
os.chdir('/workspace/mildew-detection-app')
print("You set a new current directory")

In [None]:
work_dir = os.getcwd()
work_dir

### Set Input Directories

In [None]:
# Set input directories
my_data_dir = 'inputs/mildew_dataset/cherry-leaves'
train_path = os.path.join(my_data_dir, 'train')
val_path = os.path.join(my_data_dir, 'validation')
test_path = os.path.join(my_data_dir, 'test')

### Set Output Directory

In [None]:
version = 'v1'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(work_dir) and version in os.listdir(work_dir + '/outputs'):
    print('Old version is already available create a new version.')
    pass
else:
    os.makedirs(name=file_path)

### Set Label Names

In [None]:
# Set the labels for the images
labels = os.listdir(train_path)
print('Label for the images are', labels)

---

## Data Visualization of Image Data

### Image Shape

In [None]:
# Compute average image dimensions on the train set
dim1, dim2 = [], []
for label in labels:
    for image_filename in os.listdir(train_path + '/' + label):
        img = imread(train_path + '/' + label + '/' + image_filename)
        d1, d2, colors = img.shape
        dim1.append(d1)  # image height
        dim2.append(d2)  # image width

# Plot image dimensions
sns.set_style("whitegrid")
fig, axes = plt.subplots()
sns.scatterplot(x=dim2, y=dim1, alpha=0.2)
axes.set_xlabel("Width (pixels)")
axes.set_ylabel("Height (pixels)")
dim1_mean = int(np.array(dim1).mean())
dim2_mean = int(np.array(dim2).mean())
axes.axvline(x=dim1_mean, color='r', linestyle='--')
axes.axhline(y=dim2_mean, color='r', linestyle='--')
plt.show()
print(f"Width average: {dim2_mean} \nHeight average: {dim1_mean}")

Image size is set to 128×128 to reduce computational cost and overfitting risk while preserving essential features for classification.

In [None]:
image_shape = (128, 128, 3)
image_shape

### Save the Image Shape Embeddings

In [None]:
joblib.dump(value=image_shape ,
            filename=f"{file_path}/image_shape.pkl")

## Image Statistics (Mean & Variability)

### Function to Load Images into Arrays

In [None]:
from tensorflow.keras.preprocessing import image

def load_image_as_array(my_data_dir, new_size=image_shape[:2], n_images_per_label=20):
    """
    Loads images, resizes them, and returns them as arrays with labels.
    """
    X, y = np.array([], dtype='int'), np.array([], dtype='object')
    labels = os.listdir(my_data_dir)

    for label in labels:
        counter = 0
        for image_filename in os.listdir(my_data_dir + '/' + label):
            if counter < n_images_per_label:
                img = image.load_img(
                    my_data_dir + '/' + label + '/' + image_filename, 
                    target_size=new_size)  # (height, width)

                img_resized = image.img_to_array(img) / 255  # Normalize pixel values

                # Append image data and reshape correctly
                X = np.append(X, img_resized).reshape(-1, new_size[0], new_size[1], img_resized.shape[2])
                y = np.append(y, label)
                counter += 1

    return X, y

### Load Image Shapes and Labels in an Array

In [None]:
X, y = load_image_as_array(my_data_dir=train_path,
                           new_size=image_shape,
                           n_images_per_label=30)
print(X.shape, y.shape)

### Plot Mean and Variability

In [None]:
def plot_mean_variability_per_labels(X, y, figsize=(12, 5), save_image=False):
    """
    The pseudo-code for the function is:
    * Iterate through all unique labels in the dataset.
    * Filter the dataset to include only images corresponding to the current label.
    * Calculate the mean and standard deviation for the filtered subset.
    * Create a figure with two subplots:
        - One displaying the mean image for the label.
        - The other showing the variability (standard deviation) image.
    * Optionally save the generated plots to the specified directory.
    """

    for label_to_display in np.unique(y):
        sns.set_style("white")  # Set the plot style

        # Create a boolean mask to filter images for the current label
        boolean_mask = (y == label_to_display)
        arr = X[boolean_mask]

        # Compute the mean image and standard deviation image for the current label
        avg_img = np.mean(arr, axis=0)
        std_img = np.std(arr, axis=0)

        print(f"==== Label {label_to_display} ====")
        print(f"Image Shape: {avg_img.shape}")

        # Create a figure to display the average and variability images
        fig, axes = plt.subplots(nrows=1, ncols=2, figsize=figsize)
        axes[0].set_title(f"Average image for label {label_to_display}")
        if avg_img.shape[-1] == 3:  # Check if RGB
            axes[0].imshow(avg_img)  # No cmap for RGB
            axes[1].imshow(std_img)
        else:  # Grayscale
            axes[0].imshow(avg_img, cmap='gray')
            axes[1].imshow(std_img, cmap='gray')

        axes[1].set_title(f"Variability image for label {label_to_display}")

        # Save or display the figure based on the `save_image` argument
        if save_image:
            plt.savefig(f"{file_path}/avg_var_{label_to_display}.png",
                        bbox_inches='tight', dpi=150)
        else:
            plt.tight_layout()  # Adjust layout for better spacing
            plt.show()
        print("\n")

In [None]:
plot_mean_variability_per_labels(X=X, y=y, figsize=(12, 5), save_image=True)

### Counts image files within a directory structure, handling missing directories and providing a total count

In [None]:
import os

sets = ['train', 'test', 'validation']
labels = ['Healthy', 'Infected']  

for set_name in sets:
    for label in labels:
        path = f'inputs/mildew_dataset/cherry-leaves/{set_name}/{label}' 
        try:
            number_of_files = len(os.listdir(path))
            print(f'There are {number_of_files} images in {set_name}/{label}')
        except FileNotFoundError:
            print(f"Error: Directory '{path}' not found.")

# Calculate and print total number of images
total_images = 0
for set_name in sets:
    for label in labels:
        path = f'inputs/mildew_dataset/cherry-leaves/{set_name}/{label}'  
        try:
            total_images += len(os.listdir(path))
        except FileNotFoundError:
            pass
print(f"\nTotal number of images: {total_images}")

## Dataset Distribution & Image Characteristics

Bar Chart of Image Distribution & Pie Chart of Overall Label Distribution: visualizing the number of images in train, validation, and the test per label

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Initialize dictionary to store dataset statistics
data = {
    'Set': [],
    'Label': [],
    'Frequency': []
}

# Define dataset folders: train, validation, and test
folders = ['train', 'validation', 'test']

# Iterate through dataset folders and count images per label
for folder in folders:
    for label in labels:
        row = {
            'Set': folder,
            'Label': label,
            'Frequency': int(len(os.listdir(os.path.join(my_data_dir, folder, label))))
        }
        for key, value in row.items():
            data[key].append(value)

# Convert the dictionary into a DataFrame
df_freq = pd.DataFrame(data)

# **Bar Chart of Image Distribution**
sns.set_style("whitegrid")
plt.figure(figsize=(8, 5))
sns.barplot(data=df_freq, x='Set', y='Frequency', hue='Label')
plt.title("Image Distribution in Dataset")
plt.savefig(f'{file_path}/labels_distribution.png', bbox_inches='tight', dpi=150)
plt.show()

# **Pie Chart of Overall Label Distribution**
plt.figure(figsize=(6, 6))
label_distribution = df_freq.groupby("Label")["Frequency"].sum()

plt.pie(label_distribution, labels=label_distribution.index, autopct='%1.1f%%', colors=["#1f77b4", "#ff7f0e"], startangle=90)
plt.title("Overall Class Distribution (Healthy vs Infected)")
plt.savefig(f'{file_path}/labels_pie_chart.png', bbox_inches='tight', dpi=150)
plt.show()

---

## Comparing Healthy vs. Infected Leaves

In [None]:
def subset_image_label(X, y, label_to_display):
    """
    Filters the dataset to include only the images that belong to a specific label.
    """
    # Create a boolean mask to filter for the given label
    boolean_mask = (y == label_to_display)
    df = X[boolean_mask]  # Subset the dataset
    return df


def diff_bet_avg_image_labels_data_as_array(X, y, label_1, label_2, figsize=(20, 5), save_image=False):
    """
    Compares the average images between two specified labels.

    - Verifies that both labels exist in the dataset.
    - Calculates the mean image for each label.
    - Calculate pixel-wise difference between class averages
    - Displays or optionally saves a figure with:
        * Average image for label_1
        * Average image for label_2
        * Difference between the two averages
    """
    sns.set_style("white")

    # Validate that both labels are present in the dataset
    unique_labels = np.unique(y)
    if (label_1 not in unique_labels) or (label_2 not in unique_labels):
        print(f"Either label {label_1} or label {label_2} is not in {unique_labels}")
        return

    # Calculate mean from label_1
    images_label = subset_image_label(X, y, label_1)
    label1_avg = np.mean(images_label, axis=0)

    # Calculate mean from label_2
    images_label = subset_image_label(X, y, label_2)
    label2_avg = np.mean(images_label, axis=0)

    # Calculate pixel-wise difference between class averages
    difference_mean = label1_avg - label2_avg

    # Create and display a plot with the results
    fig, axes = plt.subplots(nrows=1, ncols=3, figsize=figsize)
    
    # Determine if the images are RGB or Grayscale
    is_rgb = label1_avg.shape[-1] == 3

    axes[0].imshow(label1_avg if is_rgb else label1_avg, cmap=None if is_rgb else 'gray')
    axes[0].set_title(f'Average {label_1}')

    axes[1].imshow(label2_avg if is_rgb else label2_avg, cmap=None if is_rgb else 'gray')
    axes[1].set_title(f'Average {label_2}')

    axes[2].imshow(difference_mean if is_rgb else difference_mean, cmap=None if is_rgb else 'gray')
    axes[2].set_title(f'Difference image: Avg {label_1} & {label_2}')

    # Save the plot to a file if save_image=True, otherwise display it
    if save_image:
        plt.savefig(f"{file_path}/avg_diff.png", bbox_inches='tight', dpi=150)
    else:
        plt.tight_layout()
        plt.show()

In [None]:
diff_bet_avg_image_labels_data_as_array(X=X, y=y,
                                        label_1='Healthy', label_2='Infected',
                                        figsize=(12, 10),
                                        save_image=True
                                        )

### Image Montage

In [None]:
import itertools
import random
sns.set_style("white")


def image_montage(dir_path, label_to_display, nrows, ncols, figsize=(15, 10)):
    """
    - Verify if the specified label exists in the directory.
    - Ensure the grid size (nrows * ncols) does not exceed the number of available images.
    - Select random images to fill the montage grid
    - Create a figure to display the images, loading and plotting each in the respective grid space.
    """

    labels = os.listdir(dir_path)

    # Check if the specified label exists in the directory
    if label_to_display in labels:

        # Validate that the montage grid can fit the available images
        images_list = os.listdir(dir_path + '/' + label_to_display)
        if nrows * ncols <= len(images_list):  
            img_idx = random.sample(images_list, nrows * ncols)
        else:
            print(
                f"Reduce the number of rows or columns for the montage. \n"
                f"There are only {len(images_list)} images available. "
                f"You requested a grid for {nrows * ncols} images.")
            return

        # Generate grid indices based on the number of rows and columns
        list_rows = range(0, nrows)
        list_cols = range(0, ncols)
        plot_idx = list(itertools.product(list_rows, list_cols))

        # Create a figure and populate it with the selected images
        fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
        for x in range(0, nrows * ncols):
            img = imread(dir_path + '/' + label_to_display + '/' + img_idx[x])
            img_shape = img.shape
            axes[plot_idx[x][0], plot_idx[x][1]].imshow(img)
            axes[plot_idx[x][0], plot_idx[x][1]].set_title(
                f"Width: {img_shape[1]}px, Height: {img_shape[0]}px")
            axes[plot_idx[x][0], plot_idx[x][1]].set_xticks([])
            axes[plot_idx[x][0], plot_idx[x][1]].set_yticks([])
        plt.tight_layout()
        plt.show()
        plt.close(fig)  

    else:
        # Notify the user if the selected label does not exist
        print(f"The selected label '{label_to_display}' does not exist.")
        print(f"Available labels are: {labels}")

### Run Montage

In [None]:
for label in labels:
    print(label)
    image_montage(dir_path=train_path,
                  label_to_display=label,
                  nrows=3, ncols=3,
                  figsize=(10, 15)
                  )
    print("\n")

## Statistical Testing on Image Characteristics

Statistical Testing (t-test on image brightness): quantitatively compares Healthy vs. Infected leaves to support the business requirement of differentiating between them.

In [None]:
from scipy.stats import ttest_ind
import numpy as np
from tensorflow.keras.preprocessing.image import img_to_array, load_img

# Paths to image data
test_healthy_dir = os.path.join(test_path, "Healthy")
test_infected_dir = os.path.join(test_path, "Infected")

# List of images (50 from each class)
healthy_images = [os.path.join(test_healthy_dir, img) for img in os.listdir(test_healthy_dir)[:50]]
infected_images = [os.path.join(test_infected_dir, img) for img in os.listdir(test_infected_dir)[:50]]

def preprocess_grayscale(image_path, target_size=(128, 128)):
    img = load_img(image_path, target_size=target_size, color_mode="grayscale")
    img_array = img_to_array(img) / 255.0  # Normalize
    return img_array.flatten()  

healthy_data = np.array([preprocess_grayscale(img) for img in healthy_images])
infected_data = np.array([preprocess_grayscale(img) for img in infected_images])

# Perform t-test
t_stat, p_value = ttest_ind(healthy_data.flatten(), infected_data.flatten())

# Display results
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4e}")

if p_value < 0.05:
    print("There is a significant difference in pixel intensity distributions.")
else:
    print("No statistically significant difference between classes.")

## Feature Differences

### PCA Visualization

This helps in visualizing differences in feature space, which aligns with your objective of highlighting visual differences.

In [None]:
from sklearn.decomposition import PCA
import seaborn as sns

# Convert to NumPy array
X_flat = X.reshape(len(X), -1)  # Flatten images for PCA

# Apply PCA (reduce to 2 dimensions)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_flat)

# Visualize results
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y, palette={"Healthy": "blue", "Infected": "red"}, alpha=0.7)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA Visualization of Leaf Images")
plt.legend(title="Class")
plt.grid(True)
plt.show()

### t-SNE Visualization

This helps understand separability in feature space, useful for analyzing visual clusters.

In [None]:
from sklearn.manifold import TSNE

# Apply t-SNE (reduce to 2 dimensions)
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_flat)

# Visualize results
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=y, palette={"Healthy": "blue", "Infected": "red"}, alpha=0.7)
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.title("t-SNE Visualization of Leaf Images")
plt.legend(title="Class")
plt.grid(True)
plt.show()

### Histogram of Pixel Intensity

This shows how color intensities vary between the two classes.

In [None]:
# Create Histogram
plt.figure(figsize=(10, 5))
plt.hist(healthy_data.flatten(), bins=30, alpha=0.7, label="Healthy", color="blue", density=True)
plt.hist(infected_data.flatten(), bins=30, alpha=0.7, label="Infected", color="red", density=True)
plt.xlabel("Pixel Intensity")
plt.ylabel("Frequency")
plt.legend()
plt.title("Histogram of Pixel Intensity (Healthy vs Infected)")
plt.show()

## Image Explainability & Feature Importance

### Grad-CAM Heatmap for Explainability

This shows which parts of an image contribute most to classification.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Model
import cv2

def get_grad_cam(model, img_array, layer_name):
    grad_model = Model(inputs=model.input, outputs=[model.get_layer(layer_name).output, model.output])
    with tf.GradientTape() as tape:
        conv_outputs, predictions = grad_model(img_array)
        loss = predictions[:, 0]
    grads = tape.gradient(loss, conv_outputs)
    pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))
    heatmap = tf.reduce_sum(tf.multiply(pooled_grads, conv_outputs), axis=-1)
    heatmap = np.maximum(heatmap, 0) / np.max(heatmap)
    return heatmap[0]

# Apply to a sample image
grad_cam_map = get_grad_cam(model, np.expand_dims(X[0], axis=0), 'conv2d_2')

# Overlay heatmap on original image
heatmap = cv2.resize(grad_cam_map, (128, 128))
heatmap = cv2.applyColorMap(np.uint8(255 * heatmap), cv2.COLORMAP_JET)
superimposed_img = cv2.addWeighted(X[0], 0.5, heatmap, 0.5, 0)

plt.imshow(superimposed_img)
plt.title("Grad-CAM Visualization")
plt.axis("off")
plt.show()

### Occlusion Sensitivity Map

This helps understand how image regions impact classification.

In [None]:
def occlusion_sensitivity(model, img_array, patch_size=20, stride=10):
    img_height, img_width, _ = img_array.shape[1:]
    orig_pred = model.predict(img_array)[0, 0]  
    sensitivity_map = np.zeros((img_height, img_width))

    for y in range(0, img_height, stride):
        for x in range(0, img_width, stride):
            occluded_img = img_array.copy()
            occluded_img[:, y:y+patch_size, x:x+patch_size, :] = 0  

            new_pred = model.predict(occluded_img)[0, 0]
            sensitivity_map[y:y+patch_size, x:x+patch_size] = abs(orig_pred - new_pred)

    return sensitivity_map

# Generate occlusion map
sensitivity_map = occlusion_sensitivity(model, np.expand_dims(X[0], axis=0))

# Plot occlusion sensitivity map
plt.figure(figsize=(6, 6))
plt.imshow(sensitivity_map, cmap="hot")
plt.colorbar()
plt.title("Occlusion Sensitivity Map")
plt.show()

---

## Conclusion and Next Steps

### Summary of Findings
This notebook provided **a comprehensive visual analysis** of the cherry leaf dataset. Key takeaways include:

- Image Dimensions: Most images are approximately 128×128 pixels, ensuring standardization.
- Class Distribution: The dataset appears balanced between healthy and infected leaves.
- Mean & Variability Analysis:
- Healthy leaves show consistent structure.
- Infected leaves exhibit higher variability, likely due to different infection stages.
- Image Montage: Representative samples confirm dataset quality and balance.

### Next Steps
1. Integrate Visuals into the Streamlit Dashboard
- Use image montages for interactive dataset exploration.
- Display class distributions and image variability.

2. Enhance Data Preprocessing
- Apply data augmentation based on observed variability.
- Consider histogram equalization to improve contrast.

3. Model Training & Feature Engineering
- Utilize the image shape information for preprocessing.
- Experiment with feature extraction (e.g., PCA, edge detection).

4. Further Improvements
- Analyze color distribution differences between classes (e.g., HSV transformation).
- Expand visualizations to include pixel intensity distributions.

This concludes the Data Visualization phase. The next step is to prepare the dataset for model training and implement necessary preprocessing techniques.