# **Modeling and Evaluation**

## Objectives

- Address Business Requirement 2: Develop a model to determine whether a given leaf is infected with powdery mildew.
- Implement machine learning techniques to train and evaluate a classification model with hyperparameter tuning.

## Inputs

Dataset Directories:
- inputs/mildew_dataset_dataset/cherry-leaves/train
- inputs/mildew_dataset_dataset/cherry-leaves/test
- inputs/mildew_dataset_dataset/cherry-leaves/validation
- Image Shape Embeddings: Precomputed embeddings from the Data Visualization Notebook.

## Outputs

- Image distribution plot for training, validation, and test sets.
- Implementation of image augmentation techniques with real-time sample visualization.
- Class indices mapping for label interpretation during inference.
- Feature scaling and selection pipeline using GridSearchCV. 
- Optimized model with hyperparameter tuning using GridSearchCV.
- Best hyperparameter combination selected through cross-validation.
- Trained machine learning model using the best configuration.
- Saved trained model for future inference.
- Learning curve plot illustrating model performance over epochs.
- Model evaluation metrics (Accuracy, Precision, Recall, F1-score) saved as a pickle file.
- Confusion matrix and classification report to analyze prediction performance.
- Prediction on a randomly selected image from the test set with probability scores.
- Multiple image predictions comparing ground truth vs. model predictions.

## Additional Comments

- This notebook focuses on developing and training a classification model using the structured dataset.
- Performance evaluation ensures that the model meets the defined business requirement.
- Proper validation and testing procedures ensure model robustness before deployment.
- The trained model will serve as the backbone for the mildew detection application, aiding in real-time predictions.



---

## Import packages

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

### Set up environment

In [None]:
work_dir = os.getcwd()
os.chdir('/workspace/powdery-mildew-detector')
print("You set a new current directory")

### Set input directories

In [None]:
my_data_dir = 'inputs/mildew_dataset/cherry-leaves'
train_path = my_data_dir + '/train'
val_path = my_data_dir + '/validation'
test_path = my_data_dir + '/test'

### Set output directory

In [None]:
version = 'v1'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(work_dir) and version in os.listdir(work_dir + '/outputs'):
    print('Old version is already available create a new version.')
    pass
else:
    os.makedirs(name=file_path)

### Set labels

labels = os.listdir(train_path)
print(
    f"Project Labels: {labels}"
)

### Set image shape

In [None]:
import joblib
version = 'v1'
image_shape = joblib.load(filename=f"outputs/{version}/image_shape.pkl")
image_shape

---

### Number of images in train, test and validation data

In [None]:
import pandas as pd

# Create an empty dictionary to store data
data = {
    'Set': [],
    'Label': [],
    'Frequency': []
}

# List of dataset folders
folders = ['train', 'validation', 'test']

# Go through each folder and label to count the images
for folder in folders:
    for label in labels:
        row = {
            'Set': folder,
            'Label': label,
            'Frequency': int(len(os.listdir(my_data_dir + '/' + folder + '/' + label)))  
        }
        for key, value in row.items():
            data[key].append(value)
        print(
            f"* {folder} - {label}: {len(os.listdir(my_data_dir+'/'+ folder + '/' + label))} images")

# Convert the dictionary into a DataFrame
df_freq = pd.DataFrame(data)

print("\n")

# Set plot style
sns.set_style("whitegrid")
plt.figure(figsize=(8, 5))

# Create a bar chart to show image distribution
sns.barplot(data=df_freq, x='Set', y='Frequency', hue='Label')
plt.savefig(f'{file_path}/labels_distribution.png',
            bbox_inches='tight', dpi=150)
plt.show()

---

### Image data augmentation

---

Image data generator

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

- Initialize image data generator

In [None]:
augmented_image_data = ImageDataGenerator(rotation_range=20,
                                          width_shift_range=0.10,
                                          height_shift_range=0.10,
                                          shear_range=0.1,
                                          zoom_range=0.1,
                                          horizontal_flip=True,
                                          vertical_flip=True,
                                          fill_mode='nearest',
                                          rescale=1./255)

- Augment training image dataset

In [None]:
batch_size = 20  # Number of images processed in each batch
train_set = augmented_image_data.flow_from_directory(train_path,
                                                     target_size=image_shape[:2],
                                                     color_mode='rgb',
                                                     batch_size=batch_size,
                                                     class_mode='binary',
                                                     shuffle=True
                                                     )

# Print dataset information
print("Class indices:", train_set.class_indices)  # Dictionary mapping labels to indices
print("Number of classes:", len(train_set.class_indices))  # Total number of classes
print("Total number of images in dataset:", train_set.samples)  # Total number of images (before augmentation)

- Augment validation image dataset

In [None]:
# Preprocessing the validation images: Normalize pixel values to the range [0, 1]
validation_set = ImageDataGenerator(rescale=1./255).flow_from_directory(val_path,
                                                                        target_size=image_shape[:2],
                                                                        color_mode='rgb',
                                                                        batch_size=batch_size,
                                                                        class_mode='binary',
                                                                        shuffle=False
                                                                        )

# Display class indices (label mapping)
print(validation_set.class_indices)

- Augment test image dataset

In [None]:
# Preprocessing the test images: Normalize pixel values to the range [0, 1]
test_set = ImageDataGenerator(rescale=1./255).flow_from_directory(test_path,
                                                                  target_size=image_shape[:2],
                                                                  color_mode='rgb',
                                                                  batch_size=batch_size,
                                                                  class_mode='binary',
                                                                  shuffle=False
                                                                  )

# Display class indices (label mapping)
print(test_set.class_indices)

### Plot augmented images

Training images

In [None]:
for _ in range(3):
    img, label = next(train_set)
    print(img.shape)  # (1,256,256,3)
    plt.imshow(img[0])
    plt.show()

Validation images

In [None]:
for _ in range(3):
    img, label = next(validation_set)
    print(img.shape)  # (1,256,256,3)
    plt.imshow(img[0])
    plt.show()

Test images

In [None]:
for _ in range(3):
    img, label = next(test_set)
    print(img.shape)  # (1,256,256,3)
    plt.imshow(img[0])
    plt.show()

### Save class indices

In [None]:
joblib.dump(value=train_set.class_indices,
            filename=f"{file_path}/class_indices.pkl")

---

## CNN Model Creation

---

- Import model packages

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense, Conv2D, MaxPooling2D
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, make_scorer, recall_score

In [None]:
def create_tf_model():
    """ 
    Create a CNN model with multiple Conv2D and MaxPooling2D layers.
    The model uses different filter sizes and layer structures 
    to extract features from images.
    """
    
    model = Sequential([
        Conv2D(filters=32, kernel_size=(3, 3), activation='relu', input_shape=image_shape),
        MaxPooling2D(pool_size=(2, 2)),

        Conv2D(filters=64, kernel_size=(3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),

        Conv2D(filters=128, kernel_size=(3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),

        Conv2D(filters=128, kernel_size=(3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),

        Flatten(),
        Dense(128, activation='relu'),
        Dropout(0.3),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    
    return model

- Model Summary

In [None]:
create_tf_model().summary()

- Early Stopping 

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

### Define Pipeline with Feature Scaling & Selection

In [None]:
def pipeline_clf():
    return Pipeline([
        ("scaler", StandardScaler()),
        ("feature_selection", SelectFromModel(RandomForestClassifier(random_state=42))),
        ("model", RandomForestClassifier(random_state=42))
    ])

### Define Hyperparameter Grid

In [None]:
param_grid = {
    "model__n_estimators": [50, 100, 150],
    "model__max_depth": [10, 20, None],
    "model__min_samples_split": [2, 5, 10]
}

### GridSearchCV with Recall Score

In [None]:
scorer = make_scorer(recall_score, pos_label=1)
grid_search = GridSearchCV(estimator=pipeline_clf(), param_grid=param_grid, cv=3, scoring=scorer, verbose=2, n_jobs=-1)

### Fit Model

In [None]:
best_model = grid_search.best_estimator_
joblib.dump(best_model, f'outputs/{version}/best_model.pkl')

### Model Evaluation

In [None]:
y_pred = best_model.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=labels))

### Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.savefig(f'{file_path}/confusion_matrix.png', bbox_inches='tight', dpi=150)
plt.show()

---

### Print Best Hyperparameters

In [None]:
print("Best Hyperparameters:")
print(grid_search.best_params_)

---

## Conclusion and next steps

---