<h1>Introduction</h1>

This notebook implements a wildfire prediction pipeline using the Wildfire Prediction Dataset from Kaggle. The dataset contains satellite images (350x350px) organized into train, test, and valid folders, each with wildfire and nowildfire subfolders. The goal is to classify images as wildfire or non-wildfire using two models: a custom Convolutional Neural Network (CNN) and a pretrained ResNet50 model with transfer learning. The following steps:

<b>Data Cleaning:</b> Check for corrupt images or inconsistencies and visualize the dataset.

<b>Data Preprocessing:</b> Prepare images for modeling using TensorFlow’s ImageDataGenerator.

<b>Modeling:</b> Implement a custom CNN and a pretrained ResNet50 model.

<b>Results and Comparison:</b> Evaluate and compare model performance using accuracy, precision, recall, F1-score, confusion matrices, and ROC curves.




In [5]:
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')
# Set random seed for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

In [6]:
base_dir = 'C:\\Users\\DELL\\Downloads\\wildfire-prediction-dataset'

In [7]:
train_dir = os.path.join(base_dir, 'train')
valid_dir = os.path.join(base_dir, 'valid')
test_dir = os.path.join(base_dir, 'test') 

<h1>Step 1: Data Cleaning</h1>

<b>Objective:</b> Ensure the dataset is clean by checking for corrupt images, verifying folder structure, and visualizing class distribution.

<b>Rationale:</b> Image datasets can have corrupt files or imbalanced classes, which can affect model training. Check for issues and visualize the number of images per class to understand the dataset.

In [8]:
# Function to check for corrupt images
import glob
from PIL import Image
def verify_image(file):
    try:
        img = Image.open(file)
        img.verify()
        img.close()
        return None
    except Exception as e:
        return file

In [9]:
# Function to check images with multiprocessing
# Function to check images in a directory
def check_images(directory, max_images=1000):
    print(f"Processing {directory}...")
    corrupt_files = []
    
    # Collect image paths (limit to max_images)
    image_paths = []
    for subdir in ['wildfire', 'nowildfire']:
        path = os.path.join(directory, subdir, '*.jpg')  # Assuming JPG
        subdir_paths = glob.glob(path)
        np.random.shuffle(subdir_paths)  # Random sample
        image_paths.extend(subdir_paths[:max_images//2])  # 500 per class
    
    if not image_paths:
        print(f"No images found in {directory}. Check path or file extensions.")
        return corrupt_files
    
    # Verify images with multiprocessing
    start_time = time.time()
    with Pool(processes=os.cpu_count()) as pool:
        results = list(tqdm(pool.imap(verify_image, image_paths), 
                           total=len(image_paths), 
                           desc=f"Checking {os.path.basename(directory)}"))
    
    corrupt_files = [r for r in results if r is not None]
    
    if corrupt_files:
        print(f"Found {len(corrupt_files)} corrupt files in {directory}:")
        for file in corrupt_files:
            print(f"Corrupt: {file}")
    else:
        print(f"No corrupt files found in {directory} sample.")
    
    print(f"Finished {directory} in {time.time() - start_time:.2f} seconds.")
    return corrupt_files

In [None]:
# Check for corrupt images in train, valid, and test sets

from multiprocessing import Pool,cpu_count
from tqdm import tqdm
import time

print("Starting corrupt image check...")
train_corrupt = check_images(train_dir)
valid_corrupt = check_images(valid_dir)
test_corrupt = check_images(test_dir)



Starting corrupt image check...
Processing C:\Users\DELL\Downloads\wildfire-prediction-dataset\train...


Checking train:   0%|                                                                         | 0/1000 [00:00<?, ?it/s]

In [1]:
# Report corrupt files
total_corrupt = len(train_corrupt + valid_corrupt + test_corrupt)
if total_corrupt > 0:
    print(f"\nTotal: Found {total_corrupt} corrupt files. Consider removing them.")
else:
    print("\nTotal: No corrupt files found. Dataset is clean!")

NameError: name 'train_corrupt' is not defined

In [None]:
# Count images per class
def count_images(directory):
    wildfire_count = len(glob.glob(os.path.join(directory, 'wildfire', '*.jpg')))
    nowildfire_count = len(glob.glob(os.path.join(directory, 'nowildfire', '*.jpg')))
    return wildfire_count, nowildfire_count

train_wf, train_nwf = count_images(train_dir)
valid_wf, valid_nwf = count_images(valid_dir)
test_wf, test_nwf = count_images(test_dir)

In [None]:
# Visualize class distribution
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
labels = ['Wildfire', 'No Wildfire']
train_counts = [train_wf, train_nwf]
valid_counts = [valid_wf, valid_nwf]
test_counts = [test_wf, test_nwf]

plt.figure(figsize=(10, 5))
x = np.arange(len(labels))
width = 0.25

plt.bar(x - width, train_counts, width, label='Train')
plt.bar(x, valid_counts, width, label='Validation')
plt.bar(x + width, test_counts, width, label='Test')
plt.xlabel('Class')
plt.ylabel('Number of Images')
plt.title('Class Distribution Across Dataset Splits')
plt.xticks(x, labels)
plt.legend()
plt.show()

<h2>Insights:</h2>

If corrupt files are found, they should be removed manually or programmatically to avoid errors during training.

If the class distribution is imbalanced, we’ll address it in preprocessing (e.g., using class weights or data augmentation).

<h1>Step 2: Data Preprocessing</h1>
<b>Objective:</b> Load and preprocess images for modeling, including resizing, normalization, and data augmentation. Use the provided train, valid, and test splits.

<b>Rationale:</b> Images need to be resized to a consistent size (e.g., 224x224 for ResNet50 compatibility), normalized to [0,1], and augmented to improve model generalization. The dataset is already split, so we’ll use the provided folders.

In [None]:
# Define image parameters
IMG_SIZE = (224, 224)  # Standard size for ResNet50 and custom CNN
BATCH_SIZE = 32

In [None]:
# Create data generators with augmentation for training
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(
    rescale=1./255,  # Normalize pixel values
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

In [None]:
# Validation and test sets: only rescale, no augmentation
valid_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)



In [None]:
# Load data from directories
# Train set
train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='binary',
    shuffle=True
)
# Validation set
valid_generator = valid_datagen.flow_from_directory(
    valid_dir,
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='binary',
    shuffle=False
)
# Test set
test_generator = test_datagen.flow_from_directory(
    test_dir,
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='binary',
    shuffle=False
)

In [None]:
# Verify class indices
print("Class indices:", train_generator.class_indices)

<h2>Insights:</h2>

<b>Data Augmentation:</b> Applied to the training set to increase robustness (e.g., rotations, flips). Not applied to validation/test sets to ensure unbiased evaluation.

<b>Normalization:</b> Rescaling to [0,1] ensures consistent input ranges for the neural networks.

<b>Class Indices:</b> Confirm that wildfire is class 1 and nowildfire is class 0 for binary classification.

<h1>Step 3: Modeling
We’ll implement two models:</h1>

<b>Custom CNN:</b> A simple convolutional neural network designed for this task.

<b>PretrainedResNet50:</b> A transfer learning model using ResNet50 with weights pretrained on ImageNet.

<h3>Custom CNN</h3>
<b>Rationale:</b> A custom CNN allows us to design a lightweight model tailored to the dataset. We’ll use a few convolutional layers followed by dense layers for binary classification.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

In [None]:
# Build custom CNN
custom_cnn = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
    MaxPooling2D((2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Conv2D(128, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')  # Binary classification
])

In [None]:
# Compile the model
custom_cnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model
custom_cnn_history = custom_cnn.fit(
    train_generator,
    epochs=1,
    validation_data=valid_generator
)

<h3>Pretrained ResNet50</h3>
<b>Rationale:</b> ResNet50, pretrained on ImageNet, leverages learned features for better performance on image classification tasks, especially with limited data.

In [None]:
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import GlobalAveragePooling2D

In [None]:
# Load ResNet50 with ImageNet weights, exclude top layers
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

In [None]:
# Freeze base model layers
base_model.trainable = False

In [None]:
# Build transfer learning model
resnet_model = Sequential([
    base_model,
    GlobalAveragePooling2D(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

In [None]:
# Compile the model
resnet_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model
resnet_history = resnet_model.fit(
    train_generator,
    epochs=1,
    validation_data=valid_generator
)

<h2>Insights:</h2>

<b>Custom CNN:</b> Simple architecture, but may struggle with complex patterns due to limited depth.

<b>ResNet50: </b>Likely to perform better due to pretrained features, but may overfit if not enough data augmentation is used.

<b>Challenges:</b> If the dataset is small, ResNet50 may not generalize well without fine-tuning. We kept the base model frozen for simplicity but could unfreeze layers for better performance.

<h1>Step 4: Results and Comparison</h1>

<b>Objective:</b> Evaluate both models on the test set using accuracy, precision, recall, and F1-score. Visualize performance with confusion matrices, ROC curves, and loss curves.

In [None]:
# Evaluate models on test set
custom_cnn_results = custom_cnn.evaluate(test_generator)
resnet_results = resnet_model.evaluate(test_generator)

print(f"Custom CNN - Test Loss: {custom_cnn_results[0]:.4f}, Test Accuracy: {custom_cnn_results[1]:.4f}")
print(f"ResNet50 - Test Loss: {resnet_results[0]:.4f}, Test Accuracy: {resnet_results[1]:.4f}")

In [None]:
# Get predictions for confusion matrix and classification report
custom_cnn_pred = (custom_cnn.predict(test_generator) > 0.5).astype("int32")
resnet_pred = (resnet_model.predict(test_generator) > 0.5).astype("int32")
true_labels = test_generator.classes

In [None]:
# Classification report
print("\nCustom CNN Classification Report:")
print(classification_report(true_labels, custom_cnn_pred, target_names=['No Wildfire', 'Wildfire']))

print("\nResNet50 Classification Report:")
print(classification_report(true_labels, resnet_pred, target_names=['No Wildfire', 'Wildfire']))

In [None]:
# Plot confusion matrices
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
sns.heatmap(confusion_matrix(true_labels, custom_cnn_pred), annot=True, fmt='d', cmap='Blues', ax=ax1)
ax1.set_title('Custom CNN Confusion Matrix')
ax1.set_xlabel('Predicted')
ax1.set_ylabel('True')

sns.heatmap(confusion_matrix(true_labels, resnet_pred), annot=True, fmt='d', cmap='Blues', ax=ax2)
ax2.set_title('ResNet50 Confusion Matrix')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('True')
plt.show()

In [None]:
# Plot ROC curves
custom_cnn_prob = custom_cnn.predict(test_generator)
resnet_prob = resnet_model.predict(test_generator)

custom_fpr, custom_tpr, _ = roc_curve(true_labels, custom_cnn_prob)
resnet_fpr, resnet_tpr, _ = roc_curve(true_labels, resnet_prob)
custom_auc = auc(custom_fpr, custom_tpr)
resnet_auc = auc(resnet_fpr, resnet_tpr)

plt.figure(figsize=(8, 6))
plt.plot(custom_fpr, custom_tpr, label=f'Custom CNN (AUC = {custom_auc:.2f})')
plt.plot(resnet_fpr, resnet_tpr, label=f'ResNet50 (AUC = {resnet_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend()
plt.show()

In [None]:
# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(custom_cnn_history.history['loss'], label='Custom CNN Train Loss')
plt.plot(custom_cnn_history.history['val_loss'], label='Custom CNN Val Loss')
plt.plot(resnet_history.history['loss'], label='ResNet50 Train Loss')
plt.plot(resnet_history.history['val_loss'], label='ResNet50 Val Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(custom_cnn_history.history['accuracy'], label='Custom CNN Train Acc')
plt.plot(custom_cnn_history.history['val_accuracy'], label='Custom CNN Val Acc')
plt.plot(resnet_history.history['accuracy'], label='ResNet50 Train Acc')
plt.plot(resnet_history.history['val_accuracy'], label='ResNet50 Val Acc')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()