![](../images/data_augmentation.png)

## Introduction

In this notebook, you'll tackle **image classification** using **Convolutional Neural Networks (CNNs)**. As demonstrated in lectures, there's a fundamental principle in deep learning: **deeper networks typically possess greater model capacity**. However, this increased capacity often leads to a higher risk of **overfitting**, as the number of learnable parameters escalates significantly.

### The Role of Data Augmentation

In CNNs, **Data Augmentation** emerges as a crucial technique to bolster the model's robustness against overfitting. By applying transformations such as:

- **Rotation** (random angles)
- **Shifting** (horizontal and vertical translation)
- **Flipping** (horizontal/vertical mirroring)
- **Zooming** (scaling in/out)
- **Shearing** (slanting transformations)

Data Augmentation artificially **diversifies the training dataset**, helping the model generalize better to unseen data. Widely embraced since its introduction in **AlexNet (2012)**, Data Augmentation has become a cornerstone technique in modern computer vision.

### About ImageDataGenerator

`ImageDataGenerator` is a powerful utility class within the Keras library, facilitating:
- **Image preprocessing** (rescaling, normalization)
- **Real-time data augmentation** during model training
- **Efficient batch loading** from disk

It's primarily employed in tasks like image classification, object detection, and image segmentation.

**Directory Structure Requirements:**

To leverage `ImageDataGenerator` effectively, organize your image data into separate directories, each representing a distinct class:

![Directory Structure](https://expoundai.files.wordpress.com/2019/04/directorystructure.png)

### What You'll Do

In this assignment, you'll:

1. **Build a CNN with data augmentation** and evaluate its performance
2. **Train the same model WITHOUT augmentation** to observe overfitting

This comparison will clearly illustrate the effectiveness of data augmentation in mitigating overfitting.

Let's get started!

## Setup: Import Required Libraries

First, import the necessary libraries for image processing, model building, and visualization.

In [None]:
import os
os.environ['KERAS_BACKEND'] = 'torch'

import zipfile
import random
import shutil
import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from shutil import copyfile
import matplotlib.pyplot as plt
import numpy as np

print(f"Keras version: {keras.__version__}")
print(f"Using backend: {keras.backend.backend()}")

### Check Device Availability

Let's verify whether GPU acceleration is available for faster training:

In [None]:
# Check for GPU availability
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print("  Training will be significantly faster on GPU!")
else:
    device = torch.device("cpu")
    print("⚠ No GPU detected - using CPU")
    print("  Training will be slower. Consider using Google Colab or Kaggle Kernel for free GPU access.")

## Step 1: Load the 10-Flowers Dataset

Load the dataset, which contains images of **10 different flower species** organized into subfolders. Each subfolder represents a different flower class (labeled 0-9).

In [None]:

local_zip = '10flows.zip'

if not os.path.exists(local_zip):
    print("Please copy the 10flows.zip file to the current directory")
else:
    zip_ref   = zipfile.ZipFile(local_zip, 'r')
    zip_ref.extractall('.')
    zip_ref.close()

### Dataset Overview

Let's examine the **class distribution** — the number of images for each flower class:

In [None]:
source_path = 'flowers'

for label in range(10):
    print(f"There are {len(os.listdir(os.path.join(source_path, str(label))))} images of {label}.")

### Visual Exploration

Display **5 random sample images** from each of the 10 flower classes to get a sense of the dataset's diversity:

In [None]:
# show 5 random images from each class in a 10x5 grid
fig = plt.figure(figsize=(10, 10))
for i in range(10):
    for j in range(5):
        img = plt.imread(f'flowers/{i}/{random.choice(os.listdir(f"flowers/{i}"))}')
        fig.add_subplot(10, 5, i*5+j+1)
        plt.axis('off')
        plt.imshow(img)

In [None]:
# print out the size of each image
img = plt.imread(f'flowers/0/{random.choice(os.listdir("flowers/0"))}')
print(f"Image size: {img.shape}")

## Step 2: Split Dataset into Training, Validation, and Test Sets

We'll organize the data into a directory structure compatible with Keras' `ImageDataGenerator`.

**Split Ratios:**
- **Training**: 70% (for learning patterns)
- **Validation**: 20% (for hyperparameter tuning and monitoring)
- **Test**: 10% (for final evaluation)

You don't need to modify anything in this step — just run the code to create the proper directory structure.

In [None]:
# Define root directory
root_dir = 'sandbox'

# Empty directory to prevent FileExistsError is the function is run several times
if os.path.exists(root_dir):
  shutil.rmtree(root_dir)

def create_train_val_dirs(root_path):
  """
  Creates directories for the train and test sets

  Args:
    root_path (string) - the base directory path to create subdirectories from

  Returns:
    None
  """
  for name in range(10):
    for name2 in ['training','validation','test']:
      os.makedirs(os.path.join(root_path,name2,str(name)))

try:
  create_train_val_dirs(root_path=root_dir)
except FileExistsError:
  print("You should not be seeing this since the upper directory is removed beforehand")

In [None]:
for rootdir, dirs, files in os.walk(root_dir):
    for subdir in dirs:
        print(os.path.join(rootdir, subdir))

In [None]:
def split_data(SOURCE_DIR, TRAINING_DIR, VALIDATION_DIR, TEST_DIR, TRAIN_SPLIT, VAL_SPLIT):
    """
    Splits the data into train, validation, and test sets.

    Args:
        SOURCE_DIR (string): directory path containing the images
        TRAINING_DIR (string): directory path to be used for training
        VALIDATION_DIR (string): directory path to be used for validation
        TEST_DIR (string): directory path to be used for testing
        TRAIN_SPLIT (float): proportion of the dataset to be used for training
        VAL_SPLIT (float): proportion of the dataset to be used for validation
    """

    np.random.seed(42)
    files = []
    for name in os.listdir(SOURCE_DIR):
        if os.path.getsize(os.path.join(SOURCE_DIR, name)) > 0:
            files.append(name)
    random.shuffle(files)

    train_size = int(len(files) * TRAIN_SPLIT)
    val_size = int(len(files) * VAL_SPLIT)

    for name in files[:train_size]:
        shutil.copy(os.path.join(SOURCE_DIR, name), os.path.join(TRAINING_DIR, name))
    for name in files[train_size:train_size + val_size]:
        shutil.copy(os.path.join(SOURCE_DIR, name), os.path.join(VALIDATION_DIR, name))
    for name in files[train_size + val_size:]:
        shutil.copy(os.path.join(SOURCE_DIR, name), os.path.join(TEST_DIR, name))

### Execute the Data Split

Run the cell below to split the dataset and verify the distribution across training, validation, and test sets:

In [None]:
TRAINING_DIR = "sandbox/training/"
VALIDATION_DIR = "sandbox/validation/"
TEST_DIR = "sandbox/test/"

# Empty directories in case you run this cell multiple times
for i in range(10):
  label_train_dir = os.path.join(TRAINING_DIR, str(i))
  label_val_dir = os.path.join(VALIDATION_DIR, str(i))
  label_test_dir = os.path.join(TEST_DIR, str(i))
  if len(os.listdir(label_train_dir)) > 0:
    for file in os.scandir(label_train_dir):
      os.remove(file.path)
  if len(os.listdir(label_val_dir)) > 0:
    for file in os.scandir(label_val_dir):
      os.remove(file.path)
  if len(os.listdir(label_test_dir)) > 0:
    for file in os.scandir(label_test_dir):
      os.remove(file.path)

# Define proportion of images used for training and validation
train_split = 0.7
val_split = 0.2
test_split = 0.1

# Run the function
for i in range(10):
  label_train_dir = os.path.join(TRAINING_DIR, str(i))
  label_val_dir = os.path.join(VALIDATION_DIR, str(i))
  label_test_dir = os.path.join(TEST_DIR, str(i))
  split_data(SOURCE_DIR=os.path.join(source_path, str(i)),
             TRAINING_DIR=label_train_dir,
             VALIDATION_DIR=label_val_dir,
             TEST_DIR=label_test_dir,
             TRAIN_SPLIT=train_split,
             VAL_SPLIT=val_split)

# Check that the number of images matches the expected output
for i in range(10):
  print(f"There are {len(os.listdir(os.path.join(TRAINING_DIR, str(i))))} images of {i} in the training set")
  print(f"There are {len(os.listdir(os.path.join(VALIDATION_DIR, str(i))))} images of {i} in the validation set")
  print(f"There are {len(os.listdir(os.path.join(TEST_DIR, str(i))))} images of {i} in the test set")


## Step 3: Create Data Generators with Augmentation

Now that you've organized the data properly, it's time to create **data generators** that will:

1. **Load images in batches** (memory efficient)
2. **Apply real-time augmentation** during training
3. **Standardize image sizes** to a consistent resolution

### Key Parameters:

- **`target_size=(224, 224)`**: Resizes all images to 224×224 pixels (standard for many CNN architectures)
- **`class_mode='sparse'`**: Uses integer labels (0-9) instead of one-hot encoding
- **`rescale=1./255`**: Normalizes pixel values from [0, 255] to [0, 1]

### Data Augmentation Transformations (Training Only):

- `rotation_range=40`: Random rotations up to ±40°
- `width_shift_range=0.2`: Random horizontal shifts (20% of width)
- `height_shift_range=0.2`: Random vertical shifts (20% of height)
- `shear_range=0.2`: Random shearing transformations
- `zoom_range=0.2`: Random zoom in/out (80%-120%)
- `horizontal_flip=True`: Random horizontal flipping
- `fill_mode='nearest'`: Fill strategy for pixels outside boundaries

**Note:** Validation and test sets only receive rescaling (no augmentation) to ensure consistent evaluation.

Run the cell below to create the generators:

In [None]:

def train_val_generators(TRAINING_DIR, VALIDATION_DIR, TEST_DIR):
  """
  Creates the training and validation data generators

  Args:
    TRAINING_DIR (string): directory path containing the training images
    VALIDATION_DIR (string): directory path containing the testing/validation images

  Returns:
    train_generator, validation_generator - tuple containing the generators
  """
  ### START CODE HERE

  # Instantiate the ImageDataGenerator class for the training set
  train_datagen = ImageDataGenerator(rescale=1./255,
                                     rotation_range=40,
                                      width_shift_range=0.2,
                                      height_shift_range=0.2,
                                      shear_range=0.2,
                                      zoom_range=0.2,
                                      horizontal_flip=True,
                                      fill_mode='nearest')
  # only applying the rescale transformation to the validation and test sets
  val_test_datagen = ImageDataGenerator(rescale=1./255)

  # Pass in the appropriate arguments to the flow_from_directory method
  train_generator = train_datagen.flow_from_directory(directory=TRAINING_DIR,
                                                      batch_size=32,
                                                      class_mode='sparse',
                                                      target_size=(224, 224))
  validation_generator = val_test_datagen.flow_from_directory(directory=VALIDATION_DIR,
                                                              batch_size=32,
                                                              class_mode='sparse',
                                                              target_size=(224, 224))
  test_generator = val_test_datagen.flow_from_directory(directory=TEST_DIR,
                                                        batch_size=32,
                                                        class_mode='sparse',
                                                        target_size=(224, 224))
  return train_generator, validation_generator, test_generator


In [None]:
# Test your generators
train_generator, validation_generator, test_generator = train_val_generators(TRAINING_DIR, VALIDATION_DIR, TEST_DIR)

## Task 1: Design Your CNN Architecture

**Objective:** Build a CNN model that achieves strong performance on the flower classification task.

To pass this task, your model should:

✅ **Test Accuracy**: ≥ 80% with `epochs` = 20.  

In [None]:
# GRADED FUNCTION: create_model
def create_model():
  """
  Creates a CNN model for flower classification.

  Architecture:
    - 4 Convolutional blocks with progressive filter increase (32→64→128→128)
    - MaxPooling after each conv block for dimensionality reduction
    - Dense layer with 512 units for high-level feature learning
    - Output layer with 10 units (softmax) for 10-class classification

  Returns:
    Compiled Keras model ready for training
  """

  model = keras.models.Sequential([

    # start building the model here


    # end building the model here
  ])


  # Compile with appropriate loss for multi-class classification


  return model


In [None]:
# create the untrained model


# Display model architecture


# Count parameters




# Train the model


### Visualize Training Progress

Once training has finished, visualize the **training and validation metrics** across epochs.



In [None]:
# Extract training history
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(len(acc))

# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Accuracy
ax1.plot(epochs_range, acc, 'r-', linewidth=2, label="Training Accuracy")
ax1.plot(epochs_range, val_acc, 'b-', linewidth=2, label="Validation Accuracy")
ax1.set_title('Training and Validation Accuracy', fontsize=14, fontweight='bold')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Accuracy', fontsize=12)
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0, 1])

# Plot 2: Loss
ax2.plot(epochs_range, loss, 'r-', linewidth=2, label="Training Loss")
ax2.plot(epochs_range, val_loss, 'b-', linewidth=2, label="Validation Loss")
ax2.set_title('Training and Validation Loss', fontsize=14, fontweight='bold')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Loss', fontsize=12)
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print summary statistics
print(f"\n{'='*60}")
print(f"  Training Summary")
print(f"{'='*60}")
print(f"  Best Training Accuracy:   {max(acc):.4f} (epoch {acc.index(max(acc))+1})")
print(f"  Best Validation Accuracy: {max(val_acc):.4f} (epoch {val_acc.index(max(val_acc))+1})")
print(f"  Final Training Accuracy:  {acc[-1]:.4f}")
print(f"  Final Validation Accuracy:{val_acc[-1]:.4f}")
print(f"  Accuracy Gap (final):     {abs(acc[-1] - val_acc[-1]):.4f}")
print(f"{'='*60}")

In addition to reporting your training and validation accuracy above, please also include your test accuracy below. Finally, comment on whether your model shows signs of overfitting.

In [None]:
# Evaluate on test set


## ⭐ Bonus Challenge: Achieve Test Accuracy Above 85%

This task is **optional** — you may skip it if you prefer.

If your model reaches **> 85% test accuracy**, you will earn **1 bonus point**(added to your total score) at the end!

Feel free to experiment with:

- Different model architectures
- Optimizers & learning rate schedules
- Regularization (dropout, batch norm, early stopping, etc.)

## Task 2: Experience Overfitting Without Data Augmentation

Simply train the same model architecture **WITHOUT data augmentation** to observe the overfitting phenomenon.

In [None]:
def load_images_and_labels(directory):
    images = []
    labels = []
    for label, class_name in enumerate(os.listdir(directory)):
        class_dir = os.path.join(directory, class_name)
        for image_name in os.listdir(class_dir):
            image = keras.preprocessing.image.load_img(
                os.path.join(class_dir, image_name),
                target_size=(224, 224)
            )
            image = keras.preprocessing.image.img_to_array(image)
            images.append(image)
            labels.append(label)
    return np.array(images), np.array(labels)

In [None]:
# Load images and labels
images, labels = load_images_and_labels('flowers')

In [None]:
from sklearn.model_selection import train_test_split
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(images, labels, test_size=0.1, random_state=42)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Get the untrained model
model2 = create_model()

# IMPORTANT: Normalize the training data (divide by 255)
X_train_normalized = X_train / 255.0
X_test_normalized = X_test / 255.0

# Train the model (without data augmentation)
history_2 = model2.fit(
    X_train_normalized,
    y_train,
    epochs=20,
    verbose=1
)

In [None]:
#-----------------------------------------------------------
# Retrieve a list of list results on training and test data
# sets for each training epoch
#-----------------------------------------------------------
acc=history_2.history['accuracy']
loss=history_2.history['loss']

epochs=range(len(acc)) # Get number of epochs

#------------------------------------------------
# Plot training and validation accuracy per epoch
#------------------------------------------------
plt.plot(epochs, acc, 'r', label="Training Accuracy")
plt.title('Training accuracy')
plt.legend()
plt.show()
print("")

#------------------------------------------------
# Plot training and validation loss per epoch
#------------------------------------------------
plt.plot(epochs, loss, 'r', label="Training Loss")
plt.title('Training loss')
plt.legend()
plt.show()

In [None]:
# Evaluate the model on test set
test_loss, test_accuracy = model2.evaluate(X_test_normalized, y_test, verbose=0)

# Print results
print(f"\n{'='*50}")
print(f"  Final Training Accuracy: {history_2.history['accuracy'][-1]:.4f}")
print(f"  Final Test Accuracy:     {test_accuracy:.4f}")
print(f"  Accuracy Gap:            {abs(history_2.history['accuracy'][-1] - test_accuracy):.4f}")
print(f"{'='*50}")

if abs(history_2.history['accuracy'][-1] - test_accuracy) > 0.15:
    print("\n⚠️  OVERFITTING DETECTED!")
    print("   Training accuracy is much higher than test accuracy.")
    print("   This is why data augmentation is important!")
else:
    print("\n✓ Model generalizes well (minimal overfitting)")

## Task 3: Summarize what you have learned from this notebook below.