# Hands-On Convolutional Neural Network

Coral reefs are among the most diverse and valuable ecosystems on Earth, providing habitat for 25% of all marine species and supporting the livelihoods of over half a billion people worldwide. However, these ecosystems face unprecedented threats from climate change, ocean acidification, and other human activities, with many species now endangered.

We are given a dataset containing images of three different coral species:
 - *Acropora cervicornis* (Staghorn Coral)
 - *Colpophyllia natans* (Boulder Brain Coral)
 - *Montastraea cavernosa* (Greater Star Coral)

Our task is to build a Convolutional Neural Network (CNN) that can classify the coral images into the correct species. This technology can help automate coral reef monitoring efforts and support conservation initiatives by enabling rapid, large-scale species identification.


## Part 1: Building a CNN Model from Scratch
### Step 0. Check GPU Availability and TensorFlow Version

Before training deep learning models, it's important to check whether TensorFlow can access a GPU. Training on a GPU is significantly faster than on a CPU, especially for large image image datasets.

If you've followed the setup instructions in the [GitHub README](https://github.com/kbeavers/coral-species-CNN-tutorial/blob/main/README.md), and you've run the `install_kernel.sh` script on **Frontera**, you should now be running this notebook inside a containerized Jupyter kernel that includes:
- TensorFlow with GPU support
- CUDA libraries compatible with the system
- All required Python packages pre-installed

This cell will confirm that your environment is correctly configured (TIP: Make sure you change your kernel to `tf-cuda101`.) 

### Step 1. Data Loading and Organization
In this step, we load all coral images from the dataset directory and organize them into a DataFrame. Each image is assigned a label based on the name of the directory it's stored in (i.e., 'ACER' - *Acropora cervicornis*, 'CNAT' - *Colpophyllia natans*, 'MCAV' - *Montastraea cavernosa*).

This DataFrame will serve as the foundation for splitting our data into training, validation, and test sets later in the tutorial.

#### 1.1. List Dataset Directory Contents
Before loading the images, we first want to inspect the directory structure to make sure everything is in the right place. The code below lists the contents of the `coral-species` data directory to verity that the subdirectories for each coral species are present and correctly named:

#### 1.2. Check File Extensions
Next, we scan the dataset directory and all its subdirectories to find out what types of image files are present. This helps us catch unexpected or unsupported file types (e.g., GIFs, txt files, etc.), which could cause problems later when loading images.

This also allows us to see if the images are all in the same format or not.

#### 1.3. Explore Image Dimensions and Color Modes
Before feeding images into a CNN, it's important to understand the basic properties of the dataset. In this step, we examine the **dimensions** (width x height) as well as the **color mode** (e.g., RGB, RGBA, grayscale) of each image. This helps us decide if we need to resize or convert images before we begin training our CNN.

The script below prints a summary and gives recommendations if inconsistencies are found.

Our dataset analysis reveals some important characteristics that we'll need to keep in mind as we proceed with the tutorial:
 1. **Image Size Variation**: We have 451 total images in our dataset, with 88 different image sizes (dimensions). Also notice that some images are in portrait orientation (height > width) while others are landscape (width > height). CNNs expect all images to have the same dimensions, so we'll need to resize them to a standard size before training our model. 
 2. **Color Mode**: All images share the same color mode. Great!

We will address this again in Step 4 when we prepare our data for input into the CNN. 

#### 1.4 Check for Corrupted Images

Before continuing, we want to make sure that all image files are readable. Corrupted files can break your model training or cause unexpected errors during preprocessing.

In this step, we:
1. Attempt to open each '.jpg' file using PIL
2. Discard any files that fail to load

This ensures we only keep clean, valid images for training.

If there are any corrupted images in your dataset, this code will automatically remove them.

#### 1.5. Create a DataFrame of Image Paths and Labels
Now that we have taken a peak at the format of our data and have removed any corrupted images, we can start setting up our data for training. In this step, we build a `pandas.DataFrame` that organizes all the image data into two columns:
 1. **filepath**: The full path to each image file
 2. **label**: The class label for each image, taken from the directory name

This structured DataFrame is essential for training with Keras' `flow_from_dataframe` method that we'll use later in this tutorial.

### Step 2. Visualize the Data
### 2.1 Visualize the Class Distribution
Before we start training, it's important to understand how many images we have for each class (in this case, coral species).

In this step we:
 1. Count how many images belong to each class
 2. Plot the class distribution as a pie chart and bar graph

If the dataset is imbalanced (i.e., some classes have far more images than others), we may need to account for this later using **class weights** or **data augmentation**. 

**Thought Challenge**: Describe the class distribution in your own words. How much of the dataset is made up by the largest class? The smallest class? Is there anything that we need to address before continuing?

**Answer**:

Our dataset contains three coral species: ACER (*Acropora cervicornis*), MCAV (*Montastraea cavernosa*), and CNAT (*Colpophyllia natans*). The class distribution is well balanced, with the largest class (ACER) making up ~35% of the dataset and the smallest class (CNAT) making up ~31% of the dataset. 

Because all three classes are similarly represented, we do not need to apply class weighting or balancing techniques during training. 

#### 2.2 Visualize Images from the Dataset
It's helpful to look at a few images from each class to get a better understanding of the dataset. This will give us a better sense of:
 - What each coral species looks like
 - How much visual variation exists within each class (e.g., different angles, lighting, etc.)
 - Whether the dataset includes noise, blur, or other artifacts

We'll display a grid of randomly selected images, grouped by class.

**Thought Challenge**: Try changing the `random.seed` value a few times to view different images from our dataset. What do you notice? Take a moment to write down your observations.

*Remember: the quality of a machine learning model is decided largely by the quality of the dataset it was trained on!*

### Step 3: Split the Dataset and Handle Class Imbalance

#### 3.1. Split the Dataset into Training, Validation and Test Sets
We are now ready to split our labelled image dataset into three parts:

 - **Training Set**: Used to train the model
 - **Validation Set**: Used to tune hyperparameters and monitor training
 - **Test Set**: Used to evaluate the final model's performance after training is complete

We use `train_test_split()` from scikit-learn in two stages:

 1. First, we split the original dataset into **training + test** sets
 2. Then, we split the training set again into **training + validation**.

This ensures that our CNN *never sees the test set* during training, which is important for obtaining an unbiased estimate of the model's performance. 

To preserve the class distribution across all splits, we use `stratify=df["label"]` to ensure each subset reflects the original proportions of each class. This is called **stratified sampling**.

#### Step 3.2: Compute Class Weights

If our dataset is imbalanced (i.e., some classes have many more images than others), the model may learn to favor those majority classes. To address this, we can compute **class weights** based on the training data using the `compute_class_weight` function from scikit-learn.

These weights:
 - Assign higher important to underrepresented classes
 - Are passed into `model.fit()` using the `class_weight` argument
 - Adjust how the loss is calculated during training

This technique helps the model give balanced attention to all classes during training.

While our dataset is quite balanced, we provide the code for computing class weights below:

### Step 4: Image Preprocessing and Data Generators
As we discovered in Step 1.3, we need to prepare our images before feeding them into the CNN. This step involves two key concepts:

**a. Data Generators**

Data generators are special tools that help us efficiently load and preprocess image data in small batches (instead of all at once). Keras provides a built-in data generator called `ImageDataGenerator` that can:
 - Resize all images to a consistent size
 - Normalize pixel values (e.g., from [0-255] to [0-1])
 - Augment the training data with random transformations to improve generalization

Data generators can be used with Keras model methods like `fit()`, `evaluate()`, and `predict()`, which is particularly useful when dealing with large datasets that don't all fit into memory at once. 

**b. Data Augmentation**

Data augmentation is a powerful technique that helps our model learn more robust features by creating variations of our training images. Augmentation techniques not only expand the size of our training set, but also help prevent overfitting by exposing our model to different variations of our images. 

Conveniently, `ImageDataGenerator` also provides a number of built-in augmentation techniques that we can use to augment our training data, such as:
 - Random rotations
 - Zooming in or out
 - Shifting the image left or right
 - Flipping the image horizontally

Each of these modifications creates a new, slightly different version of our training images, helping our model learn to recognize the same features in different orientations.

#### 4.1 Define Image Preprocessing and Augmentation
We will define three separate `ImageDataGenerator` objects, one for each dataset split (train, val, test):
 - `train_datagen` will apply both normalization and augmentation to the training data
 - `val_datagen` and `test_datagen` will only apply normalization (no augmentation)

#### Step 4.2 Load Images Using `flow_from_dataframe()`
Now that our preprocessing methods are defined, we can use `flow_from_dataframe()` to load images in batches directly from our labeled DataFrames (`train_df`, `val_df`, and `test_df`).

All generators return batches of preprocessed image tensors and their corresponding labels.

#### Sanity Check: Inspect a Batch from the Training Generator

Let's inspect the output of the `train_generator` to make sure it's working as expected.

In the code below, we:
 - Retrieve one batch of images and labels from the training generator
 - Check the shape of the batch
 - Display a few image-label pairs to confirm the generator is working

#### Visualize a Few Images from the Training Generator

Let's display a few images from the training generator along with their decoded class labels.

**Thought Challenge**: Look carefully at the images displayed above. Try running the code cell multiple times and changing the code to display images from the validation and test generators. What do you notice about the images that you didn't see before (in Step 3)? Do you notice any differences in the images each time you run the cell? Think about why this might be happening.

### Step 5. Define Your CNN Architecture

Congratulations! Our data is now ready to be used to train a Convolutional Neural Network (CNN) to classify our coral images. 

In this step, we will define the architecture of our CNN model. Below, we define a model that consists of three main parts:

1. **Convolutional Blocks** (Feature Extraction):
   - Block 1: 32 filters (3×3), followed by AveragePooling
   - Block 2: 64 filters (3×3), followed by AveragePooling
   - Block 3: 128 filters (3×3), followed by AveragePooling

   Each block increases the number of filters, allowing the network to detect more complex patterns.

2. **Flatten Layer**:
   - Converts the 3D feature maps to a 1D vector for the dense layers

3. **Dense Layers** (Classification):
   - First dense layer: 128 perceptrons
   - Second dense layer: 64 perceptrons
   - Output layer: 3 perceptrons (one for each coral species)

Once you have filled in the blanks and defined your model, let's compile it:

In the code above, we use the `RMSprop` optimizer, which adapts the learning rate based on recent gradients, and is a popular chocie for image classification tasks. We also set the learning rate to `1e-4`, which sets the inital learning rate for the optimizer. 

*Note: while these are good starting choices, you may want to experiment with different optimizers or learning rates based on your model's performance*

Finally, let's display our model architecture and parameter count:

## Step 6: Train the CNN Model

Now that our CNN architecture is defined, we can train it using the `fit()` method.

During training, the model will learn patterns in the training data and adjust is parameters to minimize the loss function. After each epoch, the model's performance is evaluated on the validation set.

Here, we also pass in `class_weight` to demonstrate how to handle imbalanced data. 

We also track the training history, which we’ll use later to visualize performance over time.

In [None]:
cnn_history = cnn_model.fit(
    train_generator,
    validation_data=val_generator,
    epochs=15,
    class_weight=class_weight_dict # Computed in Step 4.2
)

#### Visualize Training History

After training the model, we can visualize the accuracy and loss over time to better understand how the model is learning. These plots can help us identify overfitting, underfitting, or confirm that the model is learning as expected.

We can use the `cnn_history` object returned by the `fit()` method to plot the training and validation accuracy and loss:

The plots above show the training and validation accuracy/loss over 15 epochs. 

**Thought Challenge**: What do you notice about the training and validation accuracy and loss? What does this tell you about the model's learning performance (i.e. overfitting, underfitting, or healthy learning)? 

### Step 7: Evaluate the Model on Test Set

Now that we've trained our model, it's time to evaluate its performance on the test set. This step is crucial because it helps us understand how well the model generalizes to new, unseen data, which is a good indicator of its real-world performance.

#### Evaluate Test Accuracy and Loss

We use `model.evaluate()` to calculate the test accuracy and loss. These metrics give us a quick overview of the model's performance.

Our model correctly classifies the test images about 45% of the time, and our loss is still quite high. While these numbers provide a snapshot of performance, they don't tell the whole story. Let's dig deeper with a confusion matrix.

#### Visualize Predictions with a Confusion Matrix

A confusion matrix provides a detailed breakdown of the model's predictions compared to the true labels. It helps identifyh which classes are being confused with each other.

#### Detailed Performance with a Classification Report

The classification report provides precision, recall, and F1-scores for each class, offering a more nuanced view of model performance.

### Classification Report

**Thought Challenge**: Critically assess the performance of our model based on the accuracy/loss values, confusion matrix, and classification report. Are there any classes that the model is particularly good or bad at predicting? Think about the data and why the model might be performing better or worse for certain classes.

## Part 2: Transfer Learning with VGG19

In this section, we apply a technique called **transfer learning** to improve model performance on our coral classification task.

**Transfer learning** is a deep learning technique where we *reuse a model that has already been trained on a large dataset for a different but related task*. Instead of starting from scratch, we "transfer" the knowledge learned by the pre-trained model to our new task.

This is especially useful when you have a limited dataset, you want to train a model faster, or you want to achieve better accuracy with less computational effort.

We use the **VGG19 model**, a classic convolutional neural network developed by researchers at Oxford University. It was trained on **ImageNet** dataset, which contains over 14 million images across 1000 classes.

### Step 1: Prepare Data for VGG19

#### 1.1 Define Image Preprocessing and Augmentation
VGG19 expects input images to be preprocessed in a very specific way because of the way it was trained. We use the `preprocess_input()` function from `tensorflow.keras.applications.vgg19` to preprocess our images. Specifically, this function converts RGB pixel values to the format VGG19 was originally trained on (i.e., channels in BGR order, zero-centered with respect to ImageNet).

Let's create new data generators for VGG19 using `ImageDataGenerator` with:
 - `preprocess_input` for normalization
 - Augmentation on the training set
 - No augmentation on the validation and test sets

### Step 2: Load Images Using `flow_from_dataframe()'

Just like we did for our CNN model, we can use `flow_from_dataframe()` to load images in batches directly from our labeled DataFrames (`train_df`, `val_df`, and `test_df`).

### Step 2: Define and Train the VGG19 Model
#### 2.1 Load VGG19 Base Model and Stack a Custom Classifier

We now load the **VGG19 base model**, which has been pre-trained on ImageNet. We exclude the original classification head (`include_top=False`) and freeze all convolutional layers for now.

Next, we stack a **custom classifier** on top using Keras’ `Sequential` API:
- Flatten the output of VGG19’s last convolutional layer
- Add the same fully connected (dense) layers that we had in our original CNN built from scratch

Now, let's compile the model with the same optimizer and loss function as our previous model.

#### 2.2 Define Training Callbacks
Next, let's define some **training callbacks**. Callbacks are functions executed during training that allow the training process to change its behavior dynamically.

Some common callbacks include:
- **EarlyStopping**: This callback stops training when a monitored metric (e.g., validation accuracy) stops improving. It helps prevent overfitting by halting training once the model's performance plateaus.
- **ReduceLROnPlateau**: This callback reduces the learning rate when a monitored metric (e.g., validation loss) stops improving. By lowering the learning rate, the model can converge to a better local minimum (preventing it from getting stuck in a suboptimal solution).

#### Visualizing Training History
Just like we did for our first CNN model, let's plot the training and validation performance over time. 

Refer back to Section 1: Step 6 – *Visualizing Training History* for a refresher on how to do this.

**Thought Challenge**: Compare the performance of our VGG19 model to our previous CNN model. What are some major differences in the training curves?

### Step 3: Evaluate the VGG19 Model on the Test Set

Just like we did for our first CNN model, let's evaluate the VGG19 model on the test set.

#### Evaluate Test Accuracy and Loss
First, let's calculate the test accuracy and loss. Can you recall how to do this?

Our model correctly classifies the test images about 84% of the time. What an improvement!

#### Visualize Predictions with a Confusion Matrix
Now, let's visualize the predictions of our VGG19 model on the test set with a confusion matrix.

Refer back to Section 1: Step 7 – *Visualize Predictions with a Confusion Matrix* for a refresher on how to do this.

Notice how the confusion matrix shows a distinct diagonal pattern, where the true and predicted labels are the same more often than not? This indicates that our model is performing well on all classes. Nice!

#### Detailed Performance with a Classification Report
Finally, let's print out the full classification report.

**Thought Challenge**: Compare the performance of our VGG19 model to our previous CNN model. What are some major differences in the classification report? Are there still any problematic classes that the model is struggling with? If so, what do you think is causing this?

### Step 4: Visualize Predictions from the Test Set
Let's display a few test images along with their predicted labels, true labels, and the model's confidence scores.

This helps visually confirm whether predictions make sense – and helps identify patterns in misclassifications.

## Final Thoughts and Wrap-Up

Congratulations! You've now built and evaluated two deep learning models for image classification:

1. **A CNN from scratch**  
   - Gave you hands-on experience building a model layer-by-layer
   - Showed the challenges of training with limited data (e.g. overfitting)

2. **A Transfer Learning model using VGG19**  
   - Leveraged features learned from millions of images (ImageNet)
   - Achieved higher accuracy and better generalization with fewer parameters and less training time

---

### Next Steps

If you're interested in improving this model further, here are some ideas to help push toward **perfect classification accuracy** on this dataset:

- **Fine-tune VGG19**: Unfreeze some of the deeper convolutional layers and retrain the model to better adapt to your specific dataset.
- **Explore Other Architectures**: Experiment with different pre-trained models like ResNet or Inception to compare their performance with VGG19.
- **Enhance Data Augmentation**: Implement more aggressive data augmentation techniques such as color jitter, brightness shifts, cropping, and noise addition to increase model robustness.
- **Improve Image Quality**: Apply image cleaning or filtering techniques to enhance the quality of your dataset.
- **Optimize Model Architecture**: Consider adding Batch Normalization, Dropout, or other regularization techniques to improve model generalization.
---

### Contribute to This Tutorial!

We encourage you to share your improvements and insights with the community. If you develop a model that surpasses our current implementation, we'd love to see it!

Here's how you can contribute:

- **Fork the Repository**: Create your own copy of the repository to work on.
- **Enhance and Document**: Add your new model architecture, results, and any notes or observations.
- S**ubmit a Pull Request**: Share your improvements by submitting a pull request to contribute to this tutorial.

Let's see what you can build!