## Data Introduction

The data we are working with is from Will Cukierski, Dogs vs. Cats Redux: Kernels Edition on Kaggle: https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition

It is a collection of images of cats and dogs. Our goal here is to create a CNN model that can effectively distinguish whether a given image is a cat or a dog. 

In the training data we have a total of 25,000 images. 

## Data load / Libraries load

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skimage import io
import seaborn as sns
import os
import time
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, regularizers
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.metrics import confusion_matrix, roc_curve, auc, accuracy_score, recall_score

print('TensorFlow version - ',tf.__version__)
# Check if GPU is available
gpu_available = tf.config.list_physical_devices('GPU')

if gpu_available:
    print("TensorFlow is installed as GPU version.")
else:
    print("TensorFlow is installed as CPU version.")


In [None]:
# training images
# List all image paths
train_paths = [os.path.join(
    'data/cat&dog_data/train', f) for f in os.listdir(
        'data/cat&dog_data/train') if f.endswith('.jpg')]

# label 0 if cat and 1 if dog
labels = [0 if 'cat' in os.path.basename(path) else 1 for path in train_paths]

# Create DataFrame
train_image = pd.DataFrame({'filename': train_paths, 'label': labels})

# shuffle data
train_image = train_image.sample(frac=1, random_state=33).reset_index(drop=True)

# testing images
# List all image paths
test_path = [os.path.join(
    'data/cat&dog_data/test', f) for f in os.listdir(
        'data/cat&dog_data/test') if f.endswith('.jpg')]

# Create DataFrame
test_image = pd.DataFrame({'filename': test_path})

### We are using ImageDataGenerator to convert images to DataFrameIterator
### We will split the data into 80% training and 20% validating. And the data are scaled using 255 pixels

In [None]:
# prevent overfitting by rotating, and flipping the images
train_gen = ImageDataGenerator(rescale = 1./255, rotation_range = 20,
                               horizontal_flip=True,
                               validation_split = 0.2)
val_gen = ImageDataGenerator(rescale = 1./255, validation_split = 0.2)
test_gen = ImageDataGenerator(rescale = 1./255)

df_train = train_gen.flow_from_dataframe(
    dataframe=train_image,
    x_col='filename',
    y_col='label',
    target_size=(150, 150),
    batch_size=128,
    class_mode='raw',
    subset='training',
    seed = 33
)

df_val = val_gen.flow_from_dataframe(
    dataframe=train_image,
    x_col='filename',
    y_col='label',
    target_size=(150, 150),
    batch_size = 128,
    class_mode='raw',
    subset='validation',
    seed = 33,
    shuffle = False
)

df_test = test_gen.flow_from_dataframe(
    dataframe=test_image,
    x_col='filename',
    y_col=None,  # No labels
    target_size=(150, 150),
    batch_size=128,
    class_mode=None,  # Important: class_mode=None skips labels
    shuffle=False      # Preserve order for submission
)

### Checking Data Dimension

In [None]:
# inspect df_train
x_batch, y_batch = next(df_train)  # Get a batch of data from df_train
print(x_batch.shape)
print(y_batch.shape)
print(np.min(x_batch), np.max(x_batch))
print(np.unique(y_batch))  # See if labels are correct (0 and 1)

# inspect df_val
x_val_batch, y_val_batch = next(df_val)
print(x_val_batch.shape)
print(y_val_batch.shape)
print(np.min(x_val_batch), np.max(x_val_batch))
print(np.unique(y_val_batch))

## Data Presentation

In [None]:
# sample images
plt.figure()
f, axarr = plt.subplots(2,2)

axarr[0,0].imshow(io.imread(train_image['filename'][20000]))
axarr[0,1].imshow(io.imread(train_image['filename'][20887]))
axarr[1,0].imshow(io.imread(train_image['filename'][1902]))
axarr[1,1].imshow(io.imread(train_image['filename'][4719]))

## Exploratory Data Analysis

In [None]:
train_image.shape

In [None]:
# label distribution
sns.histplot(train_image['label'])
plt.title('Distribution of Labels in Training Data')
plt.show()

### Since this is an image data, and all images are being used to train the model. There is not much EDA to be done. 


## Overview of findings and next steps

We are intrested in testing these parameters in our models:
- Model Depth - Increasing the number of convolutional layers from 2 to 3.
- Batch Size - Comparing a batch size of 128 versus 64.
- Regularization - Incorporating L2 regularization (with a factor of 0.001) to assess its effect on overfitting.

All models maintained a consistent dropout rate of 0.4 in the dense layer and used the Adam optimizer with default learning rates (0.001).

## Model 1 Architecture

### 2 Conv2D layers (32, 64 nodes), Dropout = 0.4 at after Dense layer, default learning rate at 0.001, Adam potimizer, batch size = 128

In [None]:
model_1 = models.Sequential([
    keras.Input(shape=(150,150,3)),
    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.BatchNormalization(),

    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.BatchNormalization(),


    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(1, activation='sigmoid')
])

model_1.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model_1.summary()

In [None]:
callbacks = [
    EarlyStopping(patience=3, restore_best_weights=True),
    ModelCheckpoint('model_1.keras', save_best_only=True)
]

print('start model 1 training')
start_time = time.time()

history = model_1.fit(
    df_train,
    epochs=10,
    validation_data=df_val,
    callbacks=callbacks
)

end_time = time.time()
print(f"Model 1 training time: {(end_time - start_time)/60:.2f} minutes")

In [None]:
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.show()

### After 10 epochs, the validation accuracy is catching up with the training accuracy

## Model 2 Architecture

### 3 Conv2D layers (32, 64, and 128 nodes), Dropout = 0.4 at after Dense layer, default learning rate at 0.001, Adam potimizer, batch size = 128

In [None]:
model_2 = models.Sequential([
    keras.Input(shape=(150,150,3)),
    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.BatchNormalization(),

    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.BatchNormalization(),

    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.BatchNormalization(),

    layers.Flatten(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(1, activation='sigmoid')
])

model_2.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model_2.summary()

In [None]:
callbacks = [
    EarlyStopping(patience=3, restore_best_weights=True),
    ModelCheckpoint('model_2.keras', save_best_only=True)
]

print('start model 2 training')
start_time = time.time()

history = model_2.fit(
    df_train,
    epochs=10,
    validation_data=df_val,
    callbacks=callbacks
)

end_time = time.time()
print(f"Model 2 training time: {(end_time - start_time)/60:.2f} minutes")

In [None]:
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.show()

### After 10 epochs, the validation accuracy shows sign of overfitting

## Model 3 Architecture

### 3 Conv2D layers (32, 64, and 128 nodes), Dropout = 0.4 at after Dense layer, default learning rate at 0.001, Adam potimizer, batch size = 64

In [None]:
df_train = train_gen.flow_from_dataframe(
    dataframe=train_image,
    x_col='filename',
    y_col='label',
    target_size=(150, 150),
    batch_size=64,
    class_mode='raw',
    subset='training',
    seed = 33
)

df_val = val_gen.flow_from_dataframe(
    dataframe=train_image,
    x_col='filename',
    y_col='label',
    target_size=(150, 150),
    batch_size = 64,
    class_mode='raw',
    subset='validation',
    seed = 33,
    shuffle = False
)

In [None]:
model_3 = models.Sequential([
    keras.Input(shape=(150,150,3)),
    layers.Conv2D(32, (3, 3), activation='relu',
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.MaxPooling2D((2, 2)),
    layers.BatchNormalization(),

    layers.Conv2D(64, (3, 3), activation='relu',
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.MaxPooling2D((2, 2)),
    layers.BatchNormalization(),

    layers.Conv2D(128, (3, 3), activation='relu',
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.MaxPooling2D((2, 2)),
    layers.BatchNormalization(),

    layers.Flatten(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(1, activation='sigmoid')
])

model_3.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model_3.summary()

In [None]:
callbacks = [
    EarlyStopping(patience=3, restore_best_weights=True),
    ModelCheckpoint('model_3.keras', save_best_only=True)
]

print('start model 3 training')
start_time = time.time()

history = model_3.fit(
    df_train,
    epochs=10,
    validation_data=df_val,
    callbacks=callbacks
)

end_time = time.time()
print(f"Model 3 training time: {(end_time - start_time)/60:.2f} minutes")


In [None]:
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.show()

### After 10 epochs, the validation accuracy shows sign of overfitting

## Model Evaluation

### Model 1

In [None]:
model_1 = tf.keras.models.load_model('model_1.keras')

model_1.summary()

In [None]:
model_1.evaluate(df_val)

y_pred_prob = model_1.predict(df_val).ravel()  # predicted probabilities
y_pred = (y_pred_prob > 0.5).astype(int)

y_true = df_val.labels

# Compute the confusion matrix
cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix for Model 1")
plt.show()

print(f'The accuracy score of Model 1 is: {accuracy_score(y_true, y_pred)}')
print(f'The recall score of Model 1 is: {recall_score(y_true, y_pred)}')

In [None]:
# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_true, y_pred_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (area = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) for model 1")
plt.legend(loc="lower right")
plt.show()

### Model 2

In [None]:
# Load the saved model from the .keras file
model_2 = tf.keras.models.load_model('model_2.keras')

# Optionally, display the model architecture
model_2.summary()

In [None]:
model_2.evaluate(df_val)

y_pred_prob = model_2.predict(df_val).ravel()  # predicted probabilities
y_pred = (y_pred_prob > 0.5).astype(int)

y_true = df_val.labels

# Compute the confusion matrix
cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix for Model 2")
plt.show()

print(f'The accuracy score of Model 2 is: {accuracy_score(y_true, y_pred)}')
print(f'The recall score of Model 2 is: {recall_score(y_true, y_pred)}')

In [None]:
# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_true, y_pred_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (area = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) for model 2")
plt.legend(loc="lower right")
plt.show()

### Model 3

In [None]:
# Load the saved model from the .keras file
model_3 = tf.keras.models.load_model('model_3.keras')

# Optionally, display the model architecture
model_3.summary()

In [None]:
model_3.evaluate(df_val)

y_pred_prob = model_3.predict(df_val).ravel()  # predicted probabilities
y_pred = (y_pred_prob > 0.5).astype(int)

y_true = df_val.labels

# Compute the confusion matrix
cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix for Model 3")
plt.show()

print(f'The accuracy score of Model 3 is: {accuracy_score(y_true, y_pred)}')
print(f'The recall score of Model 3 is: {recall_score(y_true, y_pred)}')

In [None]:
# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_true, y_pred_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (area = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) for model 3")
plt.legend(loc="lower right")
plt.show()

## Model Overview

model 1
- batch size = 128
- Conv2D layers = 2
- learning_rate = 0.001 (default)
- drop out = 0.4 at Dense
- runtime = 39 min
- Eval = loss: 0.3988 - accuracy: 0.8198

model 2
- batch size = 128
- Conv2D layers = 3
- learning_rate = 0.001 (default)
- drop out = 0.4 at Dense
- runtime = 49 min
- Eval =  loss: 0.3568 - accuracy: 0.8540

model 3
- batch size = 64
- Conv2D layers = 3
- learning_rate = 0.001 (default)
- drop out = 0.4 at Dense
- l2 regularization = 0.001
- wider network
- runtime: 51 min
- Eval: loss: 0.4198 - accuracy: 0.8532

## Submission

In [None]:
model_1 = tf.keras.models.load_model('model_1.keras')

pred_probs = model_1.predict(df_test).ravel()

# Convert filename (e.g., "1.jpg") to integer image id (1)
test_image['image id'] = test_image['filename'].apply(lambda x: int(os.path.splitext(os.path.basename(x))[0]))

submission_df = pd.DataFrame({
    'id': test_image['image id'],
    'label': pred_probs
})


submission_df.sort_values('id', inplace=True)
submission_df.to_csv('~/Desktop/submission.csv', index=False)

In [None]:
model_2 = tf.keras.models.load_model('model_2.keras')

pred_probs = model_2.predict(df_test).ravel()

# Convert filename (e.g., "1.jpg") to integer image id (1)
test_image['image id'] = test_image['filename'].apply(lambda x: int(os.path.splitext(os.path.basename(x))[0]))

submission_df = pd.DataFrame({
    'id': test_image['image id'],
    'label': pred_probs
})


submission_df.sort_values('id', inplace=True)
submission_df.to_csv('~/Desktop/submission.csv', index=False)

In [None]:
model_3 = tf.keras.models.load_model('model_3.keras')

pred_probs = model_3.predict(df_test).ravel()

# Convert filename (e.g., "1.jpg") to integer image id (1)
test_image['image id'] = test_image['filename'].apply(lambda x: int(os.path.splitext(os.path.basename(x))[0]))

submission_df = pd.DataFrame({
    'id': test_image['image id'],
    'label': pred_probs
})


submission_df.sort_values('id', inplace=True)
submission_df.to_csv('~/Desktop/submission.csv', index=False)

## Reference

     Will Cukierski. 2016. Dogs vs. Cats Redux: Kernels Edition. Kaggle. 
https://kaggle.com/competitions/dogs-vs-cats-redux-kernels-edition. 

     
     Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org
https://www.tensorflow.org/ 