# **Lab 6: Convolutional Neural Networks**
### Authors: Will Lahners, Edward Powers, and Nino Castellano

## **Describing the Data**

## Preparation (3 points total)

In [11]:
import os
import cv2
import numpy as np

def load_dataset(folder_path, target_size=(224, 224)):
    X = []  # List to store image data
    y = []  # List to store labels
    
    for character_folder in os.listdir(folder_path):
        character_path = os.path.join(folder_path, character_folder)
        
        for image_name in os.listdir(character_path):
            image_path = os.path.join(character_path, image_name)
            
            image = cv2.imread(image_path)
            image = cv2.resize(image, target_size)
            
            # Convert the image to grayscale if needed (assuming you're using grayscale images)
            # image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
            
            # Preprocess the image if needed (normalize, etc.)
            # image = preprocess_image(image)
            
            # Extract the label from the folder name
            label = character_folder
            
            # Append the image and label to the lists
            X.append(image)
            y.append(label)
    
    X = np.array(X)
    y = np.array(y)
    
    return X, y

# Define the folder path where your dataset is located
folder_path = 'one-piece/Data/'

# Load the dataset
X, y = load_dataset(folder_path)

# Print the shape of the loaded data
print("Shape of X (images):", X.shape)
print("Shape of y (labels):", y.shape)


Shape of X (images): (3255, 224, 224, 3)
Shape of y (labels): (3255,)


> [1.5 points] Choose and explain what metric(s) you will use to evaluate your algorithmâ€™s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.

In [12]:
from sklearn.model_selection import StratifiedKFold

# Define the number of folds for cross-validation
num_folds = 10

# Initialize StratifiedKFold object
stratified_kfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)

# Initialize lists to store train and test indices for each fold
train_indices_list = []
test_indices_list = []

# Perform stratified k-fold cross-validation
for train_indices, test_indices in stratified_kfold.split(X, y):
    train_indices_list.append(train_indices)
    test_indices_list.append(test_indices)

# Now train_indices_list and test_indices_list contain the indices for each fold

# Example usage:
for fold_idx, (train_idx, test_idx) in enumerate(zip(train_indices_list, test_indices_list)):
    print(f"Fold {fold_idx + 1}: Train set size: {len(train_idx)}, Test set size: {len(test_idx)}")


Fold 1: Train set size: 2929, Test set size: 326
Fold 2: Train set size: 2929, Test set size: 326
Fold 3: Train set size: 2929, Test set size: 326
Fold 4: Train set size: 2929, Test set size: 326
Fold 5: Train set size: 2929, Test set size: 326
Fold 6: Train set size: 2930, Test set size: 325
Fold 7: Train set size: 2930, Test set size: 325
Fold 8: Train set size: 2930, Test set size: 325
Fold 9: Train set size: 2930, Test set size: 325
Fold 10: Train set size: 2930, Test set size: 325


> [1.5 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Convince me that your cross validation method is a realistic mirroring of how an algorithm would be used in practice. 

We will be using the stratified 10-fold cross validation because we have multiple classes. By using this strafified 10-fold validation, we will be able to ensure that each class if represented equally in both the training and testing sets. It will maintain the same class distribution in each fold as the original dataset, which is great for us our model as it will generalize well to unseen data. 

## Modeling (6 points total)



> [1.5 points]  Setup the training to use data expansion in Keras (also called data augmentation). Explain why the chosen data expansion techniques are appropriate for your dataset. You should make use of Keras augmentation layers, like in the class examples.

> [2 points] Create a convolutional neural network to use on your data using Keras. Investigate at least two different convolutional network architectures and investigate changing one or more parameters of each architecture such as the number of filters. This means, at a  minimum, you will train a total of four models (2 different architectures, with 2 parameters changed in each architecture). Use the method of train/test splitting and evaluation metric that you argued for at the beginning of the lab. Visualize the performance of the training and validation sets per iteration (use the "history" parameter of Keras). Be sure that models converge. 

> [1.5 points] Visualize the final results of all the CNNs and interpret/compare the performances. Use proper statistics as appropriate, especially for comparing models. 

> [1 points] Compare the performance of your convolutional network to a standard multi-layer perceptron (MLP) using the receiver operating characteristic and area under the curve. Use proper statistical comparison techniques.  

## Exceptional Work (1 points total)