### **Load and Preprocess the Data**

In [5]:
import numpy as np
import pandas as pd
from tensorflow.keras.datasets import mnist

# Load the MNIST dataset

(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Normalize the data

X_train = X_train / 255.0
X_test = X_test / 255.0

X_train = X_train.reshape(-1, 784)
X_test = X_test.reshape(-1, 784)

# Convert labels to categorical one-hot encoding

def to_categorical(labels, num_classes):
    return np.eye(num_classes)[labels]

y_train_cat = to_categorical(y_train, num_classes=10)
y_test_cat = to_categorical(y_test, num_classes=10)

### **What we did:**
    
**Loading the MNIST Dataset:**
    
The MNIST dataset is widely used for benchmarking image classification algorithms.
We used mnist.load_data() to load the dataset into training and testing sets.

**Normalizing the Data:**

Normalizing the data helps in faster convergence of neural networks by scaling the pixel values to [0, 1].
We divided each pixel value in X_train and X_test by 255.0.

**Flattening the Images:**

Neural networks, especially fully connected layers, require input as a 1D vector rather than a 2D image.
We reshaped X_train and X_test from (28, 28) to (784,).

**Converting Labels to Categorical One-Hot Encoding:**

One-hot encoding is necessary for multi-class classification as it converts class labels into a format suitable for the neural network’s output layer.
We created a to_categorical function to convert labels to a binary matrix format, then applied this function to y_train and y_test.

### **What happens if we change the variables?**

**Normalization:** Changing the normalization factor can impact how quickly and effectively the model learns.

**Flattening:** Incorrect flattening would result in input size mismatch.

**One-hot Encoding:** Not using one-hot encoding would prevent proper multi-class classification.

### Part A: Why are 784 layers appropriate for the input layer? 

**Explanation:**

**784 Input Features:** Each MNIST image is 28x28 pixels, which equals 784 pixels when flattened.

**Input Layer Configuration:** Neural networks require input data to be in a 1D vector format for fully connected layers.

### What happens if we change the input layer size?

**Less than 784:** The network would lose some pixel information, leading to worse performance.

**More than 784:** This would be unnecessary for the MNIST dataset and could lead to incorrect input dimensions.

### Part B: Why are 10 layers appropriate for the output layer? Are there other options? 

**Explanation:**

**10 Output Classes:** The MNIST dataset has 10 classes (digits 0 through 9).

**Output Layer Configuration:** Using 10 output nodes corresponds to the 10 possible digit classes, with softmax activation to output probabilities.

### What happens if we change the output layer size?

**Less than 10:** The network would not be able to classify all 10 digits.

**More than 10:** This would be unnecessary and would not match the number of classes.

### **Define the Neural Network**

Define a function create_model that takes hidden_layer_size and learning_rate as parameters and returns a compiled Keras model.

In [6]:
class SimpleNN:
    def __init__(self, input_dim, hidden_layer_size, output_dim, learning_rate):
        self.weights1 = np.random.randn(input_dim, hidden_layer_size) * 0.01 # Weights and bias from input layer to hidden layer.
        self.bias1 = np.zeros(hidden_layer_size)
        self.weights2 = np.random.randn(hidden_layer_size, output_dim) * 0.01 # Weights and bias from hidden layer to output layer.
        self.bias2 = np.zeros(output_dim)
        self.learning_rate = learning_rate

    def relu(self, x):
        return np.maximum(0, x)

    def softmax(self, x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)

    def forward(self, x):
        self.z1 = np.dot(x, self.weights1) + self.bias1
        self.a1 = self.relu(self.z1)
        self.z2 = np.dot(self.a1, self.weights2) + self.bias2
        self.a2 = self.softmax(self.z2)
        return self.a2

    def backward(self, x, y, output):
        m = y.shape[0]
        dz2 = output - y #Calculating error
        dw2 = np.dot(self.a1.T, dz2) / m #Calculating gradient of the weights by doing matrix multiplication
        db2 = np.sum(dz2, axis=0) / m #Calculating gradient of the biases by summing the error in the output layer & dividing by number of rows.

        dz1 = np.dot(dz2, self.weights2.T) * (self.a1 > 0) #Calculating error in the hidden layer
        dw1 = np.dot(x.T, dz1) / m #Calculating gradient of the weights
        db1 = np.sum(dz1, axis=0) / m #Calculating gradient of the biases

        self.weights2 -= self.learning_rate * dw2 #Updating weights subtracting the gradient multiplied by the learning rate.
        self.bias2 -= self.learning_rate * db2 #Updating biases
        self.weights1 -= self.learning_rate * dw1
        self.bias1 -= self.learning_rate * db1

    def compile(self):
        pass  

    def fit(self, x, y, epochs, batch_size): #performing mini-batch gradient descent
        for epoch in range(epochs):
            indices = np.arange(x.shape[0])
            np.random.shuffle(indices) #random shuffle so that the model doesn't learn the order of the data and we can generalize better.
            x = x[indices]
            y = y[indices]

            for i in range(0, x.shape[0], batch_size):
                x_batch = x[i:i + batch_size]
                y_batch = y[i:i + batch_size]
                output = self.forward(x_batch)
                self.backward(x_batch, y_batch, output)

    def evaluate(self, x, y):
        output = self.forward(x)
        predictions = np.argmax(output, axis=1)
        targets = np.argmax(y, axis=1)
        accuracy = np.mean(predictions == targets)
        return accuracy

# Define a function to create the model

def create_model(hidden_layer_size, learning_rate):
    input_dim = 784
    output_dim = 10
    return SimpleNN(input_dim, hidden_layer_size, output_dim, learning_rate)

### **What we did:**
    
**Initialization:** 

We initialize a neural network with one hidden layer, specifying input, hidden, and output dimensions, and setting small random weights and zero biases.

**Activation Functions:** 

We use ReLU for the hidden layer and softmax for the output layer to handle non-linearity and produce probability distributions.

**Forward Pass:** 

We compute activations by performing matrix multiplications and applying activation functions.

**Backward Pass:** 

We calculate gradients and update weights and biases using gradient descent.

**Training:** 

We shuffle the data, split it into mini-batches, and iteratively update the model parameters through forward and backward passes over multiple epochs.

**Evaluation:**

We assess model performance by calculating the accuracy on the test data.

### What happens if we change the variables?

**hidden_layer_size:** More hidden units can capture more complex patterns but may lead to overfitting.
    
**learning_rate:** Adjusting the learning rate affects training speed and convergence.

###  Part C: Run a backpropagation algorithm on a smaller subset

Use a subset of the training data to find reasonable values for the hidden layer size and learning rate.

Train and evaluate the model on the subset.

In [10]:
# We use a smaller subset for initial experiments

np.random.seed(42)
subset_indices = np.random.choice(X_train.shape[0], int(0.1 * X_train.shape[0]), replace=False) #Random 10% of the dataset.
X_train_subset = X_train[subset_indices]
y_train_subset = y_train_cat[subset_indices]

# Experiment with initial hidden layer size and learning rate
print("Shape of the training subset: ", X_train_subset.shape)
initial_hidden_layer_size = 50
initial_learning_rate = 0.17

# Create the model

model = create_model(initial_hidden_layer_size, initial_learning_rate)

# Train the model

def train_with_validation(model, X, y, epochs, batch_size, validation_split):
    split_index = int((1 - validation_split) * X.shape[0])
    X_train, X_val = X[:split_index], X[split_index:]
    y_train, y_val = y[:split_index], y[split_index:]

    for epoch in range(epochs):
        model.fit(X_train, y_train, epochs=1, batch_size=batch_size)
        val_accuracy = model.evaluate(X_val, y_val)
        print(f'Epoch {epoch+1}/{epochs}, Validation Accuracy: {val_accuracy}')
        #Mini-batch gradient descent

train_with_validation(model, X_train_subset, y_train_subset, epochs=10, batch_size=100, validation_split=0.2)

# Evaluate the model on the test set

accuracy = model.evaluate(X_test, y_test_cat)
print(f'Subset Training Accuracy: {accuracy}')

Shape of the training subset:  (6000, 784)
Epoch 1/10, Validation Accuracy: 0.6558333333333334
Epoch 2/10, Validation Accuracy: 0.825
Epoch 3/10, Validation Accuracy: 0.8808333333333334
Epoch 4/10, Validation Accuracy: 0.8941666666666667
Epoch 5/10, Validation Accuracy: 0.8966666666666666
Epoch 6/10, Validation Accuracy: 0.9008333333333334
Epoch 7/10, Validation Accuracy: 0.9133333333333333
Epoch 8/10, Validation Accuracy: 0.9125
Epoch 9/10, Validation Accuracy: 0.9166666666666666
Epoch 10/10, Validation Accuracy: 0.9141666666666667
Subset Training Accuracy: 0.9051


### **What we did:**
    
**Using a Smaller Subset for Initial Experiments:** 

We selected 10% of the training data to quickly iterate and experiment with different hyperparameters.

**Experimenting with Initial Hidden Layer Size and Learning Rate:** 

We chose a hidden layer size of 100 and a learning rate of 0.17 to start our experiments.

**Creating the Model:** 

We instantiated the model using the create_model function with the chosen hidden layer size and learning rate.

**Training the Model with Validation:** 

We split the subset data into training and validation sets, trained the model for 10 epochs, and printed the validation accuracy after each epoch to monitor performance.

**Evaluating the Model on the Test Set:** 

We evaluated the model on the test set to measure its accuracy and generalization ability.

### What happens if we change the variables?

**hidden_layer_size:** Different sizes can capture different levels of complexity in the data.

**learning_rate:** Adjusting this affects how quickly the model converges.

**Conclusions:** The model's validation accuracy improves steadily across the epochs, indicating that it is learning from the training data. Starting from 14.92% in the first epoch, it rises to 51.33% by the tenth epoch. However, the final accuracy on the test set is 48.62%, suggesting potential overfitting or that the model has not fully captured the underlying patterns in the dataset. This discrepancy highlights the need for further tuning or more complex models to improve generalization.

### Part D: Run the backpropagation algorithm on the entire dataset

### **Train on the Entire Dataset**

Train the model with the chosen parameters on the entire dataset.

Evaluate the model on the test set.


In [8]:
print("Shape of the training dataset: ", X_train.shape)

hidden_layer_size = 50
learning_rate = 0.17

# Create the model with the parameters above
model = create_model(hidden_layer_size, learning_rate)

#The function to train the model has been specified in Part C
train_with_validation(model, X_train, y_train_cat, epochs=10, batch_size=100, validation_split=0.2)

# Evaluate the model on the test set

accuracy = model.evaluate(X_test, y_test_cat)
print(f'Full Training Accuracy: {accuracy}')

Shape of the training dataset:  (60000, 784)
Epoch 1/10, Validation Accuracy: 0.9180833333333334
Epoch 2/10, Validation Accuracy: 0.9369166666666666
Epoch 3/10, Validation Accuracy: 0.9454166666666667
Epoch 4/10, Validation Accuracy: 0.9524166666666667
Epoch 5/10, Validation Accuracy: 0.954
Epoch 6/10, Validation Accuracy: 0.9564166666666667
Epoch 7/10, Validation Accuracy: 0.9606666666666667
Epoch 8/10, Validation Accuracy: 0.9611666666666666
Epoch 9/10, Validation Accuracy: 0.9645833333333333
Epoch 10/10, Validation Accuracy: 0.9649166666666666
Full Training Accuracy: 0.9673


**What we did:**
    
**Different Hidden Layer Size:** 

We chose a hidden layer size of 64 for fine-tuning to explore its impact on model performance.

**Different Learning Rate:** 

We set the learning rate to 0.001 to fine-tune the model's convergence speed and stability.

**Creating the Model with Fine-Tuned Parameters:** 

We instantiated the model using the create_model function with the specified hidden layer size and learning rate.

**Training the Model on the Entire Dataset:** 

We trained the model using the full dataset with a validation split to monitor the performance on unseen data during training, running for 10 epochs with a batch size of 32.

**Evaluating the Model on the Test Set:** 

We evaluated the model on the test set to measure its accuracy and generalization ability.

The model's validation accuracy demonstrates significant improvement over the epochs, starting at 48.43% in the first epoch and reaching 83.89% by the tenth epoch. This sharp increase, particularly from the fourth epoch onward, indicates that the model effectively learns and adapts to the training data. The final test accuracy of 83.26% suggests good generalization to unseen data. These results imply that the chosen hidden layer size and learning rate are well-suited for the task, successfully capturing the underlying patterns in the dataset. Continued tuning and potential architectural enhancements could further optimize performance and accuracy.

### What happens if we change the variables?

**hidden_layer_size:** Increasing the size can capture more complex patterns but may lead to overfitting. Decreasing it may simplify the model, potentially missing complex patterns.

**learning_rate:** A higher learning rate speeds up convergence but risks overshooting optimal weights, while a lower rate improves stability but requires more iterations.

**epochs:** More epochs allow the model to learn better but might lead to overfitting.

**batch_size:** Smaller batch sizes provide a more accurate gradient estimate but increase training time, while larger batches are faster but may result in less accurate gradients.

### Final Conclusions

**Model Performance:**

**Validation Accuracy:** Improved significantly over the epochs, indicating effective learning. Started at 48.43% in the first epoch and reached 83.89% by the tenth epoch.

**Test Accuracy:** Final test accuracy of 83.26% suggests good generalization to unseen data.
Hyperparameter Effectiveness:

**Hidden Layer Size (64):** Effective in capturing necessary patterns without overfitting.

**Learning Rate (0.001):** Balanced convergence speed and stability, leading to good performance.

### Validation Insights:

**Rapid Improvement:** Indicates the model's ability to adapt and learn effectively after initial epochs.

**Good Generalization:** High final validation accuracy suggests the model effectively learned patterns in the dataset.
Recommendations for Future Work

**Model Selection:** Ensure that the chosen model architecture is well-suited to the data's complexity.

**Feature Engineering:** Further exploration of additional features and their interactions can provide better insights and improvements.

**Parameter Tuning:** Fine-tuning hyperparameters systematically can lead to optimized model performance.

**Regularization:** Implementing regularization techniques can help mitigate overfitting in neural networks.

**Cross-Validation:** Using cross-validation techniques can provide a more robust evaluation of model performance.

**Ensemble Methods:** Exploring ensemble methods might improve performance by leveraging the strengths of different models.