# MNIST Digit Recognition with CNN
## Theoretical Foundations in Machine Learning
## 1. Importing Required Libraries

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

## 2. Data Preprocessing
**Normalization:** Scaling pixel values to [0–1] helps stabilize gradient descent during training.

**One-hot Encoding:** Converts class labels into binary matrix form for multi-class classification.

In [None]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize
X_train = X_train.reshape(-1, 28, 28, 1).astype('float32') / 255
X_test = X_test.reshape(-1, 28, 28, 1).astype('float32') / 255

# One-hot encode labels
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

## 3. Conceptual Questions

**Q1: Why use CNNs instead of traditional models like Random Forests or SVMs?**
CNNs are specifically designed for image data. They capture spatial hierarchies using convolutional filters, whereas traditional models treat input as flat vectors and miss spatial information. CNNs are more parameter-efficient, translation-invariant, and generalize better on image data.

**Q2: Why are non-linear activation functions essential? Which one is most appropriate here and why?**
Non-linear activation functions allow networks to learn complex patterns beyond linear relationships. Without them, even deep networks act like linear models. **ReLU** is the most appropriate here because it avoids the vanishing gradient problem and accelerates convergence. **Softmax** is used in the output layer to convert outputs into class probabilities.

## 4. Model Architecture

In [None]:
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),

    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

model.summary()

## 5. Model Training

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=64,
                    validation_split=0.2)

## 6. Model Evaluation

In [None]:
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_acc:.4f}")

y_pred = np.argmax(model.predict(X_test), axis=1)
y_true = np.argmax(y_test, axis=1)

cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

print(classification_report(y_true, y_pred))

In [None]:
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Training vs Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

## 7. Overfitting Mitigation Strategies
- **Data Augmentation**: Adding transformations like rotation, zoom, shift.
- **Dropout Layers**: Randomly disabling neurons during training.
- **L2 Regularization**: Penalizing large weights.
- **Early Stopping**: Stopping training when validation loss stops improving.

## 8. Hyperparameter Tuning Guide
| Parameter       | Suggested Range |
|-----------------|-----------------|
| Learning Rate   | 1e-2 to 1e-5    |
| Batch Size      | 32 to 256       |
| Filter Sizes    | 32 to 128       |
| Dense Units     | 64 to 512       |

**Observations:**
- Higher filters and dense units improve accuracy but may lead to overfitting.
- Smaller learning rates lead to stable but slower convergence.
- Batch size balances between memory and speed.