# Building Neural Networks with TensorFlow 2.x Core API

In this tutorial, we'll build a multi-layer neural network using TensorFlow 2.x **Core API**.
We'll define a **custom Dense layer**, explore **different activation functions**, and train the model on the **MNIST dataset**.

## **What You Will Learn**
- How to create a **custom Dense layer** using `tf.Module`.
- How to **manually implement forward propagation**.
- How **batches and epochs** work in training.
- How to experiment with **different activation functions**.
- How to train and evaluate a **multi-layer neural network** on MNIST.

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds

### Load and Prepare MNIST Dataset
The MNIST dataset contains **70,000 grayscale images** of handwritten digits (0-9).
We split into **60,000 training** and **10,000 test** images.

In [None]:
(train_data, test_data), info = tfds.load('mnist', split=['train', 'test'], as_supervised=True, with_info=True)

In [None]:
# Count training dataset size
train_size = info.splits['train'].num_examples
print(f"Number of elements in train_data: {train_size}")

test_size = info.splits['test'].num_examples
print(f"Number of elements in test_data: {test_size}")


Number of elements in train_data: 60000
Number of elements in test_data: 10000


## **Understanding `map`, `batch`, and `shuffle` in tf.data**

- **`map`**: Applies a function to transform the dataset (e.g., normalize images).
- **`shuffle`**: Randomizes dataset order to prevent overfitting (uses a buffer size of `10000` means the dataset maintains 10,000 elements in memory, randomly shuffling before producing new batches..
- **`batch`**: Groups samples into mini-batches, improving performance.


In [None]:
def normalize_img(image, label):
    return tf.cast(image, tf.float32) / 255.0, tf.cast(label, tf.int32)

# Apply mapping function
train_data = train_data.map(normalize_img)
test_data = test_data.map(normalize_img)

# Shuffle training data
# The buffer size of 10000 means the dataset maintains 10,000 elements in memory,
# randomly shuffling before producing new batches.
train_data = train_data.shuffle(10000)

# Batch the dataset with drop_remainder=True to ensure all batches have the same size
train_data = train_data.batch(128, drop_remainder=True)
test_data = test_data.batch(128, drop_remainder=True)

In [None]:
# Print dataset size and shape
# Compute number of batches using dataset cardinality (faster)
train_size = train_data.cardinality().numpy()
print(f"Number of batches in train_data: {train_size}")

for batch in train_data.take(1):
    images, labels = batch
    print(f"Shape of one batch of images: {images.shape}")
    print(f"Shape of one batch of labels: {labels.shape}")

Number of batches in train_data: 468
Shape of one batch of images: (128, 28, 28, 1)
Shape of one batch of labels: (128,)


## **Creating a Custom Dense Layer**
Instead of using `tf.keras.layers.Dense`, we'll **create our own Dense layer** using `tf.Module`.

In [None]:
class CustomDense(tf.Module):
    def __init__(self, input_dim, output_dim, activation=tf.identity, name=None):
        super().__init__(name=name)
        # Small vals => To Prevent Exploding/Vanishing Gradients
        self.W = tf.Variable(tf.random.normal([input_dim, output_dim], stddev=0.1), name='weights')
        self.b = tf.Variable(tf.zeros([output_dim]), name='bias')
        self.activation = activation

    def __call__(self, x):
        z = tf.matmul(x, self.W) + self.b
        return self.activation(z)

## **Step 2: Building a Multi-Layer Neural Network (MLP)**
MLP => Multi-Layer Perceptron
We define a **2-layer MLP**, but we show how to easily increase the number of layers.

In [None]:
class MultiLayerNN(tf.Module):
    def __init__(self, input_dim, hidden_units=[128, 64], output_dim=10, activation=tf.nn.relu):
        super().__init__()
        self.flatten = tf.keras.layers.Flatten()
        self.hidden1 = CustomDense(input_dim, hidden_units[0], activation=activation)
        self.hidden2 = CustomDense(hidden_units[0], hidden_units[1], activation=activation)
        self.output_layer = CustomDense(hidden_units[1], output_dim, activation=tf.nn.softmax)

    def __call__(self, x):
        x = self.flatten(x)
        x = self.hidden1(x)
        x = self.hidden2(x)
        return self.output_layer(x)

To add more hidden layers, simply add more `CustomDense` layers in `__init__`.

## **Step 3: Training the Neural Network**
- **Sparse categorical cross-entropy** is used when labels are **integers** (e.g., `0-9` for MNIST).
- If labels are **one-hot encoded**, we would use `categorical_crossentropy` instead.
- Alternatives:
 - **MSE (Mean Squared Error)**: Not ideal for classification.
 - **Binary Cross-Entropy**: Used for binary classification.

In [None]:
# Why Use from_logits=True?
# ✅ Better numerical stability (avoids floating-point issues).
# ✅ Better gradient flow (softmax + crossentropy computed together).
# ✅ ***No need to apply softmax in the output layer***.
# self.output_layer = CustomDense(hidden_units[1], output_dim, activation=None)
# ⚠ Note: If from_logits=True, y_pred should be logits (before softmax).

def compute_loss(y_pred, y_true):
    return tf.reduce_mean(tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred, from_logits=True))

### **Understanding Batches and Epochs**
- **Batch Size:** The number of training samples processed **at once** before updating model parameters.
- **Epoch:** A full pass over the entire training dataset.

### **Explanation: Predictions and Accuracy Computation**
- `tf.argmax(y_pred, axis=1)`: Gets the class with the highest probability.
- `tf.equal(predictions, y_batch)`: Checks if predictions match true labels.
- `tf.reduce_mean(tf.cast(correct_preds, tf.float32))`: Computes accuracy.

#### Adam
Adam (Adaptive Moment Estimation) is an advanced optimization algorithm that combines the benefits of Momentum and RMSprop.

- Pros ✅
- ✔ Adaptive Learning Rate: Automatically adjusts learning rate for each parameter.
- ✔ Momentum-Based: Accelerates convergence.
- ✔ Handles Sparse Gradients Well: Works great for deep networks like CNNs and RNNs.
- ✔ Less Hyperparameter Tuning: Default values work well for many tasks.

- Cons ❌
- ✖ Uses More Memory: Stores momentum and squared gradients.
- ✖ Can Overshoot Optimum: May struggle with local minima in some cases.

- Best for:
- ✅ Deep learning models (CNNs, RNNs, Transformers).
- ✅ Large-scale datasets (like ImageNet, NLP tasks).
- ✅ Sparse gradient problems (e.g., NLP embeddings).

#### SGD
Stochastic Gradient Descent: is the simplest optimization method.

- Pros ✅
- ✔ Lightweight: Uses minimal memory.
- ✔ Can Generalize Better: Works well for convex problems (e.g., linear/logistic regression).
- ✔ More Control: Users can manually tune the learning rate.
 
- Cons ❌
- ✖ Slow Convergence: Can get stuck in plateaus.
- ✖ Learning Rate Sensitivity: Requires careful tuning.
- ✖ Doesn’t Handle Curvature Well: No adaptive learning rate.

- Best for:
- ✅ Shallow networks (MLPs, logistic regression, linear regression).
- ✅ Small datasets.
- ✅ Situations where we want to prevent overfitting (better generalization).



In [None]:
def train_model(model, dataset, learning_rate=0.01, epochs=5):
    optimizer = tf.optimizers.Adam(learning_rate)  # Alternative: tf.optimizers.SGD(learning_rate)

    for epoch in range(epochs):
        total_loss = 0
        total_acc = 0
        num_batches = 0

        for x_batch, y_batch in dataset:
            with tf.GradientTape() as tape:
                y_pred = model(x_batch)
                loss = compute_loss(y_pred, y_batch)

            gradients = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(gradients, model.trainable_variables))

            predictions = tf.argmax(y_pred, axis=1, output_type=tf.int32)
            accuracy = tf.reduce_mean(tf.cast(tf.equal(predictions, y_batch), tf.float32)).numpy()

            total_loss += loss.numpy()
            total_acc += accuracy
            num_batches += 1

        print(f"Epoch {epoch+1}, Loss: {total_loss / num_batches:.4f}, Accuracy: {total_acc / num_batches:.4f}")

## **Step 4: Evaluating the Model**
Let's check how well the trained model performs on the test dataset.

In [None]:
def evaluate_model(model, dataset):
    total_acc = 0
    num_batches = 0
    for x_batch, y_batch in dataset:
        y_pred = model(x_batch)
        predictions = tf.argmax(y_pred, axis=1, output_type=tf.int32) # axis=1 is row wize
        accuracy = tf.reduce_mean(tf.cast(tf.equal(predictions, y_batch), tf.float32)).numpy()
        total_acc += accuracy
        num_batches += 1
    return total_acc / num_batches


# Initialize and Train Model
# **Hyperparameter Options:**
- **Activation Functions**
- **Hidden Units**
- **Learning Rates**
- **Epochs**

In [None]:
mlp_model = MultiLayerNN(input_dim=784, hidden_units=[256, 128], output_dim=10, activation=tf.nn.relu)
train_model(mlp_model, train_data, learning_rate=0.001, epochs=5)
test_accuracy = evaluate_model(mlp_model, test_data)
print(f"Test Accuracy: {test_accuracy:.4f}")

Epoch 1, Loss: 1.5804, Accuracy: 0.8944
Epoch 2, Loss: 1.5092, Accuracy: 0.9558
Epoch 3, Loss: 1.4968, Accuracy: 0.9669
Epoch 4, Loss: 1.4900, Accuracy: 0.9730
Epoch 5, Loss: 1.4851, Accuracy: 0.9778
Test Accuracy: 0.9732


#Exercise: #
Change the number of hidden units, the learning rate, and the number of epochs and observe and report the impact on performance.
- **Activation Functions:** `tf.nn.relu`, `tf.nn.sigmoid`, `tf.nn.tanh`
- **Hidden Units:** `[256, 128], [512, 128], [128, 64]`
- **Learning Rates:** `[0.001, 0.01, 0.0001]`
- **Epochs:** `[5, 10, 20]`



## **Conclusion**
🎉 We've successfully built a neural network using TensorFlow 2.x **Core API**, implemented a **custom Dense layer**, trained on MNIST, and explored **different activation functions**.

🚀 **Next Steps:** Try changing the activation functions (`sigmoid`, `tanh`), or modify the number of layers and neurons!