# Deep Neural Networks and Backpropagation

Deep Neural Networks (DNNs) are a type of artificial neural network (ANN) with multiple hidden layers between the input and output layers. These networks are capable of learning complex patterns and representations from data, making them suitable for a wide range of tasks such as image recognition, natural language processing, and speech recognition.

## Structure of Deep Neural Networks

A DNN consists of multiple layers of interconnected neurons, organized into three main types of layers:

1. **Input Layer**: This layer consists of neurons that receive input data. Each neuron represents a feature or attribute of the input data.

2. **Hidden Layers**: These layers are responsible for learning and extracting meaningful features from the input data. Deep neural networks have multiple hidden layers, hence the term "deep". Each hidden layer performs transformations on the input data using weighted connections and activation functions.

3. **Output Layer**: The final layer of the network produces the output predictions. The number of neurons in this layer depends on the type of problem being solved. For binary classification tasks, there may be a single neuron with a sigmoid activation function, while for multi-class classification tasks, there may be multiple neurons with softmax activation.

## Activation Functions

Activation functions introduce non-linearity to the network, enabling it to learn complex mappings between inputs and outputs. Some commonly used activation functions in DNNs include:

1. **Sigmoid**: $f(z) = \frac{1}{1 + e^{-z}}$
2. **ReLU (Rectified Linear Unit)**: $f(z) = \max(0, z)$
3. **Tanh (Hyperbolic Tangent)**: $f(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$
4. **Softmax**: $f(z)_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}}$

## Backpropagation

Backpropagation is a key algorithm used to train deep neural networks. It involves computing the gradient of the loss function with respect to the network's parameters (weights and biases) and updating these parameters to minimize the loss. The process consists of two main steps:

1. **Forward Pass**: During the forward pass, input data is fed through the network, and predictions are made. The output of each layer is computed using the input data, weights, biases, and activation functions.

2. **Backward Pass**: In the backward pass, the gradient of the loss function with respect to each parameter in the network is computed using the chain rule of calculus. This gradient indicates how much the loss would change with a small change in the parameter. The gradients are then used to update the parameters using optimization algorithms such as gradient descent.

## Backpropagation Formulas

### Loss Function:
Let $L$ denote the loss function, $y_i$ the true label, and $\hat{y}_i$ the predicted probability for class $i$. For binary classification, the commonly used loss function is binary cross-entropy:

$$ L(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] $$

For multi-class classification, cross-entropy loss or categorical cross-entropy is typically used:

$$ L(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{C} y_{ij} \log(\hat{y}_{ij}) $$

Where:
- $N$ is the number of samples
- $C$ is the number of classes

### Gradient Calculation:
The gradients of the loss function with respect to the parameters of the network are computed using the chain rule. For a parameter $w_{ij}$ in layer $l$, the gradient is given by:

$$ \frac{\partial L}{\partial w_{ij}^{(l)}} = \frac{\partial L}{\partial a_{j}^{(l)}} \frac{\partial a_{j}^{(l)}}{\partial z_{j}^{(l)}} \frac{\partial z_{j}^{(l)}}{\partial w_{ij}^{(l)}} $$

Where:
- $a_{j}^{(l)}$ is the activation of neuron $j$ in layer $l$
- $z_{j}^{(l)}$ is the weighted sum of inputs to neuron $j$ in layer $l$

### Parameter Update:
The parameters of the network (weights and biases) are updated using an optimization algorithm such as gradient descent. The update rule for a parameter $w_{ij}^{(l)}$ is given by:

$$ w_{ij}^{(l)} = w_{ij}^{(l)} - \alpha \frac{\partial L}{\partial w_{ij}^{(l)}} $$

Where:
- $\alpha$ is the learning rate

## Conclusion

Deep neural networks are powerful models capable of learning complex patterns from data. Backpropagation is a fundamental algorithm for training these networks, allowing them to learn from labeled data by adjusting their parameters to minimize a given loss function. Understanding the concepts and mathematics behind deep neural networks and backpropagation is essential for effectively designing and training neural network models for various tasks.

In [1]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [2]:
def sigmoid(Z):
    """
    Computes the sigmoid activation function element-wise on an input array.

    Args:
    Z (numpy.ndarray): Input array.

    Returns:
    numpy.ndarray: Output array with sigmoid activation applied element-wise.
    """
    return 1 / (1 + np.exp(-Z))

def relu(Z):
    """
    Computes the ReLU (Rectified Linear Unit) activation function element-wise on an input array.

    Args:
    Z (numpy.ndarray): Input array.

    Returns:
    numpy.ndarray: Output array with ReLU activation applied element-wise.
    """
    return np.maximum(0, Z)

def loss(y_hat, y):
    """
    Computes the binary cross-entropy loss between predicted probabilities and ground truth labels.

    Args:
    y_hat (numpy.ndarray): Predicted probabilities.
    y (numpy.ndarray): Ground truth labels.

    Returns:
    float: Binary cross-entropy loss.
    """
    epsilon = 1e-8  # Small epsilon value to ensure numerical stability
    loss = -np.mean(y * np.log(y_hat + epsilon) + (1 - y) * np.log(1 - y_hat + epsilon))
    return loss


In [3]:
class DenseLayer:
    """
    Dense (fully connected) layer in a neural network.

    Args:
    in_features (int): Number of input features.
    out_units (int): Number of output units/neurons.
    g (str): Activation function type. Supported values are "relu" or "sigmoid".

    Attributes:
    W (numpy.ndarray): Weight matrix.
    b (numpy.ndarray): Bias vector.
    g (str): Activation function type.
    """

    def __init__(self, in_features, out_units, g):
        """
        Initializes the DenseLayer with random weights and biases.

        Args:
        in_features (int): Number of input features.
        out_units (int): Number of output units/neurons.
        g (str): Activation function type. Supported values are "relu" or "sigmoid".
        """
        self.W = np.random.randn(out_units, in_features) * np.sqrt(2.0 / (in_features + out_units))
        self.b = np.zeros((out_units, 1))
        self.g = g

    def forward(self, X):
        """
        Performs forward propagation through the layer.

        Args:
        X (numpy.ndarray): Input data.

        Returns:
        numpy.ndarray: Output of the layer after activation function.
        """
        self.X = X
        self.Z = np.dot(self.W, X) + self.b
        self.m = X.shape[1]
        if self.g == "relu":
            self.A = relu(self.Z)
        elif self.g == "sigmoid":
            self.A = sigmoid(self.Z)
        return self.A

    def backward(self, dA):
        """
        Performs backward propagation through the layer.

        Args:
        dA (numpy.ndarray): Gradient of the loss with respect to the layer's output.

        Returns:
        numpy.ndarray: Gradient of the loss with respect to the layer's input.
        """
        if self.g == "relu":
            self.dZ = np.array(dA, copy=True)
            self.dZ[self.Z <= 0] = 0
        elif self.g == "sigmoid":
            s = 1 / (1 + np.exp(-self.Z))
            self.dZ = dA * s * (1 - s)

        self.dW = (1 / self.m) * np.dot(self.dZ, self.X.T)
        self.db = (1 / self.m) * np.sum(self.dZ, axis=1, keepdims=True)
        self.dA = np.dot(self.W.T, self.dZ)

        return self.dA

    def gradient_descent_step(self, alpha):
        """
        Performs one step of gradient descent update.

        Args:
        alpha (float): Learning rate.

        """
        self.W = self.W - alpha * self.dW
        self.b = self.b - alpha * self.db

In [4]:
class NNSequential:
    """
    Neural Network model trained using a sequential approach.

    Args:
    X (numpy.ndarray): Input data.
    Y (numpy.ndarray): Ground truth labels.
    alpha (float): Learning rate.
    num_epochs (int): Number of training epochs.

    Attributes:
    layers (list): List of layers in the neural network.
    X (numpy.ndarray): Input data.
    Y (numpy.ndarray): Ground truth labels.
    learning_rate (float): Learning rate.
    num_epochs (int): Number of training epochs.
    """

    def __init__(self, X, Y, alpha, num_epochs):
        """
        Initializes the NNSequential model.

        Args:
        X (numpy.ndarray): Input data.
        Y (numpy.ndarray): Ground truth labels.
        alpha (float): Learning rate.
        num_epochs (int): Number of training epochs.
        """
        self.layers = []
        self.X = X
        self.Y = Y
        self.learning_rate = alpha
        self.num_epochs = num_epochs

    def add_layer(self, layer):
        """
        Adds a layer to the neural network.

        Args:
        layer: Layer to be added to the neural network.
        """
        self.layers.append(layer)

    def fit(self):
        """
        Trains the neural network model.
        """
        if self.num_epochs <= 0:
            print("Number of epochs should be a positive integer.")
            return

        for i in range(self.num_epochs):
            A = np.array(self.X, copy=True)
            for layer in self.layers:
                A = layer.forward(A)

            dA = - (np.divide(self.Y, A) - np.divide(1 - self.Y, 1 - A))

            for layer in reversed(self.layers):
                dA_l_minus_1 = layer.backward(dA)
                dA = dA_l_minus_1

            for layer in self.layers:
                layer.gradient_descent_step(self.learning_rate)

            print("Loss: " + str(loss(A, self.Y)))

    def predict(self, X, y_test):
        """
        Predicts the output for the given input data and evaluates performance metrics.

        Args:
        X (numpy.ndarray): Input data for prediction.
        y_test (numpy.ndarray): Ground truth labels for evaluation.

        Returns:
        numpy.ndarray: Predicted output.
        """
        A = np.array(X, copy=True)
        for layer in self.layers:
            A = layer.forward(A)

        y_pred = np.where(A > 0.5, 1, 0)
        print("Predicted Output Shape:", y_pred.shape)

        y_test = np.squeeze(y_test)
        accuracy = accuracy_score(y_test, y_pred.T)
        print("Accuracy:", accuracy)

        f1 = f1_score(y_test, y_pred.T)
        print("F1-score:", f1)

        precision = precision_score(y_test, y_pred.T)
        print("Precision:", precision)

        recall = recall_score(y_test, y_pred.T)
        print("Recall:", recall)

        return y_pred

In [5]:
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1, cache=True, as_frame=False)


X = mnist.data
y = mnist.target


y = y.astype(int)
print("Feature vectors shape:", X.shape)
print("Labels shape:", y.shape)


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=50000, test_size=20000, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# Add bias column to X_train and X_test
X_train = np.c_[np.ones((X_train.shape[0], 1)), X_train]
X_test = np.c_[np.ones((X_test.shape[0], 1)), X_test]

print("X_train shape (with bias column):", X_train.shape)
print("X_test shape (with bias column):", X_test.shape)


  warn(


Feature vectors shape: (70000, 784)
Labels shape: (70000,)
X_train shape: (50000, 784)
X_test shape: (20000, 784)
y_train shape: (50000,)
y_test shape: (20000,)
X_train shape (with bias column): (50000, 785)
X_test shape (with bias column): (20000, 785)


In [6]:
#Normalize X-train by dividing by 255, the max value of RGB pixels.
X_train /= 255
X_test  /= 255

In [7]:
def preprocess_labels(y, value_to_one):
    """
    Preprocesses the labels such that labels with the specified value are changed to 1
    and all other labels are changed to 0.

    Parameters:
    - y: numpy array, the original labels
    - value_to_one: int, the value to be changed to 1

    Returns:
    - y_processed: numpy array, the preprocessed labels
    """
    y_processed = (y == value_to_one).astype(int)
    return y_processed

In [8]:
y_train = preprocess_labels(y_train, 8)
y_test = preprocess_labels(y_test, 8)

In [9]:
X_train.shape

(50000, 785)

In [10]:
model = NNSequential(X_train.T, y_train, 0.3, 40)
layer1 = DenseLayer(785, 395, "relu")
layer2 = DenseLayer(395, 200, "relu")
layer3 = DenseLayer(200, 100, "relu")
layer4 = DenseLayer(100, 20, "relu")
layer5 = DenseLayer(20, 1, "sigmoid")
model.add_layer(layer1)
model.add_layer(layer2)
model.add_layer(layer3)
model.add_layer(layer4)
model.add_layer(layer5)

In [11]:
model.fit()

Loss: 0.6769232302749445
Loss: 0.4731457871064196
Loss: 0.36120666619088
Loss: 0.3443704228772589
Loss: 0.33554247036164087
Loss: 0.3271241362197645
Loss: 0.31883781204130346
Loss: 0.3103004736751103
Loss: 0.3013139984139406
Loss: 0.29203153200114385
Loss: 0.2824937125847092
Loss: 0.2726961338102551
Loss: 0.2627090661276295
Loss: 0.25261691455129986
Loss: 0.24250532601691568
Loss: 0.2326800162332227
Loss: 0.22353467228063736
Loss: 0.2152557976410191
Loss: 0.20776776367654515
Loss: 0.2009515267289394
Loss: 0.1946947736761549
Loss: 0.18895410507198493
Loss: 0.1837711487653117
Loss: 0.1798973277343879
Loss: 0.18186625482790017
Loss: 0.2169254341888924
Loss: 0.3985554292447169
Loss: 0.4040889052465391
Loss: 0.292943634882196
Loss: 0.19782695083127114
Loss: 0.18247006680760827
Loss: 0.17480379675258048
Loss: 0.1693602110531448
Loss: 0.16498019102769504
Loss: 0.16110813283047506
Loss: 0.15752398145475038
Loss: 0.1541039420430122
Loss: 0.15081394730296374
Loss: 0.14763324506769374
Loss: 0.144

In [12]:
model.predict(X_test.T, y_test)

Predicted Output Shape: (1, 20000)
Accuracy: 0.95095
F1-score: 0.6625386996904025
Precision: 0.9678391959798995
Recall: 0.5036610878661087


array([[1, 0, 0, ..., 0, 0, 0]])