# 1. Introduction
In this notebook, we will explore how to implement a simple Dense Neural Network (DNN) from scratch using only basic Python and NumPy.

### 1.1. Theoretical and Mathematical Background

#### 1.1.1. Structure of a Dense Neural Network (DNN)

A Dense Neural Network (DNN) consists of **layers of neurons** where each neuron in one layer is connected to every neuron in the next layer. The main components of a DNN are:

- **Input Layer**: Takes the features of the dataset (e.g., pixel values of an image) as input.
- **Hidden Layers**: Intermediate layers where neurons compute activations through a weighted sum of inputs and pass them to an activation function.
- **Output Layer**: Produces the final prediction, e.g., classification into categories.

Each neuron in the network performs a simple operation: it computes a weighted sum of its inputs and passes the result through a nonlinear activation function.

Mathematically, each neuron performs the following operation:

$$
Z^{[l]} = W^{[l]} A^{[l-1]} + B^{[l]}
$$
$$
A^{[l]} = \sigma(Z^{[l]})
$$

Where:
- $ W^{[l]} $ is the weight matrix of the layer $ l $.
- $ B^{[l]} $ is the bias vector of the layer $ l $.
- $ A^{[l-1]} $ is the output (activation) of the previous layer.
- $ Z^{[l]} $ is the linear component (weighted sum of inputs).
- $ A^{[l]} $ is the activation (output) of layer $ l $ after applying the activation function $ \sigma $.

The activation function $ \sigma $ introduces non-linearity into the network, allowing it to model complex relationships. Common activation functions are:

- **Sigmoid**: $ \sigma(z) = \frac{1}{1 + e^{-z}} $ (used in binary classification problems).
- **ReLU** (Rectified Linear Unit): $ \sigma(z) = \max(0, z) $ (commonly used in hidden layers).

#### 1.1.2. Forward Propagation

Forward propagation is the process by which input data passes through the network to produce an output. It happens layer by layer, starting from the input layer and moving towards the output.

In a single layer:

1. The inputs from the previous layer (or from the dataset, in the case of the input layer) are multiplied by the weights of the current layer.
2. A bias is added.
3. The resulting weighted sum is passed through an activation function to produce the output (or activation) of the layer.

The forward pass through the entire network can be described as the previous mentioned operations:

$$
Z^{[l]} = W^{[l]} A^{[l-1]} + B^{[l]}
$$
$$
A^{[l]} = \sigma(Z^{[l]})
$$

Where:
- $ A^{[0]} = X $ (the input features).
- $ A^{[L]} $ (where $ L $ is the total number of layers) represents the final output of the network.

#### 1.1.3. Loss Function and Cost Function

In machine learning, the loss function measures how well the network’s predictions match the actual labels. For a binary classification problem, the **binary cross-entropy loss** is commonly used:

$$
\mathcal{L}(y, \hat{y}) = - \left( y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right)
$$

Where:
- $ y $ is the true label (0 or 1).
- $ \hat{y} $ is the predicted probability from the model (between 0 and 1).

The **cost function** is the average of the loss over all examples in the dataset:

$$
J(W, B) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(y^{(i)}, \hat{y}^{(i)})
$$

Where:
- $ m $ is the number of training examples.
- $ y^{(i)} $ is the true label for the $ i $-th training example.
- $ \hat{y}^{(i)} $ is the network’s predicted output for the $ i $-th training example.

#### 1.1.4. Backpropagation

Backpropagation is the algorithm used to compute the gradients of the cost function with respect to the weights and biases. These gradients are then used to update the parameters via gradient descent.

The key idea of backpropagation is to propagate the error from the output layer back through the network, calculating how much each weight contributed to the error. This is done using the **chain rule of calculus**.

##### Backpropagation Steps:

1. **Calculate the error at the output layer** (for classification, this is the difference between the predicted output and the true label):
   $$
   \delta^{[L]} = A^{[L]} - Y
   $$

2. **Propagate the error backward** to calculate the error at each hidden layer:
   $$
   \delta^{[l]} = \left( W^{[l+1]} \right)^T \delta^{[l+1]} \cdot \sigma'(Z^{[l]})
   $$
   Where $ \sigma'(Z^{[l]}) $ is the derivative of the activation function for layer $ l $.

3. **Compute the gradients** for the weights and biases:
   $$
   \frac{\partial J}{\partial W^{[l]}} = \frac{1}{m} \delta^{[l]} (A^{[l-1]})^T
   $$
   $$
   \frac{\partial J}{\partial B^{[l]}} = \frac{1}{m} \sum_{i=1}^{m} \delta^{[l]}
   $$

The term $ \delta^{[l]} $ is often referred to as the **error term** or **delta** for layer $ l $, and it captures how much the neurons in that layer contributed to the final error.

#### 1.1.5. Gradient Descent

Once we have the gradients for each layer, we use **gradient descent** to update the weights and biases. The goal of gradient descent is to minimize the cost function $ J(W, B) $.

The update rule for gradient descent is:

$$
W^{[l]} := W^{[l]} - \alpha \frac{\partial J}{\partial W^{[l]}}
$$
$$
B^{[l]} := B^{[l]} - \alpha \frac{\partial J}{\partial B^{[l]}}
$$

Where $ \alpha $ is the **learning rate**, which controls how large the steps are that we take toward minimizing the cost function. A smaller learning rate may take longer to converge but provides more stable learning, while a larger learning rate can speed up training but risks overshooting the minimum.

##### Steps in Gradient Descent:

1. **Initialize the weights** $ W $ and biases $ B $ randomly, often small values close to zero.
2. **Compute the forward pass** to get predictions $ A^{[L]} $.
3. **Calculate the loss** using the predictions and true labels.
4. **Backpropagate the error** to compute the gradients $ \frac{\partial J}{\partial W^{[l]}} $ and $ \frac{\partial J}{\partial B^{[l]}} $.
5. **Update the weights and biases** using the computed gradients.

Repeat the process for a certain number of **epochs** or until the cost function converges (i.e., the changes in the cost become negligibly small).

#### 1.1.6. Putting it All Together

Each training iteration consists of:
1. **Forward propagation** to compute the output of the network.
2. **Loss calculation** using the predicted output and true labels.
3. **Backpropagation** to compute the gradients.
4. **Gradient descent** to update the weights and biases.

After many iterations, the network should "learn" to make good predictions by minimizing the cost function.


# 2. Implementation
Now let us write a DNN class:

In [18]:
import numpy as np

class DenseNeuralNetwork:
    
    def __init__(self, layers:list, learning_rate:float, epochs:int):
        """
        layers: List of layers including the input layer size, hidden layers sizes, and output layer size
        learning_rate: Learning rate for gradient descent
        epochs: Number of epochs to train the network
        """
        self.layers = layers
        self.learning_rate = learning_rate
        self.epochs = epochs

    def initialize_layers(self, X_rows:int):
        """
        Initialize weights and biases for each layer
        """
        # Input layer size followed by hidden layers and output layer
        self.layers = [X_rows, *self.layers]
        
        # Biases and weights initialization
        self.B = [np.random.randn(layer, 1)*0.01 for layer in self.layers[1:]]
        self.W = [np.random.randn(layer, next_layer)*0.01 for 
                  layer, next_layer in zip(self.layers[:-1], self.layers[1:])]

    def forward_propagation(self, X:np.array):
        """
        Perform forward propagation through the network
        """
        Z = []
        A = []
        
        for B, W in zip(self.B, self.W):
            # First layer uses input X
            temp_z = np.dot(W.T, A[-1]) + B if Z else np.dot(W.T, X) + B
            temp_a = self._sigmoid(temp_z)
            Z.append(temp_z)
            A.append(temp_a)
            
        return Z, A
    
    def backward_propagation(self, Z:list, A:list, y:np.array, X:np.array):
        """
        Perform backpropagation and compute gradients for weights and biases
        """
        dB = [np.zeros(B.shape) for B in self.B]
        dW = [np.zeros(W.shape) for W in self.W]
        
        last_layer = len(self.layers)-2
        deltas = []
        
        for layer in range(last_layer, -1, -1):
            delta = (self._dsigmoid(Z[layer]) * np.dot(self.W[layer+1], deltas[-1]) if layer != last_layer 
                     else (A[layer]-y) * self._dsigmoid(Z[layer]))
            deltas.append(delta)
            dB[layer] = delta
            dW[layer] = np.dot(A[layer-1], delta.T) if layer != 0 else np.dot(X, delta.T)
            
        return dB, dW
            
    def gradient_descent(self, X:np.array, y:np.array):
        """
        Update weights and biases using gradient descent
        """
        Z, A = self.forward_propagation(X)
        
        dBs, dWs = self.backward_propagation(Z, A, y, X)
        self.B = [b - self.learning_rate*np.mean(dB, axis=1, keepdims=True) for b, dB in zip(self.B, dBs)]
        self.W = [w - self.learning_rate*dW for w, dW in zip(self.W, dWs)]
        
    def fit(self, X:np.array, y:np.array):
        """
        Train the neural network on the dataset
        """
        self.initialize_layers(X.shape[0])
        
        for i in range(self.epochs):
            self.gradient_descent(X, y)
            
    def predict(self, X:np.array):
        """
        Make predictions on new data
        """
        _, A = self.forward_propagation(X)
        return np.argmax(A[-1], axis=0)
    
    def _sigmoid(self, z:np.array):
        """
        Sigmoid activation function
        """
        return 1.0 / (1.0 + np.exp(-z))
    
    def _dsigmoid(self, z:np.array):
        """
        Derivative of the sigmoid activation function
        """
        return self._sigmoid(z) * (1 - self._sigmoid(z))

### 2.1. Example  Classification
We will now train the network on a simple classification task with the Wine dataset from sklearn.

In [19]:
# Importing required libraries
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Load the Wine dataset from sklearn
wine = datasets.load_wine()
X = wine.data
y = wine.target.reshape(-1, 1)  # Reshape to a column vector

# Normalize the data (feature scaling)
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)  # Standardize features to have mean=0 and variance=1

# One-hot encode the labels
encoder = OneHotEncoder(sparse_output=False)
y_encoded = encoder.fit_transform(y)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Initialize the Dense Neural Network
layers = [13, 20, 10, 3]  # 13 input neurons (features), one hidden layer with 20 neurons, another hidden layer with 10 neurons, and 3 output neurons (for each class)
learning_rate = 0.1
epochs = 1000

dnn = DenseNeuralNetwork(layers=layers, learning_rate=learning_rate, epochs=epochs)

# Train the neural network
dnn.fit(X_train.T, y_train.T)

# Predict on the test set
y_pred = dnn.predict(X_test.T)

# Convert one-hot encoded test labels back to original form
y_test_labels = np.argmax(y_test, axis=1)

# Calculate accuracy
accuracy = np.mean(y_pred == y_test_labels) * 100
print(f"Test Accuracy: {accuracy:.2f}%")

Test Accuracy: 77.78%
