# Neural Networks introduction
## I) Concept and Theory

Neural Networks (NNs) are a subset of machine learning models inspired by the structure and function of the human brain. They are composed of interconnected nodes or "neurons," which are organized into layers. These networks are particularly powerful for capturing complex patterns in data and are widely used in applications such as image recognition, natural language processing, and predictive analytics.

### a) Perceptron Definition

A neuron is the fundamental unit of a neural network. Each neuron receives one or more inputs, processes them, and produces an output. Mathematically, a neuron can be represented as:

$$
y = f\left(\sum_{i=1}^{n} w_i x_i + b\right)
$$

- $( x_i )$: Input features
- $( w_i )$: Weights associated with each input
- $( b )$: Bias term
- $( f )$: Activation function
- $( y )$: Output of the neuron

The activation function have properties like : Non-Linearity, Differentiability, Monotonicity ... to ensure learning and capturing complex data. For simplification 
<br><br>
The most simple form of neural networks is the single Layer Perceptron looking like that :
<br><br>
<img src="static/SingleLayerPerceptron.png" height="40%" width="40%">
<br>
Let's do a simple example following the picture (we took arbitrary numbers and functions) :
 - Input vector : $x=(x_1,x_2,x_3)=(1,2,3)$
 - Weights : $W=(w_1,w_2,w_3)=(0.5,0.3,2)$
 - Bias : $b=0$ 
 - Activation function : $f=f(x) = \frac{1}{1 + e^{-x}}$ (the sigmoid function)

To compute $a_1=f(x_1*w_1+x_2*w_2+x_3*w_3)=f(7.1)\approx 0.99$.
<br>
This is called the <b>Feed Forwad Process</b>.
<br><br>

### b) Training a Perceptron

Training a neural network involves adjusting the weights and biases to minimize the difference between the predicted output and the actual target.
Let's define the Loss Function, this function quantifies the error in the predictions. Common loss functions include Mean Squared Error (MSE) $(\frac{1}{N} \sum_{i=1}^{N} \left( \hat{y}_i - y_i \right)^2)$ for regression tasks and Cross-Entropy Loss $( -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})
)$ for classification tasks. Goal is to minimize this functions accross all the observations.
<br><br>
Backpropagation is the algorithm used to update the network's weights. It involves computing the gradient of the loss function with respect to each weight using the chain rule, then updating the weights in the direction that reduces the loss.

Mathematically, the weight update rule can be expressed as:

$w_{ij} \leftarrow w_{ij} - \eta \frac{\partial \mathcal{L}}{\partial w_{ij}}$

Where:
- $( w_{ij} )$ is the weight connecting neuron \( j \) to neuron \( i \)
- $( \eta )$ is the learning rate
- $( \mathcal{L} )$ is the loss function

Gradient Descent is an optimization algorithm used to minimize the loss function. In each iteration, the algorithm updates the weights by moving them in the direction opposite to the gradient.
<br><br>
Let's do a simple example of backpropagation where :
 - Input vector : $x=(x_1,x_2,x_3)=(1,2,3)$
 - Weights : $W=(w_1,w_2,w_3)=(0.5,0.3,2)$
 - Target : $y=(0.2)$ 
 - Activation function : $f=f(x) = \frac{1}{1 + e^{-x}}$ (the sigmoid function)
 - Loss function : $\mathcal{L} =0.5*(\hat{y}-y)^2$
 - Learning Rate : $( \eta )=0.1$

 <br>

 1. Compute Forward Pass : $\hat{y}=0.99$ (we use the approximation from a) )
 2. Compute the Loss : $\mathcal{L} = 0.5 (0.99-0.2)^2=0.31205$
 3. Compute the Gradient of the Loss : $\frac{\partial \mathcal{L}}{\partial \hat{y}}=\hat{y}-y=0.79$
 4. Compute the Gradient of the the Weighted Sum : $z=\sum_{i=1}^{n} w_i x_i$ and  $f'(x)=\frac{e^{-x}}{(1 + e^{-x})^2}$ so $f'(7.1)=0.000823745$
 5. Compute the Gradient of the Loss for each weights : <br>

 Use the chain rule to compute the gradient of the loss with respect to each weight : $\frac{\partial \mathcal{L}}{\partial w_{i}}=\frac{\partial \mathcal{L}}{\partial \hat{y}}*\frac{\partial \hat{y}}{\partial z}*\frac{\partial z}{\partial w_{i}}$ and $\frac{\partial z}{\partial w_{i}}=x_i$ so we got $\frac{\partial \mathcal{L}}{\partial w_{i}}=(\hat{y}-y)*f'(z)*x_i$
 <br>
 $\frac{\partial \mathcal{L}}{\partial W}=(0.00065075855,0.0013015171,0.00195227565)$

 6. Update the Weights : Using the formula $w_{ij} \leftarrow w_{ij} - \eta \frac{\partial \mathcal{L}}{\partial w_{ij}}$ -> $W_{new}=(0.499934924145,0.29986984829,0.199804772435)$

You can iterate these steps; note that generally, we stop the iterations after a certain number of steps or when the loss function reaches a specified value. Additionally, for efficiency in terms of time and computational resources, we often use optimizers that implement stochastic gradient descent.
<br><br>

## II) Building a Perceptron from scratch in Python

In [4]:
import numpy as np

# Activation function (Sigmoid) and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Loss function (Mean Squared Error)
def mean_squared_error(y_true, y_pred):
    return 0.5 * np.mean((y_true - y_pred) ** 2)

# Single perceptron class
class Perceptron:
    def __init__(self, input_size, learning_rate=0.01):
        self.weights = np.random.rand(input_size)
        self.bias = np.random.rand(1) 
        self.learning_rate = learning_rate

    def forward(self, x):
        z = np.dot(x, self.weights) + self.bias
        return sigmoid(z), z

    def backward(self, x, y_true, y_pred, z):
        # Compute the gradients
        loss_derivative = y_pred - y_true
        z_derivative = sigmoid_derivative(z)
        
        # Gradient with respect to weights and bias
        weight_gradients = loss_derivative * z_derivative * x
        bias_gradient = loss_derivative * z_derivative

        # Update weights and bias
        self.weights -= self.learning_rate * weight_gradients
        self.bias -= self.learning_rate * bias_gradient

    def train(self, x, y_true):
        # Forward pass
        y_pred, z = self.forward(x)
        
        # Compute loss (for monitoring purposes)
        loss = mean_squared_error(y_true, y_pred)
        
        # Backward pass (update weights and bias)
        self.backward(x, y_true, y_pred, z)
        
        return loss

# Example usage
if __name__ == "__main__":
    # Input data (single sample with 3 features)
    x = np.array([0.5, 0.3, 0.2])
    
    # True label
    y_true = np.array([0])
    
    # Create perceptron
    perceptron = Perceptron(input_size=3, learning_rate=0.1)
    y_pred, _ = perceptron.forward(x)
    print(f"Predicted output before training: {y_pred}\n")
    # Train the perceptron with the sample
    for epoch in range(100):  # Train for 100 epochs
        loss = perceptron.train(x, y_true)
        if epoch % 10 == 0:
            print(f"Epoch {epoch}, Loss: {loss:.4f}")

    # Test the perceptron after training
    y_pred, _ = perceptron.forward(x)
    print(f"\nPredicted output after training : {y_pred}")

Predicted output before training: [0.69243671]

Epoch 0, Loss: 0.2397
Epoch 10, Loss: 0.2095
Epoch 20, Loss: 0.1800
Epoch 30, Loss: 0.1526
Epoch 40, Loss: 0.1286
Epoch 50, Loss: 0.1082
Epoch 60, Loss: 0.0913
Epoch 70, Loss: 0.0776
Epoch 80, Loss: 0.0666
Epoch 90, Loss: 0.0576

Predicted output after training : [0.31738198]


: 

## III) Other definitions

Some definitions:

**Input Layer**:
- The input layer is where data is fed into the model, typically from sources like CSV files or images, audios... This layer simply passes the data without any processing, making it the only visible layer in the neural network architecture.

**Hidden Layers**:
- Hidden layers are the core of deep learning. These intermediate layers perform computations to extract features from the data. Multiple interconnected hidden layers can identify various features at different levels of complexity. For instance, in image processing, early hidden layers might detect edges and shapes, while later layers might recognize entire objects like cars or buildings.

**Output Layer**:
- The output layer receives input from the preceding hidden layers and produces the final prediction based on the model's training. This layer is crucial as it provides the final result. In classification or regression models, the output layer usually consists of a single node, but its structure can vary depending on the specific problem and model design.
<br><br>
<img src="static/Deep-Neural-Network-architecture.ppm" height="40%" width="40%" />

**Epochs**: 
- An epoch is one complete pass through the entire training dataset. Multiple epochs allow the model to learn better, but too many can cause overfitting.

**Batch Size**: 
- The number of data points the model processes before updating its parameters. Smaller batches can lead to noisy learning, while larger batches make learning slower but more stable.Together, epochs and batch size help balance the efficiency and effectiveness of training a model.

**Optimizer**:
- An optimizer in neural networks is an algorithm that adjusts the model's weights and biases during training to minimize the loss function, improving the model's accuracy. Common optimizers include Stochastic Gradient Descent (SGD) and Adam.

