
# Neural Networks and Deep Learning: From Theory to Implementation

This Jupyter Notebook is based on the book **[Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/chap3.html)** by **Michael Nielsen**. It provides a hands-on guide to understanding and implementing neural networks from scratch. The notebook covers key concepts such as **weight initialization**, **cost functions**, **gradient descent**, and **regularization**.

---

## Table of Contents
1. [Introduction](#introduction)  
2. [Weight Initialization](#weight-initialization)  
3. [Cost Functions](#cost-functions)  
4. [Gradient Descent and Backpropagation](#gradient-descent-and-backpropagation)  
5. [Regularization](#regularization)  
6. [Training a Neural Network](#training-a-neural-network)  
7. [Conclusion](#conclusion)  

---

## 1. Introduction <a id="introduction"></a>

Neural networks are a powerful tool for solving complex problems in machine learning. In this notebook, we will implement a neural network from scratch using Python and NumPy. We will focus on the following key concepts:

- **Weight Initialization**: How to initialize weights and biases for optimal training.
- **Cost Functions**: Understanding cross-entropy and quadratic cost functions.
- **Gradient Descent**: Implementing the backpropagation algorithm to update weights and biases.
- **Regularization**: Using L2 regularization to prevent overfitting.

Let's get started by importing the necessary libraries.

```python
import numpy as np
import random
import matplotlib.pyplot as plt
```

---

## 2. Weight Initialization <a id="weight-initialization"></a>

Proper weight initialization is crucial for training neural networks. We will implement two initialization methods:

1. **Default Weight Initializer**: Initializes weights using a Gaussian distribution with mean 0 and standard deviation \(1/\sqrt{n}\), where \(n\) is the number of connections to the neuron.
2. **Large Weight Initializer**: Initializes weights using a Gaussian distribution with mean 0 and standard deviation 1.

```python
class Network:
    def __init__(self, sizes, cost):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.default_weight_initializer()
        self.cost = cost

    def default_weight_initializer(self):
        """Initialize weights and biases using the default method."""
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.randn(y, x)/np.sqrt(x) 
                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]

    def large_weight_initializer(self):
        """Initialize weights and biases using the large method."""
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]
```

---

## 3. Cost Functions <a id="cost-functions"></a>

We will implement two cost functions:

1. **Quadratic Cost**: Measures the squared difference between the predicted and actual outputs.
2. **Cross-Entropy Cost**: Measures the difference between the predicted and actual outputs using logarithms.

```python
class QuadraticCost:
    @staticmethod
    def fn(a, y):
        """Return the quadratic cost for output `a` and desired output `y`."""
        return 0.5 * np.linalg.norm(a - y)**2

    @staticmethod
    def delta(z, a, y):
        """Return the error delta for the output layer."""
        return (a - y) * sigmoid_prime(z)

class CrossEntropyCost:
    @staticmethod
    def fn(a, y):
        """Return the cross-entropy cost for output `a` and desired output `y`."""
        return np.sum(np.nan_to_num(-y * np.log(a) - (1 - y) * np.log(1 - a)))

    @staticmethod
    def delta(z, a, y):
        """Return the error delta for the output layer."""
        return (a - y)
```

---

## 4. Gradient Descent and Backpropagation <a id="gradient-descent-and-backpropagation"></a>

Gradient descent is used to minimize the cost function by updating the weights and biases. Backpropagation is the algorithm used to compute the gradients.

```python
def sigmoid(z):
    """The sigmoid function."""
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z) * (1 - sigmoid(z))

def backprop(self, x, y):
    """Return the gradient for the cost function C_x."""
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    
    # Feedforward
    activation = x
    activations = [x]
    zs = []
    for b, w in zip(self.biases, self.weights):
        z = np.dot(w, activation) + b
        zs.append(z)
        activation = sigmoid(z)
        activations.append(activation)
    
    # Backward pass
    delta = self.cost.delta(zs[-1], activations[-1], y)
    nabla_b[-1] = delta
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
    
    for l in range(2, self.num_layers):
        z = zs[-l]
        sp = sigmoid_prime(z)
        delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
        nabla_b[-l] = delta
        nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
    
    return nabla_b, nabla_w
```

---

## 5. Regularization <a id="regularization"></a>

Regularization is used to prevent overfitting. We will implement **L2 regularization**, which adds a penalty term to the cost function based on the magnitude of the weights.

```python
def update_mini_batch(self, mini_batch, eta, lmbda, n):
    """Update the network's weights and biases using gradient descent."""
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = self.backprop(x, y)
        nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    
    self.weights = [(1 - eta * (lmbda / n)) * w - (eta / len(mini_batch)) * nw 
                    for w, nw in zip(self.weights, nabla_w)]
    self.biases = [b - (eta / len(mini_batch)) * nb 
                   for b, nb in zip(self.biases, nabla_b)]
```

---

## 6. Training a Neural Network <a id="training-a-neural-network"></a>

We will now train a neural network on the MNIST dataset. The network will have one hidden layer with 30 neurons.

```python
# Load the MNIST dataset
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

# Initialize the network
net = Network([784, 30, 10], cost=CrossEntropyCost)

# Train the network
net.SGD(training_data, epochs=30, mini_batch_size=10, eta=0.5, lmbda=5.0, 
        evaluation_data=validation_data, monitor_evaluation_accuracy=True)
```

---

## 7. Conclusion <a id="conclusion"></a>

In this notebook, we implemented a neural network from scratch and trained it on the MNIST dataset. We covered key concepts such as weight initialization, cost functions, gradient descent, and regularization. This implementation provides a solid foundation for understanding how neural networks work and can be extended to more complex architectures.

For further reading, check out the book **[Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/chap3.html)** by **Michael Nielsen**.

---

**References**:
- Michael Nielsen, *Neural Networks and Deep Learning*, 2015.
- [Deep Learning Course by Andrew Ng](https://www.coursera.org/learn/machine-learning)
```

---

This notebook is designed to be interactive and educational. You can run each code cell to see the results and experiment with different parameters. Let me know if you'd like further enhancements! 😊