# Neural Network


### Problem Statement
The **handwritten digit recognition problem** is a **classification task** where the goal is to correctly identify digits (0-9) from images of handwritten numbers. In the context of **neural networks**, this problem is typically approached using **supervised learning**, where a model learns from labeled images.

### Problem Definition:
Given an input image \( X \), the neural network must output a predicted digit \( y \), where:

$ f(X) = y, \quad y \in \{0, 1, 2, ..., 9\}$

The function \( f(X) \) is learned by training a neural network on a dataset of images with corresponding digit labels.

### Neural Network Approach:
1. Input Representation
   - The input is a grayscale or color image of a digit, often resized to a fixed dimension (e.g., 28×28 pixels in MNIST).  
   - The pixel values are normalized (e.g., scaled between 0 and 1 or standardized).

2. Model Architecture
   - **Fully Connected Neural Network (MLP)**: A simple feedforward network with input, hidden, and output layers.  
   - **Convolutional Neural Network (CNN)**: Uses convolutional layers to capture spatial features like edges, curves, and shapes, improving accuracy.  
   - **Recurrent Neural Network (RNN)**: Sometimes used when dealing with sequential handwritten strokes.

3. Training Process
   - The network learns by minimizing a loss function, such as **MSE, cross-entropy loss**, using an optimizer like **SGD** or **Adam**.  
   - **Backpropagation** updates weights to improve prediction accuracy.

4. Output Layer and Classification**  
   - The final layer has **10 neurons**, each representing a digit (0-9).

### **Datasets Used:**
- **MNIST**: 60,000 training and 10,000 testing images (28x28 grayscale).
- **EMNIST**: Extended version of MNIST.
- **SVHN**: Street View House Numbers dataset with real-world images.


We will use the numpy for the matrix computations

**Note**: This is not a production level code and is only for the education purpose.


## Dataset source

The **MNIST (Modified National Institute of Standards and Technology) dataset** is a widely used benchmark dataset for handwritten digit recognition. It contains grayscale images of digits (0-9) and is commonly used to train and evaluate machine learning and neural network models.

1. Image Characteristics
   - Each image is **28×28 pixels** in size.  
   - The images are **grayscale** with pixel intensity values ranging from **0 (black) to 255 (white)**.  
   - The digits are **centered** and **normalized** within the images.

2. Dataset Composition
   - **Training set**: 60,000 images  
   - **Test set**: 10,000 images  
   - Each image is labeled with a digit from **0 to 9**.

3. Class Distribution
   - The dataset is **balanced**, meaning each digit (0-9) appears roughly the same number of times in both training and testing sets.

4. Format
   - The dataset is available in **IDX file format** (compressed binary format).  
   - Can be loaded using machine learning libraries like **TensorFlow, PyTorch, Keras**, or via `sklearn.datasets.fetch_openml('mnist_784')`.

5. Preprocessing Requirements
   - The images are often **normalized** to values between **0 and 1** (by dividing pixel values by 255).  
   - Flattening is required for **fully connected networks** (reshaping to a 1D vector of 784 features).  
   - Noisy or augmented versions may be used to improve robustness.

6. Challenges
   - Some digits are difficult to distinguish due to **variations in Handwriting styles**.  
   - The dataset is relatively simple, making it **less representative of real-world handwritten text recognition**.

7. Variants
   - **EMNIST**: An extension that includes letters and additional handwritten data.  
   - **Fashion-MNIST**: A similar dataset but with clothing images instead of digits.

The MNIST dataset remains a **standard benchmark** for testing neural network architectures, particularly **Convolutional Neural Networks (CNNs)**, and serves as a stepping stone for more complex image classification tasks.


http://yann.lecun.com/exdb/mnist/

https://github.com/mbornet-hl/MNIST

https://github.com/rupakraj/machine-learning/raw/refs/heads/main/datasets/mnist.pkl.gz

In [None]:
!wget https://github.com/rupakraj/machine-learning/raw/refs/heads/main/datasets/mnist.pkl.gz

In [None]:
import pickle
import gzip


def load_data():
    mnistFile = gzip.open('mnist.pkl.gz', 'rb')
    mnistFileUnpicker = pickle._Unpickler(mnistFile)
    mnistFileUnpicker.encoding = 'latin1'
    training_data, validation_data, test_data = mnistFileUnpicker.load()
    mnistFile.close()
    return (training_data, validation_data, test_data)

Vectorizing the result $[0,0,0,1,0,0,0,0,0,0]$


In [None]:
import numpy as np


def vectorized_result(j):
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

Prepare the dataset for the training

In [None]:
def load_data_wrapper():
    trainData, valData, testData = load_data()

    # prepare the dataset

    return (trainingData, validationData, testingData)

## Implementation of Neural Network

**Class: `Network`**  
This class represents a simple **feedforward neural network** that can be trained using **stochastic gradient descent (SGD)**.


**Constructor:  `__init__(self, sizes)`**

*Initialize network structure*
   - Store the number of layers (`num_layers`) from the given list `sizes`.
   - Store the list `sizes`, which defines the number of neurons in each layer.
   
*Initialize biases and weights*
   - Initialize biases randomly for all layers except the input layer.
   - Initialize weights randomly for connections between layers.

```plaintext
1. Input: sizes (list of neurons in each layer)
2. Set num_layers = length of sizes
3. Set sizes = sizes
4. Initialize biases as random values for all layers except input layer
5. Initialize weights as random values between consecutive layers
6. Return the network object
```

**`feedforward(self, a)`**:
This function computes the **output of the network** for a given input.

**Algorithm:**
1. For each layer in the network:
   - Compute the weighted sum of inputs and biases (`z = w * a + b`).
   - Apply the activation function (`sigmoid(z)`) to compute activations for the next layer.
2. Return the final activation as the output.

**Pseudo-code:**
```plaintext
1. Input: a (input vector)
2. For each layer (b, w) in (biases, weights):
   3. Compute z = w * a + b
   4. Apply activation function: a = sigmoid(z)
5. Return a (final output)
```

**`SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None)`**
This function trains the network using **Stochastic Gradient Descent (SGD)**.

**Algorithm:**
1. If test data is provided, store its size (`n_test`).
2. Store training dataset size (`n`).
3. Loop for `epochs` times:
   - Shuffle the training data randomly.
   - Split the data into `mini-batches`.
   - For each `mini-batch`, update network parameters using **backpropagation**.
   - If test data is provided, evaluate the model after each epoch.
   - Print progress information.

**Pseudo-code:**
```plaintext
1. Input: training_data, epochs, mini_batch_size, eta (learning rate), test_data (optional)
2. If test_data exists, set n_test = length of test_data
3. Set n = length of training_data
4. Repeat for j in range(epochs):
   5. Shuffle training_data
   6. Create mini_batches of size mini_batch_size
   7. For each mini_batch:
      8. Update network parameters using update_mini_batch(mini_batch, eta)
   9. If test_data exists:
      10. Evaluate performance and print results
   11. Else, print epoch completion time
```

**`update_mini_batch(self, mini_batch, eta)`**
This function updates network **weights and biases** using mini-batch **gradient descent**.

**Algorithm:**
1. Initialize gradient accumulators (`nabla_b`, `nabla_w`) as zero matrices.
2. For each `(x, y)` in the mini-batch:
   - Compute **gradients** using backpropagation.
   - Accumulate gradients over the mini-batch.
3. Update weights and biases using the averaged gradients.

**Pseudo-code:**
```plaintext
1. Input: mini_batch (list of (x, y)), eta (learning rate)
2. Initialize nabla_b and nabla_w as zero matrices
3. For each (x, y) in mini_batch:
   4. Compute gradients delta_nabla_b, delta_nabla_w using backprop(x, y)
   5. Accumulate gradients: nabla_b += delta_nabla_b, nabla_w += delta_nabla_w
6. Update weights: w = w - (eta / mini_batch_size) * nabla_w
7. Update biases: b = b - (eta / mini_batch_size) * nabla_b
8. Return updated weights and biases
```


**`backprop(self, x, y)`**
This function implements **Backpropagation** to compute gradients for updating weights and biases.

**Algorithm:**
1. **Forward pass:**
   - Compute activations for each layer.
   - Store intermediate results (`activations` and `zs`).

2. **Backward pass:**
   - Compute output layer error (`delta`).
   - Compute gradients for biases and weights at the output layer.
   - Backpropagate the error to earlier layers using **chain rule**.
   - Compute gradients for all layers.

3. Return computed gradients.

**Pseudo-code:**
```plaintext
1. Input: x (input), y (true label)
2. Initialize nabla_b and nabla_w as zero matrices
3. Forward pass:
   4. Set activation = x
   5. For each layer (b, w):
      6. Compute z = w * activation + b
      7. Store z in zs
      8. Compute activation = sigmoid(z)
      9. Store activation in activations
10. Backward pass:
   11. Compute output error: delta = cost_derivative(output_activation, y) * sigmoid_prime(last z)
   12. Compute gradient for last layer weights and biases
   13. For each previous layer:
      14. Compute delta = backpropagated error
      15. Compute gradients for biases and weights
16. Return (nabla_b, nabla_w)
```


**6. `evaluate(self, test_data)`**
This function evaluates the network's performance on test data.

**Algorithm:**
1. Compute predictions using `feedforward(x)`.
2. Compare predicted labels with actual labels.
3. Count the number of correct predictions.

**Pseudo-code:**
```plaintext
1. Input: test_data (list of (x, y))
2. For each (x, y) in test_data:
   3. Compute predicted digit using argmax(feedforward(x))
   4. Compare with y (true label)
5. Return total correct predictions
```



**7. `cost_derivative(self, output_activations, y)`**
Computes the derivative of the cost function for backpropagation.

**Algorithm:**
1. Compute the difference between predicted output (`output_activations`) and actual target (`y`).

**Pseudo-code:**
```plaintext
1. Input: output_activations, y
2. Compute cost derivative: output_activations - y
3. Return the computed derivative
```

In [None]:
import random
import time

In [None]:
class Network(object):
    def __init__(self, sizes):
        pass

    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in range(epochs):
            time1 = time.time()

            # shuffle
            # prepare the mini batches

            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)

            time2 = time.time()

            if test_data:
                print("Epoch {0}: {1} / {2}, took {3:.2f} seconds".format(
                    j, self.evaluate(test_data), n_test, time2-time1))
            else:
                print("Epoch {0} complete in {1:.2f} seconds".format(j, time2-time1))


    def update_mini_batch(self, mini_batch, eta):
        # nabla_b =
        # nabla_w =

        for x, y in mini_batch:
            # backpropagate and update nabla_b and w


        # update the weight and bias


    def backprop(self, x, y):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]

        # feedforward
        activation = x
        activations = [x]
        zs = []
        for b, w in zip(self.biases, self.weights):
            # calculate activations

        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())

        for l in range(2, self.num_layers):
            # error correctdion and compute nabla_a and w
        return (nabla_b, nabla_w)


    def feedforward(self, a):
        # inference
        pass


    def evaluate(self, test_data):
        # evaluate: only return number of correct samples

    def cost_derivative(self, output_activations, y):
        return (output_activations-y)

In [None]:
nn = Network([784, 30, 10])

**Neuran Activation and derivatives**

**`sigmoid_prime(z)`**
This function calculates the derivative of the **sigmoid activation function**, which is used in **backpropagation** to update weights and biases.


**Algorithm:**
1. Compute the sigmoid function:  
   $ \sigma(z) = \frac{1}{1 + e^{-z}} $
2. Compute the derivative using the formula:  
   $ \sigma'(z) = \sigma(z) \cdot (1 - \sigma(z)) $
3. Return the computed derivative.

**Pseudo-code:**
```plaintext
1. Input: z (scalar or matrix)
2. Compute sigmoid(z) = 1 / (1 + exp(-z))
3. Compute derivative: sigmoid(z) * (1 - sigmoid(z))
4. Return the computed value
```

This derivative is crucial in **gradient-based learning**, as it helps propagate the error during **backpropagation** by determining how much each weight should be adjusted.

In [None]:
def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))

In [None]:
def cost_derivative(self, output_activations, y):
    return (output_activations-y)

nn.cost_derivative = cost_derivative

In [None]:
training_data, validation_data, test_data = load_data_wrapper()

In [None]:
recognizer = Network([784, 50, 10])

In [None]:
recognizer.SGD(training_data, 1, 500, 3.0, test_data=test_data)

## Fun Task 1: Train the neural network for adding two digits

In [None]:
import random

def generate_unique_triples(n=500):
    unique_triples = set()
    while len(unique_triples) < n:
        a = random.randint(1, 500)
        b = random.randint(1, 500)
        c = a + b
        unique_triples.add((a, b, c))
    return list(unique_triples)

# Generate 500 unique numbers
unique_numbers = generate_unique_triples()
print(unique_numbers[:10])  # Print first 10 triples as an example

## Task: Work with real data: try to use  Disease Diagnosis (Medical Data)

- Dataset: UCI Heart Disease Dataset

    https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data/data

- Features: Age, cholesterol levels, blood pressure, heart rate, etc.
- Classes: Disease present vs. Disease absent
