# Understand the Basics

## Neural Network Architecture: 
Understand the basic components, such as input layer, hidden layers, output layer, neurons, weights, and biases.
## Activation Functions: 
Learn about functions like Sigmoid, ReLU, and Softmax.
## Loss Functions: 
Understand concepts of loss functions like Mean Squared Error (MSE) or Cross-Entropy.

# Conceptual Outline of a Neural Network Project

## Introduction
- **Objective**: Design and implement a neural network using only NumPy, focusing on the fundamental principles of machine learning and linear algebra.

## 1. Neural Network Architecture
- **Linear Algebra Perspective**: View neurons and layers as vectors and matrices.
- **Layers**: Input, hidden, and output layers, each represented by matrices of weights and bias vectors.
- **Neurons**: Units of computation, aggregating input through matrix operations.

## 2. Mathematical Foundations
- **Matrix Operations**: Emphasize the use of matrix multiplication, addition, and transpose operations in the network's computations.
- **Activation Functions**: Conceptualize as vectorized non-linear transformations (e.g., Sigmoid or ReLU applied element-wise).

## 3. Forward Propagation
- **Linear Transformation**: `Z = W*X + b`, where `W` and `X` are matrices, and `b` is a bias vector.
- **Activation**: Apply an activation function on `Z` to introduce non-linearity.

## 4. Loss Function
- **Theory**: Understand the purpose of a loss function in quantifying prediction error.
- **Implementation**: Use a function like MSE or Cross-Entropy, represented in a linear algebra framework.

## 5. Backward Propagation
- **Gradient Computation**: Utilize the chain rule in a matrix calculus context to compute gradients of the loss with respect to weights and biases.
- **Backpropagation Algorithm**: Conceptually, this is the reverse of forward propagation, applying the chain rule through layers.

## 6. Parameter Update
- **Gradient Descent**: Update the weights and biases by moving in the direction opposite to the gradient.
- **Learning Rate**: Control the step size in the gradient descent.

## 7. Model Integration and Training
- **Epochs**: Iterate over the dataset multiple times.
- **Batch Processing**: Optionally introduce the concept of mini-batch gradient descent.

## 8. Evaluation and Prediction
- **Predictive Function**: Apply the trained model to new data.
- **Performance Metrics**: Evaluate accuracy, precision, recall, etc.

## 9. Experimental Setup
- **Dataset**: Choose an appropriate dataset for testing.
- **Training/Testing Split**: Emphasize the importance of unbiased model evaluation.

## 10. Advanced Topics (Optional)
- **Regularization**: Techniques like L1, L2 regularization to prevent overfitting.
- **Optimization Algorithms**: Beyond basic gradient descent, explore algorithms like Adam, RMSprop.


# Summary: Single Layer Neural Networks from a Statistical Perspective
(based on "Introduction to statistical learning")

## Conceptual Overview
- **Purpose**: A neural network aims to build a nonlinear function `f(X)` to predict the response `Y`, using an input vector of variables `X = (X1, X2, ..., Xp)`.
- **Distinction**: Unlike trees, boosting, and generalized additive models, neural networks have a unique structure characterized by layers and units.

## Neural Network Model
- **Input Layer**: Consists of features `X1, ..., Xp` as units.
- **Hidden Layer**: Each input feeds into `K` hidden units (in this example, `K=5`).
- **Model Representation**:
  $$ f(X) = β0 + ∑_{k=1}^{K} βk g(wk0 + ∑_{j=1}^{p} wkj Xj) $$
- **Activation Functions**: `g(z)` is a nonlinear function. Popular choices are sigmoid and ReLU (Rectified Linear Unit).

## Activation Functions
- **Sigmoid**: 
  $ g(z) = \frac{1}{1 + e^{-z}} $
- **ReLU**:
  $ g(z) = max(0, z) $
- **Efficiency**: ReLU is preferred for its computational efficiency.

## Computation in Hidden Layer
- **Activations**: Calculated as $Ak = hk(X) = g(wk0 + ∑_{j=1}^{p} wkj Xj)$.
- **Role of Activations**: Similar to basis functions, transforming original features.

## Output Layer
- **Formulation**: The output is a linear regression in the activations `Ak`.
- **Parameters**: Includes both weights `w` and biases `β`, to be estimated from data.

## Nonlinearity and Interaction Effects
- Nonlinear activation functions allow the model to capture complex patterns and interactions.
- Example with quadratic `g(z)` illustrates how nonlinear transformations can model interactions.

## Fitting the Neural Network
- **Loss Function**: Typically squared-error loss for quantitative responses.
- **Optimization**: Minimize the sum of squared errors between predictions and actual values.

*This section is based on the paper "The Neural Network, its Techniques and Applications" by Casey Schafer (2016)*

# Basic Neural Network: Perceptron Explanation

A **perceptron** is the most basic form of a neural network. It consists of one layer of inputs (independent variables or features) and one layer of outputs (dependent variables). Let's break down the components and the operation of this network step by step, exemplifying with :

## Layers and Vectors

1. **Layers**: In a perceptron, layers are visualized as a series of nodes (each node corresponding to a variable) aligned vertically.
   - An **input layer** is represented by a vector for one observation. For example, vector `x` is `[x1 x2 ... xn]ᵀ` for "n" features.
   - With k observations and n features, a layer is represented as a matrix `n × k` .

2. **Input Layer (X)**: For a perceptron with two features and `k` observations, the input layer is a `2 × k` matrix (matrix `X`). We are going to model the "mini MNIST" data, which will have (28 x 28) input features and 1.000 observations, so the input layer matrix has a 784 x 1.000 shape.

3. **Output Layer (Y)**: 
   - The output layer (matrix `Y`) is typically `n × p` in this case we will model each possible digit as a binary variable, so it will have 10 x 1.000 shape.
   - `Y` is the predicted output by the network, different from `T` (label matrix), which is the known output in supervised learning.

## Goal of the Neural Network

- The objective is to minimize the difference between the known output `T` and the predicted output `Y`.
- The ideal network predicts `T` accurately using only the inputs.

## Weight Matrix (W)

- To map from `X` to `Y`, we introduce a weight matrix `W`.
- In general, `W` is an `n × m` matrix, where `m` is the number of input nodes, independent of the number of observations. In this case we have 784 input features and 10 output features, so it has a 10 x 784 shape.
- The fundamental equation of the network is: `T = W * X + b`, where `b` is a bias vector.

## Linearization

- To linearize this equation, augment `W` with `b` as `[W | b]` and add a row of 1’s to `X`.
- The equation becomes: `T = [W | b] * [X; 1]`.
- We redefine `W` to be `n × (m + 1)` and `X` to be `(m + 1) × p` with a row of 1’s at the bottom.

## Solving for W

- We don't "solve" this equation in the traditional sense. Instead, we compute the pseudoinverse of `X` to find an approximation of `W`, denoted as `Ŵ`.
- `Ŵ` is calculated as `Ŵ = T * V * Σ⁻¹ * Uᵀ`, where `V`, `Σ`, and `U` are derived from the singular value decomposition of `X`.
- This `Ŵ` is a projection onto the column space of `X`.

## Predicted Output (Y)

- The predicted output `Y` is computed as `Y = Ŵ * X`.
- `Y` equals `T` only if `T` is in the column space of `X`, which is rare.
- Otherwise, `Y` is as close to `T` as possible, given the constraints.

## Limitations and Extensions

- This linear algebra method works best if the relationship between `X` and `T` is linear, which is also rare.
- For more complex relationships, we extend the model to a more sophisticated neural network that can handle non-linear functions.


We will be working with a subset of the MNIST dataset

In [8]:
import pickle

with open('mini-mnist-1000.pickle', 'rb') as f:
    data = pickle.load(f)

images = data['images'] # a list of 1000 numpy image matrices
labels = data['labels'] # a list of 1000 integer labels

In [28]:
print("Original image shape:",images[0].shape) # (28, 28)

# Given we are working with a simple perceptron, we must flatten each image into a vector of length 784.

# Example:

print("Flattened image shape: ",images[0].flatten().shape)

# Now we can create a single matrix of all the images, where each row is an image vector.

import numpy as np

X = np.array([image.flatten() for image in images])
X = X.T # transpose so that each column is an image vector
print("X shape:",X.shape) # (784,1000)

# We also need to convert the labels so that they are one-hot encoded.
# Given that each label is an integer between 0 and 9, we can create a 10-dimensional vector for each label.
# The vector will have a 1 in the position of the label and 0s everywhere else.

T = np.zeros((len(labels), 10))
for i, label in enumerate(labels):
    T[i, label] = 1
T = T.T # transpose so that each column is a label vector
print("T shape:",T.shape) # (10,1000)

Original image shape: (28, 28)
Flattened image shape:  (784,)
X shape: (784, 1000)
T shape: (10, 1000)


In [27]:
import numpy as np

def initialize_weights(n_inputs, n_outputs):
    """
    Randomly initializes the weight matrix and bias vector.
    """
    W = np.random.randn(n_outputs, n_inputs)
    b = np.random.randn(n_outputs, 1)
    return W, b

def linear_transform(X, W, b):
    """
    Computes the linear transformation W * X + b.
    """
    return np.dot(W, X) + b

def compute_pseudoinverse(X):
    """
    Computes the pseudoinverse of matrix X.
    """
    U, S, VT = np.linalg.svd(X, full_matrices=False)
    S_inv = np.diag(1 / S)
    X_pinv = np.dot(VT.T, np.dot(S_inv, U.T))
    return X_pinv

def train_perceptron(X, T):
    """
    Trains the perceptron using pseudoinverse to find the optimal weights.
    """
    X_pinv = compute_pseudoinverse(X)
    W_hat = np.dot(T, X_pinv)
    return W_hat

def predict(X, W):
    """
    Uses the trained perceptron to predict outputs.
    """
    return np.dot(W, X)

# Example usage
n_inputs = 784  # Number of input nodes
n_outputs = 10 # Number of output nodes
p = 1000         # Number of observations

# Random example data
#X = np.random.randn(n_inputs, p)  # Input data
#T = np.random.randn(n_outputs, p) # Target output

print("X shape:", X.T.shape)
print("T shape:", T.shape)
print("X:", X)
print("T:", T)

# Training the perceptron
W_hat = train_perceptron(X, T)
print("W_hat shape:", W_hat.shape)
print("W_hat:", W_hat)

# Predicting new inputs
new_X = np.random.randn(n_inputs, p)

print("New X shape:", new_X.shape)
print("New X:", new_X)


predictions = predict(new_X, W_hat)

print("Predictions:", predictions)


X shape: (1000, 784)
T shape: (10, 1000)
X: [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
T: [[1. 1. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 1. 1. 1.]]
W_hat shape: (10, 784)
W_hat: [[ 2.11237658e+09 -1.04955834e+11 -5.13363801e+10 ...  3.85621959e+10
  -5.46846281e+10  9.56947394e+09]
 [-3.30079618e+10  1.10669870e+10 -2.14677754e+10 ...  2.03558402e+10
   3.91054613e+10  2.90604952e+10]
 [ 3.11741740e+10 -4.23477634e+10  7.80977818e+08 ... -1.43691528e+10
  -4.59679423e+10  5.12355510e+10]
 ...
 [ 6.33463940e+10 -1.11043005e+10 -2.16872418e+10 ...  5.43620773e+10
   6.17301138e+10  6.56674428e+10]
 [ 8.62112372e+10 -1.36373543e+10  6.63989055e+09 ...  4.41542226e+10
   7.58351793e+10  1.72868239e+10]
 [-1.63810193e+11  2.17931590e+10  3.70171170e+10 ... -9.68784683e+10
  -1.22973266e+11 -8.70428261e+10]]
New X shape