# **Neural Network: Multi-layer Network**

In [57]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

def feedforward(X, w1, b1, w2, b2):
    # print(f"x = {X}")
    z1 = np.dot(X, w1) + b1
    # print(f"z1 = {z1}")
    a1 = sigmoid(z1)
    # print(f"a1 = {a1}")
    # print(f"b2 = {b2}")
    z2 = np.dot(a1, w2) + b2
    # print(f"z2 = {z2}")
    a2 = sigmoid(z2)
    # print(f"a2 = {a2}")
    return z1, a1, z2, a2

def backpropagation(X, y, z1, a1, z2, a2, w1, w2, b1, b2):
    dz2 = (a2 - y) * sigmoid_derivative(z2)
    # print(f"partial derivative of z2 (dz2) = {dz2}")
    dw2 = np.dot(a1.T, dz2)
    # print(f"partial derivative of 22 (dw2) = {dw2}")
    db2 = np.sum(dz2, axis=0)
    # print(f"partial derivative of b2 (db2) = {db2}")

    dz1 = np.dot(dz2, w2.T) * sigmoid_derivative(z1)
    # print(f"partial derivative of z1 (dz1) = {dz1}")
    dw1 = np.dot(X.T, dz1)
    # print(f"partial derivative of w1 (dw1) = {dw1}")
    db1 = np.sum(dz1, axis=0)
    # print(f"partial derivative of b1 (db1) = {db1}")

    w2 = w2 - learning_rate * dw2
    b2 = b2 - learning_rate * db2

    w1 = w1 - learning_rate * dw1
    b1 = b1 - learning_rate * db1

    return w1, b1, w2, b2

# define input and output
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])

# define weights and biases
w1 = np.array([[3, 3],[-4, -4]])
b1 = np.array([2, -2])
w2 = np.array([[-4],[4]])
b2 = np.array([2])

learning_rate = 0.1

# train the network
for i in range(10):
    z1, a1, z2, a2 = feedforward(X, w1, b1, w2, b2)
    w1, b1, w2, b2 = backpropagation(X, y, z1, a1, z2, a2, w1, w2, b1, b2)

# print the final output
z1, a1, z2, a2 = feedforward(X, w1, b1, w2, b2)
print(a2)


[[0.24628462]
 [0.81122028]
 [0.72020663]
 [0.29110887]]


## Error Calculation and Backward Propagation

* Once the network has produced an output, it must be evaluated to determine its accuracy. This is done by comparing the output y with the target value t, using a cost function, such as the mean squared error (MSE), to quantify the error of the prediction:
$$ E =  \frac{1}{2}(t-y)^2 $$
* The backpropagation phase starts by computing the gradient of this error with respect to the output of the network. This gradient, ∂y/∂E, indicates the direction and magnitude by which the error will increase or decrease with respect to a change in the output:
$$  \frac{\partial E}{\partial y} = -(t-y) $$
* The gradient is then propagated back through the network, which requires computing the derivative of the output y with respect to the weighted sum Σ for the sigmoid function:
$$ \frac{\partial y}{\partial \sum} = y \cdot (1-y) $$

* This derivative reflects how changes in the weighted sum would affect the neuron’s output after the activation function is applied.

### Updating the Weights and Bias
The ultimate goal of backpropagation is to use the error gradient to update the weights and bias in such a way that the error is reduced in subsequent iterations. The weights are updated by subtracting the product of the learning rate η and the gradient with respect to each weight:
$$ w_1^{\text{new}} = w_1 - \eta \cdot \frac{\partial E}{\partial w_1} $$

$$ w_2^{\text{new}} = w_2 - \eta \cdot \frac{\partial E}{\partial w_2} $$

Similarly, the bias is updated by:

$$ b^{\text{new}} = b - \eta \cdot \frac{\partial E}{\partial b} $$

To compute the gradients for the weights and bias, the chain rule is employed:
$$ \frac{\partial E}{\partial w_1} = \frac{\partial E}{\partial y} \cdot \frac{\partial y}{\partial \sum} \cdot \frac{\partial \sum}{\partial w_1}  $$

$$ \frac{\partial E}{\partial w_2} = \frac{\partial E}{\partial y} \cdot \frac{\partial y}{\partial \sum} \cdot \frac{\partial \sum}{\partial w_2}  $$

$$ \frac{\partial E}{\partial b} = \frac{\partial E}{\partial y} \cdot \frac{\partial y}{\partial \sum} \cdot \frac{\partial \sum}{\partial b}  $$

The partial derivatives of the weighted sum Σ with respect to the weights and bias are the input values and a constant 1, respectively:
$$ \frac{\partial \sum}{\partial w_1} = x_1 $$

$$ \frac{\partial \sum}{\partial w_2} = x_2 $$

$$ \frac{\partial \sum}{\partial b} = 1 $$

By iteratively applying these updates, the neural network adjusts its parameters to minimize the error, thereby improving its performance and accuracy over time. Through backpropagation, neural networks learn to map inputs to the correct outputs, effectively learning from their experiences.

Backpropagation is a crucial mechanism in training neural networks, allowing them to learn from data by adjusting weights and biases to minimize prediction errors. This two-step process of forward and backward propagation ensures that the network iteratively improves its performance. By calculating gradients and updating parameters, backpropagation enables neural networks to capture intricate relationships in data, making them powerful tools for a wide range of applications in artificial intelligence and machine learning.


In [2]:
X

array([[0, 0],
       [0, 1],
       [1, 0],
       [1, 1]])

In [3]:
y

array([[0],
       [1],
       [1],
       [0]])

In [16]:
np.array([2])

array([2])

In [5]:
b1

array([ 2.46422928, -2.18048758])

In [6]:
w2

array([[-5.34688132],
       [ 5.60314638]])

In [7]:
b2

array([2.45185307])

In [50]:
k = np.array([[2,-2],[-2,-6],[5,1],[1,-3]])
a = sigmoid(k)
np.array(a)
print(a)

[[0.88079708 0.11920292]
 [0.11920292 0.00247262]
 [0.99330715 0.73105858]
 [0.73105858 0.04742587]]


# TensorFlow Code for the XOR Problem
The following Python code snippet provides a practical example of defining, training, and evaluating a neural network to solve the XOR problem using TensorFlow.

### Explanation of the TensorFlow code:

* We begin by importing TensorFlow, the library that will allow us to define and manipulate our neural network.
* The X and y variables hold our input data and the labels (or targets) respectively. For the XOR problem, we have a simple set of inputs and corresponding outputs.
* We then define hyperparameters: the learning_rate, which controls the size of the steps we take during optimization, and epochs, the number of times the learning algorithm will work through the entire training dataset.
* Next, we construct our neural network model. It’s a sequential model with two layers: the first with 4 neurons and a tanh activation function, and the second with a single neuron with a sigmoid activation function, appropriate for binary classification.
* We then instantiate an SGD (Stochastic Gradient Descent) optimizer with our learning rate. SGD is a popular and effective optimization algorithm in neural networks.
* Our loss function is the mean squared error, which measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value.
* The compile method configures the model for training, associating it with its optimizer and loss function.
* The fit method trains the model for a fixed number of epochs (iterations on a dataset), and we set verbose=0 to suppress the output for a cleaner display.
* We evaluate the model with the evaluate method, which returns the loss value & metrics values for the model in test mode.
* We predict the output for our inputs using the predict method.
Finally, we print our inputs, the actual output, the predicted output, and the loss to observe how well our model performs.

In [60]:
import tensorflow as tf
print("GPUs:", tf.config.list_physical_devices('GPU'))

GPUs: []


In [62]:
import tensorflow as tf

# Define input data
# Data as float32 arrays
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=np.float32)
y = np.array([[0],[1],[1],[0]], dtype=np.float32)

# Define hyperparameters
learning_rate = 0.1
epochs = 500

# Define the model architecture
model = tf.keras.Sequential([
    tf.keras.layers.Dense(4, input_shape=(2,), activation='tanh'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Define the optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)

# Define the loss
loss = tf.keras.losses.MeanSquaredError()

# Compile the model
model.compile(optimizer=optimizer, loss=loss)

# Train the model
history = model.fit(X, y, epochs=epochs, verbose=0)

# Evaluate the model
loss = model.evaluate(X, y, verbose=0)

# Predict the output
y_pred = model.predict(X)

# Print the output
print("Input: ", X)
print("Actual Output: ", y)
print("Predicted Output: ", y_pred)
print("Loss: ", loss)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
Input:  [[0. 0.]
 [0. 1.]
 [1. 0.]
 [1. 1.]]
Actual Output:  [[0.]
 [1.]
 [1.]
 [0.]]
Predicted Output:  [[0.3313114 ]
 [0.5491716 ]
 [0.6538551 ]
 [0.36822328]]
Loss:  0.14210453629493713


# PyTorch Code for the XOR Problem
The following Python code snippet provides a practical example of defining, training, and evaluating a neural network to solve the XOR problem using PyTorch.

In [74]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define input data
X = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
y = torch.tensor([[0.0], [1.0], [1.0], [0.0]])

# Define hyperparameters
learning_rate = 0.1
epochs = 500

# Define the model architecture
class XORModel(nn.Module):
    def __init__(self):
        super(XORModel, self).__init__()
        self.layer1 = nn.Linear(2, 4)
        self.layer2 = nn.Linear(4, 1)
        self.tanh = nn.Tanh()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.tanh(self.layer1(x))
        x = self.sigmoid(self.layer2(x))
        return x

model = XORModel()

# Define the optimizer
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Define the loss
criterion = nn.MSELoss()

# Train the model
for epoch in range(epochs):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(X)

    # Compute and print loss
    loss = criterion(y_pred, y)

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Evaluate the model
model.eval()
with torch.no_grad():
    y_pred = model(X)
    loss = criterion(y_pred, y)

# Print the output
print("Input: ", X.numpy())
print("Actual Output: ", y.numpy())
print("Predicted Output: ", y_pred.numpy())
print("Loss: ", loss.item())

MPS  available: True
Input:  [[0. 0.]
 [0. 1.]
 [1. 0.]
 [1. 1.]]
Actual Output:  [[0.]
 [1.]
 [1.]
 [0.]]
Predicted Output:  [[0.48436943]
 [0.46205842]
 [0.5463047 ]
 [0.44501305]]
Loss:  0.2319677323102951


# Using GPU

In [77]:
import torch
import torch.nn as nn
import torch.optim as optim

device = ("cuda" if torch.cuda.is_available()
          else "mps" if torch.backends.mps.is_available()
          else "cpu")
torch.manual_seed(0)

X = torch.tensor([[0.,0.],[0.,1.],[1.,0.],[1.,1.]], dtype=torch.float32, device=device)
y = torch.tensor([[0.],[1.],[1.],[0.]], dtype=torch.float32, device=device)

learning_rate = 0.1
epochs = 500

class XORModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)
        self.layer2 = nn.Linear(4, 1)
        self.tanh = nn.Tanh()
        self.sigmoid = nn.Sigmoid()
    def forward(self, x):
        x = self.tanh(self.layer1(x))
        x = self.sigmoid(self.layer2(x))   # using Sigmoid + BCELoss
        return x

model = XORModel().to(device)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
criterion = nn.BCELoss()

for e in range(epochs):            # <-- fixed
    print("Chosen device:", device)
    print("X on:", X.device, "y on:", y.device)
    print("Model params on:", next(model.parameters()).device)
    model.train()
    y_pred = model(X)
    print("y_pred on:", y_pred.device)
    loss = criterion(y_pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

model.eval()
with torch.no_grad():
    y_pred = model(X)
    loss = criterion(y_pred, y)

print("MPS  available:", torch.backends.mps.is_available())

# Move to CPU before calling .numpy()
print("Input: ", X.cpu().numpy())
print("Actual Output: ", y.cpu().numpy())
print("Pred:", y_pred.cpu().numpy().ravel())
print("Loss:", loss.item())


Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chosen device: mps
X on: mps:0 y on: mps:0
Model params on: mps:0
y_pred on: mps:0
Chos