# Training a Simple Artificial Neural Network with TensorFlow
This notebook provides a **step-by-step introduction** to the core concepts of training an Artificial Neural Network (ANN) using basic TensorFlow operations. Your task is to **fill in the missing code blocks** based on the accompanying theoretical explanation and hints to complete the learning process.

We will incrementally build a simple network, focusing on the fundamental operations:
1.  **Forward Pass** (Initialization and Prediction)
2.  **Backward Pass** (Gradient Calculation and Basic Weight Update)
3.  **Adding Bias**
4.  **Adding Activation Functions**
5.  **Adding Momentum**
6.  **Keras Implementation** (Comparing manual vs. high-level)

In [None]:
import tensorflow as tf

def loss_function(y_true, y_pred):
    """Calculates the Mean Squared Error (MSE) for a single sample."""
    # Using tf.reduce_mean to handle multiple samples/batches 
    return tf.reduce_mean((y_true - y_pred) ** 2)

## Step 1: Forward Pass (No Bias, No Activation)

The **Forward Pass** is the process of calculating the network's output prediction. In this initial simplified model, the output of each layer is simply the **weighted sum** of its inputs: $h = x W^T$. 

**Task:** Use `tf.matmul` and `tf.transpose` to calculate the hidden layer output (`h1`) and the final prediction (`y_pred`).

In [None]:
# Input tensor (1 sample, 2 features)
x = tf.constant([[2.0, 3.0]], dtype=tf.float32)  # shape (1, 2)

# Initial weights (must be tf.Variable for gradient tracking)
w1 = tf.Variable([[0.11, 0.21], [0.12, 0.08]], dtype=tf.float32)  # Hidden layer weights: shape (2, 2) (2 features -> 2 neurons)
w2 = tf.Variable([[0.14, 0.15]], dtype=tf.float32)  # Output layer weights: shape (1, 2) (2 hidden neurons -> 1 output neuron)

# Target output
y_true = tf.constant([[1.0]], dtype=tf.float32)

# Forward pass calculation:
# --------------------------------------------------------------------------------
# TODO: Calculate the hidden layer output (h1)
# Hint: h1 = x @ w1.T
# h1 = 

# TODO: Calculate the final prediction (y_pred)
# Hint: y_pred = h1 @ w2.T
# y_pred = 
# --------------------------------------------------------------------------------

# Show results
print("Prediction value:", y_pred.numpy())
print("Loss function (Initial MSE):", loss_function(y_true, y_pred).numpy())

## Step 2: Backward Pass and Basic Weight Update

The **Backward Pass** uses `tf.GradientTape` to calculate the gradient of the loss with respect to all `tf.Variable`s. The weights are then updated using the Gradient Descent rule:
$$
\mathbf{W}_{new} = \mathbf{W}_{old} - \eta \times \nabla
$$ 
where $\eta$ is the learning rate (`lr`).

**Task:** Complete the weight update loop using the `var.assign_sub()` method.

In [None]:
# Re-initialize weights to demonstrate the update
w1 = tf.Variable([[0.11, 0.21], [0.12, 0.08]], dtype=tf.float32)
w2 = tf.Variable([[0.14, 0.15]], dtype=tf.float32)
lr = 0.01

print("Initial w1:\n", w1.numpy())

# Backward pass to calculate gradients
with tf.GradientTape() as tape:
    h1 = tf.matmul(x, tf.transpose(w1))
    y_pred = tf.matmul(h1, tf.transpose(w2))
    loss = loss_function(y_true, y_pred)
    
grads = tape.gradient(loss, [w1, w2])

# Weight update using Gradient Descent
# --------------------------------------------------------------------------------
for var, grad in zip([w1, w2], grads):
    # TODO: Implement the weight update W = W - lr * dL/dW
    # Hint: Use var.assign_sub() with the learning rate (lr) and the gradient (grad)

    pass # Delete this line once implemented
# --------------------------------------------------------------------------------

print("\nLoss before update:", loss.numpy())
print("\nGradient for w1:\n", grads[0].numpy())
print("\nUpdated w1 (after 1 step):\n", w1.numpy())

### Practice: A Training Loop

Run a complete training loop using the code developed previously.

In [None]:
# Re-initialize weights for the training loop
w1 = tf.Variable(tf.random.normal([2, 2]), dtype=tf.float32)
w2 = tf.Variable(tf.random.normal([1, 2]), dtype=tf.float32)

x = tf.constant([[0., 0.],[0., 1.],[1., 0.],[1., 1.]], dtype=tf.float32)
y_true = tf.constant([[0.], [0.5], [0.5], [1.]], dtype=tf.float32)
lr = 0.01

print("Starting simple linear training")
for step in range(500):
    # --------------------------------------------------------------------------------
    with tf.GradientTape() as tape:
        # TODO: Implement the Forward Pass (Linear)
        # h1 = 
        # y_pred = 
        
        loss = loss_function(y_true, y_pred)
        
    grads = tape.gradient(loss, [w1, w2])
    
    # TODO: Implement the Weight Update (Gradient Descent)
    # for var, grad in zip(...):
    #     var.assign_sub(...)
    # --------------------------------------------------------------------------------
    
    if step % 100 == 0:
        print(f"Step {step}: Loss = {loss.numpy():.4f}")

print("\nFinal predictions:\n", y_pred.numpy())

## Step 3: Adding Bias

The **Bias term** $\mathbf{b}$ enables the model to learn functions that do not pass through the origin. The layer output equation is now:
$$\mathbf{h} = \mathbf{x} \mathbf{W}^T + \mathbf{b}$$

**Task:** Modify the forward pass to include the bias terms (`b1`, `b2`). Remember to include all four variables in `trainable_variables` for gradient calculation.

In [None]:
# Train data
x = tf.constant([[0., 0.],[0., 1.],[1., 0.],[1., 1.]], dtype=tf.float32)
y_true = tf.constant([[-1.], [0.], [0.], [1.]], dtype=tf.float32)

# Re-initialize weights and biases
w1 = tf.Variable(tf.random.normal([2, 2]), dtype=tf.float32)
b1 = tf.Variable(tf.zeros([2]), name='b1', dtype=tf.float32) # Bias for the hidden layer
w2 = tf.Variable(tf.random.normal([1, 2]), dtype=tf.float32)
b2 = tf.Variable(tf.zeros([1]), name='b2', dtype=tf.float32) # Bias for the output layer

lr = 0.01
# --------------------------------------------------------------------------------
# TODO: Initialize trainable_variables to include biases
# trainable_variables = []
# --------------------------------------------------------------------------------

print("Starting training with Bias")
for step in range(500):
    with tf.GradientTape() as tape:
        # --------------------------------------------------------------------------------
        # TODO: Add bias to the hidden layer output (tf.matmul + b1)
        # h1 = 
        # TODO: Add bias to the output layer output (tf.matmul + b2)
        # y_pred = 
        # --------------------------------------------------------------------------------
        loss = loss_function(y_true, y_pred)
        
    grads = tape.gradient(loss, trainable_variables)
    
    for var, grad in zip(trainable_variables, grads):
        if grad is not None:
            var.assign_sub(lr * grad)
            
    if step % 100 == 0:
        print(f"Step {step}: Loss = {loss.numpy():.4f}")

print("\nFinal predictions (XOR targets, linear model with bias):\n", y_pred.numpy())

## Step 4: Adding Activation Functions

To solve a non-linear problem, we must introduce **Activation Functions**. We will use the **Rectified Linear Unit (ReLU)** activation for the hidden layer:
$$\text{ReLU}(z) = \max(0, z)$$

**Task:** Apply the ReLU function using `tf.nn.relu()` to the hidden layer output (`h1_pre_activation`) before passing it to the output layer.

In [None]:
# Train data
x = tf.constant([[0., 0.],[0., 1.],[1., 0.],[1., 1.]], dtype=tf.float32)
y_true = tf.constant([[0.], [1.], [1.], [0.]], dtype=tf.float32)

# Re-initialize weights and biases for a fresh start with the new component
w1 = tf.Variable(tf.random.normal([2, 2]), dtype=tf.float32)
b1 = tf.Variable(tf.zeros([2]), dtype=tf.float32) 
w2 = tf.Variable(tf.random.normal([1, 2]), dtype=tf.float32)
b2 = tf.Variable(tf.zeros([1]), dtype=tf.float32) 

lr = 0.01
trainable_variables = [w1, b1, w2, b2]

print("Starting training with Bias AND ReLU Activation (XOR data)...")
for step in range(2000): # Increased steps for non-linear problem
    with tf.GradientTape() as tape:
        
        # Hidden Layer: Weighted Sum + Bias
        h1_pre_activation = tf.matmul(x, tf.transpose(w1)) + b1 
        
        # --------------------------------------------------------------------------------
        # TODO: Apply Non-linear Activation (ReLU: tf.nn.relu) to the pre-activation output
        # h1 = 
        # --------------------------------------------------------------------------------
        
        # Output Layer
        y_pred = tf.matmul(h1, tf.transpose(w2)) + b2 

        loss = loss_function(y_true, y_pred)

    grads = tape.gradient(loss, trainable_variables)

    for var, grad in zip(trainable_variables, grads):
        if grad is not None: 
            var.assign_sub(lr * grad)
            
    if step % 200 == 0:
        print(f"Step {step}: Loss = {loss.numpy():.4f}")

# Final prediction (recalculate with the learned non-linear model)
final_h1_pre_activation = tf.matmul(x, tf.transpose(w1)) + b1
final_h1 = tf.nn.relu(final_h1_pre_activation)
final_output = tf.matmul(final_h1, tf.transpose(w2)) + b2

print("\nFinal predictions (XOR targets, non-linear model):\n", final_output.numpy())

## Step 5: Adding Momentum

**Momentum** is an optimization technique that helps accelerate gradient descent by accumulating a velocity vector $\mathbf{v}_{t}$ from previous steps. 

The update rule for a weight $\mathbf{W}$ becomes:
$$\mathbf{v}_{t} = \mu \mathbf{v}_{t-1} + \eta \nabla J(\mathbf{W}_{t})$$
$$\mathbf{W}_{t+1} = \mathbf{W}_{t} - \mathbf{v}_{t}$$
where $\mu$ is the momentum coefficient.

**Task:** Implement the two-step momentum update inside the loop.

In [None]:
# Reset weights and biases
w1 = tf.Variable(tf.random.normal([2, 2]), dtype=tf.float32)
b1 = tf.Variable(tf.zeros([2]), dtype=tf.float32)
w2 = tf.Variable(tf.random.normal([1, 2]), dtype=tf.float32)
b2 = tf.Variable(tf.zeros([1]), dtype=tf.float32)

trainable_variables = [w1, b1, w2, b2]

# Initialize Momentum 'Velocity' vectors to zero
v_w1, v_b1 = tf.zeros_like(w1), tf.zeros_like(b1)
v_w2, v_b2 = tf.zeros_like(w2), tf.zeros_like(b2)
velocity_variables = [v_w1, v_b1, v_w2, v_b2]

# Hyperparameters
lr = 0.01
momentum = 0.9

print("Starting training with Momentum (and ReLU)...")
for step in range(2000):
    with tf.GradientTape() as tape:
        # Forward Pass (with Bias and ReLU)
        h1_pre_activation = tf.matmul(x, tf.transpose(w1)) + b1 
        h1 = tf.nn.relu(h1_pre_activation) 
        y_pred = tf.matmul(h1, tf.transpose(w2)) + b2 
        loss = loss_function(y_true, y_pred)

    grads = tape.gradient(loss, trainable_variables)

    # Apply updates with Momentum
    new_velocity_variables = []
    # --------------------------------------------------------------------------------
    for var, grad, vel in zip(trainable_variables, grads, velocity_variables):
        if grad is not None:
            # TODO 1: Calculate new velocity vector: v_t = mu * v_{t-1} + eta * grad
            # new_vel = 
            
            # TODO 2: Update weight: W = W - V_t (using var.assign_sub)
            # var.assign_sub(...)
            
            new_velocity_variables.append(new_vel)
        else:
            new_velocity_variables.append(vel) # keep old velocity if no gradient
            
    velocity_variables = new_velocity_variables
    # --------------------------------------------------------------------------------
    
    if step % 200 == 0:
        print(f"Step {step}: Loss = {loss.numpy():.4f}")

# Final prediction (recalculate with the learned non-linear model)
final_h1_pre_activation = tf.matmul(x, tf.transpose(w1)) + b1
final_h1 = tf.nn.relu(final_h1_pre_activation)
final_output = tf.matmul(final_h1, tf.transpose(w2)) + b2

print("\nFinal predictions (Momentum enabled):\n", final_output.numpy())

## Step 6: Using Keras for High-Level Implementation

In practice, most deep learning tasks use TensorFlow's high-level API, Keras, which handles all the manual steps we performed (variables, gradients, updates, momentum, etc.) automatically. This section replicates the XOR problem using Keras.

**Task:** Complete the model definition and the compiler call. Note that we are using the `tanh` activation here, which often performs better than ReLU for XOR.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# 1. Prepare Data (data is already defined as x and y_true)

# 2. Build the Model (equivalent to our two-layer network)
# --------------------------------------------------------------------------------
# TODO: Define the hidden layer (2 neurons, 'tanh' activation, input shape (2,))
# TODO: Define the output layer (1 neuron, default linear activation)
model = Sequential([
    # Dense(2, activation='tanh', input_shape=(2,)),
    # Dense(...)
])

# 3. Compile the Model (Define Optimizer, Loss, and Metrics)
# TODO: Use the Adam optimizer with a learning rate of 0.1
model.compile(
    # optimizer=,
    loss='mse',
    metrics=['mean_squared_error']
)
# --------------------------------------------------------------------------------

# 4. Train the Model
print("\nStarting Keras Training")
history = model.fit(
    x, y_true,
    epochs=100, 
    verbose=1 # display (or not) output for every epoch
)

# 5. Evaluate and Print Results
final_loss = history.history['loss'][-1]
print(f"Training Complete. Final Loss: {final_loss:.4f}")

final_predictions = model.predict(x)
print("\nFinal Keras Predictions (XOR targets):\n", final_predictions)