<a href="https://colab.research.google.com/github/noelfischer/ai_praktika/blob/main/4/task_2_building_locks_problems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exercise 2.1: Manually Calculate Output of Simple Feedforward Neural Network with ReLU Activation
Problem:

Given a neural network with:

    2 input neurons,
    1 hidden layer with 3 neurons (ReLU activation),
    1 output neuron (ReLU activation), calculate the output manually.

Let's denote:

    Input vector x=[x1,x2]x=[x1​,x2​]
    Weights from input to hidden layer W1=[[w11,w12],[w21,w22],[w31,w32]]W1​=[[w11​,w12​],[w21​,w22​],[w31​,w32​]]
    Biases for hidden layer b1=[bh1,bh2,bh3]b1​=[bh1​,bh2​,bh3​]
    Weights from hidden layer to output layer W2=[wout1,wout2,wout3]W2​=[wout1​,wout2​,wout3​]
    Bias for output layer b2=[bout]b2​=[bout​]

Using the ReLU function ReLU(x)=max⁡(0,x)ReLU(x)=max(0,x).
Solution:

In [2]:
import numpy as np
import torch
import torch.nn as nn

# Define inputs, weights, and biases
x = np.array([1.0, 2.0])  # Input vector
W1 = np.array([[0.5, -0.6], [0.3, 0.8], [-0.2, 0.1]])  # Weights from input to hidden layer
b1 = np.array([0.1, -0.3, 0.2])  # Biases for hidden layer

W2 = np.array([0.7, -1.2, 0.5])  # Weights from hidden layer to output
b2 = np.array([0.3])  # Bias for output layer

# Manual Calculation of Hidden Layer Activation
z_h = np.dot(W1, x) + b1
a_h = np.maximum(0, z_h)  # Apply ReLU activation
print("Hidden layer activation:", a_h)

# Manual Calculation of Output Layer Activation
z_out = np.dot(W2, a_h) + b2
a_out = np.maximum(0, z_out)  # Apply ReLU activation
print("Output layer activation:", a_out)


Hidden layer activation: [0.  1.6 0.2]
Output layer activation: [0.]


Exercise 2.2: Calculate Output with Sigmoid Activation
Problem:

A simple feedforward network with:

    2 input neurons,
    1 hidden layer with 1 neuron (Sigmoid activation),
    1 output neuron (Sigmoid activation).

Given input vector x=[x1,x2]x=[x1​,x2​], weights, and biases, calculate the output.
Solution:

In [3]:
# Define input, weights, and biases
x = np.array([1.0, 2.0])  # Input vector
W1 = np.array([0.4, -0.7])  # Weights for input to hidden layer
b1 = -0.1  # Bias for hidden layer

W2 = np.array([0.6])  # Weight for hidden layer to output layer
b2 = 0.2  # Bias for output layer

# Sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Calculate hidden layer activation
z_h = np.dot(W1, x) + b1
a_h = sigmoid(z_h)  # Sigmoid activation
print("Hidden layer activation:", a_h)

# Calculate output layer activation
z_out = np.dot(W2, a_h) + b2
a_out = sigmoid(z_out)  # Sigmoid activation
print("Output layer activation:", a_out)


Hidden layer activation: 0.24973989440488245
Output layer activation: [0.58657973]


Exercise 2.3: Weights Increase with Depth vs. Width
Problem:

For a network with:

    Input dimension Di=1Di​=1,
    Output dimension Do=1Do​=1,
    K=10K=10 layers,
    D=10D=10 hidden units per layer.

Determine whether increasing depth or width by 1 results in a greater increase in the number of weights.
Solution:

If we increase depth by 1:

    Total weights: 11×102=110011×102=1100

If we increase width by 1 (from 10 to 11 hidden units per layer):

    Total weights: 10×112=121010×112=1210

Conclusion: Increasing width by 1 adds more weights than increasing depth by 1, making width expansion more impactful on the number of weights.
Exercise 2.4: Handling Different Output Scales in Multivariate Regression
Problem:

In a multivariate regression problem where one output is height (in meters) and another is weight (in kilos), the range of values may differ significantly.
Solution:

This discrepancy can cause issues in training due to different gradient scales. Solutions include:

    Normalization: Standardize each output feature to have mean 0 and standard deviation 1.
    Weighted Loss Function: Adjust the loss function to balance importance.
    Scaled Outputs: Use a scaling factor to adjust predictions before and after training.

Exercise 2.5: Comparing SGD and Adam Optimizers
Part (a): Explanation of Optimizers

    SGD: Updates weights based on the gradient of the loss function. With momentum, it helps escape small local minima.
    Adam: Uses adaptive learning rates with momentum, using two running averages of gradients. It updates weights more flexibly and is generally faster in practice.

Part (b): Advantages and Disadvantages

    SGD:
        Pros: Simpler, less prone to overfitting, often preferred for larger datasets.
        Cons: Slower convergence.
    Adam:
        Pros: Faster convergence, adaptive learning rates.
        Cons: Can overfit, sensitive to parameter settings.

Part (c): Adam Hyperparameters

    Beta: Controls decay rates for moment estimates (default values are usually effective).
    Gamma: Learning rate multiplier (not in original Adam but seen in variations).
    Epsilon: Prevents division by zero, with smaller values resulting in more precise updates.

Example code snippet for Adam in PyTorch:

In [None]:
import torch.optim as optim

# Define model, loss, and optimizer
model = nn.Linear(2, 1)
loss_fn = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01, betas=(0.9, 0.999), eps=1e-08)

# Training step
input_data = torch.tensor([[1.0, 2.0]], requires_grad=True)
target = torch.tensor([[0.5]])
optimizer.zero_grad()
output = model(input_data)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()

print("Updated weights:", model.weight)
print("Updated bias:", model.bias)


This code demonstrates a basic forward and backward pass with Adam, showing how the optimizer updates weights based on gradients.