### Exercises

1. Given a tensor of size 100 with random numbers, obtain a new tensor that contains the 10 largest elements

In [1]:
import torch
from torch import nn

In [None]:
# tensor of random numbers
torch.manual_seed(1729)
r = torch.rand(100)
print(r)


######### YOUR CODE HERE:
# using sort
sorted_r = torch.sort(r, descending=True)
largest_r = sorted_r[0][:10]
print(largest_r)

# using argsort
inds = torch.argsort(r, descending=True)
largest_r = r[inds[:10]]
print(largest_r)

2. Given a tensor of size 64x100, obtain for each row the index of the maximum value

In [None]:
# tensor of random numbers
torch.manual_seed(1729)
r = torch.rand(64, 100)
print(r)


######### YOUR CODE HERE:
# using max
row_max = torch.max(r, dim=1)[0]
print(row_max)

# using argmax
inds = torch.argmax(r, dim=1)
row_max = r[torch.arange(64), inds]
print(row_max)

3. Replicate the result from the broadcasting example from the Pytorch dimensions section without broadcasting. \
Hint: You can use the torch.tile function to obtain a tensor with the same size of the image and perform elementwise multiplication.

In [None]:
# we create a pseudo-image first
image = torch.randn(3, 224, 224)
print(f'The shape of the image tensor: {image.shape}')
# we want to multiply the red channel by 0, the green channel by 0, and the blue channel by 1
multiplier = torch.tensor([0, 0, 1])
# We add dimensions to the multiplier tensor to make it compatible with the image tensor
multiplier = multiplier[:, None, None]


######### YOUR CODE HERE:
multiplier = torch.tile(multiplier, [1, 224, 224])
print(f'\nThe shape of the multiplier tensor: {multiplier.shape}')
product = image * multiplier
# All red and green values will be zeros
print('\nThe red and green values are zero:')
print(product[[0, 1], :, :])
# All blue values will be the same as in the original image
print('\nThe blue values are the same as in the original image:')
print(product[2, :, :])

4. Given a tensor of size 10, obtain a new tensor of size 100 where each element in the original tensor is repeated 10 times. \
This is different from repeating the full tensor 10 times which could be done with torch.tile.

In [None]:
r = torch.arange(10)


######### YOUR CODE HERE:
r_repeated = torch.repeat_interleave(r, 10)
print(r_repeated)

5. What does the torch.flatten function do? Apply the function to a batch of pseudo images. \
Only flatten the images and not the batch dimension!

In [None]:
images = torch.rand(64, 3, 224, 224)


######### YOUR CODE HERE:
images_flattened = torch.flatten(images, start_dim=1)
print(images_flattened.shape)
# torch.flatten is a function that flattens a tensor, meaning that it removes all dimensions except for one.
# The values from multiple dimensions are concatenated into a single dimension.
# In this case this new dimension has a size of 3*224*224=150528.
# The start_dim parameter specifies the first dimension that should be flattened.
# In this case, the first dimension is not flattened, so the batch size is preserved.

6. Define and apply a linear layer with 5 in_features and 2 out_features. Apply the layer and inspect the output shape and values. \
Obtain the same output shape by manually defining a (random) weights matrix and multiplying the inputs with it using matrix multiplication.

In [None]:
inputs = torch.rand(8, 5)
print(inputs)


######### YOUR CODE HERE:
linear_layer = nn.Linear(in_features=5, out_features=2)
result = linear_layer(inputs)
print('\n')
print(result.shape)
print(result)

weights = torch.randn(5, 2)
result = torch.matmul(inputs, weights)
print('\n')
print(result.shape)
print(result)

7. What does the nn.ReLu module do? Apply the module on a random tensor and inspect the output.

In [None]:
r = torch.rand(100)


######### YOUR CODE HERE:
relu = nn.ReLU()
relu(r)
# The relu module is a function sets all negative values to zero and leaves positive values unchanged.

8. You already applied a linear layer with a specific number of input and output features to a batch of inputs in exercise 5. \
Now your task is to alternately apply mutliple linear layers and relu functions successively to a batch of inputs. \
The three linear layers should have the following input/output feature sizes: 5/10, 10/20 and 20/1. \
Each linear layer except for the last one should be followed by a relu function.

In [None]:
inputs = torch.rand(8, 5)
print(inputs.shape)


######### YOUR CODE HERE:
# apply modules one by one:
linear1 = nn.Linear(5, 10)
linear2 = nn.Linear(10, 20)
linear3 = nn.Linear(20, 1)
relu1 = nn.ReLU()
relu2 = nn.ReLU()

x = linear1(inputs)
x = relu1(x)
x = linear2(x)
x = relu2(x)
outputs = linear3(x)
print(outputs.shape)

# you can also use nn.ModuleList to apply the modules in a loop:
module_list = nn.ModuleList([linear1, relu1, linear2, relu2, linear3])
x = inputs
for module in module_list:
    x = module(x)
print(x.shape)

# you can also use the nn.Sequential module to directly apply the modules in a sequence:
# It automatically applies the modules in the order they are passed:
# The star operator (*) can be used to unpack a list and pass its elements as arguments to a function.
modules = nn.Sequential(*module_list)
outputs = modules(inputs)
print(outputs.shape)

9. Bonus: We did not yet talk about why it makes sense to stack multiple linears and relu functions behind each other. \
So do not worry about this in too much detail now. However, to provide some food for thought: \
Do you have an idea why it does NOT make sense to simply stack multiple linear layers behind each other? \
How is this counteracted by adding relu functions? \
Justify your answer in words (or a small mathematical proof)

Hint: Consider the representational capacity of multiple linear layers stacked behind each other.



######### YOUR ANSWER HERE: 

The representational capacity of multiple linear layers stacked behind each other does not increase compared to one single linear layer! Consider a minimal example for this. The input is two dimensional (x1, x2) and we apply two linear layers to it which each have 2/2 as the number of input and output features. For simplicity, we omit the bias again. We denote the outputs of the first and second linear layer with (h1, h2) and (h3, h4) respectively:

first layer outputs: \
h1 = w1_1 * x1 + w1_2 * x2 \
h2 = w2_1 * x1 + w2_2 * x2

second layer outputs (takes result of first layer as input): \
h3 = w3_1 * h1 + w3_2 * h2 \
h4 = w4_1 * h1 + w4_2 * h2

We can now plug in the definition of h1 and h2 from the first layer output: \
h3 = w3_1 * (w1_1 * x1 + w1_2 * x2) + w3_2 * (w2_1 * x1 + w2_2 * x2) \
h4 = w4_1 * (w1_1 * x1 + w1_2 * x2) + w4_2 * (w2_1 * x1 + w2_2 * x2)

If we multiply everything out, we get: \
h3 = (w3_1 * w1_1 + w3_2 * w2_1) * x1 + (w3_1 * w1_2 + w3_2 * w2_2) * x2 \
h4 = (w4_1 * w1_1 + w4_2 * w2_1) * x1 + (w4_1 * w1_2 + w4_2 * w2_2) * x2

Notice that the outputs h3 and h4 are linear combinations of the inputs x1 and x2, where the coefficients are just new weights that are combinations of the original weights. Therefore, the result is still a simple linear transformation of the input:

h3 = q1_1 * x1 + q1_2 * x2 \
h4 = q2_1 * x1 + q2_2 * x2

where the q's are defined as follows: \
q1_1 = w3_1 * w1_1 + w3_2 * w2_1 \
q1_2 = w3_1 * w1_2 + w3_2 * w2_2 \
q2_1 = w4_1 * w1_1 + w4_2 * w2_1 \
q2_2 = w4_1 * w1_2 + w4_2 * w2_2

Conclusion:
This implies that stacking multiple linear layers does not increase the representational capacity beyond what a single linear layer can achieve. The reason is that a composition of linear transformations is still a linear transformation. Thus, stacking multiple linear layers without any non-linear activation functions between them is equivalent to a single linear layer.

The Role of ReLU:
By adding ReLU (or any other non-linear activation function) between the layers, we introduce non-linearity into the model. This non-linearity allows the network to model more complex functions that go beyond simple linear transformations. Specifically, ReLU can "break" the linearity by zeroing out negative values, allowing the model to create more complex decision boundaries and learn more intricate patterns in the data (more detail on that tomorrow).

Here a small code demonstration of the mini "proof" provided above

In [None]:
# Define the input tensor
x = torch.tensor([1.0, 2.0])  # Example input (x1 = 1, x2 = 2)

# Define the weights for the first linear layer
W1 = torch.tensor([[1.0, 2.0],  # w1_1, w1_2
                   [3.0, 4.0]])  # w2_1, w2_2

# Define the weights for the second linear layer
W2 = torch.tensor([[5.0, 6.0],  # w3_1, w3_2
                   [7.0, 8.0]])  # w4_1, w4_2

# First linear layer
h = torch.matmul(W1, x)  # h1, h2

# Second linear layer
output_stacked = torch.matmul(W2, h)  # h3, h4

# Now, let's calculate the equivalent single linear transformation
W_combined = torch.matmul(W2, W1)
output_single = torch.matmul(W_combined, x)  # Equivalent single linear transformation

# Print results
print("Output of stacked linear layers:", output_stacked)
print("Output of single equivalent linear layer:", output_single)

# Check if they are equivalent
equivalent = torch.allclose(output_stacked, output_single)
print("Are the outputs equivalent?", equivalent)