## Assignment 12

## 1.	How does unsqueeze help us to solve certain broadcasting problems?

Ans=>

The unsqueeze function in PyTorch can be used to insert a new dimension into a tensor at a specified position. This can help us to solve certain broadcasting problems by aligning the dimensions of two tensors. For example, if we have a tensor of shape (3, 4) and want to add a scalar to each element of the tensor, we can use unsqueeze to add a new dimension of size 1 to the scalar so that it has the same number of dimensions as the tensor, and then use broadcasting to perform the elementwise addition.

## 2.	How can we use indexing to do the same operation as unsqueeze?



Ans=>

We can use indexing to insert a new dimension into a tensor in a similar way as unsqueeze. For example, x[:, None] creates a new dimension at the second axis of the tensor x. This is equivalent to x.unsqueeze(1).

## 3.	How do we show the actual contents of the memory used for a tensor?


Ans=>

To show the actual contents of the memory used for a tensor, we can use the storage method of the tensor to access the underlying storage object, and then use the tolist method to convert the storage to a list. For example, if x is a tensor, we can show its contents using x.storage().tolist(). However, this will only work for tensors that are stored in contiguous memory. For non-contiguous tensors, we need to use more complex indexing to access the data.

## 4.	When adding a vector of size 3 to a matrix of size 3Ã—3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)


Ans=>



In [None]:
import torch

A = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
v = torch.tensor([10, 11, 12])

result = A + v

print(result)


## 5.	Do broadcasting and expand_as result in increased memory use? Why or why not?

Ans=>

Broadcasting and expand_as do not result in increased memory use, because they do not actually create new copies of the tensor data. Instead, they just change the shape and/or strides of the existing tensor, so that the same data can be interpreted as a different tensor. This is often more memory-efficient than creating a new tensor with the same data, because it avoids unnecessary memory allocation and deallocation. However, it's worth noting that some operations may require temporary memory for intermediate results, which could increase memory use in those cases.

## 6.	Implement matmul using Einstein summation.



Ans=>



In [None]:
import torch

# create two tensors
a = torch.randn(2, 3, 4)
b = torch.randn(2, 4, 5)

# matrix multiply using Einstein summation
c = torch.einsum('abc,cdx->abdx', a, b)

print(c.shape)  # output: torch.Size([2, 3, 5])


## 7.	What does a repeated index letter represent on the lefthand side of einsum?


Ans=>

In Einstein summation notation, a repeated index letter on the lefthand side of the expression indicates a summation over that index. For example, if we have an expression like

C_ij = A_ik * B_kj,

then the k index is repeated, so it is summed over.

## 8.	What are the three rules of Einstein summation notation? Why?

Ans=>

The three rules of Einstein summation notation are:

- If an index appears twice in an expression, once as a subscript and once as a superscript, then it represents a summation over that index.
- If an index appears only as a subscript, then it represents a dummy index that is summed over, but that does not appear in the final result.
- If an index appears only as a superscript, then it represents a tensor that is being contracted (i.e., multiplied elementwise) with the other tensors in the expression.

These rules allow us to write complex tensor expressions in a compact and readable form, without having to explicitly write out all the summations and contractions. The rules are based on the principle of index contraction, which says that whenever two indices are repeated, they should be summed over.

## 9.	What are the forward pass and backward pass of a neural network?


Ans=>

The forward pass of a neural network refers to the process of passing input data through the network to obtain a prediction or output. Each layer of the network applies a series of mathematical operations to the input data to produce a transformed output, which is then passed on to the next layer. This process continues until the final output is produced. During the forward pass, the network parameters are fixed, and the output is calculated by applying the weights and biases of the layers to the input data.

The backward pass, also known as backpropagation, is the process of computing the gradients of the loss with respect to the parameters of the network. The gradients are then used to update the weights and biases of the network to reduce the loss. Backpropagation works by computing the derivative of the loss function with respect to each parameter of the network using the chain rule of calculus. These gradients are then propagated backward through the layers of the network to update the parameters of each layer.

## 10.	Why do we need to store some of the activations calculated for intermediate layers in the forward pass?


Ans=>

Storing some of the activations calculated for intermediate layers in the forward pass can be useful for a number of reasons. One reason is that it allows us to perform backpropagation and calculate gradients more efficiently. During backpropagation, we need to compute gradients for each layer in the network, and these gradients depend on the activations computed during the forward pass. By storing these activations, we can avoid having to recompute them during backpropagation, which can be computationally expensive. Additionally, these stored activations can be useful for visualizing the activity of the network during training, debugging, and monitoring for overfitting or underfitting.

## 11.	What is the downside of having activations with a standard deviation too far away from 1?


Ans=>

Having activations with a standard deviation that is too far away from 1 can cause several problems in neural networks. If the standard deviation is too small, then the activations can become "saturated," which means that they are stuck at the extreme ends of the activation function and cannot propagate useful information through the network. If the standard deviation is too large, then the activations can become "unstable," which means that they can fluctuate wildly and cause the network to be unable to converge. Additionally, large activations can cause numerical overflow and instability during training.


## 12.	How can weight initialization help avoid this problem?


Ans=>

Weight initialization can help avoid the problem of having activations with a standard deviation that is too far away from 1. When weights are initialized too large or too small, they can cause the activations in the network to be too large or too small as well. One common method for weight initialization is to use Xavier initialization or He initialization, which set the initial weights in each layer to be randomly drawn from a distribution with zero mean and a variance that is proportional to the number of inputs or outputs to the layer. These methods help to ensure that the activations in the network are well-behaved and have a standard deviation close to 1.

## ----------------------------------------------------------------------------------------------------------------------------------