## torch

### repeating samples along axis

- *repeat_interleave* is equal to *cat and view*

In [3]:
import torch

a =torch.tensor([[[1,2]],[[2,3]]])
print(a,a.shape)

a.repeat_interleave(repeats=2,dim=0),torch.cat([a.unsqueeze(dim=1)]*2,dim=1).view(-1,1,2)

tensor([[[1, 2]],

        [[2, 3]]]) torch.Size([2, 1, 2])


(tensor([[[1, 2]],
 
         [[1, 2]],
 
         [[2, 3]],
 
         [[2, 3]]]),
 tensor([[[1, 2]],
 
         [[1, 2]],
 
         [[2, 3]],
 
         [[2, 3]]]))

### `permute`

let's suppose there is a tensor of $[dim0, dim1, dim2, dim3]$, then we permute it to $[dim3, dim1, dim2, dim0]$

- **consequence**: origin value at $[a, b, c, d]$ will be switched to $[d, b, c, a]$

In [None]:
import torch

a = torch.zeros([5,4,3,2]),
# say a=2, b=3, c=1, d=1,
a[2,3,1,1] = 1,
print("origin 1 at [2,3,1,1]: {}".format(a[2,3,1,1])),
b = a.permute(3,1,2,0),
print("permuted 1 at [3,1,2,0]: {}".format(b[1,3,1,2]))

### finding non-zero indices

In [9]:
import torch

a = torch.zeros(2,2)
a[1,0] = 1
a.nonzero()

tensor([[1, 0]])

### tensor multiplication

Docs about `matmul` function is concise, I want to give an example:

In [5]:
import torch

a = torch.tensor([[[1,2,3],[1,0,0]],[[2,3,4],[0,1,0]]])
b = torch.tensor([[[1,0],[0,1],[1,1]],[[1,1],[0,0],[0,1]]])

a,a.shape, b, b.shape, b[0], b[0].shape,torch.matmul(a,b[0]), torch.matmul(a,b), torch.matmul(a[0],b), a[0].shape

(tensor([[[1, 2, 3],
          [1, 0, 0]],
 
         [[2, 3, 4],
          [0, 1, 0]]]),
 torch.Size([2, 2, 3]),
 tensor([[[1, 0],
          [0, 1],
          [1, 1]],
 
         [[1, 1],
          [0, 0],
          [0, 1]]]),
 torch.Size([2, 3, 2]),
 tensor([[1, 0],
         [0, 1],
         [1, 1]]),
 torch.Size([3, 2]),
 tensor([[[4, 5],
          [1, 0]],
 
         [[6, 7],
          [0, 1]]]),
 tensor([[[4, 5],
          [1, 0]],
 
         [[2, 6],
          [0, 0]]]),
 tensor([[[4, 5],
          [1, 0]],
 
         [[1, 4],
          [1, 1]]]),
 torch.Size([2, 3]))

## torch.nn

### embedding

sometimes we have to create an embedding layer (*loop up layer*). The derivation from emebdding layer is straight forward: the last |n-1| dimension in embedding layer will be appended to the index tensor.

In [3]:
embedding_layer = torch.rand((5,3))
index_tensor = torch.tensor([[3,4],[0,1]]) # only tensor of dtype=torch.long works
print("1 dimensional embedding:{} of size {}\n".format(embedding_layer[index_tensor], embedding_layer[index_tensor].shape))

embedding_layer = torch.rand((5,5,3))
print("2 dimensional embedding:{} of size {}".format(embedding_layer[index_tensor], embedding_layer[index_tensor].shape))

1 dimensional embedding:tensor([[[0.5483, 0.6040, 0.8072],
         [0.2734, 0.9912, 0.5855]],

        [[0.5239, 0.0344, 0.4192],
         [0.0804, 0.4840, 0.0895]]]) of size torch.Size([2, 2, 3])

2 dimensional embedding:tensor([[[[0.7808, 0.7868, 0.6942],
          [0.5475, 0.7632, 0.5925],
          [0.4733, 0.4298, 0.9889],
          [0.2916, 0.9522, 0.1331],
          [0.3567, 0.5843, 0.9117]],

         [[0.9080, 0.9921, 0.0865],
          [0.8223, 0.4205, 0.7956],
          [0.0150, 0.7352, 0.6009],
          [0.7414, 0.3398, 0.8795],
          [0.7309, 0.1035, 0.8814]]],


        [[[0.2508, 0.4999, 0.8504],
          [0.1842, 0.2682, 0.2584],
          [0.4942, 0.1009, 0.9256],
          [0.5841, 0.6895, 0.1570],
          [0.9514, 0.1604, 0.2270]],

         [[0.8479, 0.4176, 0.6745],
          [0.4648, 0.4139, 0.1350],
          [0.7314, 0.8723, 0.5420],
          [0.6580, 0.8016, 0.8920],
          [0.6776, 0.1732, 0.4551]]]]) of size torch.Size([2, 2, 5, 3])


### Cosine Similarity
`PyTorch` provides convenient api for computing cosine similarity between two tensor, however it's confusing when dimension is more than one.

- From my perspective, the `dim` parameter can be viewed as the dimension to *compress*, which means computing cosine similarity along `dim` is actually transforming the vector on this dimension to a single value.
- As for calculating, we first slice the tensor of given `dim` and compute cosine similarity pair-wise
- when `dim` is higher dimension:
    - `dim` = 0: value at the same place across the batch will be packed into a vector
    - `dim` = -1: value at the last dimension will be collected into a vector

In [None]:
from torch.nn import CosineSimilarity

" example for cosine similarity along the last dimension "
cos = CosineSimilarity(dim=2)

a = torch.rand((3,2,3))
b = torch.rand((3,2,3))

c = a[0]
d = b[0]

e = a[:,0,:].unsqueeze(dim=1)
f = b[:,0,:].unsqueeze(dim=2)
g = a[:,1,:].unsqueeze(dim=1)
h = b[:,1,:].unsqueeze(dim=2)

result_1 = torch.matmul(e,f) / torch.sqrt(torch.matmul(e,e.permute(0,2,1)) * torch.matmul(f.permute(0,2,1),f))
result_2 = torch.matmul(g,h) / torch.sqrt(torch.matmul(g,g.permute(0,2,1)) * torch.matmul(h.permute(0,2,1),h))

cos_2 = CosineSimilarity(dim=1)
cos(a,b), cos_2(c,d), result_1.squeeze(), result_2.squeeze()

### Layer Normalization

Layer Normalization is applied over the last given dimensions of the input tensor, i.e. `mean` and `variance` are calculated within the current input, rather than the whole batch

In [4]:
LayerNorm = torch.nn.LayerNorm((2,3))
a = torch.tensor([[[-0.5,0.5,0],[0,1,-1]],[[1,2,3],[5,9,11]]])
a,LayerNorm(a)

(tensor([[[-0.5000,  0.5000,  0.0000],
          [ 0.0000,  1.0000, -1.0000]],
 
         [[ 1.0000,  2.0000,  3.0000],
          [ 5.0000,  9.0000, 11.0000]]]),
 tensor([[[-0.7746,  0.7746,  0.0000],
          [ 0.0000,  1.5492, -1.5492]],
 
         [[-1.1352, -0.8627, -0.5903],
          [-0.0454,  1.0444,  1.5893]]], grad_fn=<NativeLayerNormBackward>))

## torch.autograd
Only when the operation in the forward phrase is **not differentiable** while you want the gradient to be pass through that you should rewrite torch.autograd, where you can define your own backward algorithm to give an approximate gradient of the **indifferentiable** operation.