# CS6493 - Tutorial 1
## Introduction to JupyterHub and PyTorch

Welcome to CS6493 tutorial. In this tutorial, you will get familiar with our exprimental environment, and also practice with some basic PyTorch operations.

## 1. JupyterHub

You can use the JupyterHub to run the toy models. Here are some notes for JupyterHub:

- You are supposed to be familiar with Python and Jupyter;
- We use **Google Colab** to do the following experiments, please login with your Gmail Account in https://colab.research.google.com/ (If you do not have one, please register one);
- Please turn to Edit->Notebook Settings, and select Python 3 and GPU as the hardware accelerator;
- Before run a specific model, please know the exact resource that you need and the resource you have with **!nvidia-smi**;
- We are glad to provide help during the whole tutorial if you have any questions.

## 2. PyTorch

We use [PyTorch](https://pytorch.org/) framework to finish the implementations. In this section, we will introduce the installation, the basic operations of PyTorch.

### 2.1 Installation
Since the Colab has installed the PyTorch by default, you can check the version of PyTorch and whether it supports to GPUs by the following command.

In [2]:
# check the GPU resource in the Colab
!nvidia-smi

Tue Jan 14 02:33:28 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
import torch
print("PyTorch version: ", torch.__version__)

PyTorch version:  2.5.1+cu121


Additionally, if you want a specific torch version if some of the repos that requires, you can go to the Pytorch official website to find the version. And a full command with concrete version information is recommended like this:
```
# CUDA 12.1
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
```



In [None]:
# you can try this if you want to install a specific version of PyTorch
!pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

You can use the following code to check more details about the information of GPUs.

In [None]:
import torch
print("PyTorch version: ", torch.__version__)
print("GPU support: ", torch.cuda.is_available())
print("Available devices count: ", torch.cuda.device_count())

PyTorch version:  2.5.1+cu121
GPU support:  True
Available devices count:  1


## 2.2 Quick start - Tensor in PyTorch

In this section, we introcue some basic concepts and operations of Tensor.

In [None]:
import numpy as np

Tensors are a specialized data structure that are very similar to arrays and matrices. In PyTorch, we use tensors to encode the inputs and outputs of a model, as well as the model’s parameters.

Tensors are similar to NumPy’s ndarrays, except that tensors can run on GPUs or other hardware accelerators.

One simple way to understand / utilize the tensor is to know how each dimension represents for.

### Create Tensors

Tensors can be created directly from data or NumPy arrays. You can assign the data type to the tensor. Otherwise, the data type would be automatically inferred.

In [None]:
data = [[0,1], [2,3]]
tensor_data = torch.tensor(data)
tensor_data_float = torch.tensor(data).float()
print(f"Long Tensor: \n {tensor_data} \n")  # the data type is LongTensor
print(f"Float Tensor: \n {tensor_data_float} \n")

Long Tensor: 
 tensor([[0, 1],
        [2, 3]]) 

Float Tensor: 
 tensor([[0., 1.],
        [2., 3.]]) 



In [None]:
np_data = np.array(data)
tensor_np_data = torch.tensor(np_data)
tensor_np_data_float = torch.tensor(np_data).float()
print(f"Long Tensor: \n {tensor_np_data} \n")  # the data type is LongTensor
print(f"Float Tensor: \n {tensor_np_data_float} \n")

Long Tensor: 
 tensor([[0, 1],
        [2, 3]]) 

Float Tensor: 
 tensor([[0., 1.],
        [2., 3.]]) 



You can also create the tensors filled with constant (e.g., 0 and 1) or random values,

In [None]:
zeros_tensor = torch.zeros((2,3))
ones_tensor = torch.ones((2,3))
random_tensor = torch.rand((2,3))
print(f"Zeros Tensor: \n {zeros_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Random Tensor: \n {random_tensor} \n")

Zeros Tensor: 
 tensor([[0., 0., 0.],
        [0., 0., 0.]]) 

Ones Tensor: 
 tensor([[1., 1., 1.],
        [1., 1., 1.]]) 

Random Tensor: 
 tensor([[0.8107, 0.6311, 0.1082],
        [0.0881, 0.4793, 0.0343]]) 



### Attributes of a Tensor

Tensor attributes describe their shape, datatype, and the device on which they are stored.

In [None]:
tensor = torch.rand(2,3)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

Shape of tensor: torch.Size([2, 3])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu


### Operations on Tensors

There are over 100 tensor operations, including arthmetic, linear algebra, matrix manipulation and more. In this section, we only introduce some frequently used operations in our later tutorials and projects.

**Move Tensor to Device**

By default, tensors are created on the CPU. We need to explicitly move tensors to the GPU using `.to()` method (after checking for GPU availability). Keep in mind that copying large tensors across devices can be expensive in terms of time and memory!

In [None]:
# move tensor to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
tensor = tensor.to(device)
print(f"Device tensor is stored on: {tensor.device}")

Device tensor is stored on: cuda:0


**Tensor indexing, slicing and reshape**

In [None]:
import torch
tensor = torch.rand(4, 6)
tensor

tensor([[0.1444, 0.1404, 0.6457, 0.3260, 0.3766, 0.6086],
        [0.5304, 0.9665, 0.8526, 0.4024, 0.1897, 0.8329],
        [0.0054, 0.0091, 0.7874, 0.5189, 0.8398, 0.3064],
        [0.2892, 0.0357, 0.5813, 0.9860, 0.4902, 0.4146]])

In [None]:
# let take a look at its first row and column
print(f"First row: {tensor[0]}")
print(f"First column: {tensor[:,0]}")
print(f"Last column: {tensor[:, -1]}")

First row: tensor([0.1444, 0.1404, 0.6457, 0.3260, 0.3766, 0.6086])
First column: tensor([0.1444, 0.5304, 0.0054, 0.2892])
Last column: tensor([0.6086, 0.8329, 0.3064, 0.4146])


In [None]:
# reshape
print(f"Reshape to (2,12): \n {tensor.view(2, 12)} \n")
print(f"Reshape to (2,2,6): \n {tensor.view(-1, 2, 6)} \n")

Reshape to (2,12): 
 tensor([[0.1444, 0.1404, 0.6457, 0.3260, 0.3766, 0.6086, 0.5304, 0.9665, 0.8526,
         0.4024, 0.1897, 0.8329],
        [0.0054, 0.0091, 0.7874, 0.5189, 0.8398, 0.3064, 0.2892, 0.0357, 0.5813,
         0.9860, 0.4902, 0.4146]]) 

Reshape to (2,2,6): 
 tensor([[[0.1444, 0.1404, 0.6457, 0.3260, 0.3766, 0.6086],
         [0.5304, 0.9665, 0.8526, 0.4024, 0.1897, 0.8329]],

        [[0.0054, 0.0091, 0.7874, 0.5189, 0.8398, 0.3064],
         [0.2892, 0.0357, 0.5813, 0.9860, 0.4902, 0.4146]]]) 



**Joining tensors.** You can use torch.cat to concatenate a sequence of tensors along a given dimension.

In [None]:
t1 = torch.zeros(4, 2)
new_t = torch.cat([tensor, t1, t1], dim=1)
new_t

tensor([[0.1444, 0.1404, 0.6457, 0.3260, 0.3766, 0.6086, 0.0000, 0.0000, 0.0000,
         0.0000],
        [0.5304, 0.9665, 0.8526, 0.4024, 0.1897, 0.8329, 0.0000, 0.0000, 0.0000,
         0.0000],
        [0.0054, 0.0091, 0.7874, 0.5189, 0.8398, 0.3064, 0.0000, 0.0000, 0.0000,
         0.0000],
        [0.2892, 0.0357, 0.5813, 0.9860, 0.4902, 0.4146, 0.0000, 0.0000, 0.0000,
         0.0000]])

**Arithmetic operations**

The basic arithmetic operations of Pytorch are similar with those in Numpy, such as `.pow()`, `.div()`, `.sum()` and more. Here we talk more about multiplication in Pytorch.

In [None]:
# This computes the matrix multiplication between two tensors. y1, y2 will have the same value
print(f"Shape of original tensor: {tensor.shape}")
y1 = tensor @ tensor.T
y2 = tensor.matmul(tensor.T)

print(f"Shape of matrix multiplication resulting tensor: {y1.shape}")

# This computes the element-wise product. z1, z2, z3 will have the same value
z1 = tensor * tensor
z2 = tensor.mul(tensor)

print(f"Shape of element-wise product resulting tensor: {z1.shape}")

Shape of original tensor: torch.Size([4, 6])
Shape of matrix multiplication resulting tensor: torch.Size([4, 4])
Shape of element-wise product resulting tensor: torch.Size([4, 6])


## 2.3 Practice

In NLP, we have a very popular and famous techique, termed **Attention** which is used to measure the improtance among each components. Formally, we define the attention mechanism as:

$Attention(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Softmax}(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}})\mathbf{V}$

$\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$,

you can attempt to implement softmax function and attention by yourself.

Hint: you can decompose the equation into some basic components and check how to achieve these basic components. Google your questions and you can find the answers on stackoverflow or official documentations of Numpy or Pytorch.

In [None]:
# please notice that we have a batch size of 2,
# and the dimension of Q, K, V is 4x8
# you can regard them as 4 queries, 4 keys and 4 values with the representation of 8 dimensional vectors
v= torch.rand((2,4,8))
k = v
q = torch.rand((2,4,8))
d_k = 8

In [None]:
# insert your code
def attention(q, k, v):
    pass

def softmax(x):
    pass


# Answer for the practice
## Hint: we can decompose the equation and write the logic process first, then search for the implementation

In [36]:
# please notice that we have a batch size of 2,
# and the dimension of Q, K, V is 4x8
# you can regard them as 4 queries, 4 keys and 4 values with the representation of 8 dimensional vectors

# random initialization for further test
v= torch.rand((2,4,8))
k = v
q = torch.rand((2,4,8))
d_k = 8

# prepare the test case for validation
# for simple check, we use the batch size of 1 and matrix size of (2, 4)
# please notice the dtype should be float
v_test = [[[1.0, 2.0, 3.0, 1.0], [2.0, 2.0, 4.0,3.0]]]
v_test = torch.tensor(v_test)
k_test = v_test
q_test = [[[4.0,2.0,1.0, 4.0], [7.0,2.0, 5.0, 6.0]]]
q_test = torch.tensor(q_test)
# check for the data shape
print(v_test.shape, k_test.shape, q_test.shape)
d_k_test = 4 # the sqrt of d_k is 2 for easy testing

torch.Size([1, 2, 4]) torch.Size([1, 2, 4]) torch.Size([1, 2, 4])


In [37]:
# ! Notice That: Since this is this is the first time we practice with Pytorch, so we provide a very detailed process for you, which includes
# -- decomposition about the task/equation
# -- the corresponding code for implementation, validation and cross-check
# -- the intermediate thinking process

# We may not provide such details for the following tasks and leave the thinking and exploration space for yourself :)
# If you have any questions, find any errors or have any suggestions, please feel freely to contact us.


import math
def attention(q, k, v):
    # we highly recommend the using of torch method with official document illustration
    # do the matrix mulplication first, notice the transpose
    # check the shape of the transpose of k if you are not sure about the method
    # - torch.transpose(k, 1, 2).shape

    # decompose the equation
    # molecular = torch.matmul(q, torch.transpose(k, 1, 2))
    # denominator = math.sqrt(k.shape[-1])
    # soft_result = softmax(molecular / denominator)
    # final result = torch.matmul(soft_result, v)

    # combine the above together
    attn_compute = torch.matmul(softmax(torch.matmul(q, torch.transpose(k, 1, 2)) / math.sqrt(k.shape[-1]), -1), v)
    return attn_compute

def softmax(x, dim_):
    # decompose the equation into two parts
    # 1. get the exp of each element
    # 2. get the summation of the exp of all elements (in a matrix) and do the division
    # you can check you own soft max with the torch.nn.functional.softmax
    # !!! one thing we need to know is that the softmax is not on the whole matrix but on a specific dimension, usually the last dimension
    exp_x = torch.exp(x)
    sum_x = torch.sum(exp_x, dim_, keepdim=True)
    return exp_x / sum_x

In [38]:
# check the softmax implementation
exp_own = softmax(v, -1)
exp_torch = torch.nn.functional.softmax(v, dim=-1)
print(exp_own)
print(exp_torch)

tensor([[[0.1164, 0.1182, 0.1174, 0.1730, 0.0948, 0.1464, 0.1248, 0.1089],
         [0.0823, 0.1113, 0.1495, 0.1220, 0.1117, 0.1010, 0.1832, 0.1390],
         [0.1228, 0.1672, 0.1130, 0.1723, 0.1109, 0.0665, 0.1086, 0.1387],
         [0.2289, 0.1224, 0.1217, 0.0889, 0.1226, 0.0875, 0.1016, 0.1264]],

        [[0.1281, 0.1507, 0.0901, 0.0806, 0.0802, 0.1052, 0.1586, 0.2065],
         [0.1283, 0.1416, 0.0766, 0.1503, 0.1468, 0.0767, 0.1697, 0.1099],
         [0.1172, 0.1628, 0.1011, 0.1316, 0.1827, 0.0762, 0.1497, 0.0787],
         [0.0967, 0.0737, 0.1237, 0.1412, 0.1617, 0.1593, 0.0983, 0.1453]]])
tensor([[[0.1164, 0.1182, 0.1174, 0.1730, 0.0948, 0.1464, 0.1248, 0.1089],
         [0.0823, 0.1113, 0.1495, 0.1220, 0.1117, 0.1010, 0.1832, 0.1390],
         [0.1228, 0.1672, 0.1130, 0.1723, 0.1109, 0.0665, 0.1086, 0.1387],
         [0.2289, 0.1224, 0.1217, 0.0889, 0.1226, 0.0875, 0.1016, 0.1264]],

        [[0.1281, 0.1507, 0.0901, 0.0806, 0.0802, 0.1052, 0.1586, 0.2065],
         [0.1283, 0

In [40]:
# check the code
attn_result = attention(q,k,v)
# check with official method
attn_official = torch.nn.functional.scaled_dot_product_attention(q, k, v)
print(attn_result)
print(attn_official)

# you can use the test data as well for intermediate process checking
print(attention(q_test, k_test, v_test))
print(torch.nn.functional.scaled_dot_product_attention(q_test, k_test, v_test))

tensor([[[0.5415, 0.5852, 0.5622, 0.6506, 0.4228, 0.3014, 0.5880, 0.5845],
         [0.5435, 0.6083, 0.5669, 0.6708, 0.4330, 0.2850, 0.5912, 0.5998],
         [0.5381, 0.5992, 0.5638, 0.6743, 0.4251, 0.3037, 0.5907, 0.5903],
         [0.5368, 0.6034, 0.5660, 0.6776, 0.4278, 0.3002, 0.5935, 0.5943]],

        [[0.5038, 0.5648, 0.3291, 0.6084, 0.7233, 0.3573, 0.6873, 0.5573],
         [0.5000, 0.5521, 0.3263, 0.5782, 0.6823, 0.3747, 0.6793, 0.5981],
         [0.4960, 0.5416, 0.3366, 0.5809, 0.6887, 0.3882, 0.6706, 0.6010],
         [0.5034, 0.5741, 0.3197, 0.5882, 0.7020, 0.3455, 0.6901, 0.5605]]])
tensor([[[0.5415, 0.5852, 0.5622, 0.6506, 0.4228, 0.3014, 0.5880, 0.5845],
         [0.5435, 0.6083, 0.5669, 0.6708, 0.4330, 0.2850, 0.5912, 0.5998],
         [0.5381, 0.5992, 0.5638, 0.6743, 0.4251, 0.3037, 0.5907, 0.5903],
         [0.5368, 0.6034, 0.5660, 0.6776, 0.4278, 0.3002, 0.5935, 0.5943]],

        [[0.5038, 0.5648, 0.3291, 0.6084, 0.7233, 0.3573, 0.6873, 0.5573],
         [0.5000, 0