
# AIG 230 – Introduction to PyTorch for Natural Language Processing

## Notebook 1: PyTorch Fundamentals for NLP

This notebook is designed for **AIG 230 – Natural Language Processing (Seneca Polytechnic)**.
It introduces PyTorch fundamentals that will be reused throughout the course for:

- Sequence Models and RNNs  
- Attention and Transformer Architectures  
- Pretrained Models and Transfer Learning  
- Named Entity Recognition  
- Machine Translation and Seq2Seq  
- Question Answering and Summarization  
- Retrieval Augmented Generation and Text Search  

---



## Learning Objectives

By the end of this notebook, students will be able to:

- Explain what PyTorch is and why it is used in NLP
- Create and manipulate tensors
- Understand devices (CPU vs GPU)
- Explain automatic differentiation
- Build and train a simple neural network
- Read PyTorch code line by line

---



## 1. What is PyTorch and Why Do We Use It in NLP?

PyTorch is a deep learning framework used to build, train, and deploy neural networks.

In NLP, PyTorch is preferred because:

- It supports dynamic computation graphs
- It handles variable-length sequences naturally
- It integrates with Hugging Face Transformers
- It provides automatic differentiation

Every model used later in this course is built on these fundamentals.



### Student Checkpoint 1

1. Why are dynamic computation graphs important for NLP?
2. Name two NLP tasks from this course that require PyTorch.



## 2. Installing and Importing PyTorch

This notebook runs both **locally** and on **Google Colab**.
Colab already includes PyTorch, but we include a safe install cell.


In [None]:

# Install PyTorch (safe for Colab and local use)
# In Colab, this will usually be skipped because PyTorch is preinstalled

!pip install torch torchvision torchaudio --quiet


In [1]:

import torch
torch.__version__


'2.9.0+cu126'


### Student Checkpoint 2

1. What does the `torch` library provide?
2. Why is checking the PyTorch version useful?



## 3. Tensors: The Core Data Structure

Tensors are multi-dimensional arrays used to store:

- Token IDs
- Embeddings
- Model weights
- Batches of text sequences


In [2]:

x = torch.tensor([1, 2, 3])
x


tensor([1, 2, 3])

In [3]:

x.shape


torch.Size([3])

In [4]:

x.dtype


torch.int64

Tensors are PyTorch's most basic building block. Each tensor is a multi-dimensional matrix; for example, a 256x256 square image might be represented by a 3x256x256 tensor, where the first dimension represents color. Here's how to create a tensor:

In [5]:
list_of_lists = [
  [1, 2, 3],
  [4, 5, 6],
]
print(list_of_lists)

[[1, 2, 3], [4, 5, 6]]


In [6]:
data = torch.tensor(list_of_lists)
print(data)

tensor([[1, 2, 3],
        [4, 5, 6]])


In [7]:
# Initializing a tensor
data = torch.tensor([
                     [0, 1],
                     [2, 3],
                     [4, 5]
                    ])
print(data)

tensor([[0, 1],
        [2, 3],
        [4, 5]])


Each tensor has a data type: the major data types you'll need to worry about are floats (torch.float32) and integers (torch.int). You can specify the data type explicitly when you create the tensor:

In [8]:
# Initializing a tensor with an explicit data type
# Notice the dots after the numbers, which specify that they're floats
data = torch.tensor([
                     [0, 1],
                     [2, 3],
                     [4, 5]
                    ], dtype=torch.float32)
print(data)

tensor([[0., 1.],
        [2., 3.],
        [4., 5.]])


In [9]:
# Initializing a tensor with an explicit data type
# Notice the dots after the numbers, which specify that they're floats
data = torch.tensor([
                     [0.11111111, 1],
                     [2, 3],
                     [4, 5]
                    ], dtype=torch.float32)
print(data)

tensor([[0.1111, 1.0000],
        [2.0000, 3.0000],
        [4.0000, 5.0000]])


In [10]:
# Initializing a tensor with an explicit data type
# Notice the dots after the numbers, which specify that they're floats
data = torch.tensor([
                     [0.11111111, 1],
                     [2, 3],
                     [4, 5]
                    ])
print(data)

tensor([[0.1111, 1.0000],
        [2.0000, 3.0000],
        [4.0000, 5.0000]])



### Student Checkpoint 3

1. What is the shape of a tensor representing one sentence with 10 tokens?
2. Why do NLP models use tensors instead of Python lists?



## 4. Tensor Operations


In [11]:

a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

a + b


tensor([5., 7., 9.])

In [12]:

a * b


tensor([ 4., 10., 18.])

In [13]:

M = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0]])
v = torch.tensor([5.0, 6.0])

torch.matmul(M, v)


tensor([17., 39.])


### Student Checkpoint 4

1. What is the difference between element-wise and matrix multiplication?
2. Why is matrix multiplication essential for Transformers?



## 5. Devices: CPU vs GPU


In [14]:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


device(type='cuda')

In [15]:

x = x.to(device)
x


tensor([1, 2, 3], device='cuda:0')


### Student Checkpoint 5

1. What happens if data and model are on different devices?
2. Why are GPUs critical for large language models?


Utility functions also exist to create tensors with given shapes and contents:

In [16]:
zeros = torch.zeros(2, 5)  # a tensor of all zeros
print(zeros)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])


In [17]:
ones = torch.ones(3, 4)   # a tensor of all ones
print(ones)

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])


In [18]:
rr = torch.arange(1, 10) # range from [1, 10)
print(rr)

tensor([1, 2, 3, 4, 5, 6, 7, 8, 9])


In [19]:
rr + 2

tensor([ 3,  4,  5,  6,  7,  8,  9, 10, 11])

In [20]:
rr * 2

tensor([ 2,  4,  6,  8, 10, 12, 14, 16, 18])

In [21]:
a = torch.tensor([[1, 2], [2, 3], [4, 5]])      # (3, 2)
b = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]])  # (2, 4)

print("A is", a)
print("B is", b)
print("The product is", a.matmul(b)) #(3, 4)
print("The other product is", a @ b) # +, -, *, @

A is tensor([[1, 2],
        [2, 3],
        [4, 5]])
B is tensor([[1, 2, 3, 4],
        [5, 6, 7, 8]])
The product is tensor([[11, 14, 17, 20],
        [17, 22, 27, 32],
        [29, 38, 47, 56]])
The other product is tensor([[11, 14, 17, 20],
        [17, 22, 27, 32],
        [29, 38, 47, 56]])


The **shape** of a matrix (which can be accessed by `.shape`) is defined as the dimensions of the matrix. Here's some examples:

In [22]:
matr_2d = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(matr_2d.shape)
print(matr_2d)

torch.Size([2, 3])
tensor([[1, 2, 3],
        [4, 5, 6]])


In [23]:
matr_3d = torch.tensor([[[1, 2, 3, 4], [-2, 5, 6, 9]], [[5, 6, 7, 2], [8, 9, 10, 4]], [[-3, 2, 2, 1], [4, 6, 5, 9]]])
print(matr_3d)
print(matr_3d.shape)

tensor([[[ 1,  2,  3,  4],
         [-2,  5,  6,  9]],

        [[ 5,  6,  7,  2],
         [ 8,  9, 10,  4]],

        [[-3,  2,  2,  1],
         [ 4,  6,  5,  9]]])
torch.Size([3, 2, 4])


**Reshaping** tensors can be used to make batch operations easier (more on that later), but be careful that the data is reshaped in the order you expect:

In [24]:
rr = torch.arange(1, 16)
print("The shape is currently", rr.shape)
print("The contents are currently", rr)
print()
rr = rr.view(5, 3)
print("After reshaping, the shape is currently", rr.shape)
print("The contents are currently", rr)

The shape is currently torch.Size([15])
The contents are currently tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

After reshaping, the shape is currently torch.Size([5, 3])
The contents are currently tensor([[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9],
        [10, 11, 12],
        [13, 14, 15]])


Finally, you can also inter-convert tensors with **NumPy arrays**:

In [25]:
import numpy as np

# numpy.ndarray --> torch.Tensor:
arr = np.array([[1, 0, 5]])
data = torch.tensor(arr)
print("This is a torch.tensor", data)

# torch.Tensor --> numpy.ndarray:
new_arr = data.numpy()
print("This is a np.ndarray", new_arr)

This is a torch.tensor tensor([[1, 0, 5]])
This is a np.ndarray [[1 0 5]]


One of the reasons why we use **tensors** is *vectorized operations*: operations that be conducted in parallel over a particular dimension of a tensor.

In [26]:
data = torch.arange(1, 36, dtype=torch.float32).reshape(5, 7)
print("Data is:", data)

# We can perform operations like *sum* over each row...
print("Taking the sum over rows:")
print(data.sum(dim=1)) #(5,)

# or over each column.
print("Taking thep sum over columns:")
print(data.sum(dim=0)) #(7,)

# Other operations are available:
print("Taking the stdev over rows:")
print(data.std(dim=1))


Data is: tensor([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11., 12., 13., 14.],
        [15., 16., 17., 18., 19., 20., 21.],
        [22., 23., 24., 25., 26., 27., 28.],
        [29., 30., 31., 32., 33., 34., 35.]])
Taking the sum over rows:
tensor([ 28.,  77., 126., 175., 224.])
Taking thep sum over columns:
tensor([ 75.,  80.,  85.,  90.,  95., 100., 105.])
Taking the stdev over rows:
tensor([2.1602, 2.1602, 2.1602, 2.1602, 2.1602])


In [27]:
data = torch.arange(1, 7, dtype=torch.float32).reshape(1, 2, 3)
print(data)
print(data.sum(dim=0).sum(dim=0))
print(data.sum(dim=0).sum(dim=0).shape)

tensor([[[1., 2., 3.],
         [4., 5., 6.]]])
tensor([5., 7., 9.])
torch.Size([3])


In [28]:
data.sum()

tensor(21.)

**Indexing**

You can access arbitrary elements of a tensor using the `[]` operator.

In [29]:
# Initialize an example tensor
x = torch.Tensor([
                  [[1, 2], [3, 4]],
                  [[5, 6], [7, 8]],
                  [[9, 10], [11, 12]]
                 ])
x

tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [11., 12.]]])

In [30]:
x.shape

torch.Size([3, 2, 2])

In [31]:
# Access the 0th element, which is the first row
x[0] # Equivalent to x[0, :]

tensor([[1., 2.],
        [3., 4.]])

In [32]:
x[:, 0]

tensor([[ 1.,  2.],
        [ 5.,  6.],
        [ 9., 10.]])

We can also index into multiple dimensions with `:`.

In [33]:
# Get the top left element of each element in our tensor
x[:, 0, 0]

tensor([1., 5., 9.])

In [34]:
x[:, :, :]

tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [11., 12.]]])

We can also access arbitrary elements in each dimension.

In [35]:
# Let's access the 0th and 1st elements, each twice
# same as stacking x[0], x[0], x[1], x[1]
i = torch.tensor([0, 0, 1, 1])
x[i]

tensor([[[1., 2.],
         [3., 4.]],

        [[1., 2.],
         [3., 4.]],

        [[5., 6.],
         [7., 8.]],

        [[5., 6.],
         [7., 8.]]])

In [36]:
# Let's access the 0th elements of the 1st and 2nd elements

i = torch.tensor([1, 2])
j = torch.tensor([0])
x[i, j]

tensor([[ 5.,  6.],
        [ 9., 10.]])

We can get a `Python` scalar value from a tensor with `item()`.

In [37]:
x[0, 0, 0]

tensor(1.)

In [38]:
x[0, 0, 0].item()

1.0


## 6. Automatic Differentiation (Autograd)

Pytorch is well-known for its automatic differentiation feature. We can call the backward() method to ask PyTorch to calculate the gradients, which are then stored in the grad attribute.


In [39]:

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()
x.grad


tensor(7.)

In [None]:
# Create an example tensor
# requires_grad parameter tells PyTorch to store gradients
x = torch.tensor([2.], requires_grad=True)

# Print the gradient if it is calculated
# Currently None since x is a scalar
pp.pprint(x.grad)

None


In [None]:
# Calculating the gradient of y with respect to x
y = x * x * 3 # 3x^2
y.backward()
pp.pprint(x.grad) # d(y)/d(x) = d(3x^2)/d(x) = 6x = 12

tensor([12.])


Let's run backprop from a different tensor again to see what happens.

In [None]:
z = x * x * 3 # 3x^2
z.backward()
pp.pprint(x.grad)

tensor([24.])


In [None]:
x.grad = None
z = x * x * 3 # 3x^2
z.backward()
# y = x * x * 3
pp.pprint(x.grad)

tensor([12.])


In [None]:
z = x * x * 3 # 3x^2
z.backward()
# y = x * x * 3
pp.pprint(x.grad)

tensor([24.])


In [None]:
z = x * x * 3 # 3x^2
z.backward()
# y = x * x * 3
pp.pprint(x.grad)

tensor([36.])


We can see that the `x.grad` is updated to be the sum of the gradients calculated so far. When we run backprop in a neural network, we sum up all the gradients for a particular neuron before making an update. This is exactly what is happening here! This is also the reason why we need to run `zero_grad()` in every training iteration (more on this later). Otherwise our gradients would keep building up from one training iteration to the other, which would cause our updates to be wrong.

## Customized Backward Function
In some rare cases, you might want to design your own operators, or calculate higher order gradients that are not supported by Pytorch. In these cases you can define your own function with customized forward & backward computation. However, keep in mind that always check if something is already implemented by Pytorch (which is very likely) before customizing your own forward & backward computation. See more at https://pytorch.org/docs/stable/notes/extending.html.

## Neural Network Module

So far we have looked into the tensors, their properties and basic operations on tensors. These are especially useful to get familiar with if we are building the layers of our network from scratch. We will utilize these in Assignment 2, but moving forward, we will use predefined blocks in the `torch.nn` module of `PyTorch`. We will then put together these blocks to create complex networks. Let's start by importing this module with an alias so that we don't have to type `torch` every time we use it.

In [None]:
import torch.nn as nn

### **Linear Layer**
We can use `nn.Linear(H_in, H_out)` to create a a linear layer. This will take a matrix of `(N, *, H_in)` dimensions and output a matrix of `(N, *, H_out)`. The `*` denotes that there could be arbitrary number of dimensions in between. The linear layer performs the operation `Ax+b`, where `A` and `b` are initialized randomly. If we don't want the linear layer to learn the bias parameters, we can initialize our layer with `bias=False`.

In [None]:
# Create the inputs
input = torch.ones(2,3,4)
# N* H_in -> N*H_out


# Make a linear layers transforming N,*,H_in dimensinal inputs to N,*,H_out
# dimensional outputs
linear = nn.Linear(4, 2)
linear_output = linear(input)
linear_output

tensor([[[-0.8810,  0.3769],
         [-0.8810,  0.3769],
         [-0.8810,  0.3769]],

        [[-0.8810,  0.3769],
         [-0.8810,  0.3769],
         [-0.8810,  0.3769]]], grad_fn=<ViewBackward0>)

In [None]:
linear_output.shape

torch.Size([2, 3, 10, 11, 2])

In [None]:
list(linear.parameters()) # Ax + b

[Parameter containing:
 tensor([[-0.1443,  0.2130,  0.2116, -0.4267],
         [-0.3379,  0.2243,  0.3289, -0.2484]], requires_grad=True),
 Parameter containing:
 tensor([-0.4608,  0.0073], requires_grad=True)]

In [None]:
# Data of shape [batch_size, feature_dim] # 4
# [batch_size, output_dim] # 2

# linear layer of shape (feature_dim, output_dim)

### **Other Module Layers**
There are several other preconfigured layers in the `nn` module. Some commonly used examples are `nn.Conv2d`, `nn.ConvTranspose2d`, `nn.BatchNorm1d`, `nn.BatchNorm2d`, `nn.Upsample` and `nn.MaxPool2d` among many others. We will learn more about these as we progress in the course. For now, the only important thing to remember is that we can treat each of these layers as plug and play components: we will be providing the required dimensions and `PyTorch` will take care of setting them up.

### **Activation Function Layer**
We can also use the `nn` module to apply activations functions to our tensors. Activation functions are used to add non-linearity to our network. Some examples of activations functions are `nn.ReLU()`, `nn.Sigmoid()` and `nn.LeakyReLU()`. Activation functions operate on each element seperately, so the shape of the tensors we get as an output are the same as the ones we pass in.

In [None]:
linear_output

tensor([[[-0.8810,  0.3769],
         [-0.8810,  0.3769],
         [-0.8810,  0.3769]],

        [[-0.8810,  0.3769],
         [-0.8810,  0.3769],
         [-0.8810,  0.3769]]], grad_fn=<ViewBackward0>)

In [None]:
sigmoid = nn.Sigmoid()
output = sigmoid(linear_output)
output

tensor([[[0.2930, 0.5931],
         [0.2930, 0.5931],
         [0.2930, 0.5931]],

        [[0.2930, 0.5931],
         [0.2930, 0.5931],
         [0.2930, 0.5931]]], grad_fn=<SigmoidBackward0>)

### **Putting the Layers Together**
So far we have seen that we can create layers and pass the output of one as the input of the next. Instead of creating intermediate tensors and passing them around, we can use `nn.Sequentual`, which does exactly that.

In [None]:
block = nn.Sequential(
    nn.Linear(4, 2),
    nn.Sigmoid()
)

input = torch.ones(2,3,4)
output = block(input)
output

tensor([[[0.6822, 0.5056],
         [0.6822, 0.5056],
         [0.6822, 0.5056]],

        [[0.6822, 0.5056],
         [0.6822, 0.5056],
         [0.6822, 0.5056]]], grad_fn=<SigmoidBackward0>)


### Student Checkpoint 6

1. Why are gradients needed in training?
2. What does `backward()` compute?



## 7. Building a Simple Neural Network


In [None]:

import torch.nn as nn

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)



### Student Checkpoint 7

1. What does `nn.Module` represent?
2. Why must the `forward` method be defined?



## 8. Loss Functions and Optimizers


In [None]:

model = SimpleModel().to(device)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)



### Student Checkpoint 8

1. What does the loss function measure?
2. Why is Adam commonly used in NLP?



## 9. Training Loop


In [None]:

x = torch.randn(5, 10).to(device)
y = torch.randn(5, 1).to(device)

for epoch in range(3):
    optimizer.zero_grad()
    predictions = model(x)
    loss = loss_fn(predictions, y)
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch}, Loss: {loss.item()}")



### Student Checkpoint 9

1. Why are gradients reset every iteration?
2. Which line updates the model weights?



## 10. How This Maps to NLP Models

| PyTorch Concept | NLP Usage |
|----------------|----------|
| Tensor | Token IDs, embeddings |
| Linear Layer | Feedforward blocks |
| Autograd | Backpropagation |
| Optimizer | Fine-tuning |
| Training Loop | RNNs, Transformers, LLMs |



### Final Reflection

1. Which PyTorch concept is still unclear?
2. Which upcoming NLP topic will rely most on these fundamentals?
