# PyTorch Fundamentals: Tensors and Linear Regression  

This notebook provides an introduction to PyTorch fundamentals, focusing on tensor operations and implementing linear regression.

It covers:

* **PyTorch Tensors Essentials**: Basic tensor creation, properties (shape, dtype), and device management (CPU vs. GPU).
* **Implementing Linear Regression (Low-Level)**: A manual implementation of linear regression using raw PyTorch tensor operations, including forward pass, MSE loss calculation, backpropagation with loss.backward(), and parameter updates using torch.no_grad().
* **Linear Regression Using PyTorch's High-Level API**: An implementation of linear regression using torch.nn.Linear for model definition, torch.optim.SGD for optimization, and torch.nn.MSELoss for the loss function, demonstrating a more streamlined approach.  

The notebook uses the California Housing dataset as an example for linear regression, showing data preparation steps like splitting, conversion to tensors, and normalization.




This notebook is based at Chapter 10 Building Neural Networks with PyTorch of "Hands-On ML" textbook.  

Sources:  
* [Online Chapter (subscription required)](https://learning.oreilly.com/library/view/hands-on-machine-learning/9798341607972/ch10.html)  
* [GitHub repo with code for chapter 10](https://github.com/ageron/handson-mlp/blob/main/10_neural_nets_with_pytorch.ipynb)

## PyTorch Tensors Essentials

In [None]:
import torch

In [None]:
# you can create a PyTorch tensor much like you would create a NumPy array.
X = torch.tensor([[1.0, 4.0, 7.0], [2.0, 3.0, 6.0]])
X

tensor([[1., 4., 7.],
        [2., 3., 6.]])

In [None]:
# get a tensor’s shape and data type
X.dtype

torch.float32

Note that tensors of strings or objects are not supported.

In [None]:
X.shape

torch.Size([2, 3])

This means:
* 2 → the number of rows (samples, observations, or examples)  
* 3 → the number of columns (features, input dimensions, or variables)

You can think of this as:
> “We have 2 samples, each with 3 features.”

Or in the context of machine learning:

> Each row = one training example.  
> Each column = one feature describing that example.

Notice that the default precision for floats is 32-bits in PyTorch, whereas it’s 64-bits in NumPy. It’s generally better to use 32-bits in deep learning because this takes half the RAM and speeds up computations, and neural nets do not actually need the extra precision offered by 64-bit floats. So when calling the torch.tensor() function to convert a NumPy array to a tensor, it’s best to specify dtype=torch.float32. Alternatively, you can use torch.FloatTensor() which automatically converts the array to 32-bits:

### Specify device on Colab

In [None]:
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps" # for Apple
else:
    device = "cpu"

On a Colab GPU Runtime, device will be equal to "cuda". Now let’s create a tensor on that GPU. To do that, one option is to create the tensor on the CPU, then copy it to the GPU using the to() method:

In [None]:
M = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
M = M.to(device)

You can always tell which device a tensor lives on by looking at its device attribute:

In [None]:
M.device

device(type='cuda', index=0)

In [None]:
R = M @ M.T  # run some operations on the GPU
R

tensor([[14., 32.],
        [32., 77.]], device='cuda:0')

Note that the result R also lives on the GPU. This means we can perform multiple operations on the GPU without having to transfer data back and forth between the CPU and the GPU. This is crucial in deep learning because data transfer between devices can often become a performance bottleneck.

In [None]:
M = torch.rand((1000, 1000))  # on the CPU
%timeit M @ M.T


22.5 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:

M = torch.rand((1000, 1000), device="cuda")  # on the GPU
%timeit M @ M.T

647 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


GPUs work by breaking large operations into smaller operations and running them in parallel across thousands of cores. If the task is small, it cannot be broken up into that many pieces, and the performance gain is therefore smaller. In fact, when running many tiny tasks, it can sometimes be faster to just run the operations on the CPU.

# Implementing Linear Regression

The train exercise to use California housing dataset.

In [None]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

In [None]:
from sklearn.model_selection import train_test_split
X_train_and_valid, X_test, y_train_and_valid, y_test = train_test_split(
    housing.data, housing.target, random_state=42)

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_and_valid, y_train_and_valid, random_state=42)

Convert it to tensors and normalize it. We could use a StandardScaler for this, but let’s just use tensor operations instead, to get a bit of practice

In [None]:
# By default, fetch_california_housing returns NumPy arrays (float64).
# Neural networks expect float32 tensors, so we cast with FloatTensor.

X_train = torch.FloatTensor(X_train)
X_valid = torch.FloatTensor(X_valid)
X_test = torch.FloatTensor(X_test)


means = X_train.mean(dim=0, keepdims=True)
stds = X_train.std(dim=0, keepdims=True)
X_train = (X_train - means) / stds
X_valid = (X_valid - means) / stds
X_test = (X_test - means) / stds

Normalization brings all features roughly to a mean ≈ 0 and standard deviation ≈ 1, which:  
* stabilizes gradient descent;  
* helps the network converge faster;  
* avoids one large-magnitude feature dominating the loss.  

We need to reshape the tensors to column vectors. (Why?)

In [None]:
y_train = torch.FloatTensor(y_train).reshape(-1, 1)
y_valid = torch.FloatTensor(y_valid).reshape(-1, 1)
y_test = torch.FloatTensor(y_test).reshape(-1, 1)

In [None]:
torch.manual_seed(42)
n_features = X_train.shape[1]  # there are 8 input features

In [None]:
n_features

8

In [None]:
w = torch.randn((n_features, 1), requires_grad=True)


In [None]:
b = torch.tensor(0., requires_grad=True)

Explanation:  
* **requires_grad=True** tells PyTorch to track gradients during backpropagation — essential for learning via gradient descent.    

Weights (w) learn how each input feature influences the prediction.  
* a column vector with one weight per input dimension
* Each feature (e.g., house age, number of rooms, latitude, etc.) gets its own weight.
* These weights tell the model how strongly each feature contributes to the output.

Bias (b) shifts the entire prediction up or down, regardless of inputs.
* a single number that’s added equally to all predictions.

#### Why random initialization matters?  
* For linear regression, initializing all weights to zero would still work — because there’s only one output neuron, no symmetry problem.  
* For neural networks, if all neurons in a layer start with identical weights, they’ll compute the same thing and receive identical gradients → they’ll remain identical forever (no diversity in learning).    

✅ Random initialization breaks this “symmetry” so each neuron learns a distinct feature representation.




#### Why manual_seed can give you different results  

We called torch.manual_seed() to ensure that the results are reproducible. However, PyTorch does not guarantee perfectly reproducible results across different releases, platforms, or devices, so if you do not run the code in this chapter with PyTorch 2.5 on a Colab runtime with an Nvidia T4 GPU, you may get different results.

#### Deterministic vs stochastic algorithms

Moreover, since a GPU splits each operation into multiple chunks and runs them in parallel, the order in which these chunks finish may vary across runs, and this may slightly affect the result due to floating point precision errors. These minor differences may compound during training, and lead to very different models. To reduce this risk, you can tell PyTorch to use only deterministic algorithms by calling torch.use_deterministic_algorithms(True). However, deterministic algorithms are often slower than stochastic ones, and some operations don’t have a deterministic version at all, so you will get an error if your code tries to use one.

In [None]:
learning_rate = 0.4
n_epochs = 20
for epoch in range(n_epochs):
    # forward pass
    y_pred = X_train @ w + b
    loss = ((y_pred - y_train) ** 2).mean() # MSE
    # BACKPROP (AUTOGRAD)
    loss.backward()
    # PARAMETER UPDATE
    with torch.no_grad(): #  PyTorch will consume less RAM and run faster since it won’t have to keep track of the computation graph.
        b -= learning_rate * b.grad
        w -= learning_rate * w.grad
        # Clear accumulated gradients
        b.grad.zero_()
        w.grad.zero_()
    print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item()}")

Epoch 1/20, Loss: 16.158456802368164
Epoch 2/20, Loss: 4.8793745040893555
Epoch 3/20, Loss: 2.255225419998169
Epoch 4/20, Loss: 1.3307636976242065
Epoch 5/20, Loss: 0.9680693745613098
Epoch 6/20, Loss: 0.8142675757408142
Epoch 7/20, Loss: 0.7417045831680298
Epoch 8/20, Loss: 0.7020700573921204
Epoch 9/20, Loss: 0.6765917539596558
Epoch 10/20, Loss: 0.6577963829040527
Epoch 11/20, Loss: 0.6426151394844055
Epoch 12/20, Loss: 0.6297222971916199
Epoch 13/20, Loss: 0.6184941530227661
Epoch 14/20, Loss: 0.6085968017578125
Epoch 15/20, Loss: 0.5998216271400452
Epoch 16/20, Loss: 0.592018723487854
Epoch 17/20, Loss: 0.5850691795349121
Epoch 18/20, Loss: 0.578873336315155
Epoch 19/20, Loss: 0.573345422744751
Epoch 20/20, Loss: 0.5684100389480591


Implementing linear regression using PyTorch’s low-level API wasn’t too hard, but using this approach for more complex models would get really messy and difficult.

In [None]:
X_new = X_test[:3]  # pretend these are new instances
with torch.no_grad():
    y_pred = X_new @ w + b  # use the trained parameters to make predictions

y_pred

tensor([[0.8916],
        [1.6480],
        [2.6577]])

# Linear Regression Using PyTorch’s High-Level API

In [None]:
import torch.nn as nn  # by convention, this module is usually imported this way

torch.manual_seed(42)  # to get reproducible results
model = nn.Linear(in_features=n_features, out_features=1)

For most neural networks you will need to assemble many modules, as we will see later in this chapter, so you can think of modules as math LEGO® bricks.

In [None]:
model.bias

Parameter containing:
tensor([0.3117], requires_grad=True)

In [None]:
model.weight

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True)

In [None]:
for param in model.parameters():
  print(param)

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True)
Parameter containing:
tensor([0.3117], requires_grad=True)


In [None]:
model(X_train[:2])

tensor([[-0.4718],
        [ 0.1131]], grad_fn=<AddmmBackward0>)

Now that we have our model, we need to create an optimizer to update the model parameters, and we must also choose a loss function:

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) # learning_rate variable defined above

Optimizer:  
* we’re using the simple stochastic gradient descent (SGD) optimizer, which can be used for:
  * SGD,
  * mini-batch GD, or
  * batch gradient descent.
* To initialize it, we must give it the model parameters and the learning rate.

In [None]:
mse = nn.MSELoss()

Loss function:  
For the loss function, we create an instance of the nn.MSELoss class: this is also a module, so we can use it like a function, giving it the predictions and the targets, and it will compute the MSE

Let’s write a small function to train our model. we’re now using higher-level constructs rather than working directly with tensors and autograd.

In [None]:
def train_bgd(model, optimizer, criterion, X_train, y_train, n_epochs):
    for epoch in range(n_epochs):
        y_pred = model(X_train)
        loss = criterion(y_pred, y_train) # nn.MSELoss() instance
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item()}")

**Criterion:**  
* In PyTorch, the **loss function object** is commonly referred to as the criterion, to distinguish it from the loss value itself (which is computed at each training iteration using the criterion). In this example, it’s the MSELoss instance.

The** optimizer.step()** line corresponds to the two lines that updated b and w in our earlier code.  

The **optimizer.zero_grad()** line corresponds to the two lines that zeroed out b.grad and w.grad. Notice that we don’t need to use with torch.no_grad() here since this is done automatically by the optimizer, inside the step() and zero_grad() functions.  

In [None]:
train_bgd(model, optimizer, mse, X_train, y_train, n_epochs)

Epoch 1/20, Loss: 4.3378496170043945
Epoch 2/20, Loss: 0.7802939414978027
Epoch 3/20, Loss: 0.6253842115402222
Epoch 4/20, Loss: 0.6060433983802795
Epoch 5/20, Loss: 0.5956299304962158
Epoch 6/20, Loss: 0.587356686592102
Epoch 7/20, Loss: 0.5802990794181824
Epoch 8/20, Loss: 0.5741382241249084
Epoch 9/20, Loss: 0.5687101483345032
Epoch 10/20, Loss: 0.5639079809188843
Epoch 11/20, Loss: 0.5596511363983154
Epoch 12/20, Loss: 0.5558737516403198
Epoch 13/20, Loss: 0.5525194406509399
Epoch 14/20, Loss: 0.5495392084121704
Epoch 15/20, Loss: 0.5468900203704834
Epoch 16/20, Loss: 0.5445339679718018
Epoch 17/20, Loss: 0.5424376726150513
Epoch 18/20, Loss: 0.5405716300010681
Epoch 19/20, Loss: 0.5389097332954407
Epoch 20/20, Loss: 0.5374288558959961


The model is trained, you can now use it to make predictions by simply calling it like a function (preferably inside a no_grad() context):

In [None]:
X_new = X_test[:3]
with torch.no_grad():
    y_pred = model(X_new)
y_pred


tensor([[0.8061],
        [1.7116],
        [2.6973]])

Code explanation:  
* The purpose of using with **torch.no_grad()** when making predictions with a PyTorch model is to **disable gradient tracking**, which can improve performance and reduce memory consumption.  
  * The purpose of gradient tracking in machine learning models is to compute gradients across the model, which is necessary for training but not for prediction.
  * During the training process, the model computes the gradients of the loss function with respect to its weights, which are then used to update the weights and minimize the loss. However, during the prediction phase, gradient tracking is not necessary, as the model is not learning or updating its weights.

These predictions are similar to the ones our previous model made, but not exactly the same: that’s because the nn.Linear module initializes the parameters slightly differently: it uses a uniform random distribution from -
 to
 for both the weights and the bias term