# Exploring the rank of trained Neural Networks

In this Final assignment, you're going to explore trained neural networks, and study the rank of its matrices.

**Reminder**: The rank is the number of independent columns of the matrix. If a matrix $A \in \mathbb{R}^{n\times m}$  has rank $k$, then $A$ can be approximated by

$$A \approx B \cdot C$$

where $B \in \mathbb{R}^{n\times k}$ and $C \in \mathbb{R}^{k\times m}$.

You can find the rank of matrix $A$ by performing Gaussian elimination and counting the number of pivots. This can be done in few lines of `numpy` code.

**References**:
- https://arxiv.org/pdf/1804.08838
- https://arxiv.org/pdf/2209.13569
- https://arxiv.org/pdf/2012.13255

Note: The references above are not needed to complete this notebook, but reading them might give you additional insights.

## Important

1. For all the training done, make sure to plot things like the loss values and accuracy on each epoch.

    - You can either use tensorboard or just make a static matplotlib plot.
    
2. Don't add biases to the layers in the network, not important for this notebook.
3. No need to use Dropout or BatchNorm on the network.
4. Remember to use GPUs during the training.
5. Always test your hypothesis on both training and testing sets, you might get a surprising result sometimes.

## Task 1: Downloading MNIST and Dataloaders

Download the MNIST dataset and split into training and testing, and create dataloaders.

Link: https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.utils.data as data

import torchvision.transforms as transforms
import torchvision.datasets as datasets

from sklearn import metrics
from sklearn import decomposition
from sklearn import manifold
import matplotlib.pyplot as plt
import numpy as np
import copy

In [None]:
#loading train and test data
train_set = datasets.MNIST(root='./data', train=True, download=True)
test_set = datasets.MNIST(root='./data', train=False, download=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 22409593.05it/s]


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 1738157.81it/s]

Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz





Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 3009684.76it/s]


Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 4061946.43it/s]

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw






In [None]:
mean = train_set.data.float().mean() / 255
std = train_set.data.float().std() / 255

print(f'Calculated mean: {mean}')
print(f'Calculated std: {std}')

Calculated mean: 0.13066047430038452
Calculated std: 0.30810779333114624


In [None]:
#preprocessing our data and transforming it to a suitable format
train_transforms = transforms.Compose([
                            transforms.RandomRotation(5, fill=(0,)),
                            transforms.RandomCrop(28, padding = 2),
                            transforms.ToTensor(),
                            transforms.Normalize(mean = [mean], std = [std])
                                      ])

test_transforms = transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize(mean = [mean], std = [std])
                                     ])

In [None]:
#loading data again with the following transforms
train_data = datasets.MNIST(root = './data',
                            train = True,
                            download = True,
                            transform = train_transforms)

test_data = datasets.MNIST(root = './data',
                           train = False,
                           download = True,
                           transform = test_transforms)

In [None]:
VALID_RATIO = 0.9

n_train_examples = int(len(train_data) * VALID_RATIO)
n_valid_examples = len(train_data) - n_train_examples

train_data, valid_data = data.random_split(train_data,
                                  [n_train_examples, n_valid_examples])

In [None]:
valid_data = copy.deepcopy(valid_data)
valid_data.dataset.transform = test_transforms

In [None]:
#creating data loaders
batch_size = 64
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)
valid_iterator = data.DataLoader(valid_data, batch_size = batch_size, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, shuffle=False)

## Task 2: Train a neural network

Build a simple Multi-layered Perceptron with ReLU activations, and train it on MNIST until achieving 95% accuracy or higher.


In [None]:
class MLP(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()

        self.input_fc = nn.Linear(input_dim, 250)
        self.hidden_fc = nn.Linear(250, 100)
        self.output_fc = nn.Linear(100, output_dim)

    def forward(self, x):

        batch_size = x.shape[0]
        x = x.view(batch_size, -1)
        h_1 = F.relu(self.input_fc(x))
        h_2 = F.relu(self.hidden_fc(h_1))
        y_pred = self.output_fc(h_2)
        return y_pred, h_2

In [None]:
#defining the model
INPUT_DIM = 28 * 28
OUTPUT_DIM = 10

model = MLP(INPUT_DIM, OUTPUT_DIM)

In [None]:
# defining the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = criterion.to(device)

#calculating accuracy
def calculate_accuracy(y_pred, y):
    top_pred = y_pred.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

#defining training loop
def train(model, iterator, optimizer, criterion, device):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for (x, y) in iterator:

        x = x.to(device)
        y = y.to(device)
        optimizer.zero_grad()

        y_pred, _ = model(x)
        loss = criterion(y_pred, y)
        acc = calculate_accuracy(y_pred, y)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

#defining evaluating loop
def evaluate(model, iterator, criterion, device):

    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():

        for (x, y) in iterator:
            x = x.to(device)
            y = y.to(device)
            y_pred, _ = model(x)
            loss = criterion(y_pred, y)
            acc = calculate_accuracy(y_pred, y)
            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
#train it on MNIST until achieving 95% accuracy or higher
EPOCHS = 10

for epoch in range(EPOCHS):
    train_loss, train_acc = train(model, train_loader, optimizer, criterion, device)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, device)

    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01
	Train Loss: 0.073 | Train Acc: 97.76%
	 Val. Loss: 0.051 |  Val. Acc: 98.38%
Epoch: 02
	Train Loss: 0.070 | Train Acc: 97.75%
	 Val. Loss: 0.078 |  Val. Acc: 97.62%
Epoch: 03
	Train Loss: 0.068 | Train Acc: 97.88%
	 Val. Loss: 0.055 |  Val. Acc: 98.32%
Epoch: 04
	Train Loss: 0.065 | Train Acc: 97.94%
	 Val. Loss: 0.067 |  Val. Acc: 97.86%
Epoch: 05
	Train Loss: 0.064 | Train Acc: 97.93%
	 Val. Loss: 0.069 |  Val. Acc: 98.02%
Epoch: 06
	Train Loss: 0.061 | Train Acc: 98.14%
	 Val. Loss: 0.062 |  Val. Acc: 98.18%
Epoch: 07
	Train Loss: 0.061 | Train Acc: 98.08%
	 Val. Loss: 0.056 |  Val. Acc: 98.40%
Epoch: 08
	Train Loss: 0.060 | Train Acc: 98.12%
	 Val. Loss: 0.052 |  Val. Acc: 98.48%
Epoch: 09
	Train Loss: 0.061 | Train Acc: 98.08%
	 Val. Loss: 0.054 |  Val. Acc: 98.54%
Epoch: 10
	Train Loss: 0.057 | Train Acc: 98.28%
	 Val. Loss: 0.057 |  Val. Acc: 98.45%


In [None]:
#evaluating the model on the test set
test_loss, test_acc = evaluate(model, test_loader, criterion, device)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.050 | Test Acc: 98.42%


## Task 3: Analyze the rank of the matrices in this network

Perform experiments and answer the following questions:
- What's the average rank of the matrices on all layers?
- How does the rank increase as we go to deeper layers?
- Try the same MLP, but change the activation function to others ($\tanh, \sigma, \dots$). Do the answers change?

In [None]:
#function to calculate average rank and rank per each layer
weight_matrices = [model.input_fc.weight, model.hidden_fc.weight, model.output_fc.weight]
def calculate_ranks(model):
    ranks = []
    for name, param in model.named_parameters():
        if 'weight' in name:
            weight_matrix = param.data.numpy()
            rank = np.linalg.matrix_rank(weight_matrix)
            ranks.append(rank)
            print(f"Layer: {name}, Rank: {rank}")
    return ranks

ranks = calculate_ranks(model)

Layer: input_fc.weight, Rank: 250
Layer: hidden_fc.weight, Rank: 100
Layer: output_fc.weight, Rank: 10


In [None]:
#calculating evarsge rank of the matrix
average_rank = np.mean(ranks)
print(f"Average Rank: {average_rank}")

Average Rank: 120.0


As we can see, as we go deeper into the network the rank decreases.

It is also noticeable than ranks are equal to the dimensions of the following layers.

In [None]:
#same MLP but with the tanh activation function to others
class MLP_tanh(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()

        self.input_fc = nn.Linear(input_dim, 250)
        self.hidden_fc = nn.Linear(250, 100)
        self.output_fc = nn.Linear(100, output_dim)

    def forward(self, x):

        batch_size = x.shape[0]
        x = x.view(batch_size, -1)
        h_1 = F.tanh(self.input_fc(x))
        h_2 = F.tanh(self.hidden_fc(h_1))
        y_pred = self.output_fc(h_2)
        return y_pred, h_2

In [None]:
model_1 = MLP_tanh(INPUT_DIM, OUTPUT_DIM)
criterion = nn.CrossEntropyLoss()
optimizer_1 = optim.Adam(model_1.parameters())
EPOCHS = 10

for epoch in range(EPOCHS):
    train_loss, train_acc = train(model_1, train_loader, optimizer_1, criterion, device)
    valid_loss, valid_acc = evaluate(model_1, valid_iterator, criterion, device)

    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01
	Train Loss: 0.454 | Train Acc: 86.37%
	 Val. Loss: 0.155 |  Val. Acc: 95.63%
Epoch: 02
	Train Loss: 0.184 | Train Acc: 94.33%
	 Val. Loss: 0.106 |  Val. Acc: 96.55%
Epoch: 03
	Train Loss: 0.151 | Train Acc: 95.25%
	 Val. Loss: 0.106 |  Val. Acc: 96.66%
Epoch: 04
	Train Loss: 0.135 | Train Acc: 95.72%
	 Val. Loss: 0.098 |  Val. Acc: 96.97%
Epoch: 05
	Train Loss: 0.126 | Train Acc: 96.00%
	 Val. Loss: 0.094 |  Val. Acc: 97.06%
Epoch: 06
	Train Loss: 0.118 | Train Acc: 96.26%
	 Val. Loss: 0.080 |  Val. Acc: 97.61%
Epoch: 07
	Train Loss: 0.113 | Train Acc: 96.43%
	 Val. Loss: 0.071 |  Val. Acc: 97.60%
Epoch: 08
	Train Loss: 0.106 | Train Acc: 96.71%
	 Val. Loss: 0.075 |  Val. Acc: 97.82%
Epoch: 09
	Train Loss: 0.104 | Train Acc: 96.78%
	 Val. Loss: 0.071 |  Val. Acc: 97.79%
Epoch: 10
	Train Loss: 0.105 | Train Acc: 96.66%
	 Val. Loss: 0.086 |  Val. Acc: 96.88%


In [None]:
#evaluating the model on the test set
test_loss, test_acc = evaluate(model_1, test_loader, criterion, device)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.077 | Test Acc: 97.40%


In [None]:
ranks = calculate_ranks(model_1)
average_rank = np.mean(ranks)
print(f"Average Rank: {average_rank}")

Layer: input_fc.weight, Rank: 250
Layer: hidden_fc.weight, Rank: 100
Layer: output_fc.weight, Rank: 10
Average Rank: 120.0


The accuracy remained almost the same, ranks themselves didn't change at all.

## Task 4: Overfit by scaling the MLP

1. Create a bigger network and train it on MNIST, to the point of overfitting.
2. Now check the rank of the matrices in the network, and answer the same questions.

In [None]:
#creating a bigger MLP
class LargeMLP(nn.Module):
    def __init__(self):
        super(LargeMLP, self).__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28*28, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 128)
        self.fc4 = nn.Linear(128, 64)
        self.fc5 = nn.Linear(64, 10)  # output layer

    def forward(self, x):
        x = self.flatten(x)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = torch.relu(self.fc4(x))
        x = self.fc5(x)
        return x

In [None]:
#Applying a larger model
large_model = LargeMLP()
large_criterion = nn.CrossEntropyLoss()
large_optimizer = optim.Adam(large_model.parameters(), lr=0.001)

large_model.to(device)

large_train_losses = []
large_train_accuracies = []
num_epochs = 40

for epoch in range(num_epochs):
    large_model.train()
    running_loss = 0.0
    correct_train = 0
    total_train = 0

    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        large_optimizer.zero_grad()
        outputs = large_model(inputs)
        loss = large_criterion(outputs, labels)
        loss.backward()
        large_optimizer.step()

        running_loss += loss.item()
        total_train += labels.size(0)
        correct_train += (torch.argmax(outputs, dim=1) == labels).sum().item()

    train_loss = running_loss / len(train_loader)
    train_acc = correct_train / total_train
    large_train_losses.append(train_loss)
    large_train_accuracies.append(train_acc)

    print(f'Epoch [{epoch+1}/{num_epochs}],'f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')


Epoch [1/40],	Train Loss: 0.403 | Train Acc: 87.09%
Epoch [2/40],	Train Loss: 0.175 | Train Acc: 94.61%
Epoch [3/40],	Train Loss: 0.136 | Train Acc: 95.77%
Epoch [4/40],	Train Loss: 0.115 | Train Acc: 96.50%
Epoch [5/40],	Train Loss: 0.109 | Train Acc: 96.57%
Epoch [6/40],	Train Loss: 0.102 | Train Acc: 96.84%
Epoch [7/40],	Train Loss: 0.092 | Train Acc: 97.14%
Epoch [8/40],	Train Loss: 0.086 | Train Acc: 97.39%
Epoch [9/40],	Train Loss: 0.084 | Train Acc: 97.44%
Epoch [10/40],	Train Loss: 0.078 | Train Acc: 97.62%
Epoch [11/40],	Train Loss: 0.076 | Train Acc: 97.69%
Epoch [12/40],	Train Loss: 0.067 | Train Acc: 97.96%
Epoch [13/40],	Train Loss: 0.067 | Train Acc: 97.93%
Epoch [14/40],	Train Loss: 0.067 | Train Acc: 97.99%
Epoch [15/40],	Train Loss: 0.065 | Train Acc: 98.03%
Epoch [16/40],	Train Loss: 0.062 | Train Acc: 98.11%
Epoch [17/40],	Train Loss: 0.060 | Train Acc: 98.18%
Epoch [18/40],	Train Loss: 0.060 | Train Acc: 98.20%
Epoch [19/40],	Train Loss: 0.056 | Train Acc: 98.32%
Ep

In [None]:
ranks = calculate_ranks(large_model)
average_rank = np.mean(ranks)
print(f"Average Rank: {average_rank}")

Layer: fc1.weight, Rank: 512
Layer: fc2.weight, Rank: 256
Layer: fc3.weight, Rank: 128
Layer: fc4.weight, Rank: 64
Layer: fc5.weight, Rank: 10
Average Rank: 194.0


As we can see, even though the model is overfitted, all ranks are still equal to the dimensions of the layers.

## Task 5: Approximate low-rank

From some of the references given at the beginning, you can realize that trained neural networks have intrinsically low dimensionality (meaning low-rank matrices).

In this task, take the overparametrized network already trained from the TASK4 and try to approximate each layer's matrix with a product of two other low-rank matrices?

This means, if a layer has a matrix $A \in\mathbb{R}^{n\times m}$, then try to find two matrices $B \in \mathbb{R}^{n\times r}$ and $C \in \mathbb{R}^{r\times m}$ so that $\lvert {A - B\cdot C}\rvert $ is minimized, where $\lvert x\rvert$ means the Frobenius norm. You can use a different norm, if you think it makes sense. In order to learn $B$ and $C$, you can do gradient descent-like algorithms, where you alternate between updating $B$ and $C$ on each optimization step.

**Ablate**:
Try different values for $r$ and analyze how good your approximation is (for e.g, by taking average Frobenius norm across all layers) as you increase $r$. Make a plot with that.

Conclude what is the effective rank $r$: the smallest rank such that the approximation of that rank is good enough (meaning the Frobenius norm is smaller than some threshold chosen by you).

In [None]:
# Function to perform low-rank matrix factorization
def low_rank_approximation(weight_matrix, rank, num_iterations=1000, lr=0.01, verbose=True):
    m, n = weight_matrix.shape
    B = torch.randn(m, rank, requires_grad=True, device=device)
    C = torch.randn(rank, n, requires_grad=True, device=device)

    optimizer = optim.Adam([B, C], lr=lr)

    frobenius_norms = []

    for i in range(num_iterations):
        optimizer.zero_grad()
        approx_matrix = torch.matmul(B, C)
        loss = torch.norm(weight_matrix - approx_matrix, 'fro') #frobenius norms
        loss.backward()
        optimizer.step()

        if verbose and (i+1) % 100 == 0:
            print(f'Epoch [{i+1}/{num_iterations}], Loss: {loss.item():.4f}')

        frobenius_norms.append(loss.item())

    #average value of the frobenius norm the layer
    final_layer_loss = frobenius_norms[-1]
    print('Final loss on current layer: ', final_layer_loss)
    print()

    return B.detach(), C.detach(), frobenius_norms, final_layer_loss

def apply_low_rank_approximation(model, rank_values):
    for i, rank in enumerate(rank_values):
        print('Norm values for rank ', rank, ':')
        final_losses = []
        for j, (name, param) in enumerate(model.named_parameters()):

            if 'weight' in name:
                weight_matrix = param.detach().cpu().numpy()
                weight_tensor = torch.tensor(weight_matrix, device=device)

                # low-rank approximation
                B, C, frobenius_norms, final_layer_loss = low_rank_approximation(weight_tensor, rank)
                final_losses.append(final_layer_loss)
                # updating weight matrix with low-rank approximation
                with torch.no_grad():
                    new_weight_matrix = torch.matmul(B, C).cpu().numpy()
                    param.copy_(torch.tensor(new_weight_matrix, device=device))
        mean = np.mean(final_losses)
        print('Mean frobenius value for current rank: ', mean)
        print()


In [None]:
rank_values = [5, 10, 20, 30]
apply_low_rank_approximation(large_model, rank_values)

low_rank_model_ranks = calculate_ranks(large_model)

#average rank of low-rank approximated model
low_rank_average_rank = np.mean(low_rank_model_ranks)
print(f'Average Rank of Matrices (Low-Rank Approximated Model): {low_rank_average_rank:.2f}')

Norm values for rank  5 :
Epoch [100/1000], Loss: 271.8906
Epoch [200/1000], Loss: 28.5072
Epoch [300/1000], Loss: 5.0118
Epoch [400/1000], Loss: 4.8898
Epoch [500/1000], Loss: 4.8747
Epoch [600/1000], Loss: 4.8588
Epoch [700/1000], Loss: 4.8143
Epoch [800/1000], Loss: 4.7573
Epoch [900/1000], Loss: 4.6312
Epoch [1000/1000], Loss: 4.4148
Final loss on current layer:  4.414775371551514

Epoch [100/1000], Loss: 125.7987
Epoch [200/1000], Loss: 15.8924
Epoch [300/1000], Loss: 13.9436
Epoch [400/1000], Loss: 10.8530
Epoch [500/1000], Loss: 9.4193
Epoch [600/1000], Loss: 8.0707
Epoch [700/1000], Loss: 6.3410
Epoch [800/1000], Loss: 4.6433
Epoch [900/1000], Loss: 3.3837
Epoch [1000/1000], Loss: 2.5697
Final loss on current layer:  2.569650650024414

Epoch [100/1000], Loss: 66.5053
Epoch [200/1000], Loss: 10.1900
Epoch [300/1000], Loss: 5.6859
Epoch [400/1000], Loss: 3.9254
Epoch [500/1000], Loss: 2.3704
Epoch [600/1000], Loss: 1.0769
Epoch [700/1000], Loss: 0.7306
Epoch [800/1000], Loss: 0.6

The smallest mean loss (frobenius norm) is for rank = 20, so we'll further use it as the effective one

## Task 6: Learning with low-rank factorization

Once you found the effective rank $r$, take the same architecture from the previous task, and now replace each layer $A \in \mathbb{R}^{n\times m}$ by a layer that applies $B\cdot C$ with $B\in \mathbb{R}^{n\times r}$ and $C \in \mathbb{R}^{r\times m}$.

**Question**: How much memory do you save? (you can just count the number of parameters of the original network and compare to that of the new network).

Initialize these values with standard initialization, and train this network.

**Question**: How does the learning change? Does it converge faster or slower? What about accuracy on both training and testing sets?

**Question**: Now try doing inference, how much improvement do you see?

In [None]:
class approximated_model(nn.Module):
    def __init__(self, rank):
        super().__init__()
        self.rank = rank
        self.activation = nn.ReLU()
        self.flatten = nn.Flatten()
        self.B1 = nn.Linear(self.rank, 1024, bias=False)
        self.C1 = nn.Linear(784, self.rank, bias=False)
        self.B2 = nn.Linear(self.rank, 1024, bias=False)
        self.C2 = nn.Linear(1024, self.rank, bias=False)
        self.B3 = nn.Linear(self.rank, 512, bias=False)
        self.C3 = nn.Linear(1024, self.rank, bias=False)
        self.B4 = nn.Linear(self.rank, 256, bias=False)
        self.C4 = nn.Linear(512, self.rank, bias=False)
        self.A5 = nn.Linear(256, 10, bias=False)

    def forward(self, x):
        x = self.flatten(x)
        a1 = self.activation(self.B1(self.C1(x)))

        a2 = self.activation(self.B2(self.C2(a1)))
        a3 = self.activation(self.B3(self.C3(a2)))
        a4 = self.activation(self.B4(self.C4(a3)))

        logits = self.A5(a4)
        return logits

model_approximated = approximated_model(20).to(device)

In [None]:
#calculating memory savings
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
original_params = count_parameters(large_model)
modified_params = count_parameters(model_approximated)

memory_savings = original_params - modified_params
print(f"Memory savings: {memory_savings} parameters")

Memory savings: 449290 parameters


In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_approximated.parameters(), lr=0.001)

train_losses = []
train_accuracies = []
num_epochs = 10

for epoch in range(num_epochs):
    model_approximated.train()
    running_loss = 0.0
    correct_train = 0
    total_train = 0

    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model_approximated(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        total_train += labels.size(0)
        correct_train += (torch.argmax(outputs, dim=1) == labels).sum().item()

    train_loss = running_loss / len(train_loader)
    train_acc = correct_train / total_train
    train_losses.append(train_loss)
    train_accuracies.append(train_acc)

    print(f'Epoch [{epoch+1}/{num_epochs}],'f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')

Epoch [1/10],	Train Loss: 0.654 | Train Acc: 78.01%
Epoch [2/10],	Train Loss: 0.282 | Train Acc: 91.36%
Epoch [3/10],	Train Loss: 0.222 | Train Acc: 93.16%
Epoch [4/10],	Train Loss: 0.190 | Train Acc: 94.21%
Epoch [5/10],	Train Loss: 0.175 | Train Acc: 94.60%
Epoch [6/10],	Train Loss: 0.163 | Train Acc: 94.96%
Epoch [7/10],	Train Loss: 0.150 | Train Acc: 95.37%
Epoch [8/10],	Train Loss: 0.145 | Train Acc: 95.61%
Epoch [9/10],	Train Loss: 0.133 | Train Acc: 96.00%
Epoch [10/10],	Train Loss: 0.129 | Train Acc: 96.04%


## Task 7: Final conclusions

Based on all the previous experiments, report your conclusions and try to give an explanation to the behaviours you observed.

Can you think of other ways of using the low-rank factorizations? What about SVD? Provide an explanation.

As we noticed from the first tasks, as we go deeper into the network, rank decreases. This decreasing rank pattern suggests that as we go deeper into the network, the weight matrices become less rank-deficient. It means that there is a reduction in the linearly independent columns in the weight matrices, indicating a compression and abstraction of information as it flows through the network.

Therefore, in this specific MLP model, the ranks decrease as we go deeper into the network, indicating a decrease in the dimensionality or complexity of the learned representations.

Another thing that can be mentioned is that even though we tried different MLP models ranks we got as a result are still equal to the dimensions of the following layers.

Talking about the model in task 6, it is seen that the accuracy is a bit worse than the accuracy of large model that was created erlier, but is also pretty good. As for the memory savings, the original(large MLP) model has 449290 parameters more than the new approximated one, which alows to conclude that we save quite a good amount of memory using the new model.

## BONUS Task: LoRA

Propose ideas by which low-rank could improve fine-tuning and training? Which disadvantages does it have?

Read about LoRA (given in one of the references at the begining of the notebook).

Now, take MNIST, and remove some digit from the dataset (keep the same labels, just remove the datapoints of a specific label).

Train a simple MLP on this modified dataset.
Fine-tune in the datapoints of the chosen digit, by using LoRA.

Report the memory and time overheads.