# SparseLinear Demonstration using MNIST 

##### Training models consisting of sparsely connected linear layers

## Table of Contents

- [Introduction](#intro)
- [Setup](#setup)
- [Time and memory efficiency](#efficiency)
- [Training with random inputs](#random)
- [Training on MNIST](#mnist)
- [Training sparse models with user-defined connections](#user)
- [Training sparse models with dynamic connections](#dynamic)
- [Training sparse models with small-world connections](#sw)
- [Utilizing the activation sparsity feature](#activation)
- [Training very wide and sparse models](#big)

## Introduction <a name="intro"></a>

SparseLinear is a PyTorch package that allows a user to create extremely wide and sparse linear layers efficiently. A sparsely connected network is a network where each node is connected to some fraction of available nodes.

The provided package is built on top of [PyTorch Sparse](https://github.com/rusty1s/pytorch_sparse), which provides optimized sparse matrix operations with autograd support in PyTorch.

In this tutorial, we demonstrate its basic usage along with steps to train using the package features. Note that it is advisable to run these on the GPU instead of the CPU owing to much faster training times on the former.

## Setup <a name="setup"></a>

We import PyTorch, which contains the (dense)linear module, and load the device.

In [1]:
import torch
import torch.nn as nn

import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


We create the linear layer and demonstrate some of its built-in attributes. 

In [2]:
fc1 = nn.Linear(10, 20)
fc1.extra_repr(), fc1.weight.shape, fc1.bias.shape

('in_features=10, out_features=20, bias=True',
 torch.Size([20, 10]),
 torch.Size([20]))

In a similar manner, we now import the `sparselinear` package. As can be observed, the custom layer's weights and biases can be accessed in the same manner as before. The new layer also returns some extra attributes about which we will discuss later. 

In [3]:
import sparselinear as sl

In [4]:
sl1 = sl.SparseLinear(10,20)
sl1.extra_repr(), sl1.weight.shape, sl1.bias.shape

('in_features=10, out_features=20, bias=True, sparsity=0.9, connectivity=None, small_world=False',
 torch.Size([20, 10]),
 torch.Size([20]))

We now take a look at the two weight matrices.

In [5]:
fc1.weight

Parameter containing:
tensor([[ 2.2700e-01,  1.9950e-04,  1.5138e-01, -2.3028e-01,  3.0498e-01,
         -1.1592e-01,  8.9561e-03, -1.9644e-01, -1.8278e-01, -1.2167e-01],
        [ 1.8054e-01,  4.9672e-03, -2.7930e-01,  1.7971e-02, -2.5313e-01,
         -1.6389e-01,  2.8138e-02,  2.3216e-01,  8.5033e-02,  2.6193e-01],
        [-1.3997e-01, -2.0780e-01, -1.3777e-01,  9.5758e-02, -1.1465e-01,
         -3.0299e-01,  2.3639e-01,  2.3740e-01,  3.4879e-02, -2.8988e-01],
        [ 3.4024e-02, -3.4284e-02, -3.1449e-01, -7.3634e-02,  1.0884e-01,
          3.4649e-02,  2.2210e-01, -2.2692e-01,  1.7318e-01,  1.0567e-01],
        [ 1.8497e-01,  8.6446e-02, -1.3994e-02, -1.8335e-01,  7.1342e-02,
         -5.4367e-02, -1.2261e-01, -1.2711e-01,  1.2817e-01,  3.0136e-01],
        [ 2.7756e-01, -2.6505e-01,  2.1932e-02,  2.2353e-01, -2.0779e-01,
          2.9041e-01, -2.9108e-01,  2.5556e-02,  2.6355e-02,  9.2430e-02],
        [-6.9308e-02,  1.4349e-01,  2.1799e-01,  9.2573e-02, -1.1946e-01,
         -

In [6]:
sl1.weight

tensor(indices=tensor([[ 0,  0,  1,  2,  3,  3,  4,  5,  5,  6,  7, 11, 12, 12,
                        13, 14, 14, 15, 15, 19],
                       [ 3,  8,  1,  4,  0,  7,  1,  5,  8,  7,  0,  2,  3,  4,
                         7,  3,  7,  4,  7,  9]]),
       values=tensor([-0.0633, -0.0889, -0.1319, -0.0976,  0.0678, -0.0097,
                      -0.1309, -0.0588, -0.2626, -0.1929,  0.2060, -0.2528,
                      -0.0253,  0.1744,  0.2165,  0.1699, -0.0991, -0.2581,
                       0.3090,  0.0959]),
       size=(20, 10), nnz=20, layout=torch.sparse_coo)

As can be seen, the first weight matrix has 200 non-zero entries while the second one has 20 non-zero entries as specified by `nnz`. The indices tensor keeps track of all the indices where a non-zero entry is present with the corresponding entry in the values tensor providing the entry at that index.

## Time and Memory Efficiency <a name="efficiency"></a>

The `SparseLinear` class is ideal for very wide and sparse layers. Since we utilize sparse tensors and only store non-zero values (and their corresponding indices), `SparseLinear` is much more efficient in terms of memory consumption than simply applying a mask over a standard dense weight matrix -- as is often done by researchers and practioners. Since we only perform computations on non-zero values, we see speedups in computation time for large matrices as well. As hardware becomes more well-suited for sparse computations, these speedups will likely increase.

To show this, we create a (20000, 20000) `SparseLinear` layer with 99% sparsity and compare its runtime to that of a standard `Linear` layer. Later in this notebook, we create massive layers that would not be possible with standard `Linear` layers due to memory inefficiencies.

We initialize two layers and define the input.

In [7]:
sl2 = sl.SparseLinear(20000, 20000, sparsity=.99).cuda()

# Reduce weight dimensions if memory errors are raised
fc2 = nn.Linear(20000, 20000).cuda()

x = torch.rand(20000, device=device)

We time the inference steps.

In [8]:
%timeit y = sl2(x)
%timeit y = fc2(x)

583 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.85 ms ± 93.1 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


We time the training step for SparseLinear.

In [9]:
%%timeit
y = sl2(x)
y.sum().backward()

789 µs ± 666 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)


We time the training step for Linear.

In [10]:
%%timeit
y = fc2(x)
y.sum().backward()

9.29 ms ± 86 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


We delete layers to save GPU memory when running this notebook.

In [11]:
del sl2, fc2

## Training with random inputs <a name="random"></a>

Next, we demonstrate how to train a two-layer network using the `SparseLinear` module provided in the package. The code has been built upon the PyTorch [tutorial](https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_nn.html) to highlight the parallels between the `nn.Linear` and `sl.SparseLinear` modules.

In [12]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 200, 1000, 10

# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our dense model as a sequence of layers. 
model_dense = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# Use the sl package to define our sparse model as a sequence of layers.
# Note that the default sparsity is 90%.
model_sparse = torch.nn.Sequential(
    sl.SparseLinear(D_in, H),
    torch.nn.ReLU(),
    sl.SparseLinear(H, D_out),
)

# We will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

# We define our learning rates.
# Note that sparse and dense models may require different learning rates.
learning_rate_dense = 1e-4
learning_rate_sparse = 1e-3

for t in range(500):
    # Forward pass
    y_pred_dense = model_dense(x)
    y_pred_sparse = model_sparse(x)

    # Compute and print loss
    loss_dense = loss_fn(y_pred_dense, y)
    loss_sparse = loss_fn(y_pred_sparse, y)
    if t % 100 == 99:
        print("Dense model loss: %.3f; Sparse model loss: %.3f" %(loss_dense.item(), loss_sparse.item()))

    # Zero the gradients before running the backward pass.
    model_dense.zero_grad()
    model_sparse.zero_grad()

    # Backward pass
    loss_dense.backward()
    loss_sparse.backward()

    # Update the weights using gradient descent
    with torch.no_grad():
        for param in model_dense.parameters():
            param -= learning_rate_dense * param.grad
            
        for param in model_sparse.parameters():
            param -= learning_rate_sparse * param.grad

Dense model loss: 4.417; Sparse model loss: 5.111
Dense model loss: 0.075; Sparse model loss: 0.031
Dense model loss: 0.002; Sparse model loss: 0.000
Dense model loss: 0.000; Sparse model loss: 0.000
Dense model loss: 0.000; Sparse model loss: 0.000


As we can see, the loss value in both models decreases. Let's now build models using this module and train on the MNIST digit classification task.

## Training on MNIST <a name="mnist"></a>

We start by doing the initial imports, generating transforms, creating the dataset along with the dataloader, defining the loss function and some other helper functions.

In [13]:
import time
import torchvision
import torchvision.transforms as transforms
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import sampler

In [14]:
tf = transforms.Compose([transforms.ToTensor(),
                                   transforms.Normalize((0.1307,), (0.3081,))])

batch_size = 64
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=tf)
testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=tf)
train_dataloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, drop_last=True)
test_dataloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, drop_last=True)

In [15]:
def train_model(model, optimizer, criterion, train_dataloader, test_dataloader, num_epochs=20):
    since = time.time()
    for epoch in range(num_epochs):
        cum_loss, total, correct = 0, 0, 0
        model.train()
        
        # Training epoch
        for i, (images, labels) in enumerate(train_dataloader, 0):
            images = images.to(device)
            labels = labels.to(device)

            # Forward pass & statistics
            out = model(images)
            predicted = out.argmax(dim=1)
            correct += (predicted == labels).sum().item()
            total += labels.size(0)
            loss = criterion(out, labels)
            cum_loss += loss.item()

            # Backwards pass & update
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            
        epoch_loss = images.shape[0] * cum_loss / total
        epoch_acc = 100 * (correct / total)
        print('Epoch %d' % (epoch + 1))
        print('Training Loss: {:.4f}; Training Acc: {:.4f}'.format(epoch_loss, epoch_acc))
        
        cum_loss, total, correct = 0, 0, 0
        model.eval()
        
        # Test epoch
        for i, (images, labels) in enumerate(test_dataloader, 0):
            images = images.to(device)
            labels = labels.to(device)

            # Forward pass & statistics
            out = model(images)
            predicted = out.argmax(dim=1)
            correct += (predicted == labels).sum().item()
            total += labels.size(0)
            loss = criterion(out, labels)
            cum_loss += loss.item()
            
        epoch_loss = images.shape[0] * cum_loss / total
        epoch_acc = 100 * (correct / total)
        
        print('Test Loss: {:.4f}; Test Acc: {:.4f}'.format(epoch_loss, epoch_acc))
        print('------------')
    
    time_elapsed = time.time() - since
    print('\nTraining complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))

In [16]:
def flatten(x):
	N = x.shape[0]
	return x.view(N, -1)

class Flatten(nn.Module):
    def forward(self, x):
        return flatten(x)

In [17]:
criterion = nn.CrossEntropyLoss()

### Training a dense model

We start off with training a two-layer fully connected network. 

In [18]:
model = nn.Sequential(
	Flatten(),
	nn.Linear(784, 2000),
    nn.LayerNorm(2000),
	nn.ReLU(),
    nn.Linear(2000, 10),
)
model = model.to(device)

After we set everything up, we declare the optimizer and start training the dense model. We use SGD as the optimizer since we found its behavior to be slightly better than that of others. However, one is free to choose any optimizer as long as there exists an implementation for it to handle sparse tensors. 

In [19]:
learning_rate = 1e-2
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)

#Perform the training 
train_model(model, optimizer, criterion, train_dataloader, test_dataloader)

Epoch 1
Training Loss: 0.2062; Training Acc: 93.5916
Test Loss: 0.1250; Test Acc: 95.9736
------------
Epoch 2
Training Loss: 0.0778; Training Acc: 97.6387
Test Loss: 0.0949; Test Acc: 96.9351
------------
Epoch 3
Training Loss: 0.0491; Training Acc: 98.4775
Test Loss: 0.0785; Test Acc: 97.5160
------------
Epoch 4
Training Loss: 0.0311; Training Acc: 99.0678
Test Loss: 0.0664; Test Acc: 97.9667
------------
Epoch 5
Training Loss: 0.0201; Training Acc: 99.4614
Test Loss: 0.0704; Test Acc: 97.7564
------------
Epoch 6
Training Loss: 0.0131; Training Acc: 99.7065
Test Loss: 0.0549; Test Acc: 98.3173
------------
Epoch 7
Training Loss: 0.0081; Training Acc: 99.8749
Test Loss: 0.0592; Test Acc: 98.2372
------------
Epoch 8
Training Loss: 0.0057; Training Acc: 99.9350
Test Loss: 0.0536; Test Acc: 98.3674
------------
Epoch 9
Training Loss: 0.0035; Training Acc: 99.9850
Test Loss: 0.0551; Test Acc: 98.3173
------------
Epoch 10
Training Loss: 0.0025; Training Acc: 99.9983
Test Loss: 0.0518; 

### Training a Sparse Model

(with the default configuration)

In the same way that we declared a dense model, we now declare a sparse model with the same number of input and output features but far fewer parameters. 

In [20]:
sparse_model = nn.Sequential(
	Flatten(),
	sl.SparseLinear(784, 2000),
    nn.LayerNorm(2000),
	nn.ReLU(),
    sl.SparseLinear(2000, 10)
)
sparse_model = sparse_model.to(device)

We now train this model. Note that the learning rate is an order of magnitude higher. This is something we have found to be a rule of thumb while training these models. 

In [21]:
learning_rate = 1e-1
optimizer = optim.SGD(sparse_model.parameters(), lr=learning_rate, momentum=0.9)

train_model(sparse_model, optimizer, criterion, train_dataloader, test_dataloader)

Epoch 1
Training Loss: 0.2103; Training Acc: 93.6383
Test Loss: 0.1184; Test Acc: 96.3141
------------
Epoch 2
Training Loss: 0.0896; Training Acc: 97.2269
Test Loss: 0.1022; Test Acc: 96.7949
------------
Epoch 3
Training Loss: 0.0604; Training Acc: 98.0340
Test Loss: 0.0880; Test Acc: 97.4159
------------
Epoch 4
Training Loss: 0.0424; Training Acc: 98.6860
Test Loss: 0.0799; Test Acc: 97.6262
------------
Epoch 5
Training Loss: 0.0313; Training Acc: 99.0061
Test Loss: 0.0782; Test Acc: 97.7764
------------
Epoch 6
Training Loss: 0.0228; Training Acc: 99.2696
Test Loss: 0.0843; Test Acc: 97.6963
------------
Epoch 7
Training Loss: 0.0163; Training Acc: 99.5147
Test Loss: 0.0861; Test Acc: 97.8466
------------
Epoch 8
Training Loss: 0.0117; Training Acc: 99.6565
Test Loss: 0.0917; Test Acc: 97.7364
------------
Epoch 9
Training Loss: 0.0074; Training Acc: 99.8332
Test Loss: 0.0913; Test Acc: 97.8666
------------
Epoch 10
Training Loss: 0.0044; Training Acc: 99.9200
Test Loss: 0.0819; 

As can be seen, the two models perform comparably.


However, while the dense model has a total of 1590010 parameters, the sparse model only has 160810 parameters. This translates to **~89.8%** parameter reduction in the sparse model! 

We display the weight parameter counts of the two layers below. 

In [22]:
sparse_model[1].weights.shape, model[1].weight.shape, model[4].weight.shape, sparse_model[4].weights.shape

(torch.Size([156800]),
 torch.Size([2000, 784]),
 torch.Size([10, 2000]),
 torch.Size([2000]))

## Training sparse models with user-defined connections <a name="user"></a>

Instead of using the random set of connections created during initialization between the input and output neurons, one can choose to define one's own connections to the sparse linear layer by providing an input long tensor of shape (2,`nnz`) specifying connections from input to output neurons using the `connectivity` argument. 

Below we create a connectivity matrix where the input layer is connected to random entries in the output layer. Of course, this is just a small demonstration and one can experiment here with different connectivity matrices. 

In [23]:
num_connections = 200
col = torch.arange(784).repeat_interleave(num_connections).view(1,-1).long()
row = torch.randint(low=0, high=2000, size=(784*num_connections,)).view(1,-1).long()
connections = torch.cat((row, col), dim=0)

We provide our connectivity matrix as an input to the `SparseLinear` module and follow the same training procedure as before. 

In [24]:
sparse_model_user = nn.Sequential(
	Flatten(),
	sl.SparseLinear(784, 2000, connectivity=connections),
    nn.LayerNorm(2000),
	nn.ReLU(),
    sl.SparseLinear(2000, 10)
)
sparse_model_user = sparse_model_user.to(device)

In [25]:
learning_rate = 1e-1
optimizer = optim.SGD(sparse_model_user.parameters(), lr=learning_rate, momentum=0.9)

train_model(sparse_model_user, optimizer, criterion, train_dataloader, test_dataloader)

Epoch 1
Training Loss: 0.2136; Training Acc: 93.6283
Test Loss: 0.1089; Test Acc: 96.8249
------------
Epoch 2
Training Loss: 0.0902; Training Acc: 97.3302
Test Loss: 0.1025; Test Acc: 96.8049
------------
Epoch 3
Training Loss: 0.0619; Training Acc: 98.0606
Test Loss: 0.0877; Test Acc: 97.3658
------------
Epoch 4
Training Loss: 0.0437; Training Acc: 98.5859
Test Loss: 0.0814; Test Acc: 97.5160
------------
Epoch 5
Training Loss: 0.0340; Training Acc: 98.8661
Test Loss: 0.0854; Test Acc: 97.5461
------------
Epoch 6
Training Loss: 0.0239; Training Acc: 99.2496
Test Loss: 0.0815; Test Acc: 97.6062
------------
Epoch 7
Training Loss: 0.0165; Training Acc: 99.4997
Test Loss: 0.0901; Test Acc: 97.5060
------------
Epoch 8
Training Loss: 0.0122; Training Acc: 99.6298
Test Loss: 0.0873; Test Acc: 97.7063
------------
Epoch 9
Training Loss: 0.0080; Training Acc: 99.8016
Test Loss: 0.0917; Test Acc: 97.6963
------------
Epoch 10
Training Loss: 0.0041; Training Acc: 99.9450
Test Loss: 0.0837; 

## Training sparse model with dynamic connections <a name="dynamic"></a>

The default `SparseLinear` model creates a random set of connections during initialization between the input and output neurons. An improvement over this strategy is to prune some non-required connections and grow (hopefully)required ones. We implement the [Rigging the Lottery](https://arxiv.org/pdf/1911.11134.pdf) algorithm to achieve this. Specifying `dynamic` to be `True` alters the layer connections dynamically while training.

In [26]:
sparse_model_dynamic = nn.Sequential(
	Flatten(),
	sl.SparseLinear(784, 2000, dynamic=True),
    nn.LayerNorm(2000),
	nn.ReLU(),
    sl.SparseLinear(2000, 10, dynamic=True)
)
sparse_model_dynamic = sparse_model_dynamic.to(device)

In [27]:
learning_rate = 5e-3
optimizer = optim.SGD(sparse_model_dynamic.parameters(), lr=learning_rate, momentum=0.9)
train_model(sparse_model_dynamic, optimizer, criterion, train_dataloader, test_dataloader)

Epoch 1
Training Loss: 0.2137; Training Acc: 93.5616
Test Loss: 0.1246; Test Acc: 96.0737
------------
Epoch 2
Training Loss: 0.0915; Training Acc: 97.2269
Test Loss: 0.0815; Test Acc: 97.4960
------------
Epoch 3
Training Loss: 0.0613; Training Acc: 98.1057
Test Loss: 0.0907; Test Acc: 97.2155
------------
Epoch 4
Training Loss: 0.0452; Training Acc: 98.5676
Test Loss: 0.0839; Test Acc: 97.3658
------------
Epoch 5
Training Loss: 0.0336; Training Acc: 98.9695
Test Loss: 0.0789; Test Acc: 97.5761
------------
Epoch 6
Training Loss: 0.0219; Training Acc: 99.3647
Test Loss: 0.0681; Test Acc: 97.8365
------------
Epoch 7
Training Loss: 0.0128; Training Acc: 99.7182
Test Loss: 0.0670; Test Acc: 97.9267
------------
Epoch 8
Training Loss: 0.0116; Training Acc: 99.7665
Test Loss: 0.0672; Test Acc: 97.9267
------------
Epoch 9
Training Loss: 0.0110; Training Acc: 99.7816
Test Loss: 0.0667; Test Acc: 97.9367
------------
Epoch 10
Training Loss: 0.0106; Training Acc: 99.7932
Test Loss: 0.0669; 

## Training sparse model with small-world connections <a name="sw"></a>

Some sparsity patterns tend to perform better than others. Small-world sparsity provides a network that is mostly locally connected with a few global, long-range connections scattered in. See [here](https://en.wikipedia.org/wiki/Small-world_network). We implement an initialization strategy to incorporate small-world sparsity in the model. To specify, set `small_world` to `True`. 

In [28]:
sparse_model_sw = nn.Sequential(
	Flatten(),
	sl.SparseLinear(784, 2000, small_world=True),
    nn.LayerNorm(2000),
	nn.ReLU(),
    sl.SparseLinear(2000, 10, small_world=True)
)
sparse_model_sw = sparse_model_sw.to(device)

In [29]:
learning_rate = 1e-1
optimizer = optim.SGD(sparse_model_sw.parameters(), lr=learning_rate, momentum=0.9)
train_model(sparse_model_sw, optimizer, criterion, train_dataloader, test_dataloader)

Epoch 1
Training Loss: 0.2043; Training Acc: 93.7817
Test Loss: 0.1040; Test Acc: 96.7748
------------
Epoch 2
Training Loss: 0.0856; Training Acc: 97.3202
Test Loss: 0.0853; Test Acc: 97.2756
------------
Epoch 3
Training Loss: 0.0573; Training Acc: 98.2057
Test Loss: 0.0816; Test Acc: 97.5661
------------
Epoch 4
Training Loss: 0.0404; Training Acc: 98.7176
Test Loss: 0.0786; Test Acc: 97.5761
------------
Epoch 5
Training Loss: 0.0297; Training Acc: 99.0211
Test Loss: 0.0741; Test Acc: 97.7464
------------
Epoch 6
Training Loss: 0.0211; Training Acc: 99.3246
Test Loss: 0.0740; Test Acc: 97.9267
------------
Epoch 7
Training Loss: 0.0162; Training Acc: 99.4597
Test Loss: 0.0808; Test Acc: 97.8666
------------
Epoch 8
Training Loss: 0.0108; Training Acc: 99.7132
Test Loss: 0.0900; Test Acc: 97.6462
------------
Epoch 9
Training Loss: 0.0090; Training Acc: 99.7148
Test Loss: 0.0858; Test Acc: 97.8666
------------
Epoch 10
Training Loss: 0.0055; Training Acc: 99.8666
Test Loss: 0.0830; 

## Utilizing the activation sparsity feature <a name="activation"></a>

The `SparseLinear` layer is constructed for parameter sparsity; however, we make no stipulations on the sparsity (or density) of the activations. We include an option for sparse activations using the K-Winners strategy. This paper describes a potential method ([k-winners](https://arxiv.org/pdf/1903.11257.pdf) layer) which we use to train both linear and sparse linear models.

In [30]:
import activationsparsity as asy

Below we train a linear model using this activation sparsity feature. By default, we set `act_sparsity=0.65` (which means `k=(1-0.65)*2000`) for the layer below. 

In [31]:
model_asy = nn.Sequential(
	Flatten(),
    nn.Linear(784, 2000),
    nn.LayerNorm(2000),
    asy.ActivationSparsity(),
    nn.Linear(2000,10)
)
model_asy = model_asy.to(device)

In [32]:
learning_rate = 5e-3
optimizer = optim.SGD(model_asy.parameters(), lr=learning_rate, momentum=0.9)

#Perform the training 
train_model(model_asy, optimizer, criterion, train_dataloader, test_dataloader)

Epoch 1
Training Loss: 0.2241; Training Acc: 93.3548
Test Loss: 0.1153; Test Acc: 96.5845
------------
Epoch 2
Training Loss: 0.0901; Training Acc: 97.3102
Test Loss: 0.0894; Test Acc: 97.1855
------------
Epoch 3
Training Loss: 0.0590; Training Acc: 98.3825
Test Loss: 0.0785; Test Acc: 97.6462
------------
Epoch 4
Training Loss: 0.0415; Training Acc: 98.8877
Test Loss: 0.0671; Test Acc: 97.8866
------------
Epoch 5
Training Loss: 0.0303; Training Acc: 99.2529
Test Loss: 0.0616; Test Acc: 98.0569
------------
Epoch 6
Training Loss: 0.0222; Training Acc: 99.5114
Test Loss: 0.0602; Test Acc: 98.1270
------------
Epoch 7
Training Loss: 0.0168; Training Acc: 99.6748
Test Loss: 0.0611; Test Acc: 98.0569
------------
Epoch 8
Training Loss: 0.0128; Training Acc: 99.8366
Test Loss: 0.0604; Test Acc: 98.0970
------------
Epoch 9
Training Loss: 0.0102; Training Acc: 99.8849
Test Loss: 0.0569; Test Acc: 98.2472
------------
Epoch 10
Training Loss: 0.0079; Training Acc: 99.9400
Test Loss: 0.0581; 

Now we train another model which uses the sparse linear module along with this activation. As mentioned before, the learning rate is an order of magnitude higher than the linear module. 

In [33]:
model_asy_sparse = nn.Sequential(
	Flatten(),
	sl.SparseLinear(784, 2000),
    nn.LayerNorm(2000),
    asy.ActivationSparsity(),
    sl.SparseLinear(2000, 10),
)
model_asy_sparse = model_asy_sparse.to(device)

In [34]:
learning_rate = 5e-2
optimizer = optim.SGD(model_asy_sparse.parameters(), lr=learning_rate, momentum=0.9)

#Perform the training 
train_model(model_asy_sparse, optimizer, criterion, train_dataloader, test_dataloader)

Epoch 1
Training Loss: 0.2218; Training Acc: 93.2080
Test Loss: 0.1237; Test Acc: 96.1138
------------
Epoch 2
Training Loss: 0.0960; Training Acc: 97.0318
Test Loss: 0.0965; Test Acc: 97.0954
------------
Epoch 3
Training Loss: 0.0670; Training Acc: 97.9172
Test Loss: 0.0930; Test Acc: 97.1855
------------
Epoch 4
Training Loss: 0.0501; Training Acc: 98.4075
Test Loss: 0.0898; Test Acc: 97.2556
------------
Epoch 5
Training Loss: 0.0391; Training Acc: 98.7777
Test Loss: 0.0747; Test Acc: 97.7163
------------
Epoch 6
Training Loss: 0.0303; Training Acc: 99.0261
Test Loss: 0.0824; Test Acc: 97.5661
------------
Epoch 7
Training Loss: 0.0234; Training Acc: 99.2596
Test Loss: 0.0777; Test Acc: 97.7764
------------
Epoch 8
Training Loss: 0.0173; Training Acc: 99.4981
Test Loss: 0.0848; Test Acc: 97.4960
------------
Epoch 9
Training Loss: 0.0138; Training Acc: 99.6398
Test Loss: 0.0780; Test Acc: 97.8766
------------
Epoch 10
Training Loss: 0.0096; Training Acc: 99.7699
Test Loss: 0.0840; 

## Training very wide and sparse models <a name="big"></a>

The main advantage of utilizing sparse tensors is that it enables us to train very wide models. Below we demonstrate an example of such a model. Of course, it is just a demonstration and the key take away is that we can build these huge models for more complex tasks where the benefits would be more viable. 

In [39]:
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.sc1 = sl.SparseLinear(10 * 28 * 28, 50000, sparsity=0.999)
        self.sc2 = sl.SparseLinear(50000, 50000, sparsity=0.999)
        self.sc3 = sl.SparseLinear(50000, 50000, sparsity=0.999)
        self.sc4 = sl.SparseLinear(50000, 50000, sparsity=0.999)
        self.sc5 = sl.SparseLinear(50000, 50000, sparsity=0.999)
        
        self.input_scaling = nn.Parameter(torch.ones(10 * 28 * 28))
        self.input_shifting = nn.Parameter(torch.zeros(10 * 28 * 28))
        self.ln1 = nn.LayerNorm(50000)
        self.ln2 = nn.LayerNorm(50000)
        self.ln3 = nn.LayerNorm(50000)
        self.ln4 = nn.LayerNorm(50000)
        
    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = torch.repeat_interleave(x, 10, dim=1)
        x = self.input_scaling * x + self.input_shifting
        x = F.relu(self.ln1(self.sc1(x)))
        x = F.relu(self.ln2(self.sc2(x)))
        x = F.relu(self.ln3(self.sc3(x)))
        x = F.relu(self.ln4(self.sc4(x)))
        x = self.sc5(x)
        x = x.view(x.shape[0], -1, 10).sum(dim=1)  # sum 5000 outputs per class
        return x

sparse_big = Net().to(device)

In [40]:
learning_rate = 5e-5
optimizer = optim.SGD(sparse_big.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)
train_model(sparse_big, optimizer, criterion, train_dataloader, test_dataloader)

Epoch 1
Training Loss: 0.1993; Training Acc: 93.9168
Test Loss: 0.1317; Test Acc: 95.9034
------------
Epoch 2
Training Loss: 0.0675; Training Acc: 98.0023
Test Loss: 0.0806; Test Acc: 97.5160
------------
Epoch 3
Training Loss: 0.0318; Training Acc: 99.2663
Test Loss: 0.0723; Test Acc: 97.6863
------------
Epoch 4
Training Loss: 0.0167; Training Acc: 99.7532
Test Loss: 0.0630; Test Acc: 97.9968
------------
Epoch 5
Training Loss: 0.0088; Training Acc: 99.9483
Test Loss: 0.0609; Test Acc: 98.0970
------------
Epoch 6
Training Loss: 0.0056; Training Acc: 99.9917
Test Loss: 0.0584; Test Acc: 98.2272
------------
Epoch 7
Training Loss: 0.0041; Training Acc: 100.0000
Test Loss: 0.0587; Test Acc: 98.1771
------------
Epoch 8
Training Loss: 0.0033; Training Acc: 100.0000
Test Loss: 0.0583; Test Acc: 98.1671
------------
Epoch 9
Training Loss: 0.0028; Training Acc: 100.0000
Test Loss: 0.0587; Test Acc: 98.1370
------------
Epoch 10
Training Loss: 0.0024; Training Acc: 100.0000
Test Loss: 0.05

In conclusion, we demonstrated the `SparseLinear` layer. From a user's perspective, it is very similar to PyTorch's `Linear` layer. We also showed extra features namely user-defined sparsity, dynamic sparsity, small-world connectivity, and activation sparsity. 

Our experiments showed that even with a huge reduction in parameters, we were able to achieve a performance similar to that of massively parameterised layers. 

We hope this excites and enables people to build highly scalable sparse networks!

![Alt Text](https://media.giphy.com/media/L0O3TQpp0WnSXmxV8p/giphy.gif)