![Tome](../imgs/tome.png)

<center><small style="font-size: 12px;">Image taken from Meta's blog: "Token Merging: Your ViT but faster".</small></center>

<h1>Token Merging (ToMe) - Meta AI's New Optimization Technique to Make ViT Faster by 2x. But Can ViT be Even Faster?</h1>

<h2>Playing with ToMe and benchmarking it against other inference optimization strategies</h2>

The goal of this notebook is to explore Meta Research's new Token Merging (ToMe) optimization strategy, perform some practical experiments with it, and benchmark ToMe with other state-of-the-art inference optimization techniques with the opensource library Speedster.
We will try to answer a few questions:
* What's the accuracy-latency trade-off with ToMe?
* Can we reproduce Meta's results?
* Meta tested Tome performances with an optimal batch size on a GPU V100. How does ToMe perform with different batch sizes and on cpus?
* How does ToMe perform compared to other optimization techniques, such as quantization or pruning and compilation?
* Can ToMe be combined with other optimization strategies to achieve a multiplier effect on throughput?
* Meta's images are super cool 🙌. How can you ToMe-nize one of your pictures?

You can also find a [blog post](https://www.nebuly.com/blog/token-merging-tome-meta-ais-new-optimization-technique-to-make-vit-faster) where we discuss all these points with less code 🌈

Let's first explore ToMe.

<h2>🧩 ToMe: Token Merging</h2>

<b>Token Merging (ToMe)</b> is a technique recently introduced by Meta AI to reduce the latency of existing Vision Transformer (ViT) models without the need for additional training. ToMe gradually combines similar tokens into a transformer using an algorithm as lightweight as pruning while maintaining better accuracy.

ToMe introduces a module for token merging into an existing ViT, merging redundant tokens to improve both inference and training throughput.

![tomeintro](../imgs/tomeintro.png)

<center><small style="font-size: 12px;">ToMe accuracy vs inference speed performance. Image from Meta's blog: "Token Merging: Your ViT but faster".</small></center>

<h3>🗺️ Optimization Strategy</h3>
ViT converts image patches into "tokens". Then, it applies an attention mechanism to each layer that allows these tokens to collect information from one another proportionally to their similarity. To improve the speed of ViT while maintaining its accuracy, ToMe builds on two observations:

1. Computation speed and memory use are heavily tied to the number of tokens in the transformer
2. These tokens are often redundant

In each transformer block, tokens are combined (and thus reduced in number) by a quantity r of tokens per layer. Over the L blocks in the network, a number of rL tokens are merged. By varying the parameter r, we get a speed-accuracy trade-off, as fewer tokens means lower accuracy but higher throughput.

The image below shows how the token merging step is applied between the attention and MLP branches of each transformer block. Step by step, the dog's fur is merged into a single token.

![Tome-arch](../imgs/tome-arch.png)

<center><small style="font-size: 12px;">Image taken from Meta's paper: "Token Merging: Your ViT but faster".</small></center>

<h3>🧑‍🤝‍🧑 The Concept of Similarity</h3>

ToMe reduces the number of tokens by combining similar ones. The similarity between tokens is defined using self-attention QVKs. Specifically, the keys (K) summarise the information contained in each token. A dot product similarity metric (e.g. cosine similarity) between the keys of each token is then defined as a metric that measures the similarity between the different tokens, in order to understand whether they contain similar information.

<h3>📈 Paper Results</h3>

So, what's the accuracy-latency trade-off with ToMe?

Let's have a look at the results reported in the paper, which were obtained on a V100 GPU. I plotted the accuracy of ViT as a function of the hyperparameter r, where the original ViT corresponds to r=0. We can see that smaller values of r correspond to a model that is slower but with accuracy more faithful to the original model. Large values of r result in a considerable acceleration of the model but a loss in accuracy. For instance, to achieve 2x acceleration from the original model, the model loses 4 points of accuracy.

![tomeres](../imgs/infer-results.png)

<center><small style="font-size: 12px;">Results taken from Meta's paper: "Token Merging: Your ViT but faster".</small></center>

All results shown are for the ViT-B/16 model, which is also the model that will be used in the notebook for the various experiments as well.

<h2>👷‍♀️ Hands-on</h2>

Let's see if we can reproduce Meta's results.

<h3>⚙️ Libraries Intallation</h3>

<h4>ToMe Installation</h4>

To install ToMe, follow the [instructions](https://github.com/facebookresearch/ToMe/blob/main/INSTALL.md) released with the implementation by Meta Research.

<h4>Speedster Installation</h4>

Install Speedster and its base requirements:

In [None]:
!pip install speedster

Then make sure to install all the available deep learning compilers:

In [None]:
!python -m nebullvm.installers.auto_installer --compilers all

<h4>Timm Installation</h4>

Timm exists only in pre-release version, so to install it run:

In [None]:
!pip install timm

In [4]:
from speedster import optimize_model, save_model, load_model

from nebullvm.tools.benchmark import benchmark

import timm, tome
import torch

<h3>🔥 ToMe Optimization</h3>

<h4>👩‍🔬 Experiments</h4>

I ran the experiments on ViT-B/16 model on a GPU V100 with batch size 64 as in the paper, the recommended value of the hyperparameter r=16. Then, I tested ToMe on smaller batch sizes up to batch size=1, and replicated the same experiment on a CPU E5–2686. A p3.2xlarge instance of aws was used for all experiments.

<h3>Batch Size = 1</h3>

<h4>Original model benchmark - batch size = 1</h4>

In [2]:
input_data = [((torch.randn(1, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [2]:
model = timm.create_model("vit_base_patch16_224", pretrained=True)
model.eval();

<h5>GPU</h5>

In [7]:
benchmark(model, input_data)

[32m2023-02-16 08:46:47[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:05<00:00,  9.66it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:08<00:00, 113.47it/s]

Batch size: 1
Average Throughput: 114.39 data/second
Average Latency: 0.0087 seconds/data





<h5>CPU</h5>

In [8]:
benchmark(model, input_data, device="cpu", n_warmup=20, n_runs=100)

[32m2023-02-16 08:47:10[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 20 iterations: 100%|██████████| 20/20 [00:03<00:00,  5.37it/s]
Performing benchmark on 100 iterations: 100%|██████████| 100/100 [00:17<00:00,  5.56it/s]

Batch size: 1
Average Throughput: 5.60 data/second
Average Latency: 0.1784 seconds/data





<h4> ToMe Optimization </h4>

In [9]:
tome.patch.timm(model)
# Set the number of tokens reduced per layer
model.r = 16

<h4>ToMe Optimized Model Benchmark - batch size = 1</h4>

<h5>GPU</h5>

In [10]:
benchmark(model, input_data)

[32m2023-02-16 08:47:45[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 50.34it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:18<00:00, 53.13it/s]

Batch size: 1
Average Throughput: 53.50 data/second
Average Latency: 0.0187 seconds/data





<h5>CPU</h5>

In [11]:
benchmark(model, input_data, device="cpu", n_warmup=20, n_runs=100)

[32m2023-02-16 08:48:13[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 20 iterations: 100%|██████████| 20/20 [00:02<00:00,  8.31it/s]
Performing benchmark on 100 iterations: 100%|██████████| 100/100 [00:11<00:00,  8.58it/s]

Batch size: 1
Average Throughput: 8.69 data/second
Average Latency: 0.1151 seconds/data





The original model with the batch size of 1 has a throughput of <b>114.39 data/second</b> with a latency of <b>0.0087 seconds/data</b> on gpu, while on cpu: <b>5.60 data/second</b> and <b>0.1784 seconds/data</b> latency. After optimizing the model with ToMe we get a throughput of <b>53.50 data/second</b> and latency <b>0.0187 seconds/data</b> on gpu while <b>8.69 data/second</b> and <b>0.1151 seconds/data</b> on cpu. 

This means that the ToMe optimization library in inference for cases where the batch size is equal to one (a very common case when in inference) slows down the model, almost doubling its time.

<h3>Batch Size = 2</h3>

<h4>Original model benchmark - batch size = 2</h4>

In [12]:
input_data = [((torch.randn(2, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [13]:
model = timm.create_model("vit_base_patch16_224", pretrained=True)
model.eval();

<h5>GPU</h5>

In [14]:
benchmark(model, input_data)

[32m2023-02-16 08:48:44[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 109.44it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:09<00:00, 107.07it/s]

Batch size: 2
Average Throughput: 216.19 data/second
Average Latency: 0.0046 seconds/data





<h5>CPU</h5>

In [15]:
benchmark(model, input_data, device="cpu", n_warmup=20, n_runs=100)

[32m2023-02-16 08:48:54[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 20 iterations: 100%|██████████| 20/20 [00:06<00:00,  3.01it/s]
Performing benchmark on 100 iterations: 100%|██████████| 100/100 [00:33<00:00,  3.00it/s]

Batch size: 2
Average Throughput: 6.03 data/second
Average Latency: 0.1660 seconds/data





<h4>ToMe Optimization</h4>

In [16]:
tome.patch.timm(model)
# Set the number of tokens reduced per layer
model.r = 16

<h4>ToMe Optimized Model Benchmark - batch size = 2</h4>

<h5>GPU</h5>

In [17]:
benchmark(model, input_data)

[32m2023-02-16 08:49:35[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 50.92it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:19<00:00, 50.94it/s]

Batch size: 2
Average Throughput: 102.59 data/second
Average Latency: 0.0097 seconds/data





<h5>CPU</h5>

In [18]:
benchmark(model, input_data, device="cpu", n_warmup=20, n_runs=100)

[32m2023-02-16 08:49:56[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 20 iterations: 100%|██████████| 20/20 [00:04<00:00,  4.96it/s]
Performing benchmark on 100 iterations: 100%|██████████| 100/100 [00:19<00:00,  5.05it/s]

Batch size: 2
Average Throughput: 10.16 data/second
Average Latency: 0.0984 seconds/data





In this case the original model with the batch size of 2 has a throughput of <b>216.19 data/second</b> with a latency of <b>0.0046 seconds/data</b> on gpu, while on cpu: <b>6.03 data/second</b> and <b>0.1660 seconds/data</b> latency. After optimizing the model with ToMe we get a throughput of <b>102.59 data/second</b> and latency <b>0.0097 seconds/data</b> on gpu while <b>10.16 data/second</b> and <b>0.0984 seconds/data</b> on cpu. 

Here we can see how ToMe speeds up the model in inference on cpu, where the impact of batch size is limited, while it still slows down the model on gpu.

<h3>Batch Size = 4</h3>

<h4>Original model benchmark - batch size = 4</h4>

In [19]:
input_data = [((torch.randn(4, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [20]:
model = timm.create_model("vit_base_patch16_224", pretrained=True)
model.eval();

<h5>GPU</h5>

In [21]:
benchmark(model, input_data)

[32m2023-02-16 08:50:23[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 82.22it/s] 
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:13<00:00, 72.73it/s]

Batch size: 4
Average Throughput: 292.85 data/second
Average Latency: 0.0034 seconds/data





<h5>CPU</h5>

In [22]:
benchmark(model, input_data, device="cpu", n_warmup=20, n_runs=100)

[32m2023-02-16 08:50:38[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 20 iterations: 100%|██████████| 20/20 [00:11<00:00,  1.71it/s]
Performing benchmark on 100 iterations: 100%|██████████| 100/100 [00:58<00:00,  1.71it/s]

Batch size: 4
Average Throughput: 6.86 data/second
Average Latency: 0.1457 seconds/data





<h4>ToMe Optimization</h4>

In [23]:
tome.patch.timm(model)
# Set the number of tokens reduced per layer
model.r = 16

<h4>ToMe Optimized Model Benchmark - batch size = 4</h4>

<h5>GPU</h5>

In [24]:
benchmark(model, input_data)

[32m2023-02-16 08:51:49[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 50.34it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:19<00:00, 50.36it/s]

Batch size: 4
Average Throughput: 202.81 data/second
Average Latency: 0.0049 seconds/data





<h5>CPU</h5>

In [25]:
benchmark(model, input_data, device="cpu", n_warmup=20, n_runs=100)

[32m2023-02-16 08:52:10[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 20 iterations: 100%|██████████| 20/20 [00:08<00:00,  2.46it/s]
Performing benchmark on 100 iterations: 100%|██████████| 100/100 [00:34<00:00,  2.94it/s]

Batch size: 4
Average Throughput: 11.81 data/second
Average Latency: 0.0847 seconds/data





The original model with the batch size of 4 has a throughput of <b>292.85 data/second</b> with a latency of <b>0.0034 seconds/data</b> on gpu, while on cpu: <b>6.86 data/second</b> and <b>0.1457 seconds/data</b> latency. After optimizing the model with ToMe we get a throughput of <b>202.81 data/second</b> and latency <b>0.0049 seconds/data</b> on gpu while <b>11.81 data/second</b> and <b>0.0847 seconds/data</b> on cpu.

Also in this case ToMe slows down the model when used on gpu while giving it a boost on cpu.

<h3>Batch Size = 8</h3>

<h4>Original model benchmark - batch size = 8</h4>

In [26]:
input_data = [((torch.randn(8, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [27]:
model = timm.create_model("vit_base_patch16_224", pretrained=True)
model.eval();

<h5>GPU</h5>

In [28]:
benchmark(model, input_data)

[32m2023-02-16 08:52:57[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:01<00:00, 35.13it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:26<00:00, 38.15it/s]

Batch size: 8
Average Throughput: 307.51 data/second
Average Latency: 0.0033 seconds/data





<h5>CPU</h5>

In [29]:
benchmark(model, input_data, device="cpu", n_warmup=20, n_runs=100)

[32m2023-02-16 08:53:25[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 20 iterations: 100%|██████████| 20/20 [00:22<00:00,  1.15s/it]
Performing benchmark on 100 iterations: 100%|██████████| 100/100 [01:55<00:00,  1.15s/it]

Batch size: 8
Average Throughput: 6.96 data/second
Average Latency: 0.1438 seconds/data





<h4>ToMe Optimization</h4>

In [30]:
tome.patch.timm(model)
# Set the number of tokens reduced per layer
model.r = 16

<h4>ToMe Optimized Model Benchmark - batch size = 8</h4>

<h5>GPU</h5>

In [31]:
benchmark(model, input_data)

[32m2023-02-16 08:55:44[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 50.07it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:19<00:00, 50.30it/s]

Batch size: 8
Average Throughput: 405.17 data/second
Average Latency: 0.0025 seconds/data





<h5>CPU</h5>

In [32]:
benchmark(model, input_data, device="cpu", n_warmup=20, n_runs=100)

[32m2023-02-16 08:56:05[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 20 iterations: 100%|██████████| 20/20 [00:12<00:00,  1.64it/s]
Performing benchmark on 100 iterations: 100%|██████████| 100/100 [01:02<00:00,  1.60it/s]

Batch size: 8
Average Throughput: 12.82 data/second
Average Latency: 0.0780 seconds/data





The original model with the batch size of 8 has a throughput of <b>307.51 data/second</b> with a latency of <b>0.0033 seconds/data</b> on gpu, while on cpu: <b>6.96 data/second</b> and <b>0.1438 seconds/data</b> latency. After optimizing the model with ToMe we get a throughput of <b>405.17 data/second</b> and latency <b>0.0025 seconds/data</b> on gpu while <b>12.82 data/second</b> and <b>0.0780 seconds/data</b> on cpu.

Here one can see the first speed improvement in both cpu and gpu use cases after applying ToMe to the model.

<h3>Batch Size = 32</h3>

<h4>Original model benchmark - batch size = 32</h4>

In [33]:
input_data = [((torch.randn(32, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [34]:
model = timm.create_model("vit_base_patch16_224", pretrained=True)
model.eval();

<h5>GPU</h5>

In [35]:
benchmark(model, input_data)

[32m2023-02-16 08:57:28[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:04<00:00, 11.08it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [01:40<00:00,  9.90it/s]

Batch size: 32
Average Throughput: 319.47 data/second
Average Latency: 0.0031 seconds/data





<h5>CPU</h5>

In [37]:
benchmark(model, input_data, device="cpu", n_warmup=10, n_runs=15)

[32m2023-02-16 09:00:07[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 10 iterations: 100%|██████████| 10/10 [00:50<00:00,  5.06s/it]
Performing benchmark on 15 iterations: 100%|██████████| 15/15 [01:13<00:00,  4.92s/it]

Batch size: 32
Average Throughput: 6.50 data/second
Average Latency: 0.1539 seconds/data





<h4>ToMe Optimization</h4>

In [38]:
tome.patch.timm(model)
# Set the number of tokens reduced per layer
model.r = 16

<h4>ToMe Optimized Model Benchmark - batch size = 32</h4>

<h5>GPU</h5>

In [39]:
benchmark(model, input_data)

[32m2023-02-16 09:02:12[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:02<00:00, 19.19it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:54<00:00, 18.30it/s]

Batch size: 32
Average Throughput: 589.79 data/second
Average Latency: 0.0017 seconds/data





<h5>CPU</h5>

In [40]:
benchmark(model, input_data, device="cpu", n_warmup=10, n_runs=15)

[32m2023-02-16 09:03:10[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 10 iterations: 100%|██████████| 10/10 [00:24<00:00,  2.47s/it]
Performing benchmark on 15 iterations: 100%|██████████| 15/15 [00:39<00:00,  2.65s/it]

Batch size: 32
Average Throughput: 12.07 data/second
Average Latency: 0.0828 seconds/data





In this case, with batch size of 32, the original model has a throughput of <b>319.47 data/second</b> with a latency of <b>0.0031 seconds/data</b> on gpu, while on cpu: <b>6.50 data/second</b> and <b>0.1539 seconds/data</b> latency. After optimizing the model with ToMe we get a throughput of <b>589.79 data/second</b> and latency <b>0.0017 seconds/data</b> on gpu while <b>12.07 data/second</b> and <b>0.0828 seconds/data</b> on cpu.

As the batch size increases, one can see the improvement in model speed achieved through the application of ToMe.

<h3>Batch Size = 64</h3>

<h4>Original model benchmark - batch size = 64</h4>

In [41]:
input_data = [((torch.randn(64, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [42]:
model = timm.create_model("vit_base_patch16_224", pretrained=True)
model.eval();

<h5>GPU</h5>

In [43]:
benchmark(model, input_data, n_runs=500)

[32m2023-02-16 09:04:32[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:08<00:00,  5.80it/s]
Performing benchmark on 500 iterations: 100%|██████████| 500/500 [01:36<00:00,  5.19it/s]

Batch size: 64
Average Throughput: 333.49 data/second
Average Latency: 0.0030 seconds/data





<h5>CPU</h5>

In [44]:
benchmark(model, input_data, device="cpu", n_warmup=10, n_runs=15)

[32m2023-02-16 09:06:19[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 10 iterations: 100%|██████████| 10/10 [02:01<00:00, 12.15s/it]
Performing benchmark on 15 iterations: 100%|██████████| 15/15 [02:57<00:00, 11.84s/it]

Batch size: 64
Average Throughput: 5.40 data/second
Average Latency: 0.1850 seconds/data





<h4>ToMe Optimization</h4>

In [45]:
tome.patch.timm(model)
# Set the number of tokens reduced per layer
model.r = 16

<h4>ToMe Optimized Model Benchmark - batch size = 64</h4>

<h5>GPU</h5>

In [46]:
benchmark(model, input_data, n_runs=500)

[32m2023-02-16 09:11:19[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:04<00:00, 10.18it/s]
Performing benchmark on 500 iterations: 100%|██████████| 500/500 [00:51<00:00,  9.71it/s]

Batch size: 64
Average Throughput: 625.73 data/second
Average Latency: 0.0016 seconds/data





<h5>CPU</h5>

In [47]:
benchmark(model, input_data, device="cpu", n_warmup=10, n_runs=15)

[32m2023-02-16 09:12:17[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 10 iterations: 100%|██████████| 10/10 [00:56<00:00,  5.70s/it]
Performing benchmark on 15 iterations: 100%|██████████| 15/15 [01:25<00:00,  5.70s/it]

Batch size: 64
Average Throughput: 11.23 data/second
Average Latency: 0.0891 seconds/data





In this final case, with batch size of 64, the original model has a throughput of <b>333.49 data/second</b> with a latency of <b>0.0030 seconds/data</b> on gpu, while on cpu: <b>5.40 data/second</b> and <b>0.1850 seconds/data</b> latency. After optimizing the model with ToMe we get a throughput of <b>625.73 data/second</b> and latency <b>0.0016 seconds/data</b> on gpu while <b>11.23 data/second</b> and <b>0.0891 seconds/data</b> on cpu.

This is the batch size used as the default within the library, here you can see a <b>2x</b> in speed for the optimized model.

The next step is to apply Speedster to the chosen model, again on both cpu and gpu and with different batch sizes.

<h3>👾 Results</h3>

The results are very interesting:

* The use of ToMe is very simple, and we are able to reproduce Meta's acceleration of 2x on a V100 for batch size 64.
* On GPU, ToMe is very sensitive to the batch size, and for low batch sizes it produces poorer performances with respect to the original model. This is because in this case the resources are not fully used when the batch size is small, in this case the highly parallelizable GPU still has space for further parallel computation. The token-reduction becomes significant on GPU only when the GPU power is fully utilized.
* On CPU, performances are around 2x both for very small and larger batch sizes. This is due to the fact that also for smaller batch sizes the CPU compute power is fully used by the network. Thus, the overhead of ToMe on CPU can already be compensated by the token-reduction for smaller batch sizes.

![tomeresults](../imgs/tome-results.png)

<center><small style="font-size: 12px;">Throughput graph for the original model and the model to which ToMe was applied, as the batch size varies.</small></center>

<h2>🚀 ToMe vs Other Optimization Techniques</h2>

As a next step, I compared the performance of ToMe with what can be achieved by other optimization techniques. I used the Speedster library to run the optimizations and see the performance on CPU and GPU.

<b>Speedster</b> is an open-source module designed to speed up AI inference in just a few lines of code. Its use is very simple. The library automatically applies the best set of SOTA optimization techniques to achieve the maximum inference speed-up (of latency and throughput, while compressing the model size) physically possible on the available hardware.

The optimization workflow consists of 3 steps: select, search, and serve.

📚 <b>Select step</b>: in this step, users input their model in their preferred deep learning framework and express their preferences regarding maximum consented accuracy loss and optimization time. This information is used to guide the optimization process and ensure that the resulting model meets the user's needs.

🔍 <b>Search step</b>: the library automatically tests every combination of optimization techniques across the software-to-hardware stack, such as sparsity, quantization, and compilers, that is compatible with the user's preferences and local hardware. This allows the library to find the optimal configuration of techniques for accelerating the model.

🍵 <b>Serve step</b>: in this final step, the library returns an accelerated version of the user's model in the DL framework of choice, providing a significant boost in performance.
The model is optimized by the 4 Speedster blocks shown in the image below. How they work is presented in the library documentation.

![speed](../imgs/speedsterdoc.png)

<center><small style="font-size: 12px;">Image taken from Speedster documentation.</small></center>

<h3>💫 ViT Optimization with Speedster</h3>

<h4>👩‍🔬 Experiments</h4>

I performed two experiments with Speedster. First, I performed only optimization with techniques that have no impact on model performance. This is achieved by setting the parameter metric_drop_ths=0.

Next, I increased the metric_drop threshold to 0.05 so that speedster could also apply techniques that slightly change the accuracy to provide better speedup, such as quantization and compression. The 0.05 value is very low, which means that we expect the accuracy to remain essentially unchanged, as explained in the documentation.

In [6]:
model = timm.create_model("vit_base_patch16_224", pretrained=True)
model.eval();

<h3>Batch Size = 1 </h3>

In [7]:
input_data = [((torch.randn(1, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [4]:
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0
)

[32m2023-02-16 09:20:49[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 09:20:56[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 09:20:58[0m | [1mINFO    [0m | [1mOriginal model latency: 0.008123621940612794 sec/iter[0m
[32m2023-02-16 09:21:03[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 09:21:07[0m | [1mINFO    [0m | [1mOptimized model latency: 0.006738185882568359 sec/iter[0m
[32m2023-02-16 09:21:07[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 09:21:32[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-16 09:21:35[0m | [1mINFO    [0m | [1mOptimized model latency: 0.0060863494873046875 sec/iter[0m
[32m2023-02-16 09:21:35[0m | [1mINFO    [0m | [1mOptimizing with ONNXTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 09:22:17[0m 

In [8]:
optimized_model_metric_drop = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="unconstrained",
    metric_drop_ths=0.05
)

[32m2023-02-16 11:00:21[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 11:00:28[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 11:00:30[0m | [1mINFO    [0m | [1mOriginal model latency: 0.008265492916107177 sec/iter[0m
[32m2023-02-16 11:00:35[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 11:00:38[0m | [1mINFO    [0m | [1mOptimized model latency: 0.0067327022552490234 sec/iter[0m
[32m2023-02-16 11:00:38[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-16 11:00:42[0m | [1mINFO    [0m | [1mOptimized model latency: 0.007616758346557617 sec/iter[0m
[32m2023-02-16 11:00:42[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 11:01:08[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: Quantizat

<h4>Speedster Optimized Model Benchmark</h4>

<h5>GPU</h5>

In [5]:
benchmark(optimized_model, input_data)

[32m2023-02-16 09:22:39[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 218.85it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:04<00:00, 218.38it/s]

Batch size: 1
Average Throughput: 220.19 data/second
Average Latency: 0.0045 seconds/data





In [9]:
benchmark(optimized_model_metric_drop, input_data)

[32m2023-02-16 11:10:33[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 427.99it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:02<00:00, 424.11it/s]

Batch size: 1
Average Throughput: 427.80 data/second
Average Latency: 0.0023 seconds/data





<h5>CPU</h5>

In [7]:
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0,
    device="cpu"
)

[32m2023-02-16 09:23:26[0m | [1mINFO    [0m | [1mRunning Speedster on CPU[0m
[32m2023-02-16 09:23:46[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 09:24:09[0m | [1mINFO    [0m | [1mOriginal model latency: 0.17869094610214234 sec/iter[0m
[32m2023-02-16 09:24:14[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 09:24:21[0m | [1mINFO    [0m | [1mOptimized model latency: 0.1728215217590332 sec/iter[0m
[32m2023-02-16 09:24:21[0m | [1mINFO    [0m | [1mOptimizing with DeepSparseCompiler and q_type: None.[0m


DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.3.2 COMMUNITY | (7d31c4bf) (release) (optimized) (system=avx2, binary=avx2)


[32m2023-02-16 09:24:41[0m | [1mINFO    [0m | [1mOptimized model latency: 0.1993265151977539 sec/iter[0m
[32m2023-02-16 09:24:41[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-16 09:24:50[0m | [1mINFO    [0m | [1mOptimized model latency: 0.15670418739318848 sec/iter[0m
[32m2023-02-16 09:24:50[0m | [1mINFO    [0m | [1mOptimizing with OpenVINOCompiler and q_type: None.[0m
[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: /tmp/tmpoaccyrgp/fp32/temp.xml
[ SUCCESS ] BIN file: /tmp/tmpoaccyrgp/fp32/tem

<h5>CPU</h5>

In [8]:
benchmark(optimized_model, input_data, device="cpu", n_warmup=20, n_runs=100)

[32m2023-02-16 09:25:23[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 20 iterations: 100%|██████████| 20/20 [00:03<00:00,  6.10it/s]
Performing benchmark on 100 iterations: 100%|██████████| 100/100 [00:16<00:00,  6.24it/s]

Batch size: 1
Average Throughput: 6.31 data/second
Average Latency: 0.1584 seconds/data





After trying several approaches, Speedster selected the use of TensorRT as the technique that best optimizes the model if gpu is used. While ONNXRun when on cpu.

With Speedster already with batch size of 1, improvements are achieved in terms of model throughput on gpu. Going from an average Throughput of <b>114.39</b> for the original model, to an average Throughput of <b>220.19</b> for the Speedster optimized model. While on cpu from <b>5.60</b> for the unoptimized model to <b>6.31</b>.

<h3>Batch Size = 2 </h3>

In [12]:
input_data = [((torch.randn(2, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [10]:
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0
)

[32m2023-02-16 09:25:54[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 09:25:58[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 09:25:59[0m | [1mINFO    [0m | [1mOriginal model latency: 0.00877812623977661 sec/iter[0m
[32m2023-02-16 09:26:05[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 09:26:06[0m | [1mINFO    [0m | [1mOptimized model latency: 0.008505582809448242 sec/iter[0m
[32m2023-02-16 09:26:06[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 09:26:26[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-16 09:26:27[0m | [1mINFO    [0m | [1mOptimized model latency: 0.007837057113647461 sec/iter[0m
[32m2023-02-16 09:26:27[0m | [1mINFO    [0m | [1mOptimizing with ONNXTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 09:27:06[0m | 

In [13]:
optimized_model_metric_drop = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="unconstrained",
    metric_drop_ths=0.05
)

[32m2023-02-16 11:19:24[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 11:19:29[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 11:19:30[0m | [1mINFO    [0m | [1mOriginal model latency: 0.009010717868804932 sec/iter[0m
[32m2023-02-16 11:19:35[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 11:19:36[0m | [1mINFO    [0m | [1mOptimized model latency: 0.008727788925170898 sec/iter[0m
[32m2023-02-16 11:19:36[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-16 11:19:37[0m | [1mINFO    [0m | [1mOptimized model latency: 0.0071909427642822266 sec/iter[0m
[32m2023-02-16 11:19:37[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 11:19:57[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: Quantizat

<h4>Speedster Optimized Model Benchmark</h4>

<h5>GPU</h5>

In [11]:
benchmark(optimized_model, input_data)

[32m2023-02-16 09:27:08[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 145.93it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:06<00:00, 144.07it/s]

Batch size: 2
Average Throughput: 290.16 data/second
Average Latency: 0.0034 seconds/data





In [14]:
benchmark(optimized_model_metric_drop, input_data)

[32m2023-02-16 11:29:11[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 353.16it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:02<00:00, 349.08it/s]

Batch size: 2
Average Throughput: 704.12 data/second
Average Latency: 0.0014 seconds/data





In [12]:
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0,
    device="cpu"
)

[32m2023-02-16 09:28:44[0m | [1mINFO    [0m | [1mRunning Speedster on CPU[0m
[32m2023-02-16 09:29:20[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 09:30:03[0m | [1mINFO    [0m | [1mOriginal model latency: 0.3269576716423035 sec/iter[0m
[32m2023-02-16 09:30:09[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 09:30:21[0m | [1mINFO    [0m | [1mOptimized model latency: 0.3234107494354248 sec/iter[0m
[32m2023-02-16 09:30:21[0m | [1mINFO    [0m | [1mOptimizing with DeepSparseCompiler and q_type: None.[0m
[32m2023-02-16 09:30:50[0m | [1mINFO    [0m | [1mOptimized model latency: 0.3949728012084961 sec/iter[0m
[32m2023-02-16 09:30:50[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-16 09:31:05[0m | [1mINFO    [0m | [1mOptimized model latency: 0.3019731044769287 sec/iter[0m
[32m2023-02-16 09:31:05[0m | [1mINFO    [

<h5>CPU</h5>

In [13]:
benchmark(optimized_model, input_data, device="cpu", n_warmup=20, n_runs=100)

[32m2023-02-16 09:31:26[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 20 iterations: 100%|██████████| 20/20 [00:06<00:00,  3.20it/s]
Performing benchmark on 100 iterations: 100%|██████████| 100/100 [00:31<00:00,  3.22it/s]

Batch size: 2
Average Throughput: 6.49 data/second
Average Latency: 0.1542 seconds/data





Again as with the batch size of 1, Speedster selected the use of TensorRT as the technique that best optimizes the model when using gpu. While ONNXRun when on cpu.

With batch size equal to 2 the model optimized with Speedster turns out to be faster than the original model on both gpu and cpu. If we accept a loss in prediction accuracy of 5% we get a model with more than 3x speedup over the original.

<h3>Batch Size = 4 </h3>

In [17]:
input_data = [((torch.randn(4, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [15]:
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0
)

[32m2023-02-16 09:32:37[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 09:32:42[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 09:32:44[0m | [1mINFO    [0m | [1mOriginal model latency: 0.013259525299072266 sec/iter[0m
[32m2023-02-16 09:32:49[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 09:32:50[0m | [1mINFO    [0m | [1mOptimized model latency: 0.024447202682495117 sec/iter[0m
[32m2023-02-16 09:32:50[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 09:33:12[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-16 09:33:13[0m | [1mINFO    [0m | [1mOptimized model latency: 0.014016389846801758 sec/iter[0m
[32m2023-02-16 09:33:13[0m | [1mINFO    [0m | [1mOptimizing with ONNXTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 09:33:53[0m |

In [18]:
optimized_model_metric_drop = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="unconstrained",
    metric_drop_ths=0.05,
)

[32m2023-02-16 11:40:43[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 11:40:48[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 11:40:50[0m | [1mINFO    [0m | [1mOriginal model latency: 0.013236365318298339 sec/iter[0m
[32m2023-02-16 11:40:56[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 11:40:57[0m | [1mINFO    [0m | [1mOptimized model latency: 0.013654947280883789 sec/iter[0m
[32m2023-02-16 11:40:57[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-16 11:40:58[0m | [1mINFO    [0m | [1mOptimized model latency: 0.007341861724853516 sec/iter[0m
[32m2023-02-16 11:40:58[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 11:41:18[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: Quantizati

<h4>Speedster Optimized Model Benchmark</h4>

<h5>GPU</h5>

In [16]:
benchmark(optimized_model, input_data)

[32m2023-02-16 09:34:36[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 78.76it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:12<00:00, 78.17it/s]

Batch size: 4
Average Throughput: 315.12 data/second
Average Latency: 0.0032 seconds/data





In [19]:
benchmark(optimized_model_metric_drop, input_data)

[32m2023-02-16 11:50:41[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 267.54it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:03<00:00, 262.84it/s]

Batch size: 4
Average Throughput: 1060.41 data/second
Average Latency: 0.0009 seconds/data





In [17]:
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0,
    device="cpu"
)

[32m2023-02-16 09:35:49[0m | [1mINFO    [0m | [1mRunning Speedster on CPU[0m
[32m2023-02-16 09:36:52[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 09:38:07[0m | [1mINFO    [0m | [1mOriginal model latency: 0.5760366940498352 sec/iter[0m
[32m2023-02-16 09:38:13[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 09:38:35[0m | [1mINFO    [0m | [1mOptimized model latency: 0.5786547660827637 sec/iter[0m
[32m2023-02-16 09:38:35[0m | [1mINFO    [0m | [1mOptimizing with DeepSparseCompiler and q_type: None.[0m
[32m2023-02-16 09:39:22[0m | [1mINFO    [0m | [1mOptimized model latency: 0.8065593242645264 sec/iter[0m
[32m2023-02-16 09:39:22[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-16 09:39:48[0m | [1mINFO    [0m | [1mOptimized model latency: 0.6088290214538574 sec/iter[0m
[32m2023-02-16 09:39:48[0m | [1mINFO    [

<h5>CPU</h5>

In [18]:
benchmark(optimized_model, input_data, device="cpu", n_warmup=10, n_runs=15)

[32m2023-02-16 09:40:26[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 10 iterations: 100%|██████████| 10/10 [00:05<00:00,  1.72it/s]
Performing benchmark on 15 iterations: 100%|██████████| 15/15 [00:08<00:00,  1.72it/s]

Batch size: 4
Average Throughput: 6.89 data/second
Average Latency: 0.1452 seconds/data





Here Speedster selects TensorRT for gpu as the best technique, while OpenVINO for cpu.

The results again remain in line as the cases with batch size equal 1 and 2.

<h3>Batch Size = 8 </h3>

In [23]:
input_data = [((torch.randn(8, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [20]:
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0
)

[32m2023-02-16 09:40:53[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 09:40:58[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 09:41:02[0m | [1mINFO    [0m | [1mOriginal model latency: 0.025456199645996092 sec/iter[0m
[32m2023-02-16 09:41:09[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 09:41:10[0m | [1mINFO    [0m | [1mOptimized model latency: 0.02623724937438965 sec/iter[0m
[32m2023-02-16 09:41:10[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 09:41:32[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-16 09:41:34[0m | [1mINFO    [0m | [1mOptimized model latency: 0.02676844596862793 sec/iter[0m
[32m2023-02-16 09:41:34[0m | [1mINFO    [0m | [1mOptimizing with ONNXTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 09:42:15[0m | 

In [24]:
optimized_model_metric_drop = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="unconstrained",
    metric_drop_ths=0.05,
)

[32m2023-02-16 12:48:44[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 12:48:50[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 12:48:53[0m | [1mINFO    [0m | [1mOriginal model latency: 0.025454015731811525 sec/iter[0m
[32m2023-02-16 12:49:01[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 12:49:03[0m | [1mINFO    [0m | [1mOptimized model latency: 0.026031970977783203 sec/iter[0m
[32m2023-02-16 12:49:03[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-16 12:49:04[0m | [1mINFO    [0m | [1mOptimized model latency: 0.007582187652587891 sec/iter[0m
[32m2023-02-16 12:49:04[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 12:49:28[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: Quantizati

<h4>Speedster Optimized Model Benchmark</h4>

<h5>GPU</h5>

In [21]:
benchmark(optimized_model, input_data)

[32m2023-02-16 09:42:24[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:01<00:00, 41.08it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:24<00:00, 40.78it/s]

Batch size: 8
Average Throughput: 328.79 data/second
Average Latency: 0.0030 seconds/data





In [25]:
benchmark(optimized_model_metric_drop, input_data)

[32m2023-02-16 12:59:24[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 156.99it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:06<00:00, 154.87it/s]

Batch size: 8
Average Throughput: 1248.45 data/second
Average Latency: 0.0008 seconds/data





In [22]:
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0,
    device="cpu"
)

[32m2023-02-16 09:42:50[0m | [1mINFO    [0m | [1mRunning Speedster on CPU[0m
[32m2023-02-16 09:44:49[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 09:47:17[0m | [1mINFO    [0m | [1mOriginal model latency: 1.147692756652832 sec/iter[0m
[32m2023-02-16 09:47:25[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 09:48:09[0m | [1mINFO    [0m | [1mOptimized model latency: 1.1343605518341064 sec/iter[0m
[32m2023-02-16 09:48:09[0m | [1mINFO    [0m | [1mOptimizing with DeepSparseCompiler and q_type: None.[0m
[32m2023-02-16 09:49:27[0m | [1mINFO    [0m | [1mOptimized model latency: 1.5942871570587158 sec/iter[0m
[32m2023-02-16 09:49:27[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-16 09:50:16[0m | [1mINFO    [0m | [1mOptimized model latency: 1.2054996490478516 sec/iter[0m
[32m2023-02-16 09:50:16[0m | [1mINFO    [0

<h5>CPU</h5>

In [23]:
benchmark(optimized_model, input_data, device="cpu", n_warmup=10, n_runs=15)

[32m2023-02-16 09:51:09[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 10 iterations: 100%|██████████| 10/10 [00:11<00:00,  1.11s/it]
Performing benchmark on 15 iterations: 100%|██████████| 15/15 [00:16<00:00,  1.11s/it]

Batch size: 8
Average Throughput: 7.22 data/second
Average Latency: 0.1386 seconds/data





The best approaches are to use TensorRT for GPU and OpenVINO for cpu.

<h3>Batch Size = 32 </h3>

In [4]:
input_data = [((torch.randn(32, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [25]:
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0
)

[32m2023-02-16 09:51:42[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 09:51:55[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 09:52:09[0m | [1mINFO    [0m | [1mOriginal model latency: 0.09855757713317871 sec/iter[0m
[32m2023-02-16 09:52:21[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 09:52:25[0m | [1mINFO    [0m | [1mOptimized model latency: 0.10088324546813965 sec/iter[0m
[32m2023-02-16 09:52:25[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 09:52:49[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-16 09:52:55[0m | [1mINFO    [0m | [1mOptimized model latency: 0.1034550666809082 sec/iter[0m
[32m2023-02-16 09:52:55[0m | [1mINFO    [0m | [1mOptimizing with ONNXTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 09:53:37[0m | [1

In [5]:
optimized_model_metric_drop = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="unconstrained",
    metric_drop_ths=0.05,
)

[32m2023-02-16 13:34:06[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 13:34:23[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 13:34:36[0m | [1mINFO    [0m | [1mOriginal model latency: 0.09862235069274902 sec/iter[0m
[32m2023-02-16 13:34:48[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 13:34:55[0m | [1mINFO    [0m | [1mOptimized model latency: 0.10131096839904785 sec/iter[0m
[32m2023-02-16 13:34:55[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-16 13:35:00[0m | [1mINFO    [0m | [1mOptimized model latency: 0.028622865676879883 sec/iter[0m
[32m2023-02-16 13:35:00[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 13:35:26[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: Quantization

<h4>Speedster Optimized Model Benchmark</h4>

<h5>GPU</h5>

In [26]:
benchmark(optimized_model, input_data)

[32m2023-02-16 09:53:41[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:04<00:00, 10.72it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [01:33<00:00, 10.64it/s]

Batch size: 32
Average Throughput: 342.10 data/second
Average Latency: 0.0029 seconds/data





In [6]:
benchmark(optimized_model_metric_drop, input_data)

[32m2023-02-16 13:47:42[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:01<00:00, 46.61it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:21<00:00, 46.21it/s]

Batch size: 32
Average Throughput: 1488.99 data/second
Average Latency: 0.0007 seconds/data





<h3>Batch Size = 64 </h3>

In [6]:
input_data = [((torch.randn(64, 3, 224, 224), ), torch.tensor([0])) for _ in range(100)]

In [30]:
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0
)

[32m2023-02-16 10:33:11[0m | [1mINFO    [0m | [1mRunning Speedster on GPU:0[0m
[32m2023-02-16 10:33:34[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-16 10:34:01[0m | [1mINFO    [0m | [1mOriginal model latency: 0.1909184455871582 sec/iter[0m
[32m2023-02-16 10:34:19[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-16 10:34:27[0m | [1mINFO    [0m | [1mOptimized model latency: 0.19694280624389648 sec/iter[0m
[32m2023-02-16 10:34:28[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 10:34:57[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-16 10:35:05[0m | [1mINFO    [0m | [1mOptimized model latency: 0.19941234588623047 sec/iter[0m
[32m2023-02-16 10:35:05[0m | [1mINFO    [0m | [1mOptimizing with ONNXTensorRTCompiler and q_type: None.[0m
[32m2023-02-16 10:35:52[0m | [1

In [None]:
optimized_model_metric_drop = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="unconstrained",
    metric_drop_ths=0.05,
)

<h4>Speedster Optimized Model Benchmark</h4>

<h5>GPU</h5>

In [31]:
benchmark(optimized_model, input_data)

[32m2023-02-16 10:35:55[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:09<00:00,  5.31it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [03:09<00:00,  5.27it/s]

Batch size: 64
Average Throughput: 338.72 data/second
Average Latency: 0.0030 seconds/data





In [7]:
benchmark(optimized_model_metric_drop, input_data)

[32m2023-02-16 14:17:56[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:02<00:00, 23.72it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:42<00:00, 23.51it/s]

Batch size: 64
Average Throughput: 1513.78 data/second
Average Latency: 0.0007 seconds/data





Overall even for high batch sizes Speedster leads to a noticeable acceleration of the model, especially if we accept a drop in the reference metric.

<h3>👾 Results</h3>

Let's analyze the results:

* Speedster is easy to use and allows information to be obtained during optimization regarding the techniques tested.
* The library makes it possible to display initial metrics related to model acceleration without the need to run the benchmark function.
* Using half precision, flash attention and hardware-specific compilation techniques, Speedster on GPU can significantly accelerate the model with a target metric lower than 0.05 compared to the original version. The default metric used by speedster is the numeric precision, which measures the average relative difference between the original model and the optimized one. Therefore, any model with a target metric lower than 0.05 can be considered as having no accuracy loss.
* The original model and Speedster had similar throughput on CPU. Increasing the metric drop to 0.05 did not help with speed-up. This is because many layers in Speedster do not have fp16 kernels, so converting fp32 tensors to fp16 and back inside the network slows down latency. Also, int8 conversion for the ViT model made numeric precision drop, and did not meet the 0.05 constraint. This makes sense because transformers are more affected by quantization, due to having "outliers" in the activations. Using QAT before quantizing, i.e. fine-tuning the model with simulated quantization, can then be used for getting a better speed-up with Speedster on CPUs. One way to improve Speedster performance on CPUs might be to implement ToMe within the library.

![speedsterresults](../imgs/speedster-results.png)

<center><small style="font-size: 12px;">Throughput graph for the original model and the model to which Speedster was applied, as the batch size varies.</small></center>

<a id="play"></a>
<h2>🎨 Test ToMe With Your Own Picture</h2>

In this section you can find a section where you can test ToMe on your images, with the possibility of changing the hyperparameter r that adjusts the level of optimization.

In [None]:
# change this with the path of your image
PATH_TO_IMAGE = "imgs/PATH_TO_YOUR_IMAGE.jpg"

In [None]:
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
from PIL import Image

In [None]:
Image.LOAD_TRUNCATED_IMAGES = True

In [None]:
model_name = "vit_large_patch16_384"
model = timm.create_model(model_name, pretrained=True)
tome.patch.timm(model, trace_source=True)

In [None]:
input_size = model.default_cfg["input_size"][1]

transform_list = [
    transforms.Resize(int((256 / 224) * input_size), interpolation=InterpolationMode.BICUBIC),
    transforms.CenterCrop(input_size)
]

transform_vis  = transforms.Compose(transform_list)
transform_norm = transforms.Compose(transform_list + [
    transforms.ToTensor(),
    transforms.Normalize(model.default_cfg["mean"], model.default_cfg["std"]),
])

In [None]:
img = Image.open(PATH_TO_IMAGE).convert('RGB')

In [None]:
img_vis = transform_vis(img)
img_norm = transform_norm(img)

You can change the hyper-parameter <em>r</em>. The larger <em>r</em> is, the more pixels in your image will be merged together:

In [None]:
model.r = 16
_ = model(img_norm[None, ...])
source = model._tome_info["source"]

print(f"{source.shape[1]} tokens at the end")
tome.make_visualization(img_vis, source, patch_size=16, class_token=True)

The results as r varies will be similar to these:

![thanos](../imgs/thanos.png)

<center><small style="font-size: 12px;">ToMe test on a picture of me.</small></center>

Yes I admit, unfortunately the images will be without Thanos :(

<h2>🌈 Conclusions</h2>

ToMe makes it possible to accelerate Visual Transformer models, both on GPU and CPU. One interesting thing to notice is that ToMe improves the model's speed on CPU inference, but reduces it on GPU when the batch size is low. This can be explained by the fact that CPU uses its full compute power for smaller batch sizes, while GPU has more room for parallel computation. Therefore, ToMe's overhead on CPU is offset by the token reduction, but not on GPU until the batch size is large enough. This can also be seen from the graphs below:

![totalresults](../imgs/total-results.png)

<center><small style="font-size: 12px;">Results obtained with different optimization techniques, with various values for batch size.</small></center>

Here we can see that Speedster accepting a 5% performance loss is significantly faster than the original model, remembering that ToMe also leads to performance losses the comparison between the techniques can be considered fair. While on the CPU ToMe appears to be the fastest technique, so it might be interesting to implement its automatic use within Speedster. I opened an [issue on Speedster GitHub](https://github.com/nebuly-ai/nebullvm/issues/174) so that anyone can contribute.
And that's it! If you are interested in AI optimization or if you liked this notebook please leave a star at our repo [Speedster](https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster) 💕🌟!

Thanks to the Nebuly team for their support in these analyses ❤️

<h2>💾 References</h2>

* [Speedster](https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster)
* [Blog Post: Token Merging: Your ViT but faster](https://research.facebook.com/blog/2023/2/token-merging-your-vit-but-faster/?utm_source=linkedin&utm_medium=organic_social&utm_campaign=evergreen&utm_content=animation)
* [Paper: Token Merging: Your ViT but faster](https://arxiv.org/pdf/2210.09461.pdf)
* [Timm](https://github.com/rwightman/pytorch-image-models#getting-started-documentation)
* [Nebuly](https://www.nebuly.com/)