Merge pull request #29 from premAI-io/main

Merge from main.
premAI-io · Jan 31, 2024 · 560d6b3 · 560d6b3
2 parents c9fd226 + ac46afa
commit 560d6b3
Show file tree

Hide file tree

Showing 3 changed files with 83 additions and 104 deletions.
diff --git a/README.md b/README.md
@@ -29,24 +29,24 @@
 Take a first glance of Llama-2-7B Model Performance Metrics Across Different Precision and Inference Engines
 
 
-| Engine                       | float32      | float16        | int8          | int4          |
-|------------------------------|--------------|----------------|---------------|---------------|
-| burn                         | 10.04 ± 0.64 |      -         |      -        |      -        |
-| candle                       |      -       | 36.78 ± 2.17   |      -        |      -        |
-| llama.cpp                    |      -       |      -         | 79.15 ± 1.20  | 100.90 ± 1.46 |
-| ctranslate                   | 35.23 ± 4.01 | 55.72 ± 16.66  | 35.73 ± 10.87 |      -        |
-| tinygrad                     |      -       | 20.32 ± 0.06   |      -        |      -        |
-| onnx                         |      -       | 54.16 ± 3.15   |      -        |      -        |
-| transformers (pytorch)       | 43.79 ± 0.61 | 46.39 ± 0.28   | 6.98 ± 0.05   | 21.72 ± 0.11  |
-| vllm                         | 90.78 ± 1.60 | 90.54 ± 2.22   |      -        |      -        |
-| exllamav2                    |      -       |      -         | 121.63 ± 0.74 | 130.16 ± 0.35 |
-| ctransformers                |      -       |      -         | 76.75 ± 10.36 | 84.26 ± 5.79  |
-| AutoGPTQ                     | 42.01 ± 1.03 | 30.24 ± 0.41   |      -        |      -        |
-| AutoAWQ                      |      -       |      -         |      -        | 109.20 ± 3.28 |
-| DeepSpeed                    |      -       | 81.44 ± 8.13   |      -        |               |
-| PyTorch Lightning            | 24.85 ± 0.07 | 44.56 ± 2.89   | 10.50 ± 0.12  | 24.83 ± 0.05  |
-| Optimum Nvidia               | 110.36 ± 0.52| 109.09 ± 4.26  |      -        |      -        |
-| Nvidia TensorRT-LLM          | 55.19 ± 1.03 | 85.03 ± 0.62   | 167.66 ± 2.05 | 235.18 ± 3.20 |
+| Engine                                      | float32      | float16        | int8          | int4          |
+|---------------------------------------------|--------------|----------------|---------------|---------------|
+| [burn](/bench_burn/)                        | 10.04 ± 0.64 |      -         |      -        |      -        |
+| [candle](/bench_candle/)                    |      -       | 36.78 ± 2.17   |      -        |      -        |
+| [llama.cpp](/bench_llamacpp/)               |      -       |      -         | 79.15 ± 1.20  | 100.90 ± 1.46 |
+| [ctranslate](/bench_ctranslate/)            | 35.23 ± 4.01 | 55.72 ± 16.66  | 35.73 ± 10.87 |      -        |
+| [tinygrad](/bench_tinygrad/)                |      -       | 20.32 ± 0.06   |      -        |      -        |
+| [onnx](/bench_onnxruntime/)                 |      -       | 54.16 ± 3.15   |      -        |      -        |
+| [transformers (pytorch)](/bench_pytorch/)   | 43.79 ± 0.61 | 46.39 ± 0.28   | 6.98 ± 0.05   | 21.72 ± 0.11  |
+| [vllm](/bench_vllm/)                        | 90.78 ± 1.60 | 90.54 ± 2.22   |      -        |      -        |
+| [exllamav2](/bench_exllamav2/)              |      -       |      -         | 121.63 ± 0.74 | 130.16 ± 0.35 |
+| [ctransformers](/bench_ctransformers/)      |      -       |      -         | 76.75 ± 10.36 | 84.26 ± 5.79  |
+| [AutoGPTQ](/bench_autogptq/)                | 42.01 ± 1.03 | 30.24 ± 0.41   |      -        |      -        |
+| [AutoAWQ](/bench_autoawq/)                  |      -       |      -         |      -        | 109.20 ± 3.28 |
+| [DeepSpeed](/bench_deepspeed/)              |      -       | 81.44 ± 8.13   |      -        |               |
+| [PyTorch Lightning](/bench_lightning/)      | 24.85 ± 0.07 | 44.56 ± 2.89   | 10.50 ± 0.12  | 24.83 ± 0.05  |
+| [Optimum Nvidia](/bench_optimum_nvidia/)    | 110.36 ± 0.52| 109.09 ± 4.26  |      -        |      -        |
+| [Nvidia TensorRT-LLM](/bench_tensorrtllm/)  | 55.19 ± 1.03 | 85.03 ± 0.62   | 167.66 ± 2.05 | 235.18 ± 3.20 |
 
 *(Data updated: `31th January 2024`)
 
@@ -56,7 +56,6 @@ Take a first glance of Llama-2-7B Model Performance Metrics Across Different Pre
 
 - Also if you want to see more detailed information about each of the benchmark, you can find those details the respective benchmark folders.
 
-- If you want to compare side by side which inference engines supports which precision and device, you can check out the [ml_engines.md](/docs/ml_engines.md) file. Please note that this file is incomplete and a better comparision of engines will be added in the later versions.
 
 ## 🚀 Getting Started
 
@@ -110,7 +109,7 @@ For a comprehensive execution of all benchmarks, use the overarching `benchmark.
 
 Again, customize the parameters according to your preferences, ensuring that <file_path> and <path_to_models> point to the correct locations.
 
-Feel free to adjust the parameters as needed for your specific benchmarking requirements. Please note that, running all the benchmarks collectively can requires lot of storage (around 500 GB). Please make sure that you have enough storage to run all of them at once.
+Feel free to adjust the parameters as needed for your specific benchmarking requirements.
 
 ## 🤝 Contribute
 

diff --git a/docs/llama2.md b/docs/llama2.md
@@ -9,24 +9,24 @@
 
 **Performance Metrics:** (unit: Tokens / second)
 
-| Engine                       | float32      | float16        | int8          | int4          |
-|------------------------------|--------------|----------------|---------------|---------------|
-| burn                         | 10.04 ± 0.64 |      -         |      -        |      -        |
-| candle                       |      -       | 36.78 ± 2.17   |      -        |      -        |
-| llama.cpp                    |      -       |      -         | 79.15 ± 1.20  | 100.90 ± 1.46 |
-| ctranslate                   | 35.23 ± 4.01 | 55.72 ± 16.66  | 35.73 ± 10.87 |      -        |
-| tinygrad                     |      -       | 20.32 ± 0.06   |      -        |      -        |
-| onnx                         |      -       | 54.16 ± 3.15   |      -        |      -        |
-| transformers (pytorch)       | 43.79 ± 0.61 | 46.39 ± 0.28   | 6.98 ± 0.05   | 21.72 ± 0.11  |
-| vllm                         | 90.78 ± 1.60 | 90.54 ± 2.22   |      -        |      -        |
-| exllamav2                    |      -       |      -         | 121.63 ± 0.74 | 130.16 ± 0.35 |
-| ctransformers                |      -       |      -         | 76.75 ± 10.36 | 84.26 ± 5.79  |
-| AutoGPTQ                     | 42.01 ± 1.03 | 30.24 ± 0.41   |      -        |      -        |
-| AutoAWQ                      |      -       |      -         |      -        | 109.20 ± 3.28 |
-| DeepSpeed                    |      -       | 81.44 ± 8.13   |      -        |               |
-| PyTorch Lightning            | 24.85 ± 0.07 | 44.56 ± 2.89   | 10.50 ± 0.12  | 24.83 ± 0.05  |
-| Optimum Nvidia               | 110.36 ± 0.52| 109.09 ± 4.26  |      -        |      -        |
-| Nvidia TensorRT-LLM          | 55.19 ± 1.03 | 85.03 ± 0.62   | 167.66 ± 2.05 | 235.18 ± 3.20 |
+| Engine                                      | float32      | float16        | int8          | int4          |
+|---------------------------------------------|--------------|----------------|---------------|---------------|
+| [burn](/bench_burn/)                        | 10.04 ± 0.64 |      -         |      -        |      -        |
+| [candle](/bench_candle/)                    |      -       | 36.78 ± 2.17   |      -        |      -        |
+| [llama.cpp](/bench_llamacpp/)               |      -       |      -         | 79.15 ± 1.20  | 100.90 ± 1.46 |
+| [ctranslate](/bench_ctranslate/)            | 35.23 ± 4.01 | 55.72 ± 16.66  | 35.73 ± 10.87 |      -        |
+| [tinygrad](/bench_tinygrad/)                |      -       | 20.32 ± 0.06   |      -        |      -        |
+| [onnx](/bench_onnxruntime/)                 |      -       | 54.16 ± 3.15   |      -        |      -        |
+| [transformers (pytorch)](/bench_pytorch/)   | 43.79 ± 0.61 | 46.39 ± 0.28   | 6.98 ± 0.05   | 21.72 ± 0.11  |
+| [vllm](/bench_vllm/)                        | 90.78 ± 1.60 | 90.54 ± 2.22   |      -        |      -        |
+| [exllamav2](/bench_exllamav2/)              |      -       |      -         | 121.63 ± 0.74 | 130.16 ± 0.35 |
+| [ctransformers](/bench_ctransformers/)      |      -       |      -         | 76.75 ± 10.36 | 84.26 ± 5.79  |
+| [AutoGPTQ](/bench_autogptq/)                | 42.01 ± 1.03 | 30.24 ± 0.41   |      -        |      -        |
+| [AutoAWQ](/bench_autoawq/)                  |      -       |      -         |      -        | 109.20 ± 3.28 |
+| [DeepSpeed](/bench_deepspeed/)              |      -       | 81.44 ± 8.13   |      -        |               |
+| [PyTorch Lightning](/bench_lightning/)      | 24.85 ± 0.07 | 44.56 ± 2.89   | 10.50 ± 0.12  | 24.83 ± 0.05  |
+| [Optimum Nvidia](/bench_optimum_nvidia/)    | 110.36 ± 0.52| 109.09 ± 4.26  |      -        |      -        |
+| [Nvidia TensorRT-LLM](/bench_tensorrtllm/)  | 55.19 ± 1.03 | 85.03 ± 0.62   | 167.66 ± 2.05 | 235.18 ± 3.20 |
 
 *(Data updated: `31th January 2024`)
 
@@ -41,35 +41,25 @@
 - Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cpu --prompt 'Write an essay about the transformer model architecture'`
 
 **Performance Metrics:** (unit: Tokens / second)
-| Engine                | float32      | float16      | int8         | int4         |
-|-----------------------|--------------|--------------|--------------|--------------|
-| burn                  | 0.21 ± 0.12  |      -       |      -       |      -       |
-| candle                |      -       | 3.43 ± 0.02  |      -       |      -       |
-| llama.cpp             |      -       |      -       | 13.24 ± 0.62 | 21.43 ± 0.47 |
-| ctranslate            |      -       |      -       | 1.87 ± 0.14  |      -       |
-| tinygrad              |      -       | 4.21 ± 0.38  |      -       |      -       |
-| onnx                  |      -       |      -       |      -       |      -       |
-| ctransformers         |      -       |      -       | 13.50 ± 0.48 | 20.57 ± 2.50 |
-| transformers (pytorch)|      -       |      -       |      -       |      -       |
-| exllamav2             |      -       |      -       |      -       |      -       |
-| vllm                  |      -       |      -       |      -       |      -       |
+| Engine                                 | float32      | float16      | int8         | int4         |
+|----------------------------------------|--------------|--------------|--------------|--------------|
+| [burn](/bench_burn/)                   | 0.21 ± 0.12  |      -       |      -       |      -       |
+| [candle](/bench_candle/)               |      -       | 3.43 ± 0.02  |      -       |      -       |
+| [llama.cpp](/bench_llamacpp/)          |      -       |      -       | 13.24 ± 0.62 | 21.43 ± 0.47 |
+| [ctranslate](/bench_ctranslate/)       |      -       |      -       | 1.87 ± 0.14  |      -       |
+| [tinygrad](/bench_tinygrad/)           |      -       | 4.21 ± 0.38  |      -       |      -       |
+| [ctransformers](/bench_ctransformers/) |      -       |      -       | 13.50 ± 0.48 | 20.57 ± 2.50 |
+
 
 ### GPU (Metal)
 
 **Command:** `./benchmark.sh --repetitions 10 --max_tokens 512 --device metal --prompt 'Write an essay about the transformer model architecture'`
 
 **Performance Metrics:** (unit: Tokens / second)
-| Engine                | float32      | float16       | int8         | int4         |
-|-----------------------|--------------|---------------|--------------|--------------|
-| burn                  |      -       |      -        |      -       |      -       |
-| candle                |      -       |      -        |      -       |      -       |
-| llama.cpp             |      -       |      -        | 30.11 ± 0.45 | 44.27 ± 0.12 |
-| ctranslate            |      -       |      -        |      -       |      -       |
-| tinygrad              |      -       | 29.78 ± 1.18  |      -       |      -       |
-| onnx                  |      -       |      -        |      -       |      -       |
-| ctransformers         |      -       |      -        | 20.75 ± 0.36 | 34.04 ± 2.11 |
-| transformers (pytorch)|      -       |      -        |      -       |      -       |
-| exllamav2             |      -       |      -        |      -       |      -       |
-| vllm                  |      -       |      -        |      -       |      -       |
+| Engine                                  | float32      | float16       | int8         | int4         |
+|-----------------------------------------|--------------|---------------|--------------|--------------|
+| [llama.cpp](/bench_llamacpp/)           |      -       |      -        | 30.11 ± 0.45 | 44.27 ± 0.12 |
+| [tinygrad](/bench_tinygrad/)            |      -       | 29.78 ± 1.18  |      -       |      -       |
+| [ctransformers](/bench_ctransformers/)  |      -       |      -        | 20.75 ± 0.36 | 34.04 ± 2.11 |
 
 *(Data updated: `31th January 2024`)