Skip to content

Commit

Permalink
Merge pull request #29 from premAI-io/main
Browse files Browse the repository at this point in the history
Merge from main.
  • Loading branch information
Anindyadeep committed Jan 31, 2024
2 parents c9fd226 + ac46afa commit 560d6b3
Show file tree
Hide file tree
Showing 3 changed files with 83 additions and 104 deletions.
39 changes: 19 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,24 +29,24 @@
Take a first glance of Llama-2-7B Model Performance Metrics Across Different Precision and Inference Engines


| Engine | float32 | float16 | int8 | int4 |
|------------------------------|--------------|----------------|---------------|---------------|
| burn | 10.04 ± 0.64 | - | - | - |
| candle | - | 36.78 ± 2.17 | - | - |
| llama.cpp | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 |
| ctranslate | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - |
| tinygrad | - | 20.32 ± 0.06 | - | - |
| onnx | - | 54.16 ± 3.15 | - | - |
| transformers (pytorch) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 |
| vllm | 90.78 ± 1.60 | 90.54 ± 2.22 | - | - |
| exllamav2 | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 |
| ctransformers | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 |
| AutoGPTQ | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - |
| AutoAWQ | - | - | - | 109.20 ± 3.28 |
| DeepSpeed | - | 81.44 ± 8.13 | - | |
| PyTorch Lightning | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 |
| Optimum Nvidia | 110.36 ± 0.52| 109.09 ± 4.26 | - | - |
| Nvidia TensorRT-LLM | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 |
| Engine | float32 | float16 | int8 | int4 |
|---------------------------------------------|--------------|----------------|---------------|---------------|
| [burn](/bench_burn/) | 10.04 ± 0.64 | - | - | - |
| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - |
| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 |
| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - |
| [tinygrad](/bench_tinygrad/) | - | 20.32 ± 0.06 | - | - |
| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - |
| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 |
| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | - |
| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 |
| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 |
| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - |
| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 |
| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | |
| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 |
| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - |
| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 |

*(Data updated: `31th January 2024`)

Expand All @@ -56,7 +56,6 @@ Take a first glance of Llama-2-7B Model Performance Metrics Across Different Pre

- Also if you want to see more detailed information about each of the benchmark, you can find those details the respective benchmark folders.

- If you want to compare side by side which inference engines supports which precision and device, you can check out the [ml_engines.md](/docs/ml_engines.md) file. Please note that this file is incomplete and a better comparision of engines will be added in the later versions.

## 🚀 Getting Started

Expand Down Expand Up @@ -110,7 +109,7 @@ For a comprehensive execution of all benchmarks, use the overarching `benchmark.

Again, customize the parameters according to your preferences, ensuring that <file_path> and <path_to_models> point to the correct locations.

Feel free to adjust the parameters as needed for your specific benchmarking requirements. Please note that, running all the benchmarks collectively can requires lot of storage (around 500 GB). Please make sure that you have enough storage to run all of them at once.
Feel free to adjust the parameters as needed for your specific benchmarking requirements.

## 🤝 Contribute

Expand Down
74 changes: 32 additions & 42 deletions docs/llama2.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,24 @@

**Performance Metrics:** (unit: Tokens / second)

| Engine | float32 | float16 | int8 | int4 |
|------------------------------|--------------|----------------|---------------|---------------|
| burn | 10.04 ± 0.64 | - | - | - |
| candle | - | 36.78 ± 2.17 | - | - |
| llama.cpp | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 |
| ctranslate | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - |
| tinygrad | - | 20.32 ± 0.06 | - | - |
| onnx | - | 54.16 ± 3.15 | - | - |
| transformers (pytorch) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 |
| vllm | 90.78 ± 1.60 | 90.54 ± 2.22 | - | - |
| exllamav2 | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 |
| ctransformers | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 |
| AutoGPTQ | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - |
| AutoAWQ | - | - | - | 109.20 ± 3.28 |
| DeepSpeed | - | 81.44 ± 8.13 | - | |
| PyTorch Lightning | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 |
| Optimum Nvidia | 110.36 ± 0.52| 109.09 ± 4.26 | - | - |
| Nvidia TensorRT-LLM | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 |
| Engine | float32 | float16 | int8 | int4 |
|---------------------------------------------|--------------|----------------|---------------|---------------|
| [burn](/bench_burn/) | 10.04 ± 0.64 | - | - | - |
| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - |
| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 |
| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - |
| [tinygrad](/bench_tinygrad/) | - | 20.32 ± 0.06 | - | - |
| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - |
| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 |
| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | - |
| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 |
| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 |
| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - |
| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 |
| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | |
| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 |
| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - |
| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 |

*(Data updated: `31th January 2024`)

Expand All @@ -41,35 +41,25 @@
- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cpu --prompt 'Write an essay about the transformer model architecture'`

**Performance Metrics:** (unit: Tokens / second)
| Engine | float32 | float16 | int8 | int4 |
|-----------------------|--------------|--------------|--------------|--------------|
| burn | 0.21 ± 0.12 | - | - | - |
| candle | - | 3.43 ± 0.02 | - | - |
| llama.cpp | - | - | 13.24 ± 0.62 | 21.43 ± 0.47 |
| ctranslate | - | - | 1.87 ± 0.14 | - |
| tinygrad | - | 4.21 ± 0.38 | - | - |
| onnx | - | - | - | - |
| ctransformers | - | - | 13.50 ± 0.48 | 20.57 ± 2.50 |
| transformers (pytorch)| - | - | - | - |
| exllamav2 | - | - | - | - |
| vllm | - | - | - | - |
| Engine | float32 | float16 | int8 | int4 |
|----------------------------------------|--------------|--------------|--------------|--------------|
| [burn](/bench_burn/) | 0.21 ± 0.12 | - | - | - |
| [candle](/bench_candle/) | - | 3.43 ± 0.02 | - | - |
| [llama.cpp](/bench_llamacpp/) | - | - | 13.24 ± 0.62 | 21.43 ± 0.47 |
| [ctranslate](/bench_ctranslate/) | - | - | 1.87 ± 0.14 | - |
| [tinygrad](/bench_tinygrad/) | - | 4.21 ± 0.38 | - | - |
| [ctransformers](/bench_ctransformers/) | - | - | 13.50 ± 0.48 | 20.57 ± 2.50 |


### GPU (Metal)

**Command:** `./benchmark.sh --repetitions 10 --max_tokens 512 --device metal --prompt 'Write an essay about the transformer model architecture'`

**Performance Metrics:** (unit: Tokens / second)
| Engine | float32 | float16 | int8 | int4 |
|-----------------------|--------------|---------------|--------------|--------------|
| burn | - | - | - | - |
| candle | - | - | - | - |
| llama.cpp | - | - | 30.11 ± 0.45 | 44.27 ± 0.12 |
| ctranslate | - | - | - | - |
| tinygrad | - | 29.78 ± 1.18 | - | - |
| onnx | - | - | - | - |
| ctransformers | - | - | 20.75 ± 0.36 | 34.04 ± 2.11 |
| transformers (pytorch)| - | - | - | - |
| exllamav2 | - | - | - | - |
| vllm | - | - | - | - |
| Engine | float32 | float16 | int8 | int4 |
|-----------------------------------------|--------------|---------------|--------------|--------------|
| [llama.cpp](/bench_llamacpp/) | - | - | 30.11 ± 0.45 | 44.27 ± 0.12 |
| [tinygrad](/bench_tinygrad/) | - | 29.78 ± 1.18 | - | - |
| [ctransformers](/bench_ctransformers/) | - | - | 20.75 ± 0.36 | 34.04 ± 2.11 |

*(Data updated: `31th January 2024`)
Loading

0 comments on commit 560d6b3

Please sign in to comment.