In [6]:
from llama_model_handler import LlamaModelHandler
from IPython.display import Markdown, display

Loading model: meta-llama/Llama-3.1-8b

In [2]:
model_handler = LlamaModelHandler("meta-llama/Llama-3.1-8b")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Authentication successful.
Loading model 'meta-llama/Llama-3.1-8b'...


Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.95s/it]


Model loaded on device: cuda:0
GPU: NVIDIA L4
Model dtype: torch.float16


Testprompt:

In [5]:
prompt = "Whats the meaning of life?"
display(Markdown(model_handler.generate_text(prompt=prompt, max_new_tokens=250)))

Whats the meaning of life? That's a question we all ask at some point in our lives. This is something that has been pondered by many great minds throughout history.
But what does it mean to you and me?
We are born, live for 80-90 years or so (depending on where we come from) then die and go back into nature.
What exactly happens after death though...is anyone really sure?
Well here is my take...
The idea behind this theory is pretty simple:
Our souls have always existed since before birth and will continue existing even when we physically pass away.
It could be argued that when one dies their soul goes straight up to heaven to meet with God but I don't believe thats how things work out.
Instead Im convinced that your soul continues living through reincarnation; which means being reborn again somewhere else along time lines other than ours - maybe another planet perhaps even Earth itself!
There may also exist an infinite number of parallel universes containing identical copies yet completely different versions thereof due certain changes made within each individual instance leading them down separate paths until they eventually become two entirely dissimilar entities once more separated by space & matter
In addition there might possibly multiple dimensions beyond those currently known about including ones consisting solely energy instead physical mass

### **Model Performance Benchmarking Metrics**

---

<small>

#### 1. **Latency**

Measures time delays during generation.

- **First-Token Latency (FTL):**  
  Time to generate the **first token**.  
  $$ \text{FTL} = t_{\text{first token}} - t_{\text{start}} $$

- **Average-Token Latency (ATL):**  
  Average time per token after the first one.  
  $$ \text{ATL} = \frac{T_{\text{total}} - \text{FTL}}{N_{\text{tokens}} - 1} $$

- **Generation Latency (GL):**  
  Total time to generate the **full output**.  
  $$ \text{GL} = t_{\text{end}} - t_{\text{start}} $$

---

#### 2. **Throughput**

Measures the output rate of the model.

- **Tokens per Second (TPS):**  
  Number of tokens generated per second.  
  $$ \text{TPS} = \frac{N_{\text{tokens}}}{\text{GL}} $$

- **Sentences per Second (SPS):**  
  Number of sentences generated per second.  
  $$ \text{SPS} = \frac{N_{\text{sentences}}}{\text{GL}} $$

---

#### 3. **Storage**

Provides insights into memory usage during inference.

- **Model Size:**  
  The total disk space used by the pre-trained model.

- **KV-Cache Size:**  
  Memory used for key-value caching during generation.

- **Memory Usage (Model + KV-Cache):**  
  $$ \text{Memory}_{\text{total}} = \text{Model Memory} + \text{KV-Cache Memory} $$

---

#### 4. **Energy**

Evaluates energy efficiency during generation.

- **Energy Consumption per Token:**  
  $$ E_{\text{token}} = \frac{E_{\text{total}}}{N_{\text{tokens}}} $$

- **Energy Consumption per Sentence:**  
  $$ E_{\text{sentence}} = \frac{E_{\text{total}}}{N_{\text{sentences}}} $$

- **Energy Consumption per Second:**  
  $$ E_{\text{sec}} = P_{\text{avg}} \times t_{\text{generation}} $$

---

#### 5. **Quality (Summarization)**

Measures the quality of model-generated text, especially for summarization tasks.

- **ROUGE Score:**  
  Measures the overlap between generated and reference summaries.

- **Perplexity:**  
  Indicates how well the model predicts a sequence. Lower is better.  
  $$ \text{Perplexity} = e^{\text{Cross-Entropy Loss}} $$

---

#### **Summary of Key Metrics**

| Metric                   | Unit             | Formula/Definition                                  |
|--------------------------|-------------------|-----------------------------------------------------|
| First-Token Latency      | seconds (s)       | $$ \text{FTL} $$                                    |
| Average-Token Latency    | seconds/token     | $$ \text{ATL} $$                                    |
| Generation Latency       | seconds (s)       | $$ \text{GL} $$                                     |
| Tokens per Second (TPS)  | tokens/second     | $$ \frac{N_{\text{tokens}}}{\text{GL}} $$            |
| Sentences per Second     | sentences/second  | $$ \frac{N_{\text{sentences}}}{\text{GL}} $$         |
| Memory Usage             | MB/GB             | $$ \text{Model Memory} + \text{KV-Cache Memory} $$   |
| Energy per Token         | Joules/token      | $$ \frac{E_{\text{total}}}{N_{\text{tokens}}} $$     |
| Energy per Sentence      | Joules/sentence   | $$ \frac{E_{\text{total}}}{N_{\text{sentences}}} $$  |
| Energy per Second        | Watts (W)         | $$ P_{\text{avg}} \times t_{\text{generation}} $$    |
| Perplexity               | -                 | $$ e^{\text{Cross-Entropy Loss}} $$                  |

</small>

#### Test Benchmark

In [1]:
from benchmark import ModelBenchmark
from llama_model_handler import LlamaModelHandler

In [2]:
# Load model and tokenizer
model_handler = LlamaModelHandler("meta-llama/Llama-3.1-8b", precision="fp16")
model, tokenizer = model_handler.get_model_and_tokenizer()

# Initialize benchmark
benchmark = ModelBenchmark(model=model, tokenizer=tokenizer, max_tokens=128)

# Run benchmark
test_prompts = [
    "Explain the significance of transformer models in NLP.",
    "What are the main benefits of renewable energy?",
    "How does the immune system work?",
    "What is the capital of France?",
    "What is the best way to cook a steak?"
]

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Authentication successful.
Loading model 'meta-llama/Llama-3.1-8b' with precision 'fp16'...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model loaded on device: cuda:0
GPU: NVIDIA L4
Model dtype: torch.float16
Model loading time: 11.5443 seconds


In [3]:
benchmark_results_fp16 = benchmark.benchmark(test_prompts)

Evaluating prompt (length 54 characters)...
Power logging started for prompt.


Power logging stopped for prompt.
Evaluating prompt (length 47 characters)...
Power logging started for prompt.
Power logging stopped for prompt.
Evaluating prompt (length 32 characters)...
Power logging started for prompt.
Power logging stopped for prompt.
Evaluating prompt (length 30 characters)...
Power logging started for prompt.
Power logging stopped for prompt.
Evaluating prompt (length 37 characters)...
Power logging started for prompt.
Power logging stopped for prompt.


In [4]:
benchmark_results_fp16

Unnamed: 0,Prompt Length,FTL (s),ATL (s),GL (s),TPS (tokens/s),SPS (sentences/s),Memory Usage (MB),Total Energy Consumption (Wh)
0,54,0.0649,0.0649,9.087,17.06,0.73,16190.06,0.329708
1,47,0.0598,0.0598,8.2516,16.82,0.98,16190.06,0.326103
2,32,0.0603,0.0603,8.207,16.56,0.73,16190.06,0.327567
3,30,0.0604,0.0604,8.2194,16.54,1.82,16190.06,0.326733
4,37,0.0592,0.0592,8.2309,16.9,0.0,16190.06,0.326532


The usage of 8bit quantization reduced the memory usage as expeted. Interesting is, that the energy consumption per token is higher than the one of the 16bit model. 8bit quantization is slower than 16bit quantization which leads to the higher energy consumption.

In [None]:
benchmark_results_8bit

Unnamed: 0,Prompt Length,FTL (s),ATL (s),GL (s),TPS (tokens/s),SPS (sentences/s),Memory Usage (MB),Total Energy Consumption (Wh),GL (s)/Energy
0,54,0.0989,0.0989,13.8477,11.67,0.5,9462.06,0.373899,37.035938
1,47,0.0861,0.0861,11.8813,11.34,0.49,9462.06,0.371903,31.947309
2,32,0.086,0.086,11.7006,11.54,0.42,9462.06,0.371951,31.457369
3,30,0.0881,0.0881,11.9751,11.47,1.29,9462.06,0.332548,36.01014
4,37,0.0829,0.0829,11.5187,11.93,0.52,9462.06,0.371158,31.034492


In [1]:
import sys
import os

# Add LLMPerf path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..', 'llmperf'))
sys.path.insert(0, project_root)

from llm_correctness import llm_correctness

# Run the benchmark using your local model
results, details = llm_correctness(
    model="meta-llama/Llama-3.1-8b",
    llm_api="local",  # 👈 Use the newly added local option
    num_concurrent_requests=2,
    max_num_completed_requests=10
)

print("Benchmark Results:", results)
print("Details:", details)

  from .autonotebook import tqdm as notebook_tqdm
2025-02-26 07:32:41,682	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


⚡ Using local LLaMA model for benchmarking.
Authentication successful.
Loading model 'meta-llama/Llama-3.1-8b' with precision 'fp16'...


Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.11s/it]


Model loaded on device: cuda:0
GPU: NVIDIA L4
Model dtype: torch.float16
Model loading time: 10.6879 seconds


100%|██████████| 10/10 [00:43<00:00,  4.36s/it]

Mismatched and errored requests.
    mismatched request: Convert the following sequence of words into a number: eight thousand and forty-six.\nPrint the number first. Then print its digits in reverse order.
```
#include <stdio.h>
int main()
{
    int num = 8046, r;
     printf("%d",num);
     while(num!=0)
      {
        r=num%10;  
          //print each digit
            if (r==1) 
                {printf("one ");}
            else if(r ==2){printf("two ");}    
             else if ( r==3){printf ("three");}
              else if, expected: 8046
    mismatched request: Convert the following sequence of words into a number: eight thousand and forty-six.\nPrint the number first. Then print your answer in terms of digits.
8*1000+4*10^2 + 6 =80406
i.e. Eighty four zero six is equal to eight hundred forty six, expected: 8046
    mismatched request: Convert the following sequence of words into a number: three thousand, one hundred and thirty-two.\nPrint the number first. Then convert it 


