# Demo: Measure Latency & Throughput
**Objective:** Get hands‚Äëon loading TinyLlama-1.1B-Chat-v1.0 and timing a simple generation so you understand raw latency and throughput.

**Tasks:**
1. Load the model & tokenizer  
2. Prepare a prompt: Pick or write ~30‚Äì50 words; tokenize with tokenizer(...)
3. Time your generation
4. Record
   - Latency (s)
   - Throughput (tokens/s)

In [1]:
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

In [2]:
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

CUDA Available: True
Number of GPUs: 1
GPU Name: NVIDIA GeForce RTX 3050 Ti Laptop GPU


In [3]:
MODEL_NAME     = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
MAX_NEW_TOKENS = 50

## Step 0: Load Model and Tokenizer
- load the model in half-precision (torch.float16) to speed up inference and reduce memory usage.

### Tokenizer
What the tokenizer does:

#### 1) Text ‚Üí Tokens
- It takes raw text (like "Hello world!") and splits it into smaller units called tokens.
- Tokens can be words, subwords, or even single characters depending on the tokenizer.
```
Example:
text = "Hello world!"
tokens = tokenizer.encode(text)
print(tokens)  # [15496, 995]  (numbers that represent each token)
```

#### 2) Tokens ‚Üí Input IDs
- These tokens are then converted into integer IDs that the model understands.
- These IDs are what actually get fed into the model.

#### 3) Input IDs ‚Üí Attention Masks
- The tokenizer can also generate attention masks, which tell the model which tokens are meaningful (1) and which are padding (0).

#### 4) Tokens ‚Üí Text (Decoding)
- After the model produces output (token IDs), the tokenizer can convert them back to readable text:
```
output_ids = [15496, 995]
text = tokenizer.decode(output_ids)
print(text)  # "Hello world!"
```

In [4]:
def load_model_and_tokenizer(model_name: str, device_name: str):
    # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    device = torch.device(device_name)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float16
    )
    model.to(device).eval()
    return tokenizer, model, device

## Step 1: Prepare Prompt

In [5]:
def prepare_prompt(input_text, tokenizer):
    prompt = (input_text)
    inputs = tokenizer(prompt, return_tensors="pt")    
    input_len = inputs["input_ids"].size(1)
    return inputs, input_len

## Step 2: Warm up model
## Purpose of warming up
- When you load a model, especially on a GPU, the first call can be slower because:
    - CUDA kernels need to be initialized.
    - Memory needs to be allocated on the GPU.
    - Just-in-time (JIT) compilation may occur for some operations.
- By running a dummy forward pass, you get the model ready for real inference, reducing latency for the actual calls.

## Code Explanation
- model.generate(**inputs, max_new_tokens=5)
    - Runs a short generation of 5 new tokens using the input.
    - This is just enough to trigger model initialization and memory allocation.
- torch.cuda.synchronize()
    - Ensures all GPU operations finish before moving on.
    - Without this, you might measure time incorrectly or start other operations before the GPU is ready.

## What does max_new_tokens do
- max_new_tokens is a parameter for text generation that controls how many tokens the model will generate in addition to your input.

1Ô∏è‚É£ What it does
- Suppose your input prompt is "Hello world".
- When you call: 
```
output_ids = model.generate(**inputs, max_new_tokens=5)
```
- The model will generate at most 5 new tokens after the input.
- The total output length = input tokens + max_new_tokens.

2Ô∏è‚É£ Why it‚Äôs important
- Limits output length: Without it, the model might generate a very long or even infinite sequence (if you don‚Äôt set eos_token_id or stop criteria).
- Controls speed and memory: Generating fewer tokens is faster and uses less GPU memory.
- Predictable generation: Useful when you want a specific maximum output size.

In [6]:
def warmup_model(model, inputs, device):
    output_ids = model.generate(**inputs, max_new_tokens=5)
    if device.type == "cuda":
        torch.cuda.synchronize()

    return output_ids


## Step 3: Full Fledged Output Token Generation + Measure Output

üî•  **CPU clock measuring GPU execution**
**torch.cuda.synchronize():** \
- It blocks the CPU until all GPU work is finished
- Latency is measured by the CPU clock, but it‚Äôs timing GPU work.

What‚Äôs actually happening in your code
```
start = time.time()          # CPU timestamp
output_ids = model.generate(...)  # GPU work launched
torch.cuda.synchronize()     # CPU waits for GPU
end = time.time()
```

üî• **CUDA is asynchronous**
When you run this:
```
output_ids = model.generate(...)
```
- The CPU launches GPU kernels
- Returns immediately
- The GPU is still crunching numbers in the background

So without synchronization, this happens:
```
start time  ‚îÄ‚îÄ‚ñ∫ launch GPU work ‚îÄ‚îÄ‚ñ∫ end time (GPU still running!)
```
Your latency would be too small and wrong.


üî• **What torch.cuda.synchronize() does**
```
torch.cuda.synchronize()
```
- Forces the CPU to wait until all pending CUDA operations finish
- Ensures the GPU has fully completed generation
- Makes your timing accurate

After syncing:
```
start time ‚îÄ‚îÄ‚ñ∫ GPU work ‚îÄ‚îÄ‚ñ∫ GPU done ‚îÄ‚îÄ‚ñ∫ end time
```

Now:
```
latency_s = end - start
```
is the true generation latency.

üî• **Why you only do this on CUDA**
- CPU execution is already synchronous
- Syncing only matters for GPUs
- Calling it on CPU would be pointless (and slower)

In [7]:
def measure_generation(model, inputs, max_new_tokens, device):
    inputs = {k: v.to(device) for k, v in inputs.items()}
    input_len = inputs["input_ids"].size(1)

    start = time.time()
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
    if device.type == "cuda":
        torch.cuda.synchronize()
    end = time.time()

    latency_s = end - start
    gen_tokens = output_ids.size(1) - input_len
    return latency_s, gen_tokens, output_ids

## Start and run everything

In [8]:
def start(device):
    # 0. Load
    tokenizer, model, device = load_model_and_tokenizer(MODEL_NAME, device)
    print(f"Using device: {device}")

    # 1. Prepare prompt
    input_text = "The first two courses on Udacity started on 20 February 2012,[29] entitled 'CS 101: Building a Search Engine', taught by David Evans from the University of Virginia, and 'CS 373: Programming a Robotic Car' taught by Thrun. Both courses use Python."
    inputs, input_len = prepare_prompt(input_text, tokenizer)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    print(f"Prompt token length: {input_len}")

    # 2. Warm-up
    warmup_output_ids = warmup_model(model, inputs, device)
    print("Warm-up complete.")
    
    # 2-b Warm up outputs : Additional Step to understand the generated text
    # Convert tokens to readable text
    warmup_output_text = tokenizer.decode(warmup_output_ids[0], skip_special_tokens=True)

    # 3-a. Measure
    latency, gen_tokens, full_output_ids = measure_generation(model, inputs, MAX_NEW_TOKENS, device)

    # 3-b : Additional Step to understand the generated text
    # Convert tokens to readable text
    full_output_text = tokenizer.decode(full_output_ids[0], skip_special_tokens=True)

    # 3-c. Compute & report
    throughput = gen_tokens / latency
    print(f"\n\n------Measurements for Generated tokens: {gen_tokens}----------")
    print(f"Latency        : {latency:.3f} s")
    print(f"Throughput     : {throughput:.1f} tokens/s")


    # 4-a. Measure
    latency2, gen_tokens2, full_output_ids2 = measure_generation(model, inputs, MAX_NEW_TOKENS*3, device)

    # 4-b : Additional Step to understand the generated text
    # Convert tokens to readable text
    full_output_text2 = tokenizer.decode(full_output_ids2[0], skip_special_tokens=True)

    # 4-c. Compute & report
    throughput2 = gen_tokens2 / latency2
    print(f"\n\n------Measurements for Generated tokens: {gen_tokens2}----------")
    print(f"Latency        : {latency2:.3f} s")
    print(f"Throughput     : {throughput2:.1f} tokens/s")


    print("\n Input Text \n :", input_text)
    
    print("\n Tokenizer Generated Inputs \n :", inputs)

    print("\n Warmup: Generated text with # tokens = 5")
    print(warmup_output_text)

    print("\n Full Run: Generated text with # tokens =", MAX_NEW_TOKENS)
    print(full_output_text)

    print("\n Full Run: Generated text with # tokens =", MAX_NEW_TOKENS*3)
    print(full_output_text2)

## Notice the difference in latency and throughput between cpu and gpu

In [9]:
start("cpu")

Using device: cpu
Prompt token length: 72
Warm-up complete.


------Measurements for Generated tokens: 50----------
Latency        : 5.250 s
Throughput     : 9.5 tokens/s


------Measurements for Generated tokens: 150----------
Latency        : 12.630 s
Throughput     : 11.9 tokens/s

 Input Text 
 : The first two courses on Udacity started on 20 February 2012,[29] entitled 'CS 101: Building a Search Engine', taught by David Evans from the University of Virginia, and 'CS 373: Programming a Robotic Car' taught by Thrun. Both courses use Python.

 Tokenizer Generated Inputs 
 : {'input_ids': tensor([[    1,   450,   937,  1023, 21888,   373,   501, 29881,  5946,  4687,
           373, 29871, 29906, 29900,  6339, 29871, 29906, 29900, 29896, 29906,
         17094, 29906, 29929, 29962, 23437,   525,  9295, 29871, 29896, 29900,
         29896, 29901, 17166,   263, 11856, 10863,   742, 16187,   491,  4699,
         24056,   515,   278,  3014,   310, 11653, 29892,   322,   525,  9295,
        

In [10]:
start("cuda")

Using device: cuda
Prompt token length: 72
Warm-up complete.


------Measurements for Generated tokens: 50----------
Latency        : 0.723 s
Throughput     : 69.2 tokens/s


------Measurements for Generated tokens: 150----------
Latency        : 2.182 s
Throughput     : 68.8 tokens/s

 Input Text 
 : The first two courses on Udacity started on 20 February 2012,[29] entitled 'CS 101: Building a Search Engine', taught by David Evans from the University of Virginia, and 'CS 373: Programming a Robotic Car' taught by Thrun. Both courses use Python.

 Tokenizer Generated Inputs 
 : {'input_ids': tensor([[    1,   450,   937,  1023, 21888,   373,   501, 29881,  5946,  4687,
           373, 29871, 29906, 29900,  6339, 29871, 29906, 29900, 29896, 29906,
         17094, 29906, 29929, 29962, 23437,   525,  9295, 29871, 29896, 29900,
         29896, 29901, 17166,   263, 11856, 10863,   742, 16187,   491,  4699,
         24056,   515,   278,  3014,   310, 11653, 29892,   322,   525,  9295,
       