# GPU Memory limits

## Goal

Find the maximum sequence length that can be used for training or inference.

## Code

- https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct
- https://huggingface.co/docs/transformers/en/main_classes/text_generation

In [None]:
import os
from arc25.utils import get_least_used_gpu_index
from arc25.logging import configure_logging, log_execution_time

configure_logging()
os.environ['CUDA_VISIBLE_DEVICES'] = str(get_least_used_gpu_index())

In [None]:
import logging
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
def generate_text_with_desired_length(model, tokenizer, sequence_length):
    prompt = "Write a long essay about the impact of artificial intelligence on modern society."
    messages = [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    t0 = time.time()
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=sequence_length,
        min_new_tokens=sequence_length,
    )
    logging.info(f"Generation speed {sequence_length/ (time.time() - t0):.2f} tokens/sec for {sequence_length} tokens")
    generated_ids = generated_ids[:, len(model_inputs.input_ids[0]):]
    assert len(generated_ids[0]) == sequence_length, f"Generated sequence length {len(generated_ids[0])} does not match the desired length {sequence_length}"
    return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [None]:
def generate_text_with_long_prompt(model, tokenizer, input_length, sequence_length=128):
    prompt = "Write a long essay about the impact of artificial intelligence on modern society. "
    model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

    n_repeats = input_length // len(model_inputs.input_ids[0])
    model_inputs = tokenizer([prompt*n_repeats], return_tensors="pt").to(model.device)
    real_input_length = len(model_inputs.input_ids[0])
    t0 = time.time()
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=sequence_length,
        min_new_tokens=sequence_length,
    )
    logging.info(f"Generation speed {sequence_length/ (time.time() - t0):.2f} tokens/sec for {real_input_length} input tokens ({input_length})")
    generated_ids = generated_ids[:, len(model_inputs.input_ids[0]):]
    assert len(generated_ids[0]) == sequence_length, f"Generated sequence length {len(generated_ids[0])} does not match the desired length {sequence_length}"
    return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

## Experiments

In [None]:
parameters = '3B'
base_model_path = f'/home/gbarbadillo/models/Qwen2.5-Coder-{parameters}-Instruct'
model = AutoModelForCausalLM.from_pretrained(
    base_model_path, torch_dtype="auto", device_map="auto", quantization_config=None)
tokenizer = AutoTokenizer.from_pretrained(base_model_path)

In [None]:
generate_text_with_desired_length(model, tokenizer, sequence_length=32700)

In [None]:
for i in range(3, 15):
    generate_text_with_desired_length(model, tokenizer, sequence_length=2**i)

In [None]:
for i in range(10, 20):
    generate_text_with_long_prompt(model, tokenizer, input_length=2**i)

```
# 0.5B
1468MiB
2025-06-26 17:43:40,263 - root - INFO - generate_text_with_desired_length - Generation speed 13.90 tokens/sec for 8 tokens
2025-06-26 17:43:40,583 - root - INFO - generate_text_with_desired_length - Generation speed 50.32 tokens/sec for 16 tokens
2025-06-26 17:43:41,201 - root - INFO - generate_text_with_desired_length - Generation speed 51.90 tokens/sec for 32 tokens
2025-06-26 17:43:42,433 - root - INFO - generate_text_with_desired_length - Generation speed 52.05 tokens/sec for 64 tokens
2025-06-26 17:43:44,979 - root - INFO - generate_text_with_desired_length - Generation speed 50.33 tokens/sec for 128 tokens
2025-06-26 17:43:49,982 - root - INFO - generate_text_with_desired_length - Generation speed 51.19 tokens/sec for 256 tokens
2025-06-26 17:44:00,041 - root - INFO - generate_text_with_desired_length - Generation speed 50.91 tokens/sec for 512 tokens
2025-06-26 17:44:20,302 - root - INFO - generate_text_with_desired_length - Generation speed 50.55 tokens/sec for 1024 tokens
2025-06-26 17:44:59,880 - root - INFO - generate_text_with_desired_length - Generation speed 51.75 tokens/sec for 2048 tokens
2025-06-26 17:46:19,066 - root - INFO - generate_text_with_desired_length - Generation speed 51.73 tokens/sec for 4096 tokens
2025-06-26 17:48:58,518 - root - INFO - generate_text_with_desired_length - Generation speed 51.38 tokens/sec for 8192 tokens
2025-06-26 18:18:20,443 - root - INFO - generate_text_with_desired_length - Generation speed 49.80 tokens/sec for 32700 tokens
1942MiB

2025-06-26 18:02:58,654 - root - INFO - generate_text_with_long_prompt - Generation speed 43.89 tokens/sec for 953 input tokens (1024)
2025-06-26 18:03:01,348 - root - INFO - generate_text_with_long_prompt - Generation speed 47.64 tokens/sec for 1905 input tokens (2048)
2025-06-26 18:03:04,092 - root - INFO - generate_text_with_long_prompt - Generation speed 46.84 tokens/sec for 3823 input tokens (4096)
2025-06-26 18:03:06,782 - root - INFO - generate_text_with_long_prompt - Generation speed 47.91 tokens/sec for 7645 input tokens (8192)
2025-06-26 18:03:09,736 - root - INFO - generate_text_with_long_prompt - Generation speed 44.17 tokens/sec for 15289 input tokens (16384)
2025-06-26 18:03:13,410 - root - INFO - generate_text_with_long_prompt - Generation speed 35.73 tokens/sec for 30577 input tokens (32768)
Token indices sequence length is longer than the specified maximum sequence length for this model (61167 > 32768). Running this sequence through the model will result in indexing errors
OOM

# 1.5B
3700MiB
2025-06-26 18:35:18,660 - root - INFO - generate_text_with_desired_length - Generation speed 39.62 tokens/sec for 32700 tokens
4648MiB

# 3B
6882MiB
2025-06-26 19:30:47,143 - root - INFO - generate_text_with_desired_length - Generation speed 27.40 tokens/sec for 32700 tokens
8002MiB

# 7B
14782MiB
2025-06-26 17:49:53,393 - root - INFO - generate_text_with_desired_length - Generation speed 12.97 tokens/sec for 8 tokens
2025-06-26 17:49:53,819 - root - INFO - generate_text_with_desired_length - Generation speed 37.77 tokens/sec for 16 tokens
2025-06-26 17:49:54,590 - root - INFO - generate_text_with_desired_length - Generation speed 41.66 tokens/sec for 32 tokens
2025-06-26 17:49:56,079 - root - INFO - generate_text_with_desired_length - Generation speed 43.03 tokens/sec for 64 tokens
2025-06-26 17:49:59,066 - root - INFO - generate_text_with_desired_length - Generation speed 42.88 tokens/sec for 128 tokens
2025-06-26 17:50:04,982 - root - INFO - generate_text_with_desired_length - Generation speed 43.28 tokens/sec for 256 tokens
2025-06-26 17:50:17,005 - root - INFO - generate_text_with_desired_length - Generation speed 42.59 tokens/sec for 512 tokens
2025-06-26 17:50:40,825 - root - INFO - generate_text_with_desired_length - Generation speed 42.99 tokens/sec for 1024 tokens
2025-06-26 17:51:30,304 - root - INFO - generate_text_with_desired_length - Generation speed 41.39 tokens/sec for 2048 tokens
15094MiB
2025-06-26 17:53:14,419 - root - INFO - generate_text_with_desired_length - Generation speed 39.34 tokens/sec for 4096 tokens
15768MiB
2025-06-26 19:25:33,969 - root - INFO - generate_text_with_desired_length - Generation speed 32.29 tokens/sec for 8000 tokens
16552MiB
2025-06-26 19:20:14,560 - root - INFO - generate_text_with_desired_length - Generation speed 26.03 tokens/sec for 16000 tokens
21088MiB
2025-06-26 18:37:14,045 - root - INFO - generate_text_with_desired_length - Generation speed 18.63 tokens/sec for 32700 tokens
20344MiB

2025-06-26 18:00:57,276 - root - INFO - generate_text_with_long_prompt - Generation speed 35.11 tokens/sec for 953 input tokens (1024)
2025-06-26 18:01:00,889 - root - INFO - generate_text_with_long_prompt - Generation speed 35.50 tokens/sec for 1905 input tokens (2048)
2025-06-26 18:01:05,198 - root - INFO - generate_text_with_long_prompt - Generation speed 29.77 tokens/sec for 3823 input tokens (4096)
2025-06-26 18:01:11,104 - root - INFO - generate_text_with_long_prompt - Generation speed 21.84 tokens/sec for 7645 input tokens (8192)
2025-06-26 18:01:20,953 - root - INFO - generate_text_with_long_prompt - Generation speed 13.04 tokens/sec for 15289 input tokens (16384)
2025-06-26 18:01:39,310 - root - INFO - generate_text_with_long_prompt - Generation speed 7.00 tokens/sec for 30577 input tokens (32768)
Token indices sequence length is longer than the specified maximum sequence length for this model (61167 > 32768). Running this sequence through the model will result in indexing errors
OOM
```

Inference speed is almost the same, but GPU utilization is much higher with the 7B model.
Both models give OOM when trying to generate for more than 32k tokens, which is bigger than the maximum sequence lenght of the models.

Tricks to speedup inference. https://chatgpt.com/c/685d743d-c194-8012-a34d-8cc17e18c9d0