* This Notebook tests the capabilities of TinyLlama quantized models to generate raw token sequences given an input sequence

* In order to speed up the model loading, the models have been stored in a local drive

* The following Gemini thread has been used to help build this up both conceptually and in practice: https://g.co/gemini/share/c5212fd78fad

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [2]:
!nvidia-smi

Sun Jul  6 15:51:17 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   30C    P0             43W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
import os

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive/',force_remount=True)

path = '/content/drive/MyDrive/dev/'

Mounted at /content/drive/


# setup

I struggled with the installation witb GPU support, only able to fix from https://gemini.google.com/app/f3ec74b59f92f56f

In [6]:
# Install the pre-built wheel for CUDA 12.4
!pip uninstall -y llama-cpp-python
!pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

[0mLooking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu124
Collecting llama-cpp-python
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.12-cu124/llama_cpp_python-0.3.12-cp311-cp311-linux_x86_64.whl (504.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m504.6/504.6 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.12


In [62]:
import torch
import time
import numpy as np
import math
from llama_cpp import Llama, LlamaGrammar

In [8]:
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

CUDA available: True
CUDA version: 12.4


# models

## n_gpu_layers

* This Notebook is just to test running various quantized TinyLlama models

* As a result, we will either use either n_gpu_layers = 0 (for running on CPU only) or n_gpu_layers = 33 (33 GPU layers)

* Refer to the Gemini explanation below in order to optimize n_gpu_layers when running the code for getting results

In [None]:
# Set
n_gpu_layers = 0
#n_gpu_layers = 33

https://g.co/gemini/share/1de12d982e5f

That's an excellent and very important question for getting the best performance out of your models. Setting `n_gpu_layers` is the primary way you control the balance between CPU and GPU work.

Here’s a breakdown of what it means and how to find the optimal value for your specific system.

### What `n_gpu_layers` Does

This parameter determines how many layers of the neural network are "offloaded" to your GPU's VRAM.

  * **CPU:** Your computer's main processor. It has access to a large amount of slower system RAM.
  * **GPU:** Your graphics card. It has a smaller amount of very fast video RAM (VRAM) and is designed for the massive parallel calculations that LLMs require.

**The trade-off is simple:**

  * **More layers on the GPU (`n_gpu_layers` is high):** Inference (generating text) is much faster.
  * **Fewer layers on the GPU (`n_gpu_layers` is low):** More work is done on the slower CPU, but it uses less of your precious VRAM.

The "optimal" value is therefore: **The maximum number of layers you can fit into your GPU's VRAM without causing errors, while leaving a small buffer for your operating system.**

-----

### How to Find the Optimal Value: A Practical Guide

The best way to find the perfect number is through a bit of experimentation. Here is a simple, effective process:

#### Step 1: Know Your Total VRAM

First, check how much VRAM your GPU has.

  * **On Windows:** Open the Task Manager, go to the "Performance" tab, and click on your GPU. Look for "Dedicated GPU Memory".
  * **On Linux (with NVIDIA):** Open a terminal and run the command `nvidia-smi`. It will show you your GPU's total memory (e.g., `8192MiB`).

#### Step 2: Start High and Let the Model Tell You

You don't need to guess the number of layers. The easiest method is to set `n_gpu_layers` to a very high number. The `llama-cpp-python` library is smart enough to only offload as many layers as it can actually fit.

Modify your model loading code like this:

```python
from llama_cpp import Llama

llm = Llama(
    model_path="/path/to/your/model.gguf",
    # Set a very high number to offload as many layers as possible
    n_gpu_layers=999,  # <--- START WITH A HIGH NUMBER
    n_ctx=2048,
    verbose=True  # <--- Set to True to see the loading output
)
```

#### Step 3: Observe the Output

When you run this code, `llama.cpp` will print detailed information as it loads the model. Look for a line that looks like this:

```
llama_model_load_internal: offloaded 33 of 33 layers to GPU
```

or

```
llm_load_tensors: offloaded 33/33 layers to GPU
```

**That number is your answer.** In this example, the model has 33 layers, and all of them fit into your VRAM. Your optimal `n_gpu_layers` is **33**. You can then set `n_gpu_layers=33` in your code for future runs to be precise.

#### What If It Crashes?

If you set the number too high manually and your system becomes unstable or the program crashes with a "CUDA out of memory" error, it means you've exceeded your VRAM. In that case, simply lower the number and try again. For example, if `n_gpu_layers=33` crashes, try `30`, then `28`, until you find a stable number.

**Pro Tip:** Often, the last one or two layers of a model can be disproportionately large. If you find that offloading all layers works but your computer feels sluggish, try reducing the number by 1 or 2 (e.g., `n_gpu_layers=31` instead of `33`). This leaves a bit more VRAM for your operating system and can lead to a smoother overall experience with a negligible impact on performance.
That's an excellent and very practical question. Finding the optimal `n_gpu_layers` is the key to getting the best performance out of your hardware. Here’s a breakdown of what it means and how to find the best value for your specific setup.

### What `n_gpu_layers` Does

This parameter controls how many layers of the model are "offloaded" from your system's RAM to the GPU's dedicated, high-speed VRAM.

  * **`n_gpu_layers = 0`**: The entire model runs on your CPU. This is the slowest option but uses the least VRAM.
  * **`n_gpu_layers > 0`**: The specified number of layers are moved to the GPU. Since GPUs are massively parallel, they can process these layers much faster than a CPU, leading to a significant speedup in generating tokens.
  * **The Goal**: Offload as many layers as possible to the GPU without running out of VRAM.

### The Trade-Off: Speed vs. VRAM

The main limiting factor is your GPU's VRAM capacity. Each layer you offload consumes a chunk of VRAM. If you try to offload more layers than your VRAM can hold, the program will crash with an out-of-memory error.

The performance gain is not linear. The bulk of the computation happens in the middle layers. Offloading the first few layers gives a huge speed boost. Continuing to offload more layers gives diminishing returns, but is still faster.

### How to Find the Optimal Value: A Simple Experiment

The best way to find the ideal number is to test it empirically.

**Step 1: Start High**

The easiest way to start is to set `n_gpu_layers` to a very large number, like `-1` or `999`. The `llama-cpp-python` library is smart enough to know this means "offload as many layers as you possibly can".

```python
# Try to offload all layers
llm = Llama(
    model_path="path/to/your/model.gguf",
    n_gpu_layers=-1, # -1 means "all possible layers"
    verbose=True # Set to True to see the loading output
)
```

**Step 2: Watch the Output During Loading**

When the model loads with `verbose=True`, it will print detailed information. Look for a line that looks like this:

`llama_model_load_internal: offloaded 33 of 33 layers to GPU`

This tells you the maximum number of layers the model has (`33` in this case) and how many it successfully moved to the GPU. **The number it successfully offloaded is your practical maximum.**

**Step 3: Monitor Your VRAM**

While the model is loaded, open your system's GPU monitoring tool:

  * **NVIDIA GPUs**: Use `nvidia-smi` in your command line/terminal.
  * **AMD GPUs**: Use `radeontop` or the monitoring utility in your driver software.
  * **Windows**: The Performance tab in the Task Manager (select your GPU).

Check how much VRAM is being used. If it's very close to the maximum (e.g., 7.8GB / 8.0GB), you are at the limit.

**Step 4: Fine-Tune if Necessary**

  * **If it crashes with an "out of memory" error:** Your GPU can't handle all the layers. Reduce the number. If the model has 33 layers, try `n_gpu_layers=28`, then `25`, and so on, until it loads successfully.
  * **If it loads but VRAM is maxed out:** You've found a good value. You might want to reduce it by 1 or 2 layers (`n_gpu_layers=31` instead of `33`) just to leave a little breathing room for your operating system or other applications, which can prevent stuttering.

**In summary, the ideal `n_gpu_layers` is the highest number you can set without running out of VRAM.** The "set it to -1 and see what happens" method is the quickest way to find that number.


## models

https://g.co/gemini/share/5e6bc8594d7d



In [38]:
model_path = path+".models/tinyllama-1.1b-chat-v1.0.Q2_K.gguf"
generator_q2_k = Llama(
    model_path=model_path,
    chat_format="disabled",  # This prevents the model from wrapping input in chat templates
    n_ctx=2048,
    n_threads=2,
    n_gpu_layers=n_gpu_layers,
    logits_all=True,
    verbose=False,
)

llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility


In [39]:
# Check the number of layers offloaded to the GPU
offloaded_layers = generator_q2_k.model_params.n_gpu_layers
print(f"✅ Successfully offloaded {offloaded_layers} layers to the GPU")

✅ Successfully offloaded 33 layers to the GPU


In [40]:
model_path = path+".models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
generator_q4_k_m = Llama(
    model_path=model_path,
    chat_format="disabled",  # This prevents the model from wrapping input in chat templates
    n_ctx=2048,
    n_threads=2,
    n_gpu_layers=n_gpu_layers,
    logits_all=True,
    verbose=False,
)

llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility


In [41]:
# Check the number of layers offloaded to the GPU
offloaded_layers = generator_q4_k_m.model_params.n_gpu_layers
print(f"✅ Successfully offloaded {offloaded_layers} layers to the GPU")

✅ Successfully offloaded 33 layers to the GPU


In [42]:
model_path = path+".models/tinyllama-1.1b-chat-v1.0.Q6_K.gguf"
generator_q6_k = Llama(
    model_path=model_path,
    chat_format="disabled",  # This prevents the model from wrapping input in chat templates
    n_ctx=2048,
    n_threads=2,
    n_gpu_layers=n_gpu_layers,
    logits_all=True,
    verbose=False,
)

llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility


In [43]:
# Check the number of layers offloaded to the GPU
offloaded_layers = generator_q6_k.model_params.n_gpu_layers
print(f"✅ Successfully offloaded {offloaded_layers} layers to the GPU")

✅ Successfully offloaded 33 layers to the GPU


In [44]:
model_path = path+".models/tinyllama-1.1b-chat-v1.0.Q8_0.gguf"
generator_q8_0 = Llama(
    model_path=model_path,
    chat_format="disabled",  # This prevents the model from wrapping input in chat templates
    n_ctx=2048,
    n_threads=2,
    n_gpu_layers=n_gpu_layers,
    logits_all=True,
    verbose=False,
)

llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility


In [45]:
# Check the number of layers offloaded to the GPU
offloaded_layers = generator_q8_0.model_params.n_gpu_layers
print(f"✅ Successfully offloaded {offloaded_layers} layers to the GPU")

✅ Successfully offloaded 33 layers to the GPU


# tests

In [109]:
# Define the grammar in GBNF format
# This grammar allows for a sequence of numbers (integers or decimals) separated by commas
# https://g.co/gemini/share/938d5c086f74

gbnf_grammar_str = r'''
root   ::= sequence
sequence ::= number ("," number)*
number ::= ("-")? ([0-9]+ | [0-9]+ "." [0-9]+)
'''
try:
    grammar = LlamaGrammar.from_string(gbnf_grammar_str)
    print("Strict grammar parsed successfully")
    print(grammar)
except Exception as e:
    print(f"Error parsing grammar: {e}")
    grammar = None

Strict grammar parsed successfully.


In [None]:
# Define the prompt with number sequence

prompt = "1.0,2.1,3.3,4.2,5.7,"
#prompt = "1,2,3,4,5,"

In [113]:
# Define parameters for running the model
# https://g.co/gemini/share/5e0aed5cf4b8
# https://g.co/gemini/share/8e15743e2cd6
# https://g.co/gemini/share/1dcee2a72019

parameters = {
    'max_tokens': 20,
    'logprobs': 10,
    'grammar': grammar,
    'temperature': 0.9
}

In [76]:
generators = {
    'q2_k': generator_q2_k,
    'q4_k_m': generator_q4_k_m,
    'q6_k': generator_q6_k,
    'q8_0': generator_q8_0,
}

In [77]:
# Generate output sequence and logprobs

def time_execution(generator, prompt, params):

    start_time = time.time()
    response = generator(prompt, **params)
    end_time = time.time()
    elapsed_time = end_time - start_time

    choice = response['choices'][0]
    text = choice['text']
    logprobs_data = choice['logprobs']

    return {'response': text,
            'logprobs_data': logprobs_data,
            'elapsed_time': elapsed_time}

In [114]:
# Run results for all models in dict generators

results = {}
r = []
t = []

for key in generators:
    results[key] = {}
    print(f'### Running {key}')
    results[key] = time_execution(generators[key], prompt, parameters)
    s = results[key]['response']
    print(f'\tresponse = {s}')
    s = results[key]['logprobs_data']
    print(f'\tlogprobs_data = {s}')
    s = results[key]['elapsed_time']
    print(f'\telapsed_time = {s}')

### Running q2_k
	response = 6.2,7.4,8.9,10.4,12.
	elapsed_time = {'tokens': ['6', '.', '2', ',', '7', '.', '4', ',', '8', '.', '9', ',', '1', '0', '.', '4', ',', '1', '2', '.'], 'text_offset': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], 'token_logprobs': [np.float32(-0.51337457), np.float32(-0.048911024), np.float32(-2.2620232), np.float32(-0.32777563), np.float32(-0.2769009), np.float32(-0.052663155), np.float32(-2.0914621), np.float32(-0.2954199), np.float32(-0.16116016), np.float32(-0.031512357), np.float32(-2.78941), np.float32(-0.25423846), np.float32(-0.31939754), np.float32(-0.059035994), np.float32(-0.122858405), np.float32(-2.2547197), np.float32(-0.79676276), np.float32(-0.13400947), np.float32(-0.2971776), np.float32(-0.05908758)], 'top_logprobs': [{'6': np.float32(-0.51337457), '7': np.float32(-1.759495), '8': np.float32(-2.716346), '1': np.float32(-3.0023444), '5': np.float32(-3.2694342), '9': np.float32(-4.275054), '\n': np.float32(-

In [93]:
results['q8_0']['logprobs_data']

{'tokens': ['6',
  ',',
  '7',
  ',',
  '8',
  ',',
  '9',
  ',',
  '1',
  '0',
  '\n',
  '\n',
  '\n',
  '\n',
  '\n',
  '\n',
  '\n',
  '\n',
  '\n',
  '\n'],
 'text_offset': [10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29],
 'token_logprobs': [np.float32(-0.039400734),
  np.float32(-0.22049828),
  np.float32(-0.017327122),
  np.float32(-0.053606026),
  np.float32(-0.004326626),
  np.float32(-0.05266202),
  np.float32(-0.003440891),
  np.float32(-0.1316747),
  np.float32(-0.24304207),
  np.float32(-0.003549349),
  np.float32(-1.8983413),
  np.float32(-1.7544292),
  np.float32(-4.7411046),
  np.float32(-3.024836),
  np.float32(-2.5478287),
  np.float32(-2.4488103),
  np.float32(-1.4894649),
  np.float32(-0.7896662),
  np.float32(-0.56893784),
  np.float32(-0.36428946)],
 'top_logprobs': [{'6': np.float32(-0.039400734),
   ' ': np.float32(-5.1428456),
   '7': np.float32(-5.402578),
   '\n': np.float32(-5.5278707),
  