This notebook will demonstrate the process I went through to run `neuralbeagle14-7b` on my laptop's 8GB GPU in Windows, as seen in [my LinkedIn post from January 27th, 2024](https://www.linkedin.com/posts/jsundance_free-local-private-ai-on-my-laptop-thanks-activity-7157117360728862720-MWxn?utm_source=share&utm_medium=member_desktop). It pulls heavily from [this LangChain documentation](https://python.langchain.com/docs/integrations/llms/llamacpp).

I was able to use `llama-cpp-python` _without_ my GPU, and it took me a couple installs before it was really loading all of the layers onto the GPU. It was still fast without the GPU, but that's not the point. ;)

I'm using an NVIDIA RTX A4000 laptop GPU. I will be compiling `llama-cpp-python` instead of using the "usual" `pip install` because I _think_ this is a more reliable method. I will be using the cuBLAS backend, but you can other backends for AMD or Apple or whatever (more or this later).

I will also describe an issue I had with a missing dll, and offer troubleshooting advice for that.

Joshua Bailey #LearningInPublic January 28, 2024

# Prerequisites (and gotchas)

## NVIDIA stuff

- [NVIDIA driver](https://www.nvidia.com/download/index.aspx)
- [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit)

## Microsoft Visual Studio stuff

From [the LangChain documentation](https://python.langchain.com/docs/integrations/llms/llamacpp):

- Visual Studio Community (make sure you install this with the following settings)
  - Desktop development with C++
  - Python development
  - Linux embedded development with C++

_side note_: I installed this stuff a while ago along with `cudnn` for ArcGIS deep learning, and I don't think I included the Linux embedded development thing (maybe), which is probably why I had the dll trouble I'll describe later. ;)

## Check `nvidia-smi` and `nvcc --version`

If either of these commands don't work, you'll have trouble.
Install NVIDIA driver and CUDA toolkit.

In [1]:
!nvidia-smi

Sun Jan 28 17:29:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.79                 Driver Version: 537.79       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA RTX A4000 Laptop GPU  WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   55C    P8              16W / 110W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


## Python stuff

First of all, I highly recommend using an environment management tool like `conda` to manage your environments-- and never tinker around in the base environment. That way, when your package versions get messed up or whatever, you can just start fresh. ;)

A common approach is to install [anaconda](https://anaconda.org/). 

Assuming you have `conda`, create a new Python environment. At the time of writing, version constraints meant that Python 3.12 was not supported, so:

```cmd
conda create -n llama-cpp-python python=3.11
conda activate llama-cpp-python
```

### `torch`

You have to install `torch` before installing `llama-cpp-python`. I think if you just `pip install torch` then you get the cpu-only version.

So assuming you'll be using CUDA 11.8, based on [the pytorch documentation](https://pytorch.org/get-started/locally/), run:

```cmd
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```

(12.1 is available at `/whl/cu121` but 12.2 is apparently not supported yet)

In [3]:
import torch

if not torch.cuda.is_available():
    raise RuntimeError()

# Compiling and installing `llama-cpp-python`

```markdown
There are different options on how to install the llama-cpp package:

- CPU usage
- CPU + GPU (using one of many BLAS backends)
- Metal GPU (MacOS with Apple Silicon Chip)
```
(from https://python.langchain.com/docs/integrations/llms/llamacpp)

In Windows with CUBLAS:

```cmd
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
pip install -v llama-cpp-python
# or if you've already installed it and need to try again (or you just wanna be extra careful I guess?):
# pip install -v --upgrade --force-reinstall --no-cache-dir llama-cpp-python
```

## `No CUDA toolset found`? Check for `Nvda.Build.CudaTasks.v*.*.dll`

If you try to compile `llama-cpp-python` and get an error message like
```text
[...]
CMake Error at [...]
No CUDA toolset found.
[...]
*** CMake configuration failed.
[end of output]
```

Then take a look at [this GitHub issue comment](https://github.com/NVlabs/tiny-cuda-nn/issues/164#issuecomment-1280749170) and possibly use the following code to help find your problem.

If everything's good, the code below will print stuff and not raise any errors.

In [4]:
import os

from glob import glob
import re

def version_files(version: str) -> set[str]:
    return {
        f"CUDA {version}.props",
        f"CUDA {version}.targets",
        f"CUDA {version}.xml",
        f"Nvda.Build.CudaTasks.v{version}.dll",
    }

dll_pat = re.compile(r"^Nvda.Build.CudaTasks.v(?P<major>\d{2})\.(?P<minor>\d)\.dll$")

nvidia_cuda_glob = f"C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\**\\extras\\visual_studio_integration\\MSBuildExtensions\\Nvda.Build.CudaTasks.v*.*.dll"
nvidia_cuda_files = glob(nvidia_cuda_glob, recursive=True)
print(f"Found files:")
print("\n".join(nvidia_cuda_files), "\n")

if not nvidia_cuda_files:
    raise RuntimeError()

basenames = (os.path.basename(f) for f in nvidia_cuda_files)
matches = (dll_pat.match(bn) for bn in basenames)
groups = (match.groupdict() for match in matches if match)
sorted_versions = sorted(groups, key=lambda x: (int(x['major']), int(x['minor'])))
highest_version = sorted_versions[-1]
highest_str = highest_version['major'] + '.' + highest_version['minor']
highest_files = version_files(highest_str)

print(f"Highest version: {highest_str}\n")

bc_dirs = glob("C:\\Program Files (x86)\\Microsoft Visual Studio\\*\\BuildTools\\MSBuild\\Microsoft\\VC\\v*\\BuildCustomizations", recursive=True)
if len(bc_dirs) != 1:
    print("Only expected to find one directory lol")
    print(bc_dirs)
    raise RuntimeError()
bc_dir = bc_dirs[0]

print(f"Build customizations dir: {bc_dir}\n")

expected_files = [os.path.join(bc_dir, file) for file in highest_files]
print("Checking for files:")
print("\n".join(expected_files), "\n")

for file in highest_files:
    expected_file = os.path.join(bc_dir, file)
    if not os.path.exists(expected_file):
        raise FileNotFoundError(expected_file)

print("All good")

Found files:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\extras\visual_studio_integration\MSBuildExtensions\Nvda.Build.CudaTasks.v11.4.dll
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\extras\visual_studio_integration\MSBuildExtensions\Nvda.Build.CudaTasks.v11.8.dll 

Highest version: 11.8

Build customizations dir: C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations

Checking for files:
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 11.8.xml
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 11.8.props
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\Nvda.Build.CudaTasks.v11.8.dll
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 11.8.targets 

All good


# Using `neuralbeagle14-7b` in `langchain`

## Enable LangSmith logging (optional)

`.env`:
```.env
LANGCHAIN_API_KEY=ls__...
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT="neuralbeagle-demo"
```

In [5]:
# for langsmith logging
from dotenv import load_dotenv
load_dotenv()

[k for k in os.environ.keys() if 'langchain' in k.lower()]

['LANGCHAIN_API_KEY',
 'LANGCHAIN_ENDPOINT',
 'LANGCHAIN_TRACING_V2',
 'LANGCHAIN_PROJECT']

## Call the model using `langchain_community.llms.LlamaCpp`

In [6]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms import LlamaCpp

In [7]:
%%time

model_path = r"C:\users\joshua.bailey\downloads\neuralbeagle14-7b.Q5_K_M.gguf"

llm_kwargs = {
    "temperature": 0.75,
    "max_tokens": 5000,
    "top_p": 1,
    # The following settings are from https://python.langchain.com/docs/integrations/llms/llamacpp
    # These settings used about 5.7GB GPU RAM on my system
    "n_gpu_layers": 40,  # Change this value based on your model and your GPU VRAM pool.
    "n_batch": 512,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    # Callbacks support token-wise streaming
    "callback_manager": CallbackManager([StreamingStdOutCallbackHandler()]),
    # Verbose is required to pass to the callback manager
    "verbose": True,
}

if not os.path.exists(model_path):
    raise FileNotFoundError(model_path)

llm = LlamaCpp(
    model_path=model_path,
    **llm_kwargs,
)

CPU times: total: 2.8 s
Wall time: 2.82 s


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | 
Model metadata: {'general.name': 'mlabonne_neuralbeagle14-7b', 'general.architecture': 'llama', 'llama.context_length': '32768', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '17', 'llama.attention.head_count_kv': '8', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.freq_base': '10000.000000', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.chat_template': "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['co

In [8]:
template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

llm_chain = PromptTemplate(template=template, input_variables=["question"]) | llm

In [9]:
question = "What are Scrub-Jays? Should a biologist expect to find them in Virginia?"

In [10]:
%%time

answer = llm_chain.invoke(dict(question=question))

 First, let's consider what a Scrub-Jay is. There are two main types of Scrub-Jays found in North America: Western Scrub-Jay and the Eastern Scrub-Jay (also known as Blue Jay). The latter is also called the Florida Scrub-Jay because it has a limited range compared to the former, mostly restricted to peninsular Florida. This means that the Eastern or Florida Scrub-Jay would not be found in Virginia since its range does not extend there. On the other hand, the Western Scrub-Jay is widely distributed across the western United States and parts of Mexico. It's range includes California, Arizona, Nevada, Utah, Colorado, Oregon and Washington.

However, Virginia is located on the east coast of the USA and is in close proximity to the Atlantic Ocean, which is outside the distribution range of Western Scrub-Jays. Nonetheless, there is a third kind of JAY that can be found in Virginia, and it's known as the Blue Jay (Cyanocitta cristata) - this is an eastern species and is quite distinct from th

## View LangSmith run

I've shared the resulting run [here](https://smith.langchain.com/public/0451496b-78a6-4cf1-b2e0-e58d6997d0ad/r).

Time to first token: 207 ms

Total tokens: 457 tokens

Latency: 12.45 seconds

## View logs in terminal

```text
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA RTX A4000 Laptop GPU, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\users\joshua.bailey\downloads\neuralbeagle14-7b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mlabonne_neuralbeagle14-7b
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW)
llm_load_print_meta: general.name     = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    85.94 MiB
llm_load_tensors:      CUDA0 buffer size =  4807.05 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =     9.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    80.30 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 3

llama_print_timings:        load time =     126.88 ms
llama_print_timings:      sample time =      60.32 ms /   464 runs   (    0.13 ms per token,  7692.56 tokens per second)
llama_print_timings: prompt eval time =     126.79 ms /    48 tokens (    2.64 ms per token,   378.57 tokens per second)
llama_print_timings:        eval time =   10946.64 ms /   463 runs   (   23.64 ms per token,    42.30 tokens per second)
llama_print_timings:       total time =   12363.03 ms /   511 tokens
```

## Check `nvidia-smi` to see current GPU usage

In [11]:
!nvidia-smi

Sun Jan 28 17:30:04 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.79                 Driver Version: 537.79       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA RTX A4000 Laptop GPU  WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   66C    P0              99W / 100W |   5705MiB /  8192MiB |     81%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    