This notebook focuses on running llama 3.1 8B model locally on NVIDIA GeForce RTX 4080 GPU and setting up inference.

In [1]:
import time
import torch
from vllm import LLM, SamplingParams
from datasets import load_dataset
from evaluate import load
from huggingface_hub import login

# CUDA setup
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

**Checking if CUDA is available**

If CUDA was previously available, but after your device "went to sleep" it no longer is, try running:
```
sudo rmmod nvidia-smi 
sudo modprobe nvidia-smi
```
and then re-start your IDE.

In [2]:
def check_cuda():
    if torch.cuda.is_available():
        print("CUDA is available!")
        print(f"Using GPU: {torch.cuda.get_device_name(0)}")
        print(f"Total GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
        print(f"Allocated GPU memory: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
        print(f"Cached GPU memory: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")
        return True
    else:
        print("CUDA is not available. Using CPU.")
        return False

# Run CUDA check
check_cuda()

CUDA is available!
Using GPU: NVIDIA GeForce RTX 4080 Laptop GPU
Total GPU memory: 12.48 GB
Allocated GPU memory: 0.00 GB
Cached GPU memory: 0.00 GB


True

**Model loading**

We want to load llama 3.1 8B model. For FP16, the memory requirement is at least 16GB, which is not available in this instance.

We are using ==Unsloth==, a library for faster and memory-optimised loading of LLMs to load it onto the local GPU.

We are loading it in 4bit quantization to reduce the memory footprint

In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [4]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    # model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: NVIDIA GeForce RTX 4080 Laptop GPU. Max memory: 11.625 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth




In [5]:
type(model)

transformers.models.llama.modeling_llama.LlamaForCausalLM

The models can be saved locally. The modifications using LoRA etc. can be done and saved in this way. Refer to the unsloth website for more details on different ways to save the model.

In [6]:
tokenizer.save_pretrained("models/llama3.1-8b-tokenizer")

('models/llama3.1-8b-tokenizer/tokenizer_config.json',
 'models/llama3.1-8b-tokenizer/special_tokens_map.json',
 'models/llama3.1-8b-tokenizer/tokenizer.json')

In [7]:
# Saving to float16 for VLLM
model.save_pretrained_merged("models/llama3.1-4bit-merged16", tokenizer, save_method = "merged_16bit",)
model.save_pretrained_merged("models/llama3.1-4bit-merged4", tokenizer, save_method = "merged_4bit_forced",)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 15.0 out of 30.98 RAM for saving.


100%|██████████| 32/32 [00:20<00:00,  1.57it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
Unsloth: Merging 4bit and LoRA weights to 4bit...
This might take 5 minutes...
Done.
Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 10 minutes for Llama-7b... Done.


**Unsloth inference**

The first test for inference is using unsloth's native FastLanguageModel

In [10]:
from transformers import TextStreamer
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,),

In [11]:
messages = [
    {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

In [12]:
inputs

tensor([[128000, 128006,    882, 128007,    271,  24433,    279,  16178,  27476,
          25471,   8668,     25,    220,     16,     11,    220,     16,     11,
            220,     17,     11,    220,     18,     11,    220,     20,     11,
            220,     23,     11, 128009, 128006,  78191, 128007,    271]],
       device='cuda:0')

This generates very unpredictable output and sometimes just goes on and on...
But it's a starting point

In [15]:
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024, use_cache = True)

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,?

A. 233
B. 233
C. 233
D. 233
E. 233

Answer: B<|end_of_text|>


**VLLM**

In [3]:
BASE_MODEL_LOC = '/home/miam/Experiments/llama_playground/models/llama3.1-4bit-merged4'
TOKENIZER_LOC = '/home/miam/Experiments/llama_playground/models/llama3.1-8b-tokenizer'

In [7]:
torch.cuda.empty_cache()  # Clear any cached memory

In [4]:
llm = LLM(
    model=BASE_MODEL_LOC,
    dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization="bitsandbytes",
    load_format="bitsandbytes",
    enforce_eager=True,
    gpu_memory_utilization=0.8,
    max_model_len=4096
)

INFO 08-21 11:52:57 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/home/miam/Experiments/llama_playground/models/llama3.1-4bit-merged4', speculative_config=None, tokenizer='/home/miam/Experiments/llama_playground/models/llama3.1-4bit-merged4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/miam/Experiments/llama_playground/models/llama3.1-4bit-merged4, use_v2_block_manager=False, enable_prefix_caching



Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 08-21 11:53:01 model_runner.py:732] Loading model weights took 5.3424 GB
INFO 08-21 11:53:04 gpu_executor.py:102] # GPU blocks: 1494, # CPU blocks: 2048


In [5]:
print(torch.cuda.memory_summary(device=None, abbreviated=False))

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   8466 MiB |   8466 MiB |  62537 MiB |  54070 MiB |
|       from large pool |   8360 MiB |   8360 MiB |  62226 MiB |  53865 MiB |
|       from small pool |    106 MiB |    107 MiB |    311 MiB |    204 MiB |
|---------------------------------------------------------------------------|
| Active memory         |   8466 MiB |   8466 MiB |  62537 MiB |  54070 MiB |
|       from large pool |   8360 MiB |   8360 MiB |  62226 MiB |  53865 MiB |
|       from small pool |    106 MiB |    107 MiB |    311 MiB |    204 MiB |
|---------------------------------------------------------------

**Testing LLM inference**

In [6]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_LOC)

In [8]:
messages = [
    {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating