## Part 1: Setup the pytorch/CUDA/GPU environment

In [1]:
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

In [2]:
import torch

def print_gpu_memory():
    if torch.cuda.is_available():
        print(f"Total GPU Memory: {torch.cuda.get_device_properties(0).total_memory / (1024**3):.2f} GB")
        print(f"Used GPU Memory: {torch.cuda.memory_allocated(0) / (1024**3):.2f} GB")
        print(f"Free GPU Memory: {torch.cuda.memory_reserved(0) / (1024**3):.2f} GB")
    else:
        print("CUDA is not available. No GPU detected.")

print_gpu_memory()

Total GPU Memory: 15.74 GB
Used GPU Memory: 0.00 GB
Free GPU Memory: 0.00 GB


## Part 2: Loading the model

In this project, the workstation has 16GB VRAM. 
A simple "divide by 4" rule of thumb, we can run at most 4B parameters model.

The configurarion of this model can be found on: https://huggingface.co/docs/transformers/en/model_doc/gemma#transformers.GemmaConfig

In [3]:
import torch

# Clear any cached memory (might help in some cases)
torch.cuda.empty_cache()

# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA (GPU support) is available in this environment!")
    print(f"Number of GPUs available: {torch.cuda.device_count()}")
    # Get the name of the GPU
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available. Using CPU instead.")

CUDA (GPU support) is available in this environment!
Number of GPUs available: 1
GPU Name: NVIDIA GeForce RTX 3080 Ti Laptop GPU


In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os

# Assuming you have already initialized the tokenizer and loaded the model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf").eval()

# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,  # the original model
    {torch.nn.Linear},  # the layers you want to dynamically quantize
    dtype=torch.qint8  # the target dtype for quantized weights
)

# Function to save the model and check its size
def save_model_check_size(model, filename):
    file_path = os.path.join(os.getcwd(), filename)
    torch.save(model.state_dict(), file_path)
    size = os.path.getsize(file_path)
    print(f"Size of the {filename}: {size / (1024**2):.2f} MB")

# Save and check the size of the original and quantized models
save_model_check_size(model, "original_model.pt")
save_model_check_size(quantized_model, "quantized_model.pt")

# Check if a GPU is available and move the model to GPU if it is
if torch.cuda.is_available():
    model = quantized_model.to("cuda")
    print("Using GPU:", torch.cuda.get_device_name(0))
else:
    print("No GPU found, using CPU instead.")

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.70s/it]


Size of the original_model.pt: 25705.13 MB
Size of the quantized_model.pt: 6802.30 MB
Using GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU


## The effect of quantization is huge:

Size of the original_model.pt: 25705.13 MB
Size of the quantized_model.pt: 6802.30 MB
Using GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU

In [7]:
# Tokenize the input text and move the tensor to GPU if available
input_text = "write me a python code that generate a simple game"

# we dont need to update tokenizer to cuda because it just convert the text to tokens
input_ids = tokenizer(input_text, return_tensors="pt")
input_ids = input_ids.to("cuda") if torch.cuda.is_available() else input_ids

# Generate output
outputs = model.generate(**input_ids)

print(tokenizer.decode(outputs[0]))

NotImplementedError: Could not run 'quantized::linear_dynamic' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::linear_dynamic' is only available for these backends: [CPU, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

CPU: registered at ../aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp:662 [kernel]
Meta: registered at ../aten/src/ATen/core/MetaFallbackKernel.cpp:23 [backend fallback]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:154 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:498 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:324 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:86 [backend fallback]
AutogradOther: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:53 [backend fallback]
AutogradCPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:57 [backend fallback]
AutogradCUDA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:65 [backend fallback]
AutogradXLA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:69 [backend fallback]
AutogradMPS: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:77 [backend fallback]
AutogradXPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:61 [backend fallback]
AutogradHPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:90 [backend fallback]
AutogradLazy: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:73 [backend fallback]
AutogradMeta: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:81 [backend fallback]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:297 [backend fallback]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:378 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:244 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:720 [backend fallback]
BatchedNestedTensor: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:746 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:203 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:162 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:494 [backend fallback]
PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:166 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:158 [backend fallback]


So, the model has already successfully setup on the PC.

## Part 3: Benchmarking

(Skipped)