* Run the code below to install the dependencies

In [1]:
# Install bitsandbytes et transformers
!pip install bitsandbytes
!pip install transformers datasets huggingface_hub

# Install auto-gptq
!pip install auto-gptq

Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1
Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting rouge (from auto-gptq)
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Collecting gekko (from auto-gptq)
  Downloading gekko-1.2.1-py3-none-any.whl.metadata (3.0 kB)
Collecting peft>=0.5.0 (from auto-gptq)
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.

* Run the code below and make sure the CUDA GPU is available

In [2]:
# Imports libraries
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import bitsandbytes as bnb 
from huggingface_hub import login
import numpy as np
import torch
import random
import time
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

# Use of GPU or CPU
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU CUDA available")
else:
    device = torch.device("cpu")
    print("No GPU CUDA available, CPU used")

# Access to Hugging Face
access_token=user_secrets.get_secret("HF_TOKEN")
login(token=access_token)

GPU CUDA available
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


* Run the code below to quantize the model
* To choose the number of bits to quantify, just change the number between brackets in the For loop

In [3]:
# Llama Loading
print('Llama loading...')
model_name = "meta-llama/Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=access_token, use_fast=True, trust_remote_code=True)
print('Llama loaded !')

# Quantization
def quantization (model_name, bits):
    torch.cuda.empty_cache()
    print("Model quantization ...")
    quantize_config = BaseQuantizeConfig(bits=bits, group_size=64, desc_act=False)
    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config,
        use_flash_attention_2=False,
        low_cpu_mem_usage=True,
        use_cache=False
    )
    model.to(device)
    examples = [
        tokenizer(
            "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
        )
    ]
    model.quantize(examples, batch_size=1)
    save_dir = f"/kaggle/working/quantized_model/{bits}bit_quantized"
    model.save_quantized(save_dir)
    print('Model quantized !')
    

for bits in [8]: # Put the number of bits between squared brackets to quantify
    print(f"{bits} bits quantization ...")
    torch.cuda.empty_cache()
    quantization(model_name, bits)
    print(f"{bits} bits quantization finished !")

Llama loading...




tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Llama loaded !
8 bits quantization ...
Model quantization ...


config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

INFO - Start quantizing layer 1/28
INFO - Quantizing self_attn.k_proj in layer 1/28...
INFO - Quantizing self_attn.v_proj in layer 1/28...
INFO - Quantizing self_attn.q_proj in layer 1/28...
INFO - Quantizing self_attn.o_proj in layer 1/28...
INFO - Quantizing mlp.up_proj in layer 1/28...
INFO - Quantizing mlp.gate_proj in layer 1/28...
INFO - Quantizing mlp.down_proj in layer 1/28...
INFO - Start quantizing layer 2/28
INFO - Quantizing self_attn.k_proj in layer 2/28...
INFO - Quantizing self_attn.v_proj in layer 2/28...
INFO - Quantizing self_attn.q_proj in layer 2/28...
INFO - Quantizing self_attn.o_proj in layer 2/28...
INFO - Quantizing mlp.up_proj in layer 2/28...
INFO - Quantizing mlp.gate_proj in layer 2/28...
INFO - Quantizing mlp.down_proj in layer 2/28...
INFO - Start quantizing layer 3/28
INFO - Quantizing self_attn.k_proj in layer 3/28...
INFO - Quantizing self_attn.v_proj in layer 3/28...
INFO - Quantizing self_attn.q_proj in layer 3/28...
INFO - Quantizing self_attn.o_pro

Model quantized !
8 bits quantization finished !


* When the code above displays Quantification of the model is complete, run the code below to zip the quantified model
* Once the code below has been run, save the notebook using the **Save Version** button at the top right of the interface
* This step is not mandatory

In [None]:
!zip -r /kaggle/working/quantized_models.zip /kaggle/working/quantized_model

* The code below allows you to run tests on a quantified model
* Just replace the path where the quantified model folder is located on the Kaggle environment

In [5]:
# LAMBADA dataset loading
print('LAMBADA dataset loading ...')
dataset = load_dataset("lambada", split="validation")
print('LAMBADA dataset loaded !')

# Quantized model loading
accuracy_results = []
tokens_per_second_results = []
number_parameters_results = []

model = AutoGPTQForCausalLM.from_quantized('/kaggle/working/quantized_model/8bit_quantized') # Change the path here
model.to(device)

# Evaluation metrics
counter = 1
total_tokens = 0
k = 5
top_k_correct = 0
start_time = time.time()

# Model evaluation
print('Model evaluation ...')
for example in dataset:
    print("counter : " + str(counter))

    # Processing data
    sentence = example["text"]
    split_sentence = sentence.split()
    true_word = split_sentence.pop(-1)
    prompt = ' '.join(split_sentence)
        
    # Tokenisation generation
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Checking k last tokens
    top_k_tokens = torch.topk(outputs.logits[0, -1, :], k=k).indices
    predicted_words = tokenizer.batch_decode(top_k_tokens)
    print('predicted_words : ' + str(predicted_words))
    
    if true_word in [word.strip() for word in predicted_words]:
        top_k_correct += 1
    print(top_k_correct)
    
    total_tokens += inputs['input_ids'].size(1)
    print('total_tokens : ' + str(total_tokens))
    
    counter += 1
    print("time : " + str(time.time() - start_time))

print('Model evaluation finsihed !')

# Metrics computing
end_time = time.time()
accuracy_top_k = top_k_correct / len(dataset)
tokens_per_second = total_tokens / (end_time - start_time)
inference_time = counter / (end_time - start_time)
number_of_parameters = model.num_parameters()

accuracy_results.append(accuracy_top_k)
tokens_per_second_results.append(tokens_per_second)
number_parameters_results.append(number_of_parameters)

print(f"Top-{k} Accuracy: {accuracy_top_k * 100:.2f}%")
print(f"Speed: {tokens_per_second:.2f} Tokens/s")
print(f"Inference Time : {inference_time}")
print(f"Number of parameters: {number_of_parameters:.2f}")

print("Results :")
print(f"Top-{k} Accuracy: {accuracy_results}")
print(f"Speed: {tokens_per_second_results}")
print(f"Number of parameters: {number_parameters_results}")

LAMBADA dataset loading ...


README.md:   0%|          | 0.00/7.32k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/269M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/281M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2662 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4869 [00:00<?, ? examples/s]

1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
INFO - The layer lm_head is not quantized.


LAMBADA dataset loaded !
Model evaluation ...
counter : 1


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


predicted_words : [' cake', ' table', ' dessert', ' wedding', ' tower']
1
total_tokens : 106
time : 0.5491220951080322
counter : 2
predicted_words : [' my', " ''", ' row', ' little', ' princess']
1
total_tokens : 198
time : 0.9475893974304199
counter : 3
predicted_words : [' mat', ' other', ' mats', ' opposite', ' ground']
2
total_tokens : 267
time : 1.3458788394927979
counter : 4
predicted_words : [' a', ' an', ' trouble', ' another', ' the']
3
total_tokens : 363
time : 1.7461364269256592
counter : 5
predicted_words : [' children', ' arrows', ' kids', ' sons', ' two']
4
total_tokens : 446
time : 2.144033432006836
counter : 6
predicted_words : [' the', ' him', ' luc', ' my', ' his']
4
total_tokens : 544
time : 2.5590004920959473
counter : 7
predicted_words : [' fetch', ' find', ' do', ' get', ' use']
5
total_tokens : 629
time : 2.9579195976257324
counter : 8
predicted_words : [' ze', ' the', ' Zeus', ' z', ' god']
5
total_tokens : 716
time : 3.3554043769836426
counter : 9
predicted_wor