## Quantize a model and save it in quantized state

This notebook shows how you can load the unquantized merged model with bitsandbytes quantization and save it as quantized model.

In [1]:
%%writefile quantize_model.py
# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_path = 'output/llama-3.2-1b-instruct-guanaco-fsdp_merged'
quant_path = model_path + '_bnb'

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda", 
    torch_dtype=torch.bfloat16,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
)

model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Quantized model saved to "{quant_path}".')

Writing quantize_model.py


In [2]:
%%writefile run_quantize_model.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:10:00

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Construct command to run container:
export CONTAINER="singularity run --nv --home=$HOME $SINGULARITY_CONTAINER"

# Run AI scripts:
$CONTAINER python3 quantize_model.py

Writing run_quantize_model.slurm


Now submit the SLURM job:

In [3]:
!sbatch --job-name=$TRAINEE_USERNAME run_quantize_model.slurm

Submitted batch job 19798862


Execute `squeue` to see, if your job is already running:

In [4]:
!squeue --name=$TRAINEE_USERNAME 

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          19798862 boost_usr   martin mpfister  R       0:03      1 lrdn1935


Once your job is running, look at the output of the job using the following command (replace the number with the JOBID from above):

In [6]:
!cat slurm-19798862.out

+ date
Wed Sep 10 11:32:34 CEST 2025
+ hostname
lrdn1935.leonardo.local
+ nvidia-smi
Wed Sep 10 11:32:34 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB           On  | 00000000:8F:00.0 Off |                    0 |
| N/A   43C    P0              59W / 449W |      2MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+-------

If you want to, you can also delete the files that we create above:

In [6]:
!rm quantize_model.py run_quantize_model.slurm slurm-*.out