## Quantize a model and save it in quantized state

This notebook shows how you can load the unquantized merged model with bitsandbytes quantization and save it as quantized model.

In [1]:
%%writefile quantize_model.py
# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_path = 'output/phi-3.5-mini-instruct-guanaco-fsdp_merged'
quant_path = model_path + '_bnb'

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda", 
    torch_dtype=torch.bfloat16,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
)

model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Quantized model saved to "{quant_path}".')

Overwriting quantize_model.py


In [2]:
%%writefile run_quantize_model.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:10:00

# Load conda:
module purge
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /leonardo/pub/userexternal/mpfister/conda_env_martin29

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Run AI scripts:
python3 quantize_model.py

Overwriting run_quantize_model.slurm


Now submit the SLURM job:

In [3]:
!sbatch run_quantize_model.slurm

Submitted batch job 16602244


Execute `squeue` to see, if your job is already running:

In [4]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          16602244 boost_usr run_quan mpfister CF       0:02      1 lrdn2589
          16599181 boost_usr jupyterl mpfister  R    1:38:29      1 lrdn2012


Once your job is running, look at the output of the job using the following command (replace the number with the JOBID from above):

In [5]:
!cat slurm-16602244.out

+ date
Tue Jun 10 18:45:26 CEST 2025
+ hostname
lrdn2589.leonardo.local
+ nvidia-smi
Tue Jun 10 18:45:26 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB            On | 00000000:1D:00.0 Off |                    0 |
| N/A   43C    P0               65W / 475W|      0MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+-------

If you want to, you can also delete the files that we create above:

In [6]:
!rm quantize_model.py run_quantize_model.slurm slurm-*.out