## Gradio

[Gradio](https://www.gradio.app) can enable simple web interfaces to your software. In this example, we are using Gradio to get a simple chat interface to a large language model.

In [1]:
%%writefile gradio_example.py
# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
import gradio as gr
import os
import random
import socket
import sys

ip = socket.gethostbyname(socket.gethostname())
hostname = socket.gethostname().split('.')[0]
port = random.randint(10000, 50000)

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# ! Change the name trainee user name to the name in your personal URL: !
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
trainee_user = 'trainee01'

print('Open the following URL in your webbrowser:')
print(f'https://hpctraining.org/{trainee_user}/proxy/absolute/{hostname}:{port}/')
print('')
sys.stdout.flush()

model_name = '/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/microsoft--phi-3.5-mini-instruct'
# model_name = 'microsoft/Phi-3.5-mini-instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='cuda',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
)
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer)
def get_answer(question, history=[]):
    history.append(
        {'role': 'user', 'content': question}
    )
    result = pipe(history, max_new_tokens=500, return_full_text=False)
    return result[0]['generated_text'].strip()
    # return question

chat_interface = gr.ChatInterface(get_answer, type='messages')
chat_interface.launch(share=False, server_name=ip, server_port=port, root_path=f'/{trainee_user}/proxy/absolute/{hostname}:{port}')

Overwriting gradio_example.py


In [2]:
%%writefile run_gradio_example.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:10:00

# Load conda:
module purge
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /leonardo/pub/userexternal/mpfister/conda_env_martin24

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Run AI scripts:
python3 gradio_example.py

Overwriting run_gradio_example.slurm


Now submit the SLURM job:

In [3]:
!sbatch run_gradio_example.slurm

Submitted batch job 12959420


Execute `squeue` to see, if your job is already running:

In [4]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          12959420 boost_usr run_grad mpfister PD       0:00      1 (None)
          12952669 boost_usr jupyterl mpfister  R    5:29:26      1 lrdn3456


Once your job is running, look at the output of the job using the following command (replace the number with the JOBID from above):

In [7]:
!cat slurm-12959420.out

+ date
Mon Feb 24 21:39:29 CET 2025
+ hostname
lrdn0151.leonardo.local
+ nvidia-smi
Mon Feb 24 21:39:29 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB            On | 00000000:56:00.0 Off |                    0 |
| N/A   43C    P0               63W / 477W|      0MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+--------

Finally, when you are finished, please cancel the SLURM job to free the resources:

In [8]:
!scancel 12959420

If you want to, you can also delete the files that we create above:

In [9]:
!rm gradio_example.py run_gradio_example.slurm slurm-*.out