# Fine-tuning of the Voila chat model for Covost-2 (Chinese to English)

Para esta extensi√≥n de la tarea he elegido finetunear el modelo chat multimodal de [Voila](https://huggingface.co/maitrix-org/Voila-base). Antes de empezar, este modelo seg√∫n su paper no es capaz de realizar **speech translation** pero con finetuning si que puede. Por este motivo, como es una nueva tarea si que emplear√© todo el dataset de entrenamiento. Por √∫ltimo, este cuaderno se encuentra en el repositorio https://github.com/psegmar1/Voila-finetuning que es un fork del repo de Voila.

## Importar todas las librerias y dependencias necesarias

In [1]:
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from transformers import AutoTokenizer, BitsAndBytesConfig, TrainingArguments, Trainer, default_data_collator
from whisper_normalizer.basic import BasicTextNormalizer
from torch.nn import CrossEntropyLoss
from datasets import load_dataset
from evaluate import load
import librosa
import torch
import copy

from voila_tokenizer import VoilaTokenizer
from model import VoilaModel

  from .autonotebook import tqdm as notebook_tqdm


## Versi√≥n y dispositivo

In [2]:
print("Torch version: ", torch. __version__)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device: ", device)

Torch version:  2.5.0+cu124
Device:  cuda


## Cargar el dataset

Con el fin de mantener la consistencia con los otros cuadernos el tama√±o de dev ser√° de 1000 muestras. 

In [3]:
DEV_SIZE = 1_000
seed = 42

raw_datasets = load_dataset("facebook/covost2", 'zh-CN_en', data_dir="/home/alumno.upv.es/psegmar1/TA_A3/cv-corpus-24.0-2025-12-05/zh-CN", trust_remote_code=True)

train_dataset = raw_datasets['train']
dev_dataset = (
    raw_datasets["validation"]
    .shuffle(seed=seed)
    .select(range(DEV_SIZE))
)

## Cargar el modelo

In [4]:
CHECKPOINT = "maitrix-org/Voila-chat"

quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
)

model = VoilaModel.from_pretrained(
    CHECKPOINT,
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config,
)

model = model.to(device)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading shards:  25%|‚ñà‚ñà‚ñå       | 1/4 [00:38<01:56, 38.94s/it]

Downloading shards:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [01:18<01:18, 39.07s/it]

Downloading shards:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [01:57<00:39, 39.05s/it]

Downloading shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [02:10<00:00, 28.78s/it]

Downloading shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [02:10<00:00, 32.54s/it]




Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:  25%|‚ñà‚ñà‚ñå       | 1/4 [00:01<00:03,  1.02s/it]

Loading checkpoint shards:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [00:02<00:02,  1.05s/it]

Loading checkpoint shards:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [00:03<00:01,  1.03s/it]

Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:03<00:00,  1.22it/s]

Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:03<00:00,  1.11it/s]




## Tokenizador del backbone LLM y de audio 

### Cargar ambos tokenizadores

In [5]:
VOILA_TOKENIZER_PATH = "maitrix-org/Voila-Tokenizer"

tokenizer_llm_backbone = AutoTokenizer.from_pretrained(CHECKPOINT)
tokenizer_voila = VoilaTokenizer(model_path=VOILA_TOKENIZER_PATH, device="cuda")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


### A√±adir nuevos tokens

El LLM de Voila (de la familia de Llama) se entrena utilizando prefijos para las tareas que son tokens especiales. Por este motivo, he decidido a√±adir tres tokens nuevos al tokenizador del LLM que son el de la propia tarea, el que indica la lengua origen y el que cierra la tarea.

In [6]:
new_tokens = ['<s2tt>', '<zh>', '<s2tt_text_output>']
tokenizer_llm_backbone.add_special_tokens({'additional_special_tokens': new_tokens})

3

### Tokenizar el conjunto de train y validaci√≥n para entrenamiento

In [7]:
DEFAULT_SYSTEM_START_TOKEN = "<SYSTEM>"
DEFAULT_SYSTEM_END_TOKEN   = "</SYSTEM>"
TASK_S2TT_TOKEN = "<s2tt>"
TASK_S2TT_END_TOKEN = "<s2tt_text_output>"
S2TT_CHINESE = "<zh>"
DEFAULT_HUMAN_TOKEN = "<|HUMAN|>"
DEFAULT_ASSISTANT_TOKEN = "<|VOILA|>"
AUDIO_TOKEN_FORMAT = "<|{}|>"
AUDIO_SR = 16000

MAK_TOKEN_LEN = 512
PAD_TOKEN_ID = tokenizer_llm_backbone.pad_token_id
EOS_TOKEN_ID = tokenizer_llm_backbone.eos_token_id

def _wrapper_audio_tokens(audio_tokens, num_codebooks, codebook_size):
    ret_audio_tokens = []
    for n in range(num_codebooks):
        audio_token = audio_tokens[n]
        ret_audio_tokens.append(''.join([AUDIO_TOKEN_FORMAT.format(au + n*codebook_size) if isinstance(au, int) else au for au in audio_token]))
    return ret_audio_tokens

def tokenize_for_training(samples):
    num_codebooks = model.config.num_codebooks
    codebook_size = model.config.codebook_size

    system = DEFAULT_SYSTEM_START_TOKEN + TASK_S2TT_TOKEN + S2TT_CHINESE + TASK_S2TT_END_TOKEN + DEFAULT_SYSTEM_END_TOKEN

    rv_input_ids = []
    rv_label_ids = []
    rv_attention_masks = []

    total_samples = len(samples["file"])

    for i in range(total_samples):

        # Copy into num_codebooks input ids
        input_ids_list = []
        for _ in range(num_codebooks):
            input_ids_list.append([])

        audio, _ = librosa.load(samples['file'][i], sr=AUDIO_SR)

        # get audio token
        audio_tokens = tokenizer_voila.encode(audio, sr=AUDIO_SR)
        audio_tokens = audio_tokens.cpu().numpy().tolist()
        audio_tokens = _wrapper_audio_tokens(audio_tokens, num_codebooks, codebook_size)
        
        labels = tokenizer_llm_backbone(samples['translation'][i], add_special_tokens=False)
        base_label_ids = labels["input_ids"] + [EOS_TOKEN_ID]

        sample_attention_mask = []
        sample_labels = []
        set_attention_mask_and_labels = True
        for n in range(num_codebooks):
            content = system + DEFAULT_HUMAN_TOKEN + audio_tokens[n] + DEFAULT_ASSISTANT_TOKEN
            content_ids = tokenizer_llm_backbone.encode(content, add_special_tokens=False, truncation=True,
                                    max_length=tokenizer_llm_backbone.model_max_length)
            
            label_input_ids = base_label_ids + [EOS_TOKEN_ID]
            model_inputs_ids = content_ids + label_input_ids
            
            if set_attention_mask_and_labels:
                label_input_ids = [PAD_TOKEN_ID] * len(content_ids) + label_input_ids
                sample_attention_mask = [1] * len(model_inputs_ids)
                sample_attention_mask = [0] * (MAK_TOKEN_LEN - len(model_inputs_ids)) + sample_attention_mask
                # Left padding
                sample_labels =  [PAD_TOKEN_ID] * (
                    MAK_TOKEN_LEN - len(label_input_ids)
                ) + label_input_ids
                set_attention_mask_and_labels = False

            # Left padding
            model_inputs_ids =  [PAD_TOKEN_ID] * (
                MAK_TOKEN_LEN - len(model_inputs_ids)
            ) + model_inputs_ids
            
            input_ids_list[n] += copy.deepcopy(model_inputs_ids)

        for n in range(num_codebooks):
            input_ids_list[n] = input_ids_list[n][:tokenizer_llm_backbone.model_max_length]

        # To get [seq_len, num_codebooks]
        transposed_inputs = list(map(list, zip(*input_ids_list)))
        
        rv_input_ids.append(transposed_inputs)
        rv_label_ids.append(sample_labels)
        rv_attention_masks.append(sample_attention_mask)

    return {"input_ids": rv_input_ids, "labels": rv_label_ids, "attention_mask": rv_attention_masks}


tokenized_train = train_dataset.map(
    tokenize_for_training,
    batched=True
)

tokenized_dev = dev_dataset.map(
    tokenize_for_training,
    batched=True
)

# During tokenization I allow in inputs the max length of the model in order to discard now those that exceed my limit of MAK_TOKEN_LEN
tokenized_train = tokenized_train.filter(lambda x: len(x["input_ids"]) <= MAK_TOKEN_LEN and len(x["labels"]) <= MAK_TOKEN_LEN , desc=f"Discarding input with more than {MAK_TOKEN_LEN} tokens")
tokenized_dev = tokenized_dev.filter(lambda x: len(x["input_ids"]) <= MAK_TOKEN_LEN and len(x["labels"]) <= MAK_TOKEN_LEN , desc=f"Discarding input with more than {MAK_TOKEN_LEN} tokens")




Map:   0%|          | 0/7085 [00:00<?, ? examples/s]

Map:  14%|‚ñà‚ñç        | 1000/7085 [00:16<01:42, 59.22 examples/s]

Map:  28%|‚ñà‚ñà‚ñä       | 2000/7085 [00:30<01:15, 67.33 examples/s]

Map:  28%|‚ñà‚ñà‚ñä       | 2000/7085 [00:41<01:15, 67.33 examples/s]

Map:  42%|‚ñà‚ñà‚ñà‚ñà‚ñè     | 3000/7085 [00:42<00:55, 73.91 examples/s]

Map:  42%|‚ñà‚ñà‚ñà‚ñà‚ñè     | 3000/7085 [00:53<00:55, 73.91 examples/s]

Map:  56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã    | 4000/7085 [00:54<00:39, 77.55 examples/s]

Map:  71%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 5000/7085 [01:06<00:26, 79.44 examples/s]

Map:  85%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 6000/7085 [01:18<00:13, 79.08 examples/s]

Map:  99%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 7000/7085 [01:30<00:01, 82.32 examples/s]

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7085/7085 [01:31<00:00, 81.71 examples/s]

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7085/7085 [01:31<00:00, 77.54 examples/s]




Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:12<00:00, 77.51 examples/s]

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:12<00:00, 77.36 examples/s]




Discarding input with more than 512 tokens:   0%|          | 0/7085 [00:00<?, ? examples/s]

Discarding input with more than 512 tokens:  14%|‚ñà‚ñç        | 1000/7085 [00:05<00:30, 199.12 examples/s]

Discarding input with more than 512 tokens:  28%|‚ñà‚ñà‚ñä       | 2000/7085 [00:10<00:25, 196.71 examples/s]

Discarding input with more than 512 tokens:  42%|‚ñà‚ñà‚ñà‚ñà‚ñè     | 3000/7085 [00:14<00:19, 206.72 examples/s]

Discarding input with more than 512 tokens:  56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã    | 4000/7085 [00:19<00:14, 213.90 examples/s]

Discarding input with more than 512 tokens:  71%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 5000/7085 [00:24<00:09, 209.49 examples/s]

Discarding input with more than 512 tokens:  85%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 6000/7085 [00:29<00:05, 204.87 examples/s]

Discarding input with more than 512 tokens:  99%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 7000/7085 [00:33<00:00, 212.14 examples/s]

Discarding input with more than 512 tokens: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7085/7085 [00:33<00:00, 213.46 examples/s]

Discarding input with more than 512 tokens: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7085/7085 [00:33<00:00, 209.21 examples/s]




Discarding input with more than 512 tokens:   0%|          | 0/1000 [00:00<?, ? examples/s]

Discarding input with more than 512 tokens: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:05<00:00, 192.06 examples/s]

Discarding input with more than 512 tokens: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:05<00:00, 191.98 examples/s]




## Preparaci√≥n para el finetuning

En este caso al estar el modelo quantizado a 4 bits se est√° utilizando la t√©cnica PEFT de QLoRA.

In [8]:
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False, gradient_checkpointing_kwargs={'use_reentrant':False})

# In this case this is not necessary because len(tokenizer_llm_backbone) < vocab_size. Moreover, I am not going to retrain the lm_head because the special
# tokens I added are not going to be predicted

# https://mohitmayank.com/a_lazy_data_science_guide/machine_learning/ML_snippets/#lora-with-selective-token-training
# Resize model embedding matrix
# model.resize_token_embeddings(len(tokenizer_llm_backbone))

config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    inference_mode=False,
    target_modules=["q_proj", "k_proj"],
    trainable_token_indices={
        'embed_tokens': tokenizer_llm_backbone.convert_tokens_to_ids(new_tokens)
    },
)

lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

trainable params: 3,420,160 || all params: 8,162,145,280 || trainable%: 0.0419


### M√°s informaci√≥n sobre el modelo

In [9]:
def get_peft_model_sizes(model):
    total_params = sum(p.numel() for p in model.parameters())

    params_grad = 0
    params_grad_names = []
    for name, param in model.named_parameters():
        if param.requires_grad:
            params_grad += param.numel()
            params_grad_names.append(name)


    return {
        'total_params': total_params,
        'params_grad': params_grad,
        'params_grad_name': params_grad_names
    }

model_sizes = get_peft_model_sizes(lora_model)
print(model_sizes)

print(lora_model)

{'total_params': 4643124224, 'params_grad': 3420160, 'params_grad_name': ['base_model.model.model.embed_tokens.token_adapter.trainable_tokens_delta.default', 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.1.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.1.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.2.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.laye

### Definir funci√≥n de p√©rdida

Como en el propio codigo base no se calcula la loss para VoilaModel, defino una funci√≥n para ello que recibir√° el m√©todo encargado del entrenamiento. He preferido que est√© aqu√≠ ya que quiero que este cuaderno sea bastante completo. Sin embargo, en el archivo 'model.py' si que he hecho alg√∫n cambio.

In [10]:
def compute_loss_func(outputs, labels, num_items_in_batch=None):
    logits = outputs.logits
    
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

    loss_fct = CrossEntropyLoss(ignore_index=PAD_TOKEN_ID)

    loss = loss_fct(
        shift_logits.view(-1, shift_logits.size(-1)), 
        shift_labels.view(-1)
    )

    return loss

## Entrenamiento

In [11]:
batch_size = 4
gradient_accumulation_steps = 8

model_name = CHECKPOINT.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-s2tt-zh-to-en",
    save_strategy = "epoch",
    eval_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    weight_decay=0.01,
    num_train_epochs=10,
    report_to="none",
    load_best_model_at_end=False, # If True => KeyError: "The `metric_for_best_model` training argument is set to 'eval_loss', which is not found in the evaluation metrics. The available evaluation metrics are: []. Consider changing the `metric_for_best_model` via the TrainingArguments."
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit"
)

trainer = Trainer(
    lora_model,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    data_collator=default_data_collator,
    compute_loss_func=compute_loss_func
)

# Save the LLM tokenizer with the new tokens
tokenizer_llm_backbone.save_pretrained("Voila-chat-tokenizer-s2tt")

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,No log
2,No log,No log
3,2.920600,No log
4,2.920600,No log
5,2.525900,No log
6,2.525900,No log
7,2.379300,No log
8,2.379300,No log
9,2.296500,No log


TrainOutput(global_step=2210, training_loss=2.50663026835584, metrics={'train_runtime': 15842.459, 'train_samples_per_second': 4.472, 'train_steps_per_second': 0.139, 'total_flos': 6.589262311364493e+18, 'train_loss': 2.50663026835584, 'epoch': 9.957110609480813})

## Cargar de nuevo el dataset y tokenizar el conjunto de dev y 

Con el fin de mantener la consistencia con los otros cuadernos el tama√±o de dev y test ser√° de 1000 muestras. 

In [12]:
TEST_SIZE = 1_000

raw_datasets = load_dataset("facebook/covost2", 'zh-CN_en', data_dir="/home/alumno.upv.es/psegmar1/TA_A3/cv-corpus-24.0-2025-12-05/zh-CN", trust_remote_code=True)

dev_dataset = (
    raw_datasets["validation"]
    .shuffle(seed=seed)
    .select(range(DEV_SIZE))
)

test_dataset = (
    raw_datasets["test"]
    .shuffle(seed=seed)
    .select(range(TEST_SIZE))
)

def tokenize_for_inference(samples):
    num_codebooks = model.config.num_codebooks
    codebook_size = model.config.codebook_size

    system = DEFAULT_SYSTEM_START_TOKEN + TASK_S2TT_TOKEN + S2TT_CHINESE + TASK_S2TT_END_TOKEN + DEFAULT_SYSTEM_END_TOKEN

    rv_input_ids = []

    total_samples = len(samples["file"])

    for i in range(total_samples):

        # Copy into num_codebooks input ids
        input_ids_list = []
        for _ in range(num_codebooks):
            input_ids_list.append([])

        audio, _ = librosa.load(samples['file'][i], sr=AUDIO_SR)

        # get audio token
        audio_tokens = tokenizer_voila.encode(audio, sr=AUDIO_SR)
        audio_tokens = audio_tokens.cpu().numpy().tolist()
        audio_tokens = _wrapper_audio_tokens(audio_tokens, num_codebooks, codebook_size)
        
        for n in range(num_codebooks):
            content = system + DEFAULT_HUMAN_TOKEN + audio_tokens[n] + DEFAULT_ASSISTANT_TOKEN
            content_ids = tokenizer_llm_backbone.encode(content, add_special_tokens=False, truncation=True,
                                    max_length=tokenizer_llm_backbone.model_max_length)
            
            input_ids_list[n] += copy.deepcopy(content_ids)

        for n in range(num_codebooks):
            input_ids_list[n] = input_ids_list[n][:tokenizer_llm_backbone.model_max_length]
        
        rv_input_ids.append(input_ids_list)

    return {"input_ids": rv_input_ids}


tokenized_dev = dev_dataset.map(
    tokenize_for_inference,
    batched=True
)

tokenized_test = test_dataset.map(
    tokenize_for_inference,
    batched=True
)

# During tokenization I allow in inputs the max length of the model in order to discard now those that exceed my limit of MAK_TOKEN_LEN
tokenized_test = tokenized_test.filter(lambda x: len(x["input_ids"][0]) <= MAK_TOKEN_LEN, desc=f"Discarding input with more than {MAK_TOKEN_LEN} tokens")
tokenized_dev = tokenized_dev.filter(lambda x: len(x["input_ids"][0]) <= MAK_TOKEN_LEN, desc=f"Discarding input with more than {MAK_TOKEN_LEN} tokens")

ds_dict = {
    "DEV": tokenized_dev,
    "TEST": tokenized_test
}

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:12<00:00, 82.85 examples/s]

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:12<00:00, 82.80 examples/s]




Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:12<00:00, 83.22 examples/s]

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:12<00:00, 83.15 examples/s]




Discarding input with more than 512 tokens:   0%|          | 0/1000 [00:00<?, ? examples/s]

Discarding input with more than 512 tokens: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:03<00:00, 262.87 examples/s]

Discarding input with more than 512 tokens: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:03<00:00, 262.72 examples/s]




Discarding input with more than 512 tokens:   0%|          | 0/1000 [00:00<?, ? examples/s]

Discarding input with more than 512 tokens: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:03<00:00, 264.66 examples/s]

Discarding input with more than 512 tokens: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:03<00:00, 264.51 examples/s]




## Cargar m√©tricas de evaluaci√≥n

In [13]:
bleu_metric = load("sacrebleu")
comet_metric = load("comet")

  from pkg_resources import DistributionNotFound, get_distribution


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Fetching 5 files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00, 65128.94it/s]


/home/alumno.upv.es/psegmar1/.conda/envs/voila_env/lib/python3.11/site-packages/lightning_fabric/utilities/cloud_io.py:73: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.


Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.6.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`


Encoder model frozen.


/home/alumno.upv.es/psegmar1/.conda/envs/voila_env/lib/python3.11/site-packages/pytorch_lightning/core/saving.py:197: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']


## Inferencia

Debido a los m√©todos proporcionados por la clase de voila la inferencia es muestra a muestra y no por batch

In [14]:
num_codebooks = model.config.num_codebooks
codebook_size = model.config.codebook_size

DEFAULT_AUDIO_TOKEN = "<au_token>"

AUDIO_MIN_TOKEN_ID = tokenizer_llm_backbone.convert_tokens_to_ids(AUDIO_TOKEN_FORMAT.format(0))
AUDIO_MAX_TOKEN_ID = tokenizer_llm_backbone.convert_tokens_to_ids(AUDIO_TOKEN_FORMAT.format(codebook_size*num_codebooks-1))
AUDIO_TOKEN_ID = tokenizer_llm_backbone.convert_tokens_to_ids(DEFAULT_AUDIO_TOKEN)
ASSISTANT_TOKEN_ID = tokenizer_llm_backbone.convert_tokens_to_ids(DEFAULT_ASSISTANT_TOKEN)

normalizer = BasicTextNormalizer()

for ds_name in ["DEV", "TEST"]:
    data = ds_dict[ds_name]
    total_samples = len(data["file"])

    hypothesis = []

    for i in range(total_samples):
        input_ids = data["input_ids"][i]
        input_ids = torch.as_tensor([input_ids]).transpose(1,2).cuda()

        gen_params = {
            "input_ids": input_ids,
            "max_new_tokens": MAK_TOKEN_LEN,
            "pad_token_id": PAD_TOKEN_ID,
            "eos_token_id": EOS_TOKEN_ID,
            "llm_audio_token_id": AUDIO_TOKEN_ID,
            "min_audio_token_id": AUDIO_MIN_TOKEN_ID,
            "temperature": 0.2,
            "top_k": 3,
            "use_audio_transformer": False
        }

        with torch.inference_mode():
            outputs = lora_model.run_generate(**gen_params)

            outputs = outputs[0].cpu().tolist()

            predict_outputs = outputs[input_ids.shape[1]:]
            text_outputs = []

            for item in predict_outputs:
                if item[0] != EOS_TOKEN_ID and not (item[0] >= AUDIO_MIN_TOKEN_ID and item[0] <= AUDIO_MAX_TOKEN_ID):
                    text_outputs.append(item[0])

            hypothesis.append(tokenizer_llm_backbone.decode(text_outputs))

        
    hypothesis_clean = [normalizer(text) for text in hypothesis]
    sentence_clean = [normalizer(text) for text in data["sentence"]]
    translation_clean = [normalizer(text) for text in data["translation"]]

    result = bleu_metric.compute(predictions=hypothesis_clean, references=translation_clean)
    print(f'BLEU en {ds_name}: {result["score"]:.1f}')
    comet_score = comet_metric.compute(predictions=hypothesis_clean, references=translation_clean, sources=sentence_clean)
    print(f"COMET en {ds_name}: {comet_score['mean_score'] * 100:.2f} %")

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


/home/alumno.upv.es/psegmar1/.conda/envs/voila_env/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /home/alumno.upv.es/psegmar1/.conda/envs/voila_env/l ...
üí° Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.


GPU available: True (cuda), used: True


TPU available: False, using: 0 TPU cores


You are using a CUDA device ('NVIDIA L40S') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


BLEU en DEV: 4.1


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


COMET en DEV: 57.45 %


/home/alumno.upv.es/psegmar1/.conda/envs/voila_env/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /home/alumno.upv.es/psegmar1/.conda/envs/voila_env/l ...
üí° Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.


GPU available: True (cuda), used: True


TPU available: False, using: 0 TPU cores


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


BLEU en TEST: 3.2


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


COMET en TEST: 55.70 %
