# Llama 3.2とWhisper encoderをadapterで接続してzero-shot instruction following

1. Google Colabのページ上部バーにて，ランタイム -> ランタイムのタイプを変更へと進み，ハードウェア アクセラレータとして"T4 GPU"を選択して保存
1. Hugging Faceに[ログイン](https://huggingface.co/login)または[アカウントを作成](https://huggingface.co/join)
1. Llama 3.2の[ライセンス](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)に同意．数分で確認メールが届き，モデルにアクセス可能になります
1. [アクセストークン](https://huggingface.co/settings/tokens)をWrite権限で作成してコピーし，下記でログインの際に入力

In [None]:
# https://github.com/ryota-komatsu/slp2025/blob/main/slp2025-tutorial.pdf

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
The token `whis-llama3.2-test` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `whis-

In [None]:
!pip install datasets==3.6.0 \
    gcsfs==2025.3.0 \
    nvidia-cublas-cu12==12.4.5.8 \
    nvidia-cuda-cupti-cu12==12.4.127 \
    nvidia-cuda-nvrtc-cu12==12.4.127 \
    nvidia-cuda-runtime-cu12==12.4.127 \
    nvidia-cudnn-cu12==9.1.0.70 \
    nvidia-cufft-cu12==11.2.1.3 \
    nvidia-curand-cu12==10.3.5.147 \
    nvidia-cusolver-cu12==11.6.1.9 \
    nvidia-cusparse-cu12==12.3.1.170 \
    nvidia-nvjitlink-cu12==12.4.127

"""
    nvidia-cublas-cu12==12.4.5.8 \ # BLAS (Basic Linear Algebra Subprograms) という行列演算やベクトル演算の標準的なAPIを、GPU向けに最適化。学習速度を向上
    nvidia-cuda-cupti-cu12==12.4.127 \ # GPUアプリケーションのパフォーマンスを分析
    nvidia-cuda-nvrtc-cu12==12.4.127 \ # プログラムの実行時にCUDA C++のコードをコンパイル。実行環境に合わせた動的なコード生成や最適化
    nvidia-cuda-runtime-cu12==12.4.127 \ # GPUの管理、メモリの割り当て、CPUとGPU間でのデータ転送
    nvidia-cudnn-cu12==9.1.0.70 \ # 畳み込み（Convolution）、プーリング（Pooling）、正規化（Normalization）といった、ディープラーニングで頻繁に使われる処理を高速に実行
    nvidia-cufft-cu12==11.2.1.3 \ # 高速フーリエ変換（FFT）をGPU上で実行。信号処理や画像処理、物理シミュレーションなどで利用
    nvidia-curand-cu12==10.3.5.147 \ # GPU上で高品質かつ高速な乱数を生成。ニューラルネットワークの重みの初期化や、データ拡張（Data Augmentation）などで必要
    nvidia-cusolver-cu12==11.6.1.9 \ # LU分解やQR分解といった、高密度の線形代数ソルバー（方程式を解くための計算）を提供
    nvidia-cusparse-cu12==12.3.1.170 \ # 疎行列（値が0である要素が多い行列）の計算を高速化
    nvidia-nvjitlink-cu12==12.4.127 # 実行時に複数のCUDAコードをリンク（結合）して最適化
"""

Collecting datasets==3.6.0
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting gcsfs==2025.3.0
  Downloading gcsfs-2025.3.0-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting nvidia-cublas-cu12==12.4.5.8
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cufft-cu12==11.2.1.3
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metad

In [None]:
from typing import Any, Dict, List

import torch
import torch.nn.functional as F
from datasets import load_dataset
from torch import nn
from torch.utils.tensorboard import SummaryWriter
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    AutoProcessor, # データをモデルが扱える形式に前処理
    PretrainedConfig, # モデルの構造を定義する設定
    PreTrainedModel, # モデルの読み込み等の共通の機能
    WhisperForConditionalGeneration,
)

In [None]:
class Adapter(nn.Module):
    def __init__(
        self,
        encoder_hidden_size: int,
        decoder_hidden_size: int,
        kernel_size: int,
        bias: bool,
    ):
        super().__init__()
        self.pool = nn.AvgPool1d(kernel_size) # 隣り合う要素をグループ化し、平均値を計算して1つの要素にまとめる。系列長を短くする。1/kernel_size になる
        self.linear1 = nn.Linear(encoder_hidden_size, 2 * decoder_hidden_size, bias=bias) # inverted bottleneck（逆ボトルネック）。表現力を上げる
        self.linear2 = nn.Linear(2 * decoder_hidden_size, decoder_hidden_size, bias=bias)

    def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
        hidden_states = hidden_states.permute(0, 2, 1) #  (バッチサイズ, 系列長, 特徴次元) から (バッチサイズ, 特徴次元, 系列長) へと変換。nn.AvgPool1d は、特徴次元をチャンネル次元として扱うため
        hidden_states = self.pool(hidden_states) # 系列長の次元が kernel_size 分の1
        hidden_states = hidden_states.permute(0, 2, 1) # nn.Linear 層が、最後の次元を特徴次元として計算するため
        hidden_states = self.linear1(hidden_states)
        hidden_states = F.gelu(hidden_states) # https://data-analytics.fun/2020/09/04/understanding-gelu/
        hidden_states = self.linear2(hidden_states)
        return hidden_states

In [None]:
# Whisper + Llamaのモデルの構造を定義
class LlamaForSpeechLMConfig(PretrainedConfig):
    model_type = "llama_for_speech_lm"

    def __init__(
        self,
        encoder_id: str = "openai/whisper-small.en",
        decoder_id: str = "meta-llama/Llama-3.2-1B-Instruct",
        adapter_kernel_size: int = 4,
        adapter_linear_bias: bool = False,
        **kwargs,
    ):
        self.encoder_id = encoder_id
        self.decoder_id = decoder_id
        self.adapter_kernel_size = adapter_kernel_size
        self.adapter_linear_bias = adapter_linear_bias
        super().__init__(**kwargs)


class LlamaForSpeechLM(PreTrainedModel):
    config_class = LlamaForSpeechLMConfig
    _tied_weights_keys = ["decoder.lm_head.weight"]

    def __init__(self, config: LlamaForSpeechLMConfig):
        super().__init__(config)
        self.encoder = WhisperForConditionalGeneration.from_pretrained(config.encoder_id).model.encoder # エンコーダだけ取る　https://zenn.dev/robes/articles/a72b95f9f76c39
        self.decoder = AutoModelForCausalLM.from_pretrained(config.decoder_id, torch_dtype=torch.bfloat16)
        self.adapter = Adapter(
            self.encoder.config.d_model,
            self.decoder.config.hidden_size,
            config.adapter_kernel_size,
            config.adapter_linear_bias,
        )

        self.encoder.requires_grad_(False) # フリーズ
        self.decoder.requires_grad_(False) # フリーズ

    def get_input_embeddings(self):
        return self.decoder.model.embed_tokens

    def set_input_embeddings(self, value):
        self.decoder.model.embed_tokens = value

    def get_output_embeddings(self):
        return self.decoder.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.decoder.lm_head = new_embeddings

    # 音声情報とテキスト情報を一つにまとめる
    def embed(
        self,
        input_features: torch.FloatTensor,
        input_ids: torch.LongTensor,
        encoder_attention_mask: torch.LongTensor,
        decoder_attention_mask: torch.LongTensor,
    ):
        encoder_outputs = self.encoder(input_features)
        encoder_hidden_states = encoder_outputs[0] # 音声をwhisperに通し、ベクトルに変換

        lengths = self.encoder._get_feat_extract_output_lengths(encoder_attention_mask.sum(dim=1, keepdim=True))
        lengths = lengths // self.config.adapter_kernel_size
        max_len = lengths.max() # バッチ内の最も長い系列長を取得

        encoder_hidden_states = self.adapter(encoder_hidden_states)
        encoder_hidden_states = encoder_hidden_states[:, :max_len] # バッチ内の最も長い系列長より後ろのデータはいらない(パディング)ので捨てる。max_lenより短い系列長のデータには、0が混じるが、これはattention maskで解決

        inputs_embeds = self.decoder.model.embed_tokens(input_ids) # テキストをベクトルに変換
        inputs_embeds = torch.cat((encoder_hidden_states, inputs_embeds), dim=1) # 音声のベクトルとテキストのベクトルを連結 [音声ベクトル1, 音声ベクトル2, ..., テキストベクトル1, テキストベクトル2, ...]

        attention_mask = torch.cat(
            (
                (
                    torch.arange(encoder_hidden_states.shape[1], device=decoder_attention_mask.device).unsqueeze(0)
                    < lengths
                ).long(),
                decoder_attention_mask,
            ),
            dim=1,
        ) # どこが意味のあるデータで、どこがパディング（穴埋め）か」を教えるためのattention_maskも、ベクトルの連結に合わせて同様に連結
        return inputs_embeds, attention_mask

    def forward(
        self,
        input_features: torch.FloatTensor,
        input_ids: torch.LongTensor,
        encoder_attention_mask: torch.LongTensor,
        decoder_attention_mask: torch.LongTensor,
    ):
        """
        Args:
            input_features (`torch.FloatTensor` of shape `(batch_size, feature_size, feature_length)`):
                Log mel spectrogram.
            input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
                Token ids.
            encoder_attention_mask (`torch.LongTensor` of shape `(batch_size, feature_length)`):
                1: non-mask
                0: mask
            decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
                1: non-mask
                0: mask
        """
        inputs_embeds, attention_mask = self.embed(
            input_features, input_ids, encoder_attention_mask, decoder_attention_mask
        )

        labels = F.pad(input_ids, (inputs_embeds.shape[1] - input_ids.shape[1], 0), value=-100) # モデルは「次の単語」を予測することで学習しますが、入力の前半部分は音声なので、「予測すべき次の単語」という概念がありません。そこで、Hugging Faceの仕組みを使い、損失計算時に無視してほしい部分のラベルを-100に設定します

        decoder_outputs = self.decoder(inputs_embeds=inputs_embeds, attention_mask=attention_mask, labels=labels)
        return decoder_outputs.loss

    @torch.amp.autocast("cuda", dtype=torch.bfloat16)
    @torch.no_grad()
    def generate(
        self,
        input_features: torch.FloatTensor,
        input_ids: torch.LongTensor,
        encoder_attention_mask: torch.LongTensor,
        decoder_attention_mask: torch.LongTensor,
        **kwargs,
    ):
        inputs_embeds, attention_mask = self.embed(
            input_features, input_ids, encoder_attention_mask, decoder_attention_mask
        )

        generated_ids = self.decoder.generate(inputs_embeds=inputs_embeds, attention_mask=attention_mask, **kwargs)
        return generated_ids

In [None]:
# learning rate scheduler
def get_lr_schedule(
    optimizer,
    total_steps: int,
    warmup_steps: int,
    base_lr: float,
    min_lr: float,
) -> torch.optim.lr_scheduler.LambdaLR:
    def lr_schedule(current_step: int) -> float:
        if current_step < warmup_steps:
            return (min_lr + (base_lr - min_lr) * current_step / warmup_steps) / base_lr
        else:
            progress = (current_step - warmup_steps) / (total_steps - warmup_steps)
            return (min_lr + (base_lr - min_lr) * (1 - progress)) / base_lr

    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_schedule)

In [None]:
def train(
    model: LlamaForSpeechLM,
    loader: torch.utils.data.DataLoader,
    lr: float = 1e-3,
    epoch: int = 1,
    warmup_steps: int = 10,
    init_grad_scale: float = 1e32,
    clip_grad_norm: float = 1.0,
    grad_accumulation: int = 128,
):
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

    # learning rate scheduler
    lr_scheduler = get_lr_schedule(
        optimizer,
        len(loader) // grad_accumulation * epoch,
        warmup_steps,
        lr,
        lr * 0.1,
    )

    scaler = torch.amp.GradScaler("cuda", init_scale=init_grad_scale)
    writer = SummaryWriter()

    step = 0

    for epoch in range(1, epoch + 1):
        model.train()

        for batch_idx, batch in enumerate(tqdm(loader, desc=f"epoch {epoch}")):
            with torch.amp.autocast("cuda", dtype=torch.bfloat16):
                loss = model(**batch)
                loss = loss / grad_accumulation
            scaler.scale(loss).backward()

            if (batch_idx + 1) % grad_accumulation == 0:
                # gradient clipping
                scaler.unscale_(optimizer)
                grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad_norm)

                # update
                scaler.step(optimizer)
                scale = scaler.get_scale()
                scaler.update()
                optimizer.zero_grad()

                # update learning rate
                lr = lr_scheduler.get_last_lr()[0]
                lr_scheduler.step()

                step += 1

                # tensorboard log
                writer.add_scalar("train/loss", loss.item(), step)
                writer.add_scalar("train/lr", lr, step)
                writer.add_scalar("train/scale", scale, step)
                writer.add_scalar("train/grad_norm", grad_norm.item(), step)

In [None]:
"""
音声データをまとめてスペクトログラムに変換する。

テキストデータをまとめてトークンIDに変換する。

長さの異なるデータにパディング（穴埋め）を追加して、バッチ内の全てのデータの長さを揃える。

アテンションマスクを生成する。
"""

def get_collate_fn(encoder_processor, decoder_processor):
    # 以下はタスクを説明する指示と、詳細なコンテキストを提供する音声入力を組み合わせたものです。音声を文字起こしし、リクエストを適切に完了する応答を作成してください。
    prompt = """<|start_header_id|>user<|end_header_id|>

    Below is an instruction that describes a task, paired with an audio input that provides further context. Transcribe the audio, and then write a response that appropriately completes the request.

    ### Instruction:
    {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

    ### Transcript:
    {}

    ### Response:
    {}<|eot_id|>"""

    def collate_fn(batch: List[Dict[str, Any]]) -> Dict[str, torch.Tensor]:
        """
        Args:
            batch: List of the following example:
                {
                    "instruction": "",
                    "input": "",
                    "output": "",
                    "text": "",
                    "audio": {"path": None, "array": tensor([...]), "sampling_rate": tensor(16000)},
                }
        """

        # メルスペクトログラムへの変換　https://qiita.com/koshian2/items/ca99b4a489d164e9cec6#%E3%83%A1%E3%83%AB%E3%82%B9%E3%83%9A%E3%82%AF%E3%83%88%E3%83%AD%E3%82%B0%E3%83%A9%E3%83%A0mel-spectrogram%E3%81%A3%E3%81%A6%E3%81%AA%E3%82%93%E3%81%A0%E3%81%A3%E3%81%91
        encoder_inputs = encoder_processor(
            [item["audio"]["array"].numpy() for item in batch],
            return_tensors="pt",
            return_attention_mask=True,
            sampling_rate=16000,
            device="cuda",
        ).to("cuda")

        # トークンIDに変換
        decoder_inputs = decoder_processor(
            [prompt.format(item["instruction"], item["input"], item["output"]) for item in batch],
            padding=True,
            return_tensors="pt",
        ).to("cuda")

        return {
            "input_features": encoder_inputs.input_features,
            "input_ids": decoder_inputs.input_ids,
            "encoder_attention_mask": encoder_inputs.attention_mask,
            "decoder_attention_mask": decoder_inputs.attention_mask,
        }

    return collate_fn

In [None]:
# load model and dataset
model_id = "ryota-komatsu/Llama-for-SpeechLM-Instruct"
dataset_id = "ryota-komatsu/spoken-alpaca" # https://huggingface.co/datasets/ryota-komatsu/spoken-alpaca

model = LlamaForSpeechLM.from_pretrained(model_id).cuda()

encoder_processor = AutoProcessor.from_pretrained(model.config.encoder_id)
decoder_processor = AutoProcessor.from_pretrained(model.config.decoder_id)
decoder_processor.pad_token = decoder_processor.pad_token or decoder_processor.eos_token

def is_train_example(example):
    return (
        len(example["audio"]["array"]) < 16000 * 30
        and len(example["instruction"]) < 102
        and len(example["output"]) < 838
    )

# exclusive with the train set
def is_test_example(example):
    return (
        len(example["audio"]["array"]) < 16000 * 30
        and 102 <= len(example["instruction"])
        and len(example["output"]) < 838
    )

dataset = load_dataset(dataset_id, split="train")
dataset = dataset.with_format("torch")
trainset = dataset.filter(is_train_example)
testset = dataset.filter(is_test_example)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/310 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.87G [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Some weights of WhisperForConditionalGeneration were not initialized from the model checkpoint at openai/whisper-small.en and are newly initialized: ['proj_out.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Some weights of LlamaForCausalLM were not initialized from the model checkpoint at meta-llama/Llama-3.2-1B-Instruct and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/643 [00:00<?, ?B/s]

train-00000-of-00009.parquet:   0%|          | 0.00/352M [00:00<?, ?B/s]

train-00001-of-00009.parquet:   0%|          | 0.00/350M [00:00<?, ?B/s]

train-00002-of-00009.parquet:   0%|          | 0.00/344M [00:00<?, ?B/s]

train-00003-of-00009.parquet:   0%|          | 0.00/336M [00:00<?, ?B/s]

train-00004-of-00009.parquet:   0%|          | 0.00/341M [00:00<?, ?B/s]

train-00005-of-00009.parquet:   0%|          | 0.00/365M [00:00<?, ?B/s]

train-00006-of-00009.parquet:   0%|          | 0.00/348M [00:00<?, ?B/s]

train-00007-of-00009.parquet:   0%|          | 0.00/344M [00:00<?, ?B/s]

train-00008-of-00009.parquet:   0%|          | 0.00/340M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11882 [00:00<?, ? examples/s]

Filter:   0%|          | 0/11882 [00:00<?, ? examples/s]

Filter:   0%|          | 0/11882 [00:00<?, ? examples/s]

In [None]:
# hyperparameters for training
batch_size = 4
lr = 1e-4
epoch = 1
warmup_steps = 10
init_grad_scale = 1e32
clip_grad_norm = 1.0
grad_accumulation = 128

trainset = trainset.select(range(batch_size * warmup_steps))
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size, True, collate_fn=get_collate_fn(encoder_processor, decoder_processor)
)

train(
    model,
    trainloader,
    lr,
    epoch,
    warmup_steps,
    init_grad_scale,
    clip_grad_norm,
    grad_accumulation,
)

epoch 1: 100%|██████████| 10/10 [00:32<00:00,  3.22s/it]


In [None]:
# hyperparameters for inference
max_length = 4096
do_sample = False
num_beams = 5

prompt = """<|start_header_id|>user<|end_header_id|>

Below is an instruction that describes a task, paired with an audio input that provides further context. Transcribe the audio, and then write a response that appropriately completes the request.

### Instruction:
{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

# 学習時のプロンプト
# Transcript以降がないので、音声 + 上記の指示から、文字起こし結果 + 回答を返すということになる
"""
<|start_header_id|>user<|end_header_id|>

Below is an instruction that describes a task, paired with an audio input that provides further context. Transcribe the audio, and then write a response that appropriately completes the request.

### Instruction:
{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

### Transcript:
{}

### Response:
{}<|eot_id|>
"""

testloader = torch.utils.data.DataLoader(testset, shuffle=True)
testloader = iter(testloader)

In [None]:
# you can run this cell repeatedly
item = next(testloader)

encoder_inputs = encoder_processor(
    item["audio"]["array"].numpy(),
    return_tensors="pt",
    return_attention_mask=True,
    sampling_rate=16000,
    device="cuda",
).to("cuda")

decoder_inputs = decoder_processor(
    prompt.format(item["instruction"][0]),
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    encoder_inputs.input_features,
    decoder_inputs.input_ids,
    encoder_attention_mask=encoder_inputs.attention_mask,
    decoder_attention_mask=decoder_inputs.attention_mask,
    max_length=max_length,
    do_sample=do_sample,
    num_beams=num_beams,
)
generated_txt = decoder_processor.batch_decode(generated_ids, skip_special_tokens=True)

print(prompt.format(item["instruction"][0]) + generated_txt[0], end="\n\n")
print("Speech input:", item["input"][0], end="\n\n")
print("Correct answer:", item["output"][0])

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|start_header_id|>user<|end_header_id|>

Below is an instruction that describes a task, paired with an audio input that provides further context. Transcribe the audio, and then write a response that appropriately completes the request.

### Instruction:
Find the sentiment of the following text. Output 1 for positive sentiment and 0 for negative sentiment.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

### Transcript:
The TV show was good but not great.

### Response:
0

Speech input: The TV show was good but not great.

Correct answer: 0


In [None]:
from IPython.display import Audio
audio=item["audio"]["array"].numpy()
sr=16000
audios = [(audio, sr)]
Audio(audio, rate=sr)