Written by [Avihu Dekel](https://huggingface.co/Avihu).

## Spoken Question Answering with Granite Speech’s Two-Pass Design

[Granite speech](https://huggingface.co/collections/ibm-granite/granite-speech-67e45da088d5092ff6b901c7) is a family of powerful speech models, that excel in speech recognition and speech translation. Specifically, [granite-speech-3.3-8b](https://huggingface.co/ibm-granite/granite-speech-3.3-8b) leads the [OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) (as of June 2025).


Granite Speech was trained by modality aligning [Granite](https://huggingface.co/collections/ibm-granite/granite-33-language-models-67f65d0cca24bcbd1d3a08e3) to speech. This was achieved by projecting the embeddings of a pretrained speech encoder to Granite's embedding space, and fine-tuning with using lightweight LoRA adapters. 

This enables a *two-pass design* with Granite Speech:
1. *Transcribe* the audio using Granite Speech
2. *Answer* the question within the transcription using Granite.

Granite Speech supports both steps seamlessly. It automatically enables the LoRA adapters when audio input is detected, and disables them when processing text-only input.

In this guide, we demonstrate how to use the two-pass design for spoken question answering.
We'll showcase this using [LlamaQuestions](https://github.com/google-research-datasets/LLAMA1-Test-Set) dataset.

In [1]:
# spoken question answering dataset
from datasets import load_dataset
ds = load_dataset("fixie-ai/llama-questions")["test"]

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import torch
from transformers.models.granite_speech import GraniteSpeechForConditionalGeneration, GraniteSpeechProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-3.3-2b"
processor = GraniteSpeechProcessor.from_pretrained(model_name)
model = GraniteSpeechForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map=device)
tokenizer = processor.tokenizer

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.22s/it]


Let's write a simple function to transcribe a given audio input:

In [3]:
def transcribe(audio) -> str:
    system_prompt = "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant"
    instruction = "Please transcribe the following audio to text<|audio|>"
    chat = [
        dict(role="system", content=system_prompt),
        dict(role="user", content=instruction)
    ]
    prompt = tokenizer.apply_chat_template(
        chat,
        add_generation_prompt=True,
        tokenize=False,
    )
    model_inputs = processor(prompt, audio, device=device, return_tensors="pt").to(device)
    model_outputs = model.generate(**model_inputs, max_new_tokens=200)
    num_input_tokens = model_inputs["input_ids"].shape[-1]
    new_tokens = model_outputs[:, num_input_tokens:]
    output_text = tokenizer.batch_decode(
        new_tokens, add_special_tokens=False, skip_special_tokens=True
    )
    return output_text

transcription = transcribe(ds[0]["audio"]["array"])[0]
print(transcription)

what is the capital of france


Now, let's use the base LLM (Granite 3.3 instruct) to answer the question. 
When the input contains only text (i.e., no audio), Granite Speech automatically *deactivates* the LoRA adapters and functions identically to the original Granite model.

In [4]:
def llm_response(query):
    chat = [dict(role="user", content=query)]
    prompt = tokenizer.apply_chat_template(
        chat,
        add_generation_prompt=True,
        tokenize=False
    )
    model_inputs = tokenizer(prompt, return_tensors="pt").to(device)
    # calling the base LLM and disabling the LoRA adaptors
    model_outputs = model.generate(**model_inputs, max_new_tokens=200)
    num_input_tokens = model_inputs["input_ids"].shape[-1]
    new_tokens = model_outputs[:, num_input_tokens:]

    output_text = tokenizer.batch_decode(
        new_tokens, add_special_tokens=False, skip_special_tokens=True
    )
    return output_text

llm_response(transcription)[0]

'The capital of France is Paris.'

Let’s run the full pipeline on a few more examples to see it in action.

In [5]:
for i in range(8):
    transcription = transcribe(ds[i]["audio"]["array"])[0]
    response = llm_response(transcription)[0]
    print(f"Q: {transcription}")
    print(f"A: {response}\n")

Q: what is the capital of france
A: The capital of France is Paris.

Q: which river is the longest in south america
A: The longest river in South America is the Amazon River. It stretches approximately 6,992 kilometers (4,345 miles) long.

Q: what is the highest mountain peak in north america
A: The highest mountain peak in North America is Mount Denali (formerly known as Denali Mountain or North America's "Great Mountain"), located in the Alaska Range in Denali National Park and Preserve. It stands at a height of approximately 20,310 feet (6,190 meters) above sea level.

Q: who was the first president of the united states
A: The first president of the United States was George Washington. He served two terms from 1789 to 1797.

Q: which city is located at the intersection of the tigris and euphrates rivers
A: The city located at the intersection of the Tigris and Euphrates rivers is ancient Mesopotamia, specifically the city of Babylon. However, in modern times, no city exists at this 