# Building a Voice Assistant with Granite Models

In this notebook, we'll walk through the process of building a voice assistant using the **Granite series of models**, specifically focusing on the **Granite Speech-to-Text** model for transcription.

The core of our assistant will be powered by two main components:
1.  **Granite Speech 3.3 8B**: To accurately transcribe spoken audio from the microphone into text in real-time.
2.  **Granite Instruct 3.3 8B**: A large language model (LLM) that will understand the transcribed text and generate intelligent, conversational responses.

## 1. Setup and Installation

First, we need to install the necessary packages.

In [None]:
%pip install git+https://github.com/ibm-granite-community/utils replicate soundfile sounddevice numpy webrtcvad scipy langchain-community langchain-core datasets ipywidgets ipywidgets torch transformers

## 2. A Quick Test: Transcribing a File

Before we start building the voice assistant, let's begin with a simpler task: transcribing a pre-existing audio file. This allows us to test our transcription function in a controlled way.

We will use the Hugging Face `datasets` library to download a sample from the `fixie-ai/llama-questions` dataset, which contains short audio recordings of spoken questions. We'll then extract the audio data and save it as a standard `.wav` file.

In [None]:
from datasets import load_dataset, Audio
import soundfile as sf, io

# 1) Load dataset
ds = load_dataset("fixie-ai/llama-questions", split="test")

# 2) Grab raw bytes for the first clip
ds = ds.cast_column("audio", Audio(decode=False))
audio_bytes = ds[0]["audio"]["bytes"]

# 3) Decode with soundfile
waveform, sr = sf.read(io.BytesIO(audio_bytes))

# 4) Save to a file
sf.write("sample.wav", waveform, sr)

## 3. Transcribing with Granite Speech

With our sample audio file ready, we can now use the Granite Speech model to convert it to text.

### How it Works
The process involves two helper functions:

1.  `encode_audio(filepath)`: The Replicate API expects the audio data to be sent as a base64-encoded [Data URI](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs). This function reads our `.wav` file, encodes it, and formats it into the required string format.

2.  `transcribe(data_uri)`: This is the core function for speech-to-text. It uses the `langchain-community` integration to call the `ibm-granite/granite-speech-3.3-8b` model on Replicate. We pass the encoded audio and a simple prompt.

Let's call these functions and print the transcription of our sample file.

In [None]:
import base64
import mimetypes
from langchain.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var

def encode_audio(filepath: str) -> str:
    mime_type, _ = mimetypes.guess_type(filepath)
    if not mime_type or not mime_type.startswith("audio"):
        raise ValueError(f"Unsupported audio type for {filepath}")

    with open(filepath, "rb") as f:
        encoded = base64.b64encode(f.read()).decode("utf-8")
        return f"data:{mime_type};base64,{encoded}"

def transcribe(data_uri: str) -> str:

    stt = Replicate(
        model="ibm-granite/granite-speech-3.3-8b",
        model_kwargs={
            "audio": [data_uri],
            "top_k": 50,
            "top_p": 0.9,
            "prompt": "Transcribe the speech into written form.",
            "max_tokens": 512,
            "min_tokens": 0,
            "temperature": 0.6,
            "presence_penalty": 0,
            "frequency_penalty": 0
        },
        replicate_api_token=get_env_var("REPLICATE_API_TOKEN")
    )

    return stt.invoke("")

In [None]:
data_uri = encode_audio("sample.wav")
transcription = transcribe(data_uri)

print(transcription)

## 4. Real-time Recording with Voice Activity Detection (VAD)

Transcribing a file is useful, but a true voice assistant needs to listen in real-time. A key challenge here is knowing when the user has finished speaking. 

To solve this, we'll implement **Voice Activity Detection (VAD)**. VAD is a technique used to distinguish between human speech and silence in an audio stream. We'll use the `webrtcvad` library, a fast and effective VAD implementation.

### The `record_segment` function
The code below defines a function `record_segment` that listens to the microphone and captures a single segment of speech. 

In [None]:
import sounddevice as sd
import numpy as np
import webrtcvad
import collections
import io
import wave
import base64

SAMPLE_RATE = 16000
FRAME_DURATION_MS = 30
FRAME_SIZE = int(SAMPLE_RATE * FRAME_DURATION_MS / 1000)
MAX_SEGMENT_SEC = 10
VAD_SILENCE_FRAMES = int(0.75 * 1000 / FRAME_DURATION_MS)

def record_segment():
    vad          = webrtcvad.Vad(3)
    voiced_frames = []                      
    ring_buffer   = collections.deque(maxlen=VAD_SILENCE_FRAMES)
    recording_done, stop_reason = False, "timeout"

    def callback(indata, frames, time, status):
        nonlocal recording_done, stop_reason
        pcm = (indata[:, 0] * 32767).astype(np.int16).tobytes()

        is_speech = vad.is_speech(pcm, SAMPLE_RATE)
        ring_buffer.append(is_speech)

        if is_speech:                       # keep only speech frames
            voiced_frames.append(pcm)

        # 0.5 s of silence → stop
        if len(ring_buffer) == VAD_SILENCE_FRAMES and not any(ring_buffer):
            stop_reason, recording_done = "vad", True

    with sd.InputStream(samplerate=SAMPLE_RATE,
                        channels=1,
                        dtype='float32',
                        blocksize=FRAME_SIZE,
                        callback=callback):
        elapsed = 0
        while elapsed < MAX_SEGMENT_SEC and not recording_done:
            sd.sleep(FRAME_DURATION_MS)
            elapsed += FRAME_DURATION_MS / 1000

    # ---- emit ONE base64 chunk for the whole segment ----
    wav_bytes = b''.join(voiced_frames)
    buffer = io.BytesIO()
    with wave.open(buffer, 'wb') as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(SAMPLE_RATE)
        wf.writeframes(wav_bytes)

    return buffer.getvalue(), stop_reason


## 5. Generating Responses with Granite LLM

Now that we can convert speech to text, we need a text-to-text language model for our system to respond to the transcribed user inputs. For this, we'll use the `ibm-granite/granite-3.3-8b-instruct`. 

The `chat` function below handles the conversation logic.

In [None]:
from langchain.llms import Replicate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import BaseMessage
from transformers import AutoTokenizer
from ibm_granite_community.langchain import TokenizerChatPromptTemplate

def chat(
    messages: list[BaseMessage],
    replicate_model = "ibm-granite/granite-3.3-8b-instruct",
) -> str:
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(replicate_model)

    # Format conversation with tokenizer-backed chat prompt
    prompt = TokenizerChatPromptTemplate.from_messages(
        messages=messages,
        tokenizer=tokenizer,
    )

    # Granite LLM via Replicate
    llm = Replicate(
        model=replicate_model,
        replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
        model_kwargs={
            "max_tokens": 512,
            "temperature": 0.6,
            "top_p": 0.9,
            "top_k": 50
        }
    )

    # Chain and run
    chain = prompt | llm | StrOutputParser()
    return chain.invoke({})


## 6. The Main Conversation Loop

All the core components are now in place:
*   `record_segment`: Records a user's speech segment from the microphone.
*   `transcribe`: Converts audio to text using the Granite Speech model.
*   `chat`: Generates a text response using the Granite LLM.

Now, we just need to put them together in a continuous loop to create our interactive voice assistant. The `start_conversation_loop` function below orchestrates this entire process.

The assistant runs with a **background listener** that continuously records speech segments and puts them into a queue. The main loop processes each segment as it arrives, so the assistant is always ready to capture your speech—even if you start talking while it's responding.

Here’s how each turn works:
1.  **Prompt**: The notebook prints `User (Speak now...):` and accumulates your transcribed speech as you talk.
2.  **Listen**: The background listener captures your speech and adds it to the queue.
3.  **Transcribe**: Each segment is transcribed and appended to the growing user input line.
4.  **Update History**: When you finish speaking, the full user message is added to the conversation history.
5.  **Get Response**: The complete conversation history is sent to the `chat` function to generate the assistant's reply.
6.  **Print Response**: The assistant's reply is printed below your message.
7.  **Repeat**: The loop starts over, ready to capture your next input as soon as you start speaking.

This design ensures the assistant never misses your speech and provides a natural, conversational experience.

In [None]:
import asyncio, base64, io, wave
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage, BaseMessage

MIN_MS = 150  

def has_payload(wav_bytes: bytes, min_ms: int = MIN_MS) -> bool:
    """True if WAV has >0 frames and duration >= min_ms."""
    if not wav_bytes:
        return False
    try:
        with wave.open(io.BytesIO(wav_bytes), "rb") as wf:
            frames = wf.getnframes()
            sr = wf.getframerate() or 1
        dur_ms = frames * 1000 / sr
        return frames > 0 and dur_ms >= min_ms
    except wave.Error:
        return False

async def background_listener(queue: asyncio.Queue):
    """Continuously record segments and enqueue only real audio."""
    while True:
        audio_bytes, stop_reason = await asyncio.to_thread(record_segment)
        if has_payload(audio_bytes):
            await queue.put((audio_bytes, stop_reason))

async def start_conversation_loop() -> None:
    print("Conversation started. Speak to begin...")
    history: list[BaseMessage] = [SystemMessage(content="You are a helpful assistant.")]
    queue: asyncio.Queue = asyncio.Queue()
    asyncio.create_task(background_listener(queue))

    current_utterance = ""

    try:
        while True:
            audio_bytes, stop_reason = await queue.get()

            # --- 1) Transcribe safely ---
            data_uri = "data:audio/wav;base64," + base64.b64encode(audio_bytes).decode()
            try:
                segment_text = await asyncio.to_thread(transcribe, data_uri)
            except Exception as e:  # catch ValueError/RuntimeError/etc.
                print("⚠️ STT error:", repr(e))
                continue

            segment_text = segment_text.strip()
            if not segment_text:
                continue

            # --- 2) Accumulate text ---
            current_utterance = (current_utterance + " " + segment_text).strip()

            # --- 3) Turn boundary (VAD silence) ---
            if stop_reason == "vad" and current_utterance:
                print(f"\n🗣️ User: {current_utterance}")
                history.append(HumanMessage(content=current_utterance))

                try:
                    response = await asyncio.to_thread(chat, history)
                except Exception as e:
                    print("⚠️ LLM error:", repr(e))
                    response = "[error generating response]"
                print(f"\n🤖 Assistant: {response}\n{'_'*72}\n🎤 Speak now...")

                history.append(AIMessage(content=response))
                current_utterance = ""  # reset

    except KeyboardInterrupt:
        print("\nConversation ended by keyboard interrupt.")


## 7. Run the Voice Assistant!

It's time to run our application. Executing the cell below will start the conversation. The program will print "Speak now..." when it's ready.

Go ahead and start chatting!

To stop the assistant, simply interrupt the kernel (the "Stop" button in most notebook environments) or press `Ctrl-C` if running as a script.

In [None]:
await start_conversation_loop()
