<a href="https://colab.research.google.com/github/milosz7/ml2025-26/blob/main/lab/audio_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Libraries

In [4]:
!pip install ffmpeg-python
!pip install --upgrade datasets[audio]
!pip install -q kokoro>=0.9.2 soundfile
!apt-get -qq -y install espeak-ng

Selecting previously unselected package libpcaudio0:amd64.
(Reading database ... 121713 files and directories currently installed.)
Preparing to unpack .../libpcaudio0_1.1-6build2_amd64.deb ...
Unpacking libpcaudio0:amd64 (1.1-6build2) ...
Selecting previously unselected package libsonic0:amd64.
Preparing to unpack .../libsonic0_0.2.0-11build1_amd64.deb ...
Unpacking libsonic0:amd64 (0.2.0-11build1) ...
Selecting previously unselected package espeak-ng-data:amd64.
Preparing to unpack .../espeak-ng-data_1.50+dfsg-10ubuntu0.1_amd64.deb ...
Unpacking espeak-ng-data:amd64 (1.50+dfsg-10ubuntu0.1) ...
Selecting previously unselected package libespeak-ng1:amd64.
Preparing to unpack .../libespeak-ng1_1.50+dfsg-10ubuntu0.1_amd64.deb ...
Unpacking libespeak-ng1:amd64 (1.50+dfsg-10ubuntu0.1) ...
Selecting previously unselected package espeak-ng.
Preparing to unpack .../espeak-ng_1.50+dfsg-10ubuntu0.1_amd64.deb ...
Unpacking espeak-ng (1.50+dfsg-10ubuntu0.1) ...
Setting up libpcaudio0:amd64 (1.1-6

### Audio recording

In [6]:
from IPython.display import Javascript, HTML, display
from google.colab import output
import base64, io
import torchaudio
import torchaudio.transforms as T


waveform_16k = None

def _record_callback(data):
    global waveform_16k
    waveform_16k = None
    audio_bytes = base64.b64decode(data.split(",")[1])
    recorded_audio = io.BytesIO(audio_bytes)
    waveform, sample_rate = torchaudio.load(recorded_audio.getvalue())

    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)

    resampler = T.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform_16k = resampler(waveform)


output.register_callback("notebook.recordCallback", _record_callback)

js = Javascript("""
let mediaRecorder;
let audioChunks = [];

function startRecording() {
  navigator.mediaDevices.getUserMedia({audio:true}).then(stream => {
    mediaRecorder = new MediaRecorder(stream);
    audioChunks = [];

    mediaRecorder.ondataavailable = e => audioChunks.push(e.data);
    mediaRecorder.onstop = async () => {
      const blob = new Blob(audioChunks, { type: 'audio/webm' });
      let reader = new FileReader();
      reader.readAsDataURL(blob);
      reader.onloadend = function() {
        google.colab.kernel.invokeFunction(
          'notebook.recordCallback', [reader.result], {});
      }
    };

    mediaRecorder.start();
    document.getElementById("status").innerHTML = "⏺ Recording...";
  });
}

function stopRecording() {
  if (mediaRecorder) {
    mediaRecorder.stop();
    document.getElementById("status").innerHTML = "⏹ Stopped! Processing...";
  }
}

let html = `
<button onclick="startRecording()" style="font-size:20px;padding:10px;margin:5px;">🎙 Start</button>
<button onclick="stopRecording()" style="font-size:20px;padding:10px;margin:5px;">🛑 Stop</button>
<div id="status" style="margin-top:10px;font-size:18px;color:green;"></div>
`;

document.body.insertAdjacentHTML("beforeend", html);
""")

display(js)


<IPython.core.display.Javascript object>

In [11]:
# Sanity check - if this is none, rerun the above cell
waveform_16k

tensor([[ 1.9835e-15,  4.7072e-14,  1.8812e-13,  ..., -1.7671e-03,
         -1.7785e-03, -2.0214e-03]])

### Audio2Text Pipeline

In [8]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-medium"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
Device set to use cuda


In [12]:
output = pipe(waveform_16k)

We expect a single channel audio input for AutomaticSpeechRecognitionPipeline, got 2. Taking the mean of the channels for mono conversion.


In [13]:
print(output["text"])

 Hello, how are you doing?


### LLM Answer

In [14]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = output["text"]
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/1.50G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

thinking content: <think>
Okay, the user just asked, "Hello, how are you doing?" I need to respond appropriately. Let me think about the best way to reply.

First, I should acknowledge their greeting. Maybe say "Hello!" to keep it friendly. Then, I can check if they need help. Since the user didn't specify anything else, a simple "Hello!" and asking if they need anything would be good. I should make sure the response is natural and not too formal. Let me put that together.
</think>
content: Hello! How are you doing today? 😊 Let me know if there's anything I can help with!


### Text2Audio Pipeline

In [15]:
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
generator = pipeline(content, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
    print(i, gs, ps)
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000)




config.json: 0.00B [00:00, ?B/s]

  WeightNorm.apply(module, name, dim)


kokoro-v1_0.pth:   0%|          | 0.00/327M [00:00<?, ?B/s]

voices/af_heart.pt:   0%|          | 0.00/523k [00:00<?, ?B/s]



0 Hello! How are you doing today? 😊 Let me know if there's anything I can help with! həlˈO! hˌW ɑɹ ju dˈuɪŋ tədˈA? smˈIlɪŋ fˈAs wɪð smˈIlɪŋ ˈIz lˈɛt mˌi nˈO ɪf ðɛɹz ˈɛniθˌɪŋ ˌI kæn hˈɛlp wɪð!
