<a href="https://colab.research.google.com/github/milosz7/ml2025-26/blob/main/lab/audio_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Libraries

In [1]:
!pip install ffmpeg-python
!pip install --upgrade datasets[audio]
!pip install -q kokoro>=0.9.2 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&



### Audio recording

In [23]:
from IPython.display import Javascript, HTML, display
from google.colab import output
import base64, io
import torchaudio
import torchaudio.transforms as T


waveform_16k = None

def _record_callback(data):
    global waveform_16k
    waveform_16k = None
    audio_bytes = base64.b64decode(data.split(",")[1])
    recorded_audio = io.BytesIO(audio_bytes)
    waveform, sample_rate = torchaudio.load(recorded_audio.getvalue())

    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)

    resampler = T.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform_16k = resampler(waveform)


output.register_callback("notebook.recordCallback", _record_callback)

js = Javascript("""
let mediaRecorder;
let audioChunks = [];

function startRecording() {
  navigator.mediaDevices.getUserMedia({audio:true}).then(stream => {
    mediaRecorder = new MediaRecorder(stream);
    audioChunks = [];

    mediaRecorder.ondataavailable = e => audioChunks.push(e.data);
    mediaRecorder.onstop = async () => {
      const blob = new Blob(audioChunks, { type: 'audio/webm' });
      let reader = new FileReader();
      reader.readAsDataURL(blob);
      reader.onloadend = function() {
        google.colab.kernel.invokeFunction(
          'notebook.recordCallback', [reader.result], {});
      }
    };

    mediaRecorder.start();
    document.getElementById("status").innerHTML = "⏺ Recording...";
  });
}

function stopRecording() {
  if (mediaRecorder) {
    mediaRecorder.stop();
    document.getElementById("status").innerHTML = "⏹ Stopped! Processing...";
  }
}

let html = `
<button onclick="startRecording()" style="font-size:20px;padding:10px;margin:5px;">🎙 Start</button>
<button onclick="stopRecording()" style="font-size:20px;padding:10px;margin:5px;">🛑 Stop</button>
<div id="status" style="margin-top:10px;font-size:18px;color:green;"></div>
`;

document.body.insertAdjacentHTML("beforeend", html);
""")

display(js)


<IPython.core.display.Javascript object>

  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)


In [32]:
# Sanity check - if this is none, rerun the above cell
waveform_16k

tensor([[-8.8354e-11,  3.8162e-09, -1.1231e-08,  ..., -1.4564e-04,
         -1.0950e-04, -9.6356e-05]])

### Audio2Text Pipeline

In [5]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-medium"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
Device set to use cuda


In [27]:
output = pipe(waveform_16k)

We expect a single channel audio input for AutomaticSpeechRecognitionPipeline, got 2. Taking the mean of the channels for mono conversion.


In [33]:
print(output["text"])

 Audio recording test


### Text2Audio Pipeline

In [29]:
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
generator = pipeline(output["text"], voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
    print(i, gs, ps)
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000)


0 Audio recording test ˈɔdiO ɹəkˈɔɹdɪŋ tˈɛst
