# Speech Detection and Transcription with WebRTC VAD and Whisper

This notebook demonstrates how to use WebRTC Voice Activity Detection (VAD) to check if an audio file contains speech and then transcribe it using the Whisper model. The process is broken down into several steps:

1. **Load and Preprocess the Audio File**
2. **Initialize WebRTC VAD**
3. **Perform VAD to Detect Speech**
4. **Transcribe the Audio Using Whisper**
5. **Save the Transcription to a File**

Let's start by importing the necessary libraries.

In [1]:
# Importing necessary libraries
import webrtcvad
import whisper
import numpy as np
import os
import io
import tempfile

## Load and Preprocess the Audio File
We first need to load the audio file and preprocess it. Whisper expects audio in a specific format, so we need to ensure that the audio is correctly formatted and preprocessed.

In [2]:
# Define the file path
file_path = "aud.mp3"

# Check if the file exists and read it
if os.path.exists(file_path):
    with open(file_path, 'rb') as file:
        data = file.read()
else:
    print("File not found.")

Next, we'll load and preprocess the audio file using Whisper. This includes converting it into a format suitable for the VAD and Whisper models.

In [3]:
# Initialize Whisper model
model = whisper.load_model("base")

# Load and preprocess audio file
audio = whisper.load_audio(file_path)
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# Convert audio to raw PCM bytes for VAD processing
audio_int16 = np.int16(audio * 32767)  # Convert float32 [-1, 1] to int16 [-32767, 32767]
audio_pcm = audio_int16.tobytes()  # Convert to raw PCM bytes

## Initialize WebRTC VAD
We initialize the WebRTC VAD (Voice Activity Detection) and configure it to detect speech. VAD helps determine if the audio contains speech segments.

In [4]:
# Initialize WebRTC VAD
vad = webrtcvad.Vad()
vad.set_mode(2)  # 0 to 3. 3 is the most aggressive mode

## Perform VAD to Detect Speech
We apply VAD to the entire audio file to check if it contains speech. VAD processes the audio in small frames to determine if each frame contains speech.

In [5]:
# Perform VAD on the entire audio
sample_rate = 16000  # Whisper processes audio at 16kHz
frame_duration_ms = 30  # Duration of each frame in ms
frame_size = int(sample_rate * frame_duration_ms / 1000)  # Frame size in samples
num_frames = len(audio_pcm) // (frame_size * 2)  # Number of frames

contains_speech = False

for i in range(num_frames):
    frame = audio_pcm[i * frame_size * 2:(i + 1) * frame_size * 2]
    if vad.is_speech(frame, sample_rate=sample_rate):
        contains_speech = True
        break

# Check if the audio contains speech before transcribing
if contains_speech:
    print("Speech detected in the audio.")
else:
    print("No speech detected in the audio.")

## Transcribe the Audio Using Whisper
If the audio contains speech, we proceed to transcribe it using the Whisper model. The transcription results are saved to a text file.

In [6]:
# Perform transcription if speech is detected
if contains_speech:
    # Perform transcription
    options = whisper.DecodingOptions(language="en", task="transcribe")
    result = whisper.decode(model, mel, options)
    text_output = result.text
 
    # Save the transcription to a text file
    with open("transcription.txt", "w") as file:
        file.write(result.text)
    
    print("Transcription saved to 'transcription.txt'.")
else:
    print("No speech detected in the audio.")

## Alternative Transcription Method
Additionally, you can use the Whisper model directly to transcribe the audio file. This method bypasses VAD and directly processes the audio file.

In [7]:
# Alternative transcription method
model = whisper.load_model("tiny")
result = model.transcribe(file_path)
print("Transcription:")
print(result["text"])