Question

Real-time Transcription: Design a system to transcribe continuous, potentially infinite audio streams in
real-time, similar to how YouTube captions work. This task cannot be performed by breaking the audio
into smaller files, saving to disk, or creating temporary files due to computational, memory, and
potential disk cost. Upon ending the stream, the system should output "stream ended" and cease
operation, not falling into an infinite loop.


Answer

I use Open AI Whisper an open source speech to text model for transciption.

The question was not so clear. There are two possibilities. 
One is that the audio file is provided and it is converted to text.
The other possibility is that audio is provided in chunks in real time e.g steaming.
Both implementations are provided below and working. I prefer the first implementation as it is more compatible is current version of open ai whisper. 


In [None]:
!pip install -U openai-whisper

In [None]:
# Methods 1
# Provided audio file

import whisper

filename = "input2.wav"
model = whisper.load_model("small")
result = model.transcribe(filename, fp16=False)
print(result["text"])
print("stream ended")


In method 2 below:
chunks are used to represnet audio in real time. 
Using mic audio was easier. But the question was more towards audio file.

Note: The below implementation works. There are limitations in terms of optimization, as currently whisper does not has a clear implementation for audio as bytes (although input as ndarray is supported that is used for the impementation). Most people temporary file as input to the transcribe method.


In [None]:
# Method 2
# real time

import numpy as np
import whisper
from scipy.io import wavfile


model = whisper.load_model("small")
samplerate, data = wavfile.read('input2.wav')
if data.shape[1] > 0:
    print('stero channel detected. Converting to mono.')
    data = np.mean(data, axis=1)

def generate_audio_chunks(data, chunk_size):
    for i in range(0, len(data), chunk_size):
        yield data[i:i + chunk_size]
        
chunk_size = 32000
data_chunks = [data]
if len(data) > chunk_size: 
    split_size = (len(data) // chunk_size)
    data_chunks = generate_audio_chunks(data, chunk_size)

for chunk in data_chunks:
    float_data = chunk.astype(np.float32, order='C') / 32768.0
    audio = whisper.pad_or_trim(float_data)
    
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    options = whisper.DecodingOptions(fp16=False)
    result = whisper.decode(model, mel, options)

    print(result.text, flush=True, end=' ')

print("stream ended")

Question:
Evaluation Metrics: After transcribing the audio, your system should be able to evaluate its
performance. Design and report appropriate metrics to measure the accuracy of the transcription.

Answer:
Using jiwer

JiWER is a simple and fast python package to evaluate an automatic speech recognition system. It supports the following measures:

word error rate (WER)
match error rate (MER)
word information lost (WIL)
word information preserved (WIP)
character error rate (CER)