Question

Real-time Transcription: Design a system to transcribe continuous, potentially infinite audio streams in
real-time, similar to how YouTube captions work. This task cannot be performed by breaking the audio
into smaller files, saving to disk, or creating temporary files due to computational, memory, and
potential disk cost. Upon ending the stream, the system should output "stream ended" and cease
operation, not falling into an infinite loop.


Answer

I use Open AI Whisper an open source speech to text model for transciption.

The question was not so clear. There are two possibilities. 
One is that the audio file is provided and it is converted to text.
The other possibility is that audio is provided in chunks in real time e.g steaming.
Both implementations are provided below and working. I prefer the first implementation as it is more compatible is current version of open ai whisper. 


In [None]:
!pip install -U openai-whisper

In [None]:
# Methods 1
# Provided audio file

import whisper

filename = "input2.wav"
model = whisper.load_model("small")
result = model.transcribe(filename, fp16=False)
print(result["text"])
print("stream ended")


In method 2 below:
chunks are used to represnet audio in real time. 
Using mic audio was easier. But the question was more towards audio file.

Note: The below implementation works. There are limitations in terms of optimization, as currently whisper does not has a clear implementation for audio as bytes (although input as ndarray is supported that is used for the impementation). Most people temporary file as input to the transcribe method.


In [None]:
# Method 2
# real time

import numpy as np
import whisper
from scipy.io import wavfile


model = whisper.load_model("small")
samplerate, data = wavfile.read('input2.wav')
if data.shape[1] > 0:
    print('stero channel detected. Converting to mono.')
    data = np.mean(data, axis=1)

def generate_audio_chunks(data, chunk_size):
    for i in range(0, len(data), chunk_size):
        yield data[i:i + chunk_size]
        
chunk_size = 32000
data_chunks = [data]
if len(data) > chunk_size: 
    split_size = (len(data) // chunk_size)
    data_chunks = generate_audio_chunks(data, chunk_size)

for chunk in data_chunks:
    float_data = chunk.astype(np.float32, order='C') / 32768.0
    audio = whisper.pad_or_trim(float_data)
    
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    options = whisper.DecodingOptions(fp16=False)
    result = whisper.decode(model, mel, options)

    print(result.text, flush=True, end=' ')

print("stream ended")

---

Question:

Evaluation Metrics: After transcribing the audio, your system should be able to evaluate its
performance. Design and report appropriate metrics to measure the accuracy of the transcription.

Answer:
Using jiwer

JiWER is a simple and fast python package to evaluate an automatic speech recognition system. It supports the following measures:

word error rate (WER)
match error rate (MER)
word information lost (WIL)
word information preserved (WIP)
character error rate (CER)

In [None]:
original = '''
    First Speaker: To tell you basically what this is about is when I was watching Harvey Mackay at one of Harv Eker's things, he said he just finished the Boston marathon and you know, the guy is 76 and I went holy crap, you know, that is amazing. He looked so fit and he is so quick minded and so on I thought, all of a sudden it occurred to me I bet the way you eat, you know, is different. I bet you don't just eat a bunch of garbage and that started this thought. So, the basic three questions will be and I am recording it for you as well if I transcribe these for the book, but then I write about it and what has really been neat about it is that what started out as three same questions to everybody, everybody had kind of a different angle on it and I realized that they were creating the chapters for this book and of course Marci Shimoff read me right [???], I am not doing something where I did all the work and you are just transcribing it, but if you actually write in the book, I will do it. So I made her that promise and it was a hard promise, but it was a good one to make because it made me think more, you know. Second Speaker: Got you. First Speaker: So, what I would do is I basically introduce you and then you can add anything that you think is important to that introduction and let me get my history up here because I have you on here. So, how is Robby doing? Second Speaker: Good, hangin' in there. First Speaker: Yeah, did you guys have a nice holiday? Second Speaker: Well, we actually kind of had a [???] holiday, her father who is very old got sick and ended up passing away. First Speaker: Oh I am sorry to hear that. Second Speaker: But, you know, stuff happens, what are you going to do? First Speaker: So I am going to – is your best website, at the end I am going to ask you, you know, about your website and stuff, is rickfrishman.com the best one to go to or - Second Speaker: Yeah probably just for most stuff that is probably the best way to go yeah. First Speaker: You had a really good bio on one of your websites. Second Speaker: It is up there, there is one, you know, in most of them. I also have rickfrishmanblog.com, you know. First Speaker: Let me check that out, okay so the - Second Speaker: There is a bio on that one, but it is also a bio on just rickfrishman.com. First Speaker: There we go about Rick, yeah. Second Speaker: Sure. First Speaker: So you know, one of the things that I will bring up is, you know, you always talk about how you have the biggest Rolodex and I thought that was a really cool angle too because part of success is who you know and you know, I think that is important. I don't know what your angle is going to be on this, but you know, the questions will be do you think that that hypothesis is true that, you know, food affects your ability to succeed on some level and then if you - Second Speaker: Food affects your ability to - First Speaker: You know, if it plays into your level of success. In other words, you know, I know there are successful people who eat crappy food, but so far kind of the consensus has been, you know, it has run the gamut of extremes, but so far people seem to say, you know, they can't keep up their energy if you speak a lot. You do a lot of speaking so you know, and you have a hectic schedule, so I imagine that if you are, you know, full of two pizzas, you probably don't have the energy on stage that you normally would. Second Speaker: Right, it is true. First Speaker: So, that's kind of the angle, but...
'''

In [None]:
predicted = '''
     I'll tell you basically what this is about is when I was watching Harvey McKay at one of Harvecker's things, he said he'd just finished the Boston Marathon and you know the guy's 76 and I went holy crap you know that's amazing. He looked so fit and he's so quick-minded and so  on. I thought all of a sudden it occurred to me I bet the way you eat you know is different. I bet you don't just eat a bunch of garbage. And that started this thought so the basic three questions will be and I'm recording it for use as well if I transcribe these for the book but then I write about it and what's really been neat about it is that what started out as three same questions to everybody. Everybody had  kind of a different angle on it and I realized that they were creating the chapters for this book and of course Marcy Shymoff read me the riot act and said I'm not doing something where I did all the work and you're just transcribing it but if you actually write in the book I'll do it. So I made her that promise and it was a hard promise but it was a good one to make because it made me think more you know. So what I would do is I basically introduce you and then you can add anything that you think is important to that introduction and let me get my history up here because I have you on here. So how's Robbie doing? Good, hanging in there. Yeah, did you guys have a nice holiday? Well we actually got out of Cruddy Holiday. Her father who was very old got sick and then passed away. Oh golly. Oh I'm sorry to hear that. But you  know stuff happens, what are you going to do? Okay so I'm going to, is your best website, we're going to at the end I'm going to ask you know about your website and stuff is rickfrishman.com the best one to go to or Yeah probably just for most stuff that's probably the best way to go. You had a really good bio on one of your websites. Up there there's one you know and most of them, I also have rickfrishmanblog.com you know. Hey it's Rick Frishman. Let me check that out. I'm so happy you made it here. Okay so the. There's a bio on that one but it's also bio on rickfrishman.com. There we go about Rick. So you know one of the things that I'll bring up is you know you always talk about how  you have the biggest roller decks and I thought that was a really cool angle too because part of success is who you know and you know I think that's important. I don't know what your angle is going to be on this but you know the questions will be do you think that that hypothesis is true that you know food affects your ability to succeed on some level and then if you. That food affects you to ability to. You know  if it plays into your level of success. In other words you know I know there are successful people who eat crappy food but so far kind of the consensus has been you know it's run the gamut of extremes but so far people seem to say you know they can't keep up their energy if you speak a lot you do a lot of speaking so you know and you have a hectic schedule so I imagine that if you're you know full of two pizzas you probably don't have the energy on stage that you normally would. So that's kind of that's kind of the angle but.
'''

In [None]:
!pip install jiwer

In [None]:
import jiwer

wer = jiwer.wer(original, predicted)
print("Word Error Rate:", wer)


WER is the percentage of words in a reference transcript that are not correctly transcribed by the model. 

CER is the percentage of characters in a reference transcript that are not correctly transcribed by the model.

Lower WER indicates better accuracy.

In this case, a WER of 0.29885 means that the predicted transcription has an error rate of around 29.89%. 

Depending on the limited model setup I used, this is fine.
There are other evaluation metrics but this should be enough for the answer.

---

Question

Query-based Approach: Considering that the transcription may contain useful information for users
who don't have time to listen to the entire audio, your system should provide a way to query the
transcriptions. For example, a user might want to ask "what is the attention mechanism?" and your
system should be able to answer this based on the transcription document.


Answer:

To solve this problem. From the question I understand that the query should not be done by string search or regex because it requires understanding of language. Therefor I use another model called GPT4All. This model may be too big for current problem depending on requirements of production. There may be smaller models but I use GPT4All as it does the work. I provide the transcription to GPT4ALL and ask question based on it.
I use the openscource version of GPT4all.

In [None]:
!pip install gpt4all

In [None]:
# random text for internet to show query result
predicted = '''
Over the last couple of decades, the technological advances in storage and processing power have enabled some innovative products based on machine learning, such as Netflix’s recommendation engine and self-driving cars.

Machine learning is an important component of the growing field of data science. Through the use of statistical methods, algorithms are trained to make classifications or predictions, and to uncover key insights in data mining projects. These insights subsequently drive decision making within applications and businesses, ideally impacting key growth metrics. As big data continues to expand and grow, the market demand for data scientists will increase. They will be required to help identify the most relevant business questions and the data to answer them.

Contact at email mail.info@ml.com .

Machine learning algorithms are typically created using frameworks that accelerate solution development, such as TensorFlow and PyTorch.
'''

In [None]:
import gpt4all
gptj = gpt4all.GPT4All("ggml-gpt4all-j-v1.3-groovy")
content = f"Does the following text mention Mary Shymoff? The text: '{predicted}' "
messages = [{"role": "user", "content": content}]
gptj.chat_completion(messages)

The GPT4ALL is cpu only so takes alot of time (see note below). So I provide sample outputs.

Model Response:
The email address in the text is [mail.info@ml.com](mail.info@ml.com).


Note:

There is alot of room for improvement and optimization. Firstly a smaller model may be better. GPT4All currently uses only CPU so take a lot of time. GPU based model would be better and faster. 
I have limited time including the time to explore, download the models with limited internet speed. Machine limitaions also apply. Still you would likely understand the solution. 
Furthermore the ML models would peroform better when tunned.


I hardcoded input to query model (predicted variable) and output for simpicity of exlaination. Using direct output of transcription as input to query model will work fine.