# Using Whisper and RAG to Build a Podcast Chat App

Whisper, an Automatic Speech Recognition (ASR) model, is a powerful tool that converts spoken language into written text. This capability is crucial in numerous applications, including transcription services, voice assistants, and more. It allows us to transform audio content into a format that can be processed and understood by text-based models.

In this notebook, we demonstrate how to use the [Whisper](https://huggingface.co/openai/whisper-large) model on Groq API, along with Retrieval-Augmented Generation (RAG) to build an interactive application. The application converts podcast audio into text transcriptions, stores these transcriptions in a [Pinecone](https://www.pinecone.io/) vector database, and then use [LangChain](https://www.langchain.com/) to perform RAG on those transcriptions to answer user queries based on the podcast content. By leveraging Groq's powerful inference capabilities, we can efficiently process and understand the podcast content, providing users with accurate and contextually relevant responses to their queries.

### Setup

In [1]:
from groq import Groq
import os
import pandas as pd
import numpy as np
from pydub import AudioSegment

from langchain.text_splitter import TokenTextSplitter
from langchain.docstore.document import Document
from langchain_pinecone import PineconeVectorStore
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings

A Groq API Key is required for this demo - you can generate one for free [here](https://console.groq.com/keys). We will be using Pinecone as our vector database, which also requires an API key (you can create one index for a small project there for free on their Starter plan) Finally, we will be using the ```whisper-large-v3``` model for audio transcription.

In [2]:
client = Groq(api_key = os.getenv('GROQ_API_KEY'))
model = 'whisper-large-v3'

### Using Whisper for Audio Transcription

First, let's test out using Whisper to transcribe a single, smaller audio file. We will start with transcribing an .mp3 conversion of our CEO Jonathan Ross' talk on CNN: [Groq's AI Chip Breaks Speed Records](https://www.youtube.com/watch?v=pRUddK6sxDg):

In [3]:
filepath = "Groq's AI Chip Breaks Speed Records.mp3"
def audio_to_text(filepath):
    with open(filepath, "rb") as file:
        translation = client.audio.translations.create(
            file=(filepath, file.read()),
            model="whisper-large-v3",
        )
    return translation.text
    
translation_text = audio_to_text(filepath)

# Show just the beginning of the transcription
print(translation_text[:2000])

 Welcome back. You're with Connect the World. I'm Becky Anderson. We're at the World Government Summit in Dubai. And one thing I've noticed here is that whenever a discussion about artificial intelligence takes place, the rooms here, the huge halls get packed. That is because some of the leading minds behind the technological revolution have been gathered here at the conference over the past couple of days and in ai's race to the top my next guest is sprinting at speeds never seen before jonathan ross is the brain behind groq the world's first language processing unit now before i lose you in the technological jargon of ai let me put it this way what Ross created is a chip that can run programs like Meta's Lama 2 model for example faster than anything else in the world ten to 100 times faster in fact and he's here with me now to explain how that is possible before I ask you that groq why groq Thank You Becky it's groq and we We spell it with a Q, and it's because it comes from a scienc

With the audio from the interview successfully converted to text, we can now feed it into the context of a simple chat completion request to ask questions about the contents of the interview (using Meta's smaller 8b Llama-3 model) for text completion:

In [5]:
def transcript_chat_completion(client, transcript, user_question):
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": '''Use this transcript or transcripts to answer any user questions, citing specific quotes:

                {transcript}
                '''.format(transcript=transcript)
            },
            {
                "role": "user",
                "content": user_question,
            }
        ],
        model="llama3-8b-8192",
    )

    print(chat_completion.choices[0].message.content)
    
user_question = "Explain the importance of fast language models"
transcript_chat_completion(client, translation_text, user_question)  

According to the transcript, Jonathan Ross explained the importance of fast language models as follows:

"The reason you care about the speed is it's about engagement. Imagine if I spoke that slowly, you'd just drift off, you'd go away. Most certainly. So the statistic is if you improve the speed by 100 milliseconds on a website on desktop, you will get about an 8% increase in user engagement. On mobile, it's 34%. People have no patience on mobile."

In this quote, Ross emphasizes that speed is crucial for keeping users engaged. He explains that even a small improvement in speed can lead to a significant increase in user engagement, with a 8% increase on desktop and a 34% increase on mobile for every 100 millisecond improvement in speed. This highlights the importance of fast language models in maintaining user attention and keeping them engaged with the application or website.


### Building a Podcast Chat App

I've stored .mp3 files of the 10 most recent episodes of the [This Week in Startups podcast](https://thisweekinstartups.com/) in the `mp3-files/` folder that we will transcribe and use for RAG. Please note that when using podcast RSS feeds or any digital content in your application, it's crucial to respect copyright permissions to avoid potential legal issues.

First, we will have to split these files into smaller chunks so that they can be fully transcribed in a Whisper request:

In [9]:
mp3_file_folder = "mp3-files"
mp3_chunk_folder = "mp3-chunks"
chunk_length_ms=1000000 # Split into 1000s chunks (16.67 min)
overlap_ms=10000 # 10s overlap between chunks

def split_m4a(mp3_file_folder, mp3_chunk_folder, episode_id, chunk_length_ms, overlap_ms, print_output):
    # Load the audio file
    audio = AudioSegment.from_file(mp3_file_folder + "/" + episode_id + ".mp3", format="mp3")
    
    # Calculate the number of chunks
    num_chunks = len(audio) // (chunk_length_ms - overlap_ms) + (1 if len(audio) % chunk_length_ms else 0)
    
    # Split the file into chunks
    for i in range(num_chunks):
        start_ms = i * chunk_length_ms - (i * overlap_ms)
        end_ms = start_ms + chunk_length_ms
        chunk = audio[start_ms:end_ms]
        
        # Export each chunk to a file
        export_fp = mp3_chunk_folder + "/" + episode_id + f"_chunk{i+1}.mp3"
        chunk.export(export_fp, format="mp3")
        if print_output:
            print('Exporting', export_fp)
        
    return chunk # Return last chunk for demo purposes
    
print_output = True
for fil in os.listdir(mp3_file_folder):
    episode_id = fil.split('.')[0]
    print('Splitting Episode ID:', episode_id)
    chunk = split_m4a(mp3_file_folder, mp3_chunk_folder, episode_id, chunk_length_ms, overlap_ms, print_output)
    print_output = False
    
    
chunk

Splitting Episode ID: 215f919b-845b-4d5a-9593-860c29dbd7fa
Exporting mp3-chunks/215f919b-845b-4d5a-9593-860c29dbd7fa_chunk1.mp3
Exporting mp3-chunks/215f919b-845b-4d5a-9593-860c29dbd7fa_chunk2.mp3
Exporting mp3-chunks/215f919b-845b-4d5a-9593-860c29dbd7fa_chunk3.mp3
Exporting mp3-chunks/215f919b-845b-4d5a-9593-860c29dbd7fa_chunk4.mp3
Splitting Episode ID: ca8faab3-b039-47c6-85a9-3e430528c5b0
Splitting Episode ID: 4753dd85-7cf2-4f20-a919-1bdf54e194f5
Splitting Episode ID: e00c7094-c70b-4173-b9d3-0cfa6c0ea7d7
Splitting Episode ID: 615b84d2-44d5-4c76-8c03-3f7312640ccf
Splitting Episode ID: 0f7b6ff1-62e9-4319-8dc2-286fd909cbd7
Splitting Episode ID: 9b0c5bd4-9ec6-4ced-82af-d40206fc75ca
Splitting Episode ID: e85e445a-85fc-48ab-8507-eaf0ced99072
Splitting Episode ID: 96a380e9-a7d1-428c-9347-7e27fcb152a0
Splitting Episode ID: 897994d8-b0a0-4ca4-b9c5-2e74e1c4a5f4


Podcast metadata is stored in `episode_metadata.csv`, which we can use to attach relevant info like Date and Title to our transcriptions:

In [8]:
episode_metadata_df = pd.read_csv('episode_metadata.csv')
chunk_fps = os.listdir(mp3_chunk_folder)
episode_chunk_df = pd.DataFrame({
    'filepath': [mp3_chunk_folder + '/' + fp for fp in chunk_fps],
    'episode_id': [fp.split('_chunk')[0] for fp in chunk_fps]
    }
)
episodes_df = episode_chunk_df.merge(episode_metadata_df,on='episode_id')
episodes_df.head(10)

Unnamed: 0,filepath,episode_id,published_date,title
0,mp3-chunks/215f919b-845b-4d5a-9593-860c29dbd7f...,215f919b-845b-4d5a-9593-860c29dbd7fa,4/23/2024,"AI Demos and News: Llama 3, Marblism, Lumona &..."
1,mp3-chunks/215f919b-845b-4d5a-9593-860c29dbd7f...,215f919b-845b-4d5a-9593-860c29dbd7fa,4/23/2024,"AI Demos and News: Llama 3, Marblism, Lumona &..."
2,mp3-chunks/215f919b-845b-4d5a-9593-860c29dbd7f...,215f919b-845b-4d5a-9593-860c29dbd7fa,4/23/2024,"AI Demos and News: Llama 3, Marblism, Lumona &..."
3,mp3-chunks/215f919b-845b-4d5a-9593-860c29dbd7f...,215f919b-845b-4d5a-9593-860c29dbd7fa,4/23/2024,"AI Demos and News: Llama 3, Marblism, Lumona &..."
4,mp3-chunks/0f7b6ff1-62e9-4319-8dc2-286fd909cbd...,0f7b6ff1-62e9-4319-8dc2-286fd909cbd7,4/26/2024,The power of super communication with Charles ...
5,mp3-chunks/0f7b6ff1-62e9-4319-8dc2-286fd909cbd...,0f7b6ff1-62e9-4319-8dc2-286fd909cbd7,4/26/2024,The power of super communication with Charles ...
6,mp3-chunks/0f7b6ff1-62e9-4319-8dc2-286fd909cbd...,0f7b6ff1-62e9-4319-8dc2-286fd909cbd7,4/26/2024,The power of super communication with Charles ...
7,mp3-chunks/0f7b6ff1-62e9-4319-8dc2-286fd909cbd...,0f7b6ff1-62e9-4319-8dc2-286fd909cbd7,4/26/2024,The power of super communication with Charles ...
8,mp3-chunks/0f7b6ff1-62e9-4319-8dc2-286fd909cbd...,0f7b6ff1-62e9-4319-8dc2-286fd909cbd7,4/26/2024,The power of super communication with Charles ...
9,mp3-chunks/897994d8-b0a0-4ca4-b9c5-2e74e1c4a5f...,897994d8-b0a0-4ca4-b9c5-2e74e1c4a5f4,4/19/2024,Redpoint Ventures and Stepstone Group on VC De...


Next, we'll loop through each episode chunk, extract their transcriptions, split those transcriptions further so that they fit in the context window, and store them in to `Document` object so that it integrates with the vector database. We'll use LangChain's `TokenTextSplitter` to split the text into 500 token chunks:

In [9]:
text_splitter = TokenTextSplitter(
    chunk_size=500, # 500 tokens is the max
    chunk_overlap=20 # Overlap of N tokens between chunks (to reduce chance of cutting out relevant connected text like middle of sentence)
)

In [10]:
documents = []
cnt = 0
for index, row in episodes_df.iterrows():
    cnt += 1
    audio_filepath = row['filepath']
    transcript = audio_to_text(audio_filepath)
    chunks = text_splitter.split_text(transcript)
    for chunk in chunks:
        header = f"Date: {row['published_date']}\nEpisode Title: {row['title']}\n\n"
        documents.append(Document(page_content=header + chunk, metadata={"source": "local"}))
        
    # Print transcription progress (it takes a few minutes)
    if np.mod(cnt ,round(len(episodes_df) / 5)) == 0:
        print(round(cnt / len(episodes_df),2) * 100, '% of transcripts processed...')
        

print('# Transcription Chunks: ', len(documents))

20.0 % of transcripts processed...
41.0 % of transcripts processed...
61.0 % of transcripts processed...
82.0 % of transcripts processed...
# Transcription Chunks:  329


Finally, we'll define our embedding function and store our transcription chunks to a Pinecone index called `twist-transcripts`. For more info on embeddings and building RAG solutions, see this [cookbook post](https://github.com/groq/groq-api-cookbook/blob/main/tutorials/presidential-speeches-rag/rag-langchain-presidential-speeches.ipynb).

In [13]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

pinecone_index_name = "twist-transcripts"
docsearch = PineconeVectorStore.from_documents(documents, embedding_function, index_name=pinecone_index_name)

Now, we can search our vector database to find relevant podcast transcripts that answer questions about recent episodes:

In [16]:
user_question = "what is Vibecheck?"

relevent_docs = docsearch.similarity_search(user_question)
relevant_transcripts = '\n\n------------------------------------------------------\n\n'.join([doc.page_content for doc in relevent_docs[:3]])
transcript_chat_completion(client, relevant_transcripts, user_question)

According to the transcript, Vibecheck is a machine learning model that produces a special website, vibecheck.market, which helps to find or recommend the best products based on user input. In the specific example given, the host asks the AI to find the best headphones, and Vibecheck provides a list of recommendations. The system uses a complex algorithm that takes into account various factors, including user preferences, reviews, and ratings from various online sources. The goal of Vibecheck is to provide personalized recommendations to users, making it easier for them to find the best products that match their needs and preferences.


In [17]:
user_question = "what basketball team does Jason like?"

relevent_docs = docsearch.similarity_search(user_question)
relevant_transcripts = '\n\n------------------------------------------------------\n\n'.join([doc.page_content for doc in relevent_docs[:3]])
transcript_chat_completion(client, relevant_transcripts, user_question)

According to the transcript from Episode 1937, Jason mentions that he likes the Golden State Warriors, referencing their winning championships with the help of multiple All-Star players.


# Conclusion

In conclusion, this notebook showcases the transformative potential of converting audio content into text using the Whisper model with Groq. This process unlocks a wealth of information previously trapped in audio format, making it accessible and searchable. By storing these text transcriptions in a Pinecone vector database and using Retrieval-Augmented Generation (RAG) for query handling, we can efficiently retrieve and present relevant information. While we demonstrated this capability through the creation of a Podcast Chat App, the implications extend far beyond this specific use case. The ability to accurately transcribe and query audio content opens up vast opportunities for information retrieval, content analysis, and user interaction across a wide range of applications.