# WhisperPlus: Advancing Speech2Text and Text2Speech Processing üöÄ

This Jupyter Notebook demonstrates the capabilities of the WhisperPlus library, an advanced tool for speech-to-text and text-to-speech processing. Below, we have organized different functionalities of WhisperPlus into separate sections, each accompanied by explanatory comments to assist with understanding and usage.

## üõ†Ô∏è Installation

Before we start, you need to install the WhisperPlus package. Run the following command to install it:

In [None]:
!pip install -U whisperplus

import nest_asyncio 
nest_asyncio.apply()

### üéµ Youtube URL to Audio

This section demonstrates how to convert a YouTube video to audio and transcribe it using WhisperPlus.

In [None]:
from whisperplus import SpeechToTextPipeline, download_and_convert_to_mp3

url = "https://www.youtube.com/watch?v=di3rHkEZuUw"
audio_path = download_and_convert_to_mp3(url)
pipeline = SpeechToTextPipeline(model_id="openai/whisper-large-v3")
transcript = pipeline(audio_path, "openai/whisper-large-v3", "english")

print(transcript)

### üì∞ Summarization

Here, we showcase how to summarize text using the TextSummarizationPipeline in WhisperPlus.

In [None]:
from whisperplus import TextSummarizationPipeline

summarizer = TextSummarizationPipeline(model_id="facebook/bart-large-cnn")
summary = summarizer.summarize(transcript)
print(summary[0]["summary_text"])

### üóûÔ∏è Long Text Support Summarization

This part shows how to summarize longer texts using the LongTextSupportSummarizationPipeline, which is particularly useful for handling extensive documents.

In [None]:
from whisperplus import LongTextSummarizationPipeline

summarizer = LongTextSummarizationPipeline(model_id="facebook/bart-large-cnn")
summary_text = summarizer.summarize(transcript)
print(summary_text)

### üí¨ Speaker Diarization

In this section, we demonstrate the use of Speaker Diarization. This feature helps in distinguishing between different speakers in an audio clip.

In [None]:
from whisperplus import (
    ASRDiarizationPipeline,
    download_and_convert_to_mp3,
    format_speech_to_dialogue,
)

audio_path = download_and_convert_to_mp3("https://www.youtube.com/watch?v=mRB14sFHw2E")

device = "cuda"  # cpu or mps
pipeline = ASRDiarizationPipeline.from_pretrained(
    asr_model="openai/whisper-large-v3",
    diarizer_model="pyannote/speaker-diarization",
    use_auth_token=False,
    chunk_length_s=30,
    device=device,
)

output_text = pipeline(audio_path, num_speakers=2, min_speaker=1, max_speaker=2)
dialogue = format_speech_to_dialogue(output_text)
print(dialogue)

### ‚≠ê RAG - Chat with Video (LanceDB)

This part covers the 'Chat with Video' feature using LanceDB. It demonstrates how to interact with a video transcript using a chat interface.

In [None]:
from whisperplus import ChatWithVideo

chat = ChatWithVideo(
    input_file="trascript.txt",
    llm_model_name="TheBloke/Mistral-7B-v0.1-GGUF",
    llm_model_file="mistral-7b-v0.1.Q4_K_M.gguf",
    llm_model_type="mistral",
    embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
)

query = "what is this video about ?"
response = chat.run_query(query)

### üå† RAG - Chat with Video (AutoLLM)

This section demonstrates the 'Chat with Video' feature using AutoLLM. It enables querying a video's content through a chat interface, utilizing advanced language models.

In [None]:
from whisperplus.pipelines.autollm_chatbot import AutoLLMChatWithVideo

# service_context_params
system_prompt = """
You are an friendly ai assistant that help users find the most relevant and accurate answers
to their questions based on the documents you have access to.
When answering the questions, mostly rely on the info in documents.
"""
query_wrapper_prompt = """
The document information is below.
---------------------
{context_str}
---------------------
Using the document information and mostly relying on it,
answer the query.
Query: {query_str}
Answer:
"""

chat = AutoLLMChatWithVideo(
    input_file="audio.mp3",
    openai_key="YOUR_OPENAI_KEY",
    huggingface_key="YOUR_HUGGINGFACE_KEY",
    llm_model="gpt-3.5-turbo",
    llm_max_tokens="256",
    llm_temperature="0.1",
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    embed_model="huggingface/BAAI/bge-large-zh",
)

query = "what is this video about ?"
response = chat.run_query(query)
print(response)

### üéôÔ∏è Speech to Text

Finally, this section covers converting text to speech using WhisperPlus, demonstrating how to generate spoken audio from text.

In [None]:
from whisperplus import TextToSpeechPipeline

tts = TextToSpeechPipeline(model_id="suno/bark")
audio = tts(text="Hello World", voice_preset="v2/en_speaker_6")