In [None]:
# pip install -U sentence-transformers

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from IPython.display import  clear_output
import time
import PyPDF2
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

DEFAULT_MODEL = "meta-llama/Llama-3.2-3B-Instruct"


model = AutoModelForCausalLM.from_pretrained(
    DEFAULT_MODEL,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    device_map=device,
)

tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
model.generation_config.pad_token_id = 128001

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
video_transcription = '''0:00:00.000,0:00:05.000
Have you wished you could turn long messy meeting recordings into concise structured summaries

0:00:05.000,0:00:10.000
 automatically. Today, I'm going to show you exactly how to do that using AI. The best part,

0:00:10.000,0:00:15.000
 it's fully automated. Whether your meeting is in English, Hindi, French or any other language,

0:00:15.000,0:00:21.000
 this method can transcribe and summarize effortlessly. We'll be leveraging two powerful

0:00:21.000,0:00:25.000
 models for this task, whisper and automatic speech recognition model that can

0:00:25.000,0:00:31.000
 transcribe meetings in any language. And another is Lama, a large language modeled by Meta, which is

0:00:31.000,0:00:37.000
 some, which will summarize the transcript into clear and actionable meetings minutes. If you're

0:00:37.000,0:00:42.000
 new to whisper or Lama, no worries, I've already made detailed videos explaining both. So check them

0:00:42.000,0:00:48.000
 out in my playlist or in the I button above and the description below. With that said let's jump right into the video and build the application.

0:00:48.000,0:00:52.000
 Before we dive into the code I want to make sure you have everything set up correctly.

0:00:53.000,0:00:58.000
 To follow along smoothly install the necessary packages and set up your environment first.

0:00:58.000,0:01:03.000
 You can either use Python's virtual environment or Konda. Once you've installed all the required

0:01:03.000,0:01:05.000
 libraries you're good to go.

0:01:05.000,0:01:08.000
 All the required details are present in my repository,

0:01:08.000,0:01:09.000
 which I'll link in the description below.

0:01:09.000,0:01:12.000
 Now let's talk about the meeting summarizer

0:01:12.000,0:01:14.000
 using the whisper model.

0:01:14.000,0:01:16.000
 Now let's talk about the meeting summarizer

0:01:16.000,0:01:17.000
 using the whisper model.

0:01:17.000,0:01:18.000
 Before running the code,

0:01:18.000,0:01:21.000
 make sure you have the FFMPEG installed.

0:01:21.000,0:01:23.000
 If you haven't installed it, don't worry.

0:01:23.000,0:01:25.000
 I provided a link here that will

0:01:25.000,0:01:31.000
 work you through the installation process. It's a little tedious if you are using Windows, but for

0:01:32.000,0:01:38.000
 Mac or Linux, it's pretty easy. It basically helps us to handle media files like audio, video.

0:01:39.000,0:01:51.000
 Once that setup, we are ready to do. We'll be working with the Hugging Phase Transformer library. In particular, we'll use the Pipeline method with an audio model for causal language modeling.

0:01:51.000,0:01:57.000
 Essentially a large language model will also use a tokenizer to process our text and specify

0:01:57.000,0:01:58.000
 our device.

0:01:58.000,0:02:01.000
 Here, we specify our device.

0:02:01.000,0:02:04.000
 Now we'll come to converting speech to text with the whisper model.

0:02:04.000,0:02:09.000
 Whether it input is audio or video, the first step is to transcribe. For this, we'll use OpenAI's

0:02:09.000,0:02:16.000
 whisper model. The more if you have a video file, let's say MP4 or AVI, you can run this

0:02:16.000,0:02:22.000
 particular lineup code to convert it into an audio file from a video file. This is the

0:02:22.000,0:02:25.000
 function convert MP4 to MP3. Now let's break down the

0:02:25.000,0:02:30.000
 key parameters of the wishbox. First we initialize the pipeline with automatic speech recognition

0:02:30.000,0:02:38.000
 as task type. Then we select the model variant we want to use there are whisper large whisper

0:02:38.000,0:02:45.000
 medium whisper small I'm using the small medium small version. Since audio files can be pretty large, we're going

0:02:45.000,0:02:50.000
 to divide it into 30 second chunks and process them chunk by chunk. By setting return time

0:02:50.000,0:02:57.000
 stamp equal to 2. We ensure that each transcribed segment includes a timestamp making it easy

0:02:57.000,0:03:01.000
 to follow along. And this final part is kind of important, whereas we do the language selection.

0:03:02.000,0:03:05.000
 We spoke either transcribe the original language or translate

0:03:05.000,0:03:13.000
 it into English. Here I am choosing the translation and I am choosing the language of my source audio

0:03:13.000,0:03:20.000
 as Hindi. But the source audio has both Hindi and English present in it. And the audio, the Usper

0:03:20.000,0:03:25.000
 model is smart enough on itself to transcribe, to translate when there is Hindi

0:03:25.000,0:03:30.000
 and just keep it keep it in English if the audio is in English. Here you can use French,

0:03:30.000,0:03:36.000
 Spanish, or any other language among like 99 different languages you can use in this

0:03:36.000,0:03:42.000
 case. Once the pipeline is initialized, all we have to do is pass in the path to our MP3

0:03:42.000,0:03:45.000
 file and the model takes care of the risks. For example,

0:03:45.000,0:03:49.000
 I choose this video to transcribe. I don't think China grew at the speed that it grew.

0:03:49.000,0:03:56.000
 I think a lot of data is printed on the page. So I'm pretty sure they didn't grow at the rate that.

0:03:56.000,0:04:02.000
 So you see this particular video is spoken both in Hindi as well as English. But the

0:04:02.000,0:04:06.000
 whisper model is smart enough to transcribe Hindi to English as well

0:04:06.000,0:04:11.000
 and just let it be English when they are speaking English and you see here is our transcription

0:04:11.000,0:04:18.000
 see do you think India will grow at the speed China grow and so now as all our audio is in one

0:04:18.000,0:04:26.000
 particular language it becomes very easy to create create a summarization of the meeting or minutes of the meeting,

0:04:26.000,0:04:30.000
 whatever you like. And for that, we're going to use a large language model. For this, I'm

0:04:30.000,0:04:36.000
 going to use the Lama 3.2, a three billion parameter model. If you have a lower end GPU,

0:04:36.000,0:04:43.000
 you can opt for one billion parameter variant as well. And even you can make it more optimized

0:04:43.000,0:04:47.000
 by using quantization techniques. And here how I created

0:04:47.000,0:04:52.000
 the conversation setup. In the prompt, I ask write the minutes of this meeting transcript in simple

0:04:52.000,0:04:59.000
 and precise English. And in the, this is my system prompt. And in the user query, I gave all the

0:04:59.000,0:05:11.000
 transcription that I just created from the whisper model. Then I use my tokenizer to convert this dictionary of conversation into actual prompt with different

0:05:11.000,0:05:12.000
 tokens.

0:05:12.000,0:05:17.000
 If you are not sure how, what do I mean by that?

0:05:17.000,0:05:22.000
 Please check out my previous video where I went deep into how the Lama model works and

0:05:22.000,0:05:24.000
 different tokens of the Lama model.

0:05:24.000,0:05:29.000
 After that I use the tokenizer to convert these words or texts into numbers that I can

0:05:29.000,0:05:32.000
 easily fit into my LLM.

0:05:32.000,0:05:36.000
 After that, I use the generate method to create the output.

0:05:36.000,0:05:42.000
 I have set maximum tokens to 1000 to keep the summary a little bit concise.

0:05:42.000,0:05:44.000
 Here do sample equal to true.

0:05:44.000,0:05:46.000
 It introduces stochasticity or a

0:05:46.000,0:05:50.000
 probabilistic nature, meaning each run might produce slightly different results by sampling

0:05:50.000,0:05:58.000
 the most verbal rewards. Finally, I decode the output back into readable text, skipping

0:05:58.000,0:06:04.000
 any special token such as like a startup sentence token or end of sentence token. These things

0:06:04.000,0:06:05.000
 are not necessary.

0:06:05.000,0:06:07.000
 And here you see the final output.

0:06:07.000,0:06:09.000
 See, I put them in the markdown file

0:06:09.000,0:06:12.000
 and you see they look pretty well.

0:06:12.000,0:06:13.000
 The speaker discusses the economic growth of India

0:06:13.000,0:06:17.000
 and China starting stating that India's growth

0:06:17.000,0:06:23.000
And it pretty well summarizes the actual meeting,
has been remarkable.

0:06:23.000,0:06:25.000
 even though it's in Hindi and English.

0:06:25.000,0:06:30.000
 The same approach can be used to summarize any YouTube video as well and with this you

0:06:30.000,0:06:33.000
 can automatically generate summaries in a few clicks.

0:06:33.000,0:06:35.000
 And yeah, that's a wrap for today's video.

0:06:35.000,0:06:39.000
 If you found this tutorial helpful, give it a thumbs up and subscribe to my channel for

0:06:39.000,0:06:44.000
 more deep dives into large language models and computer vision projects.

0:06:44.000,0:06:46.000
 Till then, stay curious and keep explaining.'''

In [11]:

conversation = [
    {"role": "system", "content": f'''**Prompt:**  

"Given the subtitles of a YouTube video, generate a structured list of timestamps in the format MM:SS along with their corresponding topics. Identify key moments in the video where the subject matter changes or an important point is introduced. Ensure that each timestamp represents a meaningful transition or highlight. Format the output as follows:  

MM:SS - Topic  

Example:  
00:30 - Introduction to the Video  
02:15 - Explanation of Key Concept  
05:45 - Demonstration of the Method  

Ensure that the topics are concise and accurately reflect the content of each section. There should be at max 5 topics from start coverining the entire duration. It should strictly follow MM:SS format. DO NOT USE MARKDOWN FORMAT'''},
    {"role": "user", "content": f'''{video_transcription}'''},
]
# 
prompt = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print(inputs.input_ids.shape)

with torch.no_grad():
    output = model.generate(
        **inputs,
        do_sample=True,
        max_new_tokens=256
    )


processed_text = tokenizer.decode(output[0][len(inputs.input_ids[0])+3:], skip_special_tokens=True)

print(processed_text)

torch.Size([1, 3014])


Here is the list of timestamps with corresponding topics:

00:00:00.000 - Introduction to the Video
00:00:05.000 - Introduction to the Video (continued)
00:01:12.000 - Meeting Summarizer using Whisper Model
00:04:30.000 - Meeting Summarizer using Lama Model
00:05:42.000 - Summarization of Meeting using Large Language Model
00:06:07.000 - Final Output of Summarization


**Automating Meeting Summarization with AI: A Step-by-Step Guide**

Are you tired of listening to long, messy meeting recordings and wishing you could condense them into concise, structured summaries? With the help of AI, you can now do just that. In this tutorial, we'll show you how to use two powerful models, Whisper and LLaMA, to transcribe and summarize meetings in any language.

**Getting Started**

Before we dive into the code, make sure you have the necessary packages and environment set up. You can either use Python's virtual environment or Conda. Once you've installed all the required libraries, you're good to go. All the required details are present in our repository, which you can link in the description below.

**Using Whisper for Automatic Speech Recognition**

We'll start with the Whisper model, which can transcribe meetings in any language. Before running the code, make sure you have FFMPEG installed. If you haven't installed it, don't worry – we'll walk you through the installation process. With Whisper, we'll use the Pipeline method with an audio model for causal language modeling. This large language model will also use a tokenizer to process our text and specify our device.

To convert speech to text, we'll use OpenAI's Whisper model. If you have a video file, you can run the `convert_mp4_to_mp3` function to convert it into an audio file. For example, let's say you have a video file that's spoken in both Hindi and English. The Whisper model will transcribe it into English, ignoring the Hindi parts.

**Using LLaMA for Summarization**

For summarization, we'll use the LLaMA model, a 3.2 billion parameter model. You can opt for the 1 billion parameter variant if you have a lower-end GPU. To use LLaMA, we'll create a conversation setup with a prompt and a user query. The prompt will ask the model to write the minutes of the meeting transcript in simple and precise English, while the user query will provide the transcription from the Whisper model.

Here's a sample conversation setup:

```
prompt: Write the minutes of this meeting transcript in simple and precise English.
user query: The speaker discusses the economic growth of India and China, stating that India's growth has been remarkable.
```

We'll then use the tokenizer to convert the conversation into actual tokens and fit them into the LLaMA model. After that, we'll use the generate method to create the output, setting maximum tokens to 1000 to keep the summary concise. Finally, we'll decode the output back into readable text, skipping special tokens.

**Putting it All Together**

Here's the complete code:

```python
import whisper
import torch
from transformers import LLaMAForConditionalGeneration, LLaMATokenizer

# Initialize Whisper model
model = whisper.load("whisper-small")
tokenizer = whisper.load_translator("whisper-small")

# Initialize LLaMA model
llama_model = LLaMAForConditionalGeneration.from_pretrained("llama-base")
llama_tokenizer = LLaMATokenizer.from_pretrained("llama-base")

# Convert speech to text using Whisper
audio_file = "path_to_audio_file.mp3"
transcription = model.transcribe(audio_file)

# Summarize the transcription using LLaMA
prompt = "Write the minutes of this meeting transcript in simple and precise English."
user_query = transcription
output = llama_model.generate(user_query, max_length=1000)

# Decode the output back into readable text
output_text = llama_tokenizer.decode(output)
print(output_text)
```

**Conclusion**

With this tutorial, you can now automatically generate summaries of meetings in just a few clicks. The same approach can be used to summarize any YouTube video, and we hope you found this tutorial helpful. Give it a thumbs up and subscribe to our channel for more deep dives into large language models and computer vision projects.