First, we need to install the Hugging Face Transformers library, which gives us access to state-of-the-art speech and language models. If you haven’t installed it yet, run:

```
pip install transformers torch
```

Now, let’s import the necessary libraries and set up our device for computation. If you have a GPU, the code will automatically use it for better performance!

## Transcribing mp3 file to numpy requires the FFmpeg. Follow the steps in this link to have it installed. [(How to install FFmpeg in windows)](https://www.wikihow.com/Install-FFmpeg-on-Windows)

In [1]:
from transformers import pipeline,AutoModelForCausalLM, AutoTokenizer
import torch
from IPython.display import  clear_output


device = 'cuda' if torch.cuda.is_available() else 'cpu'

# SPEECH-TO-TEXT WITH WHISPER

Next, we’ll use OpenAI’s Whisper model to transcribe an audio file. We’re using the whisper-small model for efficiency, but you can switch to whisper-large for better accuracy.

In [2]:
pipe  = pipeline("automatic-speech-recognition",
                    "openai/whisper-small", 
                    chunk_length_s=30,
                    stride_length_s=5,
                    return_timestamps=True,
                    device=device, 
                    generate_kwargs = {"language": 'Hindi', "task": "translate"}) 

Device set to use cuda


# If MP4 (or other video format) convert to MP3

In [9]:
import subprocess

def convert_mp4_to_mp3(input_file, output_file):
    """Convert an MP4 file to MP3 using ffmpeg."""
    try:
        command = [
            "ffmpeg",
            "-i", input_file,    # Input file
            "-vn",               # No video
            "-acodec", "libmp3lame",  # MP3 codec
            "-b:a", "192k",      # Audio bitrate
            output_file          # Output file
        ]
        subprocess.run(command, check=True)
        print(f"Conversion successful: {output_file}")
    except subprocess.CalledProcessError as e:
        print(f"Error during conversion: {e}")

# Example usage
convert_mp4_to_mp3("2025-03-15 10-36-27.mp4", "output.mp3")


ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

Conversion successful: output.mp3


size=   17682kB time=00:12:34.34 bitrate= 192.0kbits/s speed= 145x    
video:0kB audio:17681kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.003999%


Now, let’s process an audio file. The model will transcribe and translate it into English."

In [10]:
transcription = pipe("output.mp3" )
#Once the transcription is complete, we’ll format the text with timestamps for better readability.
formatted_lyrics = ""
for line in transcription['chunks']:
    text = line["text"]
    ts = line["timestamp"]
    formatted_lyrics += f"{ts}-->{text}\n"

print(formatted_lyrics.strip())



(0.0, 311.0)--> Before using the code, I would like you to come to this video and install the necessary packages and set up your environment. You can either use python's virtual environment or condi environment. You have to install all this library and after that we should be good to go. You can use all this Jupyter notebook freely. Coming to the code of meeting summarizer. for the whisper model. Now coming to the code you will need to install ffmpg library. If you do not have it already, it is used to operate on media files, video, audio and all that. I have mentioned the link here. You can follow this link to install it properly. With that said, let's start. From the transformer library, I am going to use the pipeline method auto model for causal language model, which is the large language model that we are going to use. And then the tokenizer method to tokenize our text. I am also using the torch some clear output and this one is not really important. So I am going to also specify m

In [5]:
# Let’s also save the transcription to a text file so we can use it later!"
with open("transcription.txt", "w", encoding="utf-8") as file:
    file.write(formatted_lyrics.strip())

print("Transcription saved to transcription.txt")

Transcription saved to transcription.txt


# SUMMARIZING WITH LLAMA
Now, let’s take it a step further! We’ll use Meta’s LLaMA model to summarize the transcript. For this, we need the Llama-3.2-3B-Instruct model.

In [6]:
DEFAULT_MODEL = "meta-llama/Llama-3.2-3B-Instruct"


model = AutoModelForCausalLM.from_pretrained(
    DEFAULT_MODEL,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    device_map=device,
)

tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We’ll define a conversation prompt instructing LLaMA to summarize the meeting transcript in simple English.

In [7]:
conversation = [
    {"role": "system", "content": ''' Write the minutes of this meeting transcript in simple and precise English.'''},
    {"role": "user", "content": f'''{formatted_lyrics.strip()}'''},
]

prompt = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# print(prompt)

with torch.no_grad():
    output = model.generate(
        **inputs,
        do_sample=True,
        max_new_tokens=1000
    )

processed_text = tokenizer.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True)

print(processed_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


assistant

**Meeting Minutes**

**Attendees:** N/A

**Date:** N/A

**Topic:** India's Economic Growth and Comparison with China

**Key Points:**

1. The speaker discusses the economic growth of India and China, stating that India's growth has been remarkable despite facing severe odds, including a lower per capita income when India opened up in 1991.
2. The speaker highlights the differences between India and China's economic models, noting that China's model has frailties that India's model does not have.
3. The speaker praises China's achievements in technology, but notes that India's achievements in IT services and cultural identity are also significant.
4. The speaker questions why China was able to assimilate with the West so quickly, and attributes this to the Cultural Revolution and the Chinese Communist Party's decision to westernize.
5. The speaker emphasizes the importance of holding onto one's identity and cultural heritage, citing this as a key factor in a country's success