First, we need to install the Hugging Face Transformers library, which gives us access to state-of-the-art speech and language models. If you haven’t installed it yet, run:

```
pip install transformers torch
```

Now, let’s import the necessary libraries and set up our device for computation. If you have a GPU, the code will automatically use it for better performance!

## Transcribing mp3 file to numpy requires the FFmpeg. Follow the steps in this link to have it installed. [(How to install FFmpeg in windows)](https://www.wikihow.com/Install-FFmpeg-on-Windows)

In [1]:
from transformers import pipeline,AutoModelForCausalLM, AutoTokenizer
import torch
from IPython.display import  clear_output


device = 'cuda' if torch.cuda.is_available() else 'cpu'

# SPEECH-TO-TEXT WITH WHISPER

Next, we’ll use OpenAI’s Whisper model to transcribe an audio file. We’re using the whisper-small model for efficiency, but you can switch to whisper-large for better accuracy.

In [2]:
pipe  = pipeline("automatic-speech-recognition",
                    "openai/whisper-small", 
                    chunk_length_s=30,
                    stride_length_s=5,
                    return_timestamps=True,
                    device=device, 
                    generate_kwargs = {"language": 'Hindi', "task": "translate"}) 

Device set to use cuda


Now, let’s process an audio file. The model will transcribe and translate it into English."

In [3]:
transcription = pipe("hindi_english_podcast.mp3" )
#Once the transcription is complete, we’ll format the text with timestamps for better readability.
formatted_lyrics = ""
for line in transcription['chunks']:
    text = line["text"]
    ts = line["timestamp"]
    formatted_lyrics += f"{ts}-->{text}\n"

print(formatted_lyrics.strip())

You have passed task=translate, but also have set `forced_decoder_ids` to [[1, None], [2, 50359]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=translate.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


(0.0, 5.5)--> Do you think India will grow at the speed with China growth?
(5.5, 2.0)--> I don't think China grew at the speed that it grew. I think it has a lot of data. I am pretty sure they didn't grow at the rate that they grew because had they done so then they wouldn't have had the range of social and economic problems they are currently facing. They say that in the 70s, India and China were the same.
(2.0, 5.0)--> And now today, look at China's economy versus India's economy.
(5.0, 46.48)--> So, look at China's economy versus India's economy. So, people have definitely grown up. People say a lot. What people don't realise is that the British ravaged India to a greater extent than they ravaged East Asia, Southeast Asia.
(51.48, 52.92)--> And therefore the journey that India had to traverse through the 50s and 60s was a far more demanding journey.
(53.56, 56.84)--> It is through the Deng shopping opened up China in 78-79.
(56.84, 60.4)--> And obviously they benefited from the libe

In [5]:
# Let’s also save the transcription to a text file so we can use it later!"
with open("transcription.txt", "w", encoding="utf-8") as file:
    file.write(formatted_lyrics.strip())

print("Transcription saved to transcription.txt")

Transcription saved to transcription.txt


# SUMMARIZING WITH LLAMA
Now, let’s take it a step further! We’ll use Meta’s LLaMA model to summarize the transcript. For this, we need the Llama-3.2-3B-Instruct model.

In [6]:
DEFAULT_MODEL = "meta-llama/Llama-3.2-3B-Instruct"


model = AutoModelForCausalLM.from_pretrained(
    DEFAULT_MODEL,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    device_map=device,
)

tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We’ll define a conversation prompt instructing LLaMA to summarize the meeting transcript in simple English.

In [7]:
conversation = [
    {"role": "system", "content": ''' Write the minutes of this meeting transcript in simple and precise English.'''},
    {"role": "user", "content": f'''{formatted_lyrics.strip()}'''},
]

prompt = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# print(prompt)

with torch.no_grad():
    output = model.generate(
        **inputs,
        do_sample=True,
        max_new_tokens=1000
    )

processed_text = tokenizer.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True)

print(processed_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


assistant

**Meeting Minutes**

**Attendees:** N/A

**Date:** N/A

**Topic:** India's Economic Growth and Comparison with China

**Key Points:**

1. The speaker discusses the economic growth of India and China, stating that India's growth has been remarkable despite facing severe odds, including a lower per capita income when India opened up in 1991.
2. The speaker highlights the differences between India and China's economic models, noting that China's model has frailties that India's model does not have.
3. The speaker praises China's achievements in technology, but notes that India's achievements in IT services and cultural identity are also significant.
4. The speaker questions why China was able to assimilate with the West so quickly, and attributes this to the Cultural Revolution and the Chinese Communist Party's decision to westernize.
5. The speaker emphasizes the importance of holding onto one's identity and cultural heritage, citing this as a key factor in a country's success