First, we need to install the Hugging Face Transformers library, which gives us access to state-of-the-art speech and language models. If you haven’t installed it yet, run:

```
pip install transformers torch
```

Now, let’s import the necessary libraries and set up our device for computation. If you have a GPU, the code will automatically use it for better performance!

In [3]:
from transformers import pipeline, AutoModelForSpeechSeq2Seq, AutoProcessor
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from IPython.display import  clear_output
import time

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# SPEECH-TO-TEXT WITH WHISPER

Next, we’ll use OpenAI’s Whisper model to transcribe an audio file. We’re using the whisper-small model for efficiency, but you can switch to whisper-large for better accuracy.

In [4]:
pipe  = pipeline("automatic-speech-recognition",
                    "openai/whisper-small", 
                    chunk_length_s=30,
                    stride_length_s=5,
                    return_timestamps=True,
                    device=device, 
                    generate_kwargs = {"language": 'French', "task": "translate"}) 

Device set to use cuda


Now, let’s process an audio file. The model will transcribe and translate it into English."

In [5]:
transcription = pipe("meeting.flac" )
#Once the transcription is complete, we’ll format the text with timestamps for better readability.
formatted_lyrics = ""
for line in transcription['chunks']:
    text = line["text"]
    ts = line["timestamp"]
    formatted_lyrics += f"{ts}-->{text}\n"

print(formatted_lyrics.strip())

You have passed task=translate, but also have set `forced_decoder_ids` to [[1, None], [2, 50359]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=translate.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


(0.0, 2.0)--> Yes, yes.
(87.0, 90.0)--> Ok. How do you analyze with the instantanet data? Well, today, what we're going to do is that the Valiz-Pokéoki we will define for the first stage of use with installed cameras. And the second stage of use, it will be a control point where we just downloaded or posted images on a server. The analysis is not instant, it is a part of the organization that we do not use the same time. Because if we use the analysis to do the control of the four, we start by calling the control point. And there is only this environment with the cameras that are used. In fact, to use the same analysis for your applications, you have to stop the first take-off and put the images. But yes, the analysis is not going to be... Yes, the image analysis is instantaneous, we know. And then we can also do this programming, program the analysis of the light. So that, you are talking about stopping the first and start the second. It also depends on the level of familiarization wi

In [6]:
# Let’s also save the transcription to a text file so we can use it later!"
with open("transcription.txt", "w", encoding="utf-8") as file:
    file.write(formatted_lyrics.strip())

print("Transcription saved to transcription.txt")

Transcription saved to transcription.txt


# SUMMARIZING WITH LLAMA
Now, let’s take it a step further! We’ll use Meta’s LLaMA model to summarize the transcript. For this, we need the Llama-3.2-3B-Instruct model.

In [7]:
DEFAULT_MODEL = "meta-llama/Llama-3.2-3B-Instruct"


model = AutoModelForCausalLM.from_pretrained(
    DEFAULT_MODEL,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    device_map=device,
)

tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We’ll define a conversation prompt instructing LLaMA to summarize the meeting transcript in simple English.

In [9]:
conversation = [
    {"role": "system", "content": ''' Summarize the following office meeting transcript in simple and precise English.
     DO NOT USE MARKDOWN FORMAT'''},
    {"role": "user", "content": f'''{formatted_lyrics.strip()}'''},
]

prompt = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# print(prompt)

with torch.no_grad():
    output = model.generate(
        **inputs,
        do_sample=True,
        max_new_tokens=1000
    )

processed_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(processed_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


system

Cutting Knowledge Date: December 2023
Today Date: 13 Mar 2025

Summarize the following office meeting transcript in simple and precise English.
     DO NOT USE MARKDOWN FORMATuser

(0.0, 2.0)--> Yes, yes.
(87.0, 90.0)--> Ok. How do you analyze with the instantanet data? Well, today, what we're going to do is that the Valiz-Pokéoki we will define for the first stage of use with installed cameras. And the second stage of use, it will be a control point where we just downloaded or posted images on a server. The analysis is not instant, it is a part of the organization that we do not use the same time. Because if we use the analysis to do the control of the four, we start by calling the control point. And there is only this environment with the cameras that are used. In fact, to use the same analysis for your applications, you have to stop the first take-off and put the images. But yes, the analysis is not going to be... Yes, the image analysis is instantaneous, we know. And then w