## Transcribing mp3 file to numpy requires the FFmpeg. Follow the steps in this link to have it installed. [(How to install FFmpeg in windows)](https://www.wikihow.com/Install-FFmpeg-on-Windows)

In [11]:
import time
from transformers import  pipeline,AutoModelForSpeechSeq2Seq,AutoProcessor
import torch



In [14]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'


pipe  = pipeline("automatic-speech-recognition",
                    "openai/whisper-small", 
                    chunk_length_s=30,
                    stride_length_s=5,
                    return_timestamps=True,
                    device=device, 
                    generate_kwargs = {"language": 'English', "task": "translate"}) 

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [15]:
transcription = pipe(r"F:\Gen-AI-Mini-Projects\audio2text_from_mp3\english_audio.mp3" )



formatted_lyrics = ""
for line in transcription['chunks']:
    text = line["text"]
    ts = line["timestamp"]
    formatted_lyrics += f"{ts}-->{text}\n"

print(formatted_lyrics.strip())


(0.0, 3.0)--> What's your interest in it, Mr. Wayne?
(3.0, 5.0)--> I want to borrow it...
(5.0, 7.0)--> for, uh, spelunking.
(7.0, 9.0)--> Spelunking?
(9.0, 11.0)--> Yeah, you know, cave diving?
(12.0, 15.0)--> You're expecting to run into much gunfire in these caves?


## Distil WISPER

In [16]:
model_id = "distil-whisper/distil-small.en"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, low_cpu_mem_usage=True, use_safetensors=True
)

model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    return_timestamps=True,
    chunk_length_s=15,
    device=device,
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [17]:
transcription = pipe(r"F:\Gen-AI-Mini-Projects\audio2text_from_mp3\english_audio.mp3" )



formatted_lyrics = ""
for line in transcription['chunks']:
    text = line["text"]
    ts = line["timestamp"]
    formatted_lyrics += f"{ts}-->{text}\n"

print(formatted_lyrics.strip())

(0.0, 3.0)--> What's your interest in it, Mr. Wayne?
(3.0, 7.0)--> I want to borrow it for uh, spill-lunking.
(7.0, 9.0)--> Spillunking?
(9.0, 11.0)--> Yeah, you know, cave diving.
(12.0, 14.44)--> Expecting to run into much much gunfire on these caves.
