In [1]:
!nvidia-smi

Wed Dec  6 12:58:58 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
! pip install git+https://github.com/huggingface/transformers -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone


## Playing Audio with IPython Display

In this section, we will use the IPython Display module to load and play an audio file in the notebook. The IPython Display module provides various classes and functions to display rich media in the notebook, such as images, videos, audio, HTML, etc.

We will use the following code to play an audio file named `audio`

In [19]:
audio = 'Path_To_Audio_File'    # Change the path to your audio file

from IPython.display import Audio, display

display(Audio(audio, autoplay=True))

## Automatic Speech Recognition with Huggingface Transformers and OpenAI Whisper

In this section, we will use the Huggingface Transformers library and the OpenAI Whisper model to generate text from audio samples. The Whisper model is a speech-to-text model that can transcribe speech in multiple languages and domains. It is based on the Speechformer architecture, which uses self-attention and convolutional layers to process speech signals.

We will use the following code to create a pipeline for automatic speech recognition:

- We import the necessary modules from the `torch` and `transformers` libraries.

- We set the `device` and `torch_dtype` variables according to the availability of GPU and the data type we want to use. We use `torch.float16` for faster computation on GPU, and `torch.float32` for CPU.

- We set the `model_id` variable to the name of the `pre-trained Whisper model` we want to use. We use the `openai/whisper-medium model`, which is a medium-sized model that can handle 16 languages and 8 domains. You can find other available models here.

- We use the `AutoModelForSpeechSeq2Seq.from_pretrained` method to load the pre-trained Whisper model from the Huggingface model hub. We pass the `model_id`, `torch_dtype`, and `use_safetensors` arguments to the method. The `use_safetensors` argument enables the use of SafeTensors, which are a custom tensor class that can prevent gradient attacks and improve privacy.

- We use the `model.to` method to move the model to the device we specified earlier.

- We use the `AutoProcessor.from_pretrained` method to load the corresponding processor for the Whisper model. The processor is a class that combines a tokenizer and a feature extractor. The tokenizer is responsible for converting text to tokens, and the feature extractor is responsible for converting audio to features. We pass the `model_id argument` to the method.

- We use the `pipeline` function to create a pipeline for automatic speech recognition. We pass the following arguments to the function:
 - `"automatic-speech-recognition"`: the name of the task we want to perform.
 - `model`: the pre-trained Whisper model we loaded earlier.
 - `tokenizer`: the tokenizer from the processor we loaded earlier.
 - `feature_extractor`: the feature extractor from the processor we loaded earlier.
 - `max_new_tokens`: the maximum number of tokens to generate for each audio sample. We set it to 128, which means the pipeline will generate up to 128 tokens or words for each audio sample.
 - `chunk_length_s`: the length of each audio chunk in seconds. We set it to 30, which means the pipeline will split the audio samples into 30-second chunks and process them separately. This can help reduce memory usage and improve performance.
 - `batch_size`: the number of audio samples to process in parallel. We set it to 16, which means the pipeline will process 16 audio samples at a time. This can also help reduce memory usage and improve performance.
 - `return_timestamps`: whether to return the timestamps of the generated tokens. We set it to True, which means the pipeline will return the start and end time of each token in seconds. This can help us align the text with the audio.
 - `torch_dtype`: the data type we want to use for the computation. We set it to the torch_dtype variable we defined earlier.
 - `device`: the device we want to use for the computation. We set it to the device variable we defined earlier.

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-medium"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

In [20]:
result = pipe(audio)
print(result["text"])

 What motivated you to pursue a career in data science?


## Sources
[OpenAI/Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)

[HuggingFace Whisper notebook](https://huggingface.co/openai/whisper-large-v3)