# Whisper

Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al from OpenAI. Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. 

It has been trained on 680,000 hours of labeled data. This data, which is collected from the internet, is processed carefully to create a large-scale dataset. The method used is called "Weak Supervision" because, we use data that already has labels, we do not need to annotate the data ourselves. Instead, we focus on meticulous data processing to ensure its quality.

The models were trained on either English-only data or multilingual data. The English-only models were trained on the task of speech recognition. The multilingual models were trained on both speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio.

Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.

# Introduction about Hugging face

Hugging Face is a company specializing in natural language processing (NLP) and machine learning. It has gained prominence for its contributions to open-source AI, particularly through its `Transformers` library, which provides pre-trained models for a range of NLP tasks. 

**Transformers Library:** A popular library that provides state-of-the-art pre-trained models for NLP tasks like text classification, translation, and generation. Provides access to pre-trained models from leading AI research, such as BERT, GPT, and T5. It simplifies fine-tuning and deployment of these models for specific tasks.

There several pretrained Whisper model of varying model sizes, you can choose the suitable model for your purpose, visit the openAI site on hugging face for more detail https://huggingface.co/openai. 

In this lab, we will use `openai/whisper-large-v3`.

## Load the model

In [1]:
# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3")

  from .autonotebook import tqdm as notebook_tqdm


## WhisperProcessor

The `WhisperProcessor` is used to:
- Pre-process the audio inputs (converting them to log-Mel spectrograms for the model)
- Post-process the model outputs (converting them from tokens to text)

Whisper was pretrain on audio with sampling rate of 16000, so we need make sure out audio file have sampling rate of 16000

In [8]:
import librosa

# Load audio file
audio_path = './test.wav'
data, sampling_rate = librosa.load(audio_path)

# Resampling if sampling_rate != 1600
if sampling_rate != 1600:
    target_sr = 16000  # Example target sampling rate
    y_resampled = librosa.resample(data, orig_sr=sampling_rate, target_sr=target_sr)

# WhisperProcessor is used to pre-process the audio input, its output is log-Mel spectrograms of audio.
input_features = processor(y_resampled, sampling_rate=target_sr, return_tensors="pt").input_features 
input_features

tensor([[[-0.5486, -0.5486, -0.5162,  ..., -0.5486, -0.5486, -0.5486],
         [-0.5486, -0.4993, -0.3898,  ..., -0.5486, -0.5486, -0.5486],
         [-0.4974, -0.4346, -0.1367,  ..., -0.5486, -0.5486, -0.5486],
         ...,
         [-0.5486, -0.5486,  0.0812,  ..., -0.5486, -0.5486, -0.5486],
         [-0.5486, -0.5486,  0.2757,  ..., -0.5486, -0.5486, -0.5486],
         [-0.5486, -0.5486,  0.0576,  ..., -0.5486, -0.5486, -0.5486]]])

The model determines whether to perform transcription or translation by receiving specific "context tokens." These tokens form a sequence provided to the decoder at the beginning of the decoding process, following this structure:

- The transcription starts with the `<|startoftranscript|>` token.
- The second token indicates the language (e.g., `<|en|>` for English).
- The third token is the "task token," which can either be `<|transcribe|>` for speech recognition or `<|translate|>` for speech translation.
- Additionally, a `<|notimestamps|>` token is included if timestamp prediction is not required.

For example, this sequence instructs the model to decode in English for the task of speech recognition, without predicting timestamp:

```python
<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>
```

These tokens can be either forced or unforced. If they are forced, the model must predict each token at every position, allowing control over the output language and task for the Whisper model. If they are unforced, the Whisper model will automatically determine the output language and task.

The context tokens can be configured as follows:

```python
model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="english", task="transcribe")
```

Which forces the model to predict in English under the task of speech recognition.

In [10]:
# load model and processor
forced_decoder_ids = processor.get_decoder_prompt_ids(language="english", task="transcribe")

# generate token ids
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

# decode token ids to text
transcription = processor.batch_decode(predicted_ids)
print(transcription)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Printing, in the only sense with which we are at present concerned, differs from most, if not from all the arts and crafts represented in the exhibition']


The context tokens can be removed from the start of the transcription by setting skip_special_tokens=True.

In [12]:
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
transcription

[' Printing, in the only sense with which we are at present concerned, differs from most, if not from all the arts and crafts represented in the exhibition']

## English to French

In [None]:
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")

# generate token ids
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)