# Transformers

**Exercise:** Use the [small version of the Whisper model from OpenAI](https://huggingface.co/openai/whisper-small) to recognise [this audio in Spanish](https://huggingface.co/datasets/Narsil/asr_dummy/resolve/285aeb6e0cb9a9dbba1ce9b16a98f0b1655d4884/4.flac) from [this small dataset](https://huggingface.co/datasets/Narsil/asr_dummy).

In [9]:
!pip install datasets evaluate transformers[sentencepiece]



In [12]:
from transformers import pipeline
transcriber = pipeline(task="automatic-speech-recognition",model="openai/whisper-small")
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/285aeb6e0cb9a9dbba1ce9b16a98f0b1655d4884/4.flac",generate_kwargs={"language": "spanish"})

{'text': ' Y en las ramas medio sumergidas revoloteaban algunos pájaros de quimérico y legendario plumaque.'}

Alternative solution going into the pipeline.

First load [the small dataset](https://huggingface.co/datasets/Narsil/asr_dummy) including four files being the last one, the Spanish audio file referred above.

In [14]:
import datasets as ds

# load dummy dataset including four audio files
dummy_ds = ds.load_dataset("Narsil/asr_dummy")
print(dummy_ds)

DatasetDict({
    test: Dataset({
        features: ['id', 'file'],
        num_rows: 4
    })
})


Then, downsample it to 16kHz using the [Audio class](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Audio). You can find [further information on Audio Datasets](https://huggingface.co/blog/audio-datasets).

In [15]:
dummy_ds = dummy_ds.cast_column("file", ds.Audio(sampling_rate=16_000))
sample = dummy_ds['test'][-1]
print(sample)

{'id': '3', 'file': {'path': '/root/.cache/huggingface/datasets/downloads/b5d16a62fa6856bfbf56c92328e152d4b76a7f1e0f242a9e094ff6821583a329', 'array': array([-2.91038305e-10,  2.32830644e-10,  7.56699592e-10, ...,
        1.64538494e-03,  8.10257625e-04,  1.00391894e-03]), 'sampling_rate': 16000}}


Load the [WhisperProcessor](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperProcessor) (audio feature extractor) and the [WhisperForConditionalGeneration](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration) with the language model head. Then, set the prompt into the model configuration to make sure that only transcribes into Spanish.

In [16]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="spanish", task="transcribe")


Generate Whisper input vector features from raw audio sample.

In [17]:
input_features = processor(sample["file"]["array"], sampling_rate=sample["file"]["sampling_rate"], return_tensors="pt").input_features
print(input_features)


tensor([[[-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331],
         [-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331],
         [-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331],
         ...,
         [-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331],
         [-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331],
         [-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331]]])


Inference process calling the [generic generate function](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.generate).

In [18]:
predicted_ids = model.generate(input_features)
print(predicted_ids)

tensor([[50258, 50262, 50359, 50363,   398,   465,  2439, 10211,   296, 22123,
          2408, 17025, 11382, 16908,  1370, 18165, 21078, 40639, 10150,   329,
           368,   421,   332,   526, 23776,   288,  9451,  4912, 25854, 23179,
            13, 50257]])


Convert ids to words using the function [batch_decode](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode).

In [19]:
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
print(transcription)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

['<|startoftranscript|><|es|><|transcribe|><|notimestamps|> Y en las ramas medio sumergidas revoloteaban algunos pájaros de quimérico y legendario plumaque.<|endoftext|>']
[' Y en las ramas medio sumergidas revoloteaban algunos pájaros de quimérico y legendario plumaque.']
