# Transformers

**Exercise:** Use the [small version of the Whisper model from OpenAI](https://huggingface.co/openai/whisper-small) to recognise [this audio in Spanish](https://huggingface.co/datasets/Narsil/asr_dummy/resolve/285aeb6e0cb9a9dbba1ce9b16a98f0b1655d4884/4.flac) from [this small dataset](https://huggingface.co/datasets/Narsil/asr_dummy).

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
from transformers import pipeline
generator = pipeline(task="automatic-speech-recognition",model="openai/whisper-small",max_new_tokens=30)
generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/285aeb6e0cb9a9dbba1ce9b16a98f0b1655d4884/4.flac")

{'text': ' Y en las ramas medio sumergidas revoloteaban algunos pájaros de quimérico y legendario plumaque.'}

Alternative solution going into the pipeline, provided that the default pipeline function seems to implicitly include translation as well. 

First load [the small dataset](https://huggingface.co/datasets/Narsil/asr_dummy) including four files being the last one, the Spanish audio file referred above. 

In [2]:
import datasets as ds

# load dummy dataset including four audio files
dummy_ds = ds.load_dataset("Narsil/asr_dummy")
print(dummy_ds)

Downloading builder script:   0%|          | 0.00/6.37k [00:00<?, ?B/s]



Downloading and preparing dataset asr_dummy/asr to /root/.cache/huggingface/datasets/Narsil___asr_dummy/asr/1.9.0/fe8ba7222a49ddd388b924d597bc83fa4cdc37fd3df9bd2baa4f3d37edf380b8...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/183k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/58.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/116k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/566k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset asr_dummy downloaded and prepared to /root/.cache/huggingface/datasets/Narsil___asr_dummy/asr/1.9.0/fe8ba7222a49ddd388b924d597bc83fa4cdc37fd3df9bd2baa4f3d37edf380b8. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['id', 'file'],
        num_rows: 4
    })
})


Then, downsample it to 16kHz using the [Audio class](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Audio). You can find [further information on Audio Datasets](https://huggingface.co/blog/audio-datasets).

In [3]:
dummy_ds = dummy_ds.cast_column("file", ds.Audio(sampling_rate=16_000))
sample = dummy_ds['test'][-1]
print(sample)

{'id': '3', 'file': {'path': '/root/.cache/huggingface/datasets/downloads/b5d16a62fa6856bfbf56c92328e152d4b76a7f1e0f242a9e094ff6821583a329', 'array': array([-2.9103830e-10,  2.3283064e-10,  7.5669959e-10, ...,
        1.6453849e-03,  8.1025762e-04,  1.0039189e-03], dtype=float32), 'sampling_rate': 16000}}


Load the [WhisperProcessor](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperProcessor) (audio feature extractor) and the [WhisperForConditionalGeneration](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration) with the language model head. Then, set the prompt into the model configuration to make sure that only transcribes into Spanish.

In [4]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="spanish", task="transcribe")


Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/842 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/3.49k [00:00<?, ?B/s]

Generate Whisper input vector features from raw audio sample.

In [5]:
input_features = processor(sample["file"]["array"], sampling_rate=sample["file"]["sampling_rate"], return_tensors="pt").input_features 
print(input_features)


tensor([[[-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331],
         [-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331],
         [-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331],
         ...,
         [-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331],
         [-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331],
         [-0.6331, -0.6331, -0.6331,  ..., -0.6331, -0.6331, -0.6331]]])


Inference process calling the [generic generate function](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.generate).

In [6]:
predicted_ids = model.generate(input_features)
print(predicted_ids)



tensor([[50258, 50262, 50359, 50363,   398,   465,  2439, 10211,   296, 22123,
          2408, 17025, 11382, 16908,  1370, 18165, 21078, 40639, 10150,   329,
           368,   421,   332,   526, 23776,   288,  9451,  4912, 25854, 23179,
            13, 50257]])


Convert ids to words using the function [batch_decode](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode).

In [7]:
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
print(transcription)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

['<|startoftranscript|><|es|><|transcribe|><|notimestamps|> Y en las ramas medio sumergidas revoloteaban algunos pájaros de quimérico y legendario plumaque.<|endoftext|>']
[' Y en las ramas medio sumergidas revoloteaban algunos pájaros de quimérico y legendario plumaque.']
