<!-- @format -->

# Hands-on exercise

This exercise is not graded and is intended to help you become familiar with the
tools and libraries that you will be using throughout the rest of the course. If
you are already experienced in using Google Colab, 🤗 Datasets, librosa and 🤗
Transformers, you may choose to skip this exercise.

1. Create a [Google Colab](https://colab.research.google.com) notebook.
2. Use 🤗 Datasets to load the train split of the
   [`facebook/voxpopuli` dataset](https://huggingface.co/datasets/facebook/voxpopuli)
   in language of your choice in streaming mode.
3. Get the third example from the `train` part of the dataset and explore it.
   Given the features that this example has, what kinds of audio tasks can you
   use this dataset for?
4. Plot this example's waveform and spectrogram.
5. Go to [🤗 Hub](https://huggingface.co/models), explore pretrained models and
   find one that can be used for automatic speech recognition for the language
   that you have picked earlier. Instantiate a corresponding pipeline with the
   model you found, and transcribe the example.
6. Compare the transcription that you get from the pipeline to the transcription
   provided in the example.

If you struggle with this exercise, feel free to take a peek at an
[example solution](https://colab.research.google.com/drive/1NGyo5wFpRj8TMfZOIuPaJHqyyXCITftc?usp=sharing).
Discovered something interesting? Found a cool model? Got a beautiful
spectrogram? Feel free to share your work and discoveries on Twitter!

In the next chapters you'll learn more about various audio transformer
architectures and will train your own model!


In [None]:
!pip install datasets[audio] librosa transformers


In [None]:
from datasets import load_dataset

ds = load_dataset("facebook/voxpopuli", name="en", streaming=True)
ds_head = ds["train"].take(3)
example = list(ds_head)[-1]
example


In [None]:
from IPython.display import Audio

Audio(example["audio"]["array"], rate=16000)


In [None]:
import librosa
import matplotlib.pyplot as plt
import librosa.display

array = example["audio"]["array"]
sampling_rate = example["audio"]["sampling_rate"]
plt.figure().set_figwidth(12)
librosa.display.waveshow(array, sr=sampling_rate)


In [None]:
import numpy as np

D = librosa.stft(array)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

plt.figure().set_figwidth(12)
librosa.display.specshow(S_db, x_axis="time", y_axis="hz")
plt.colorbar()


In [None]:
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="facebook/wav2vec2-large-xlsr-53-german",
)
print(asr(example["audio"]["array"]))
print(example["raw_text"])


In [None]:
from IPython.display import Audio

Audio(example["audio"]["array"], rate=16000)
