## Step 2: Load and Prepare Dataset
* Use the “VoxPopuli” dataset available on Hugging Face, specifically the “it” (Italian) subset, and load only the training split.
* Shuffle the dataset and take a random quarter (with seed=42) of the entries. This smaller subset will reduce processing time, making it easier to handle on limited resources.
* Convert Audio Sampling Rate: Convert the audio samples in the dataset to a 16 kHz sampling rate, as this is compatible with the model you’ll be using.

In [1]:
from datasets import load_dataset
ds = load_dataset('facebook/voxpopuli', 'it', split='train')
ds = ds.shuffle(seed=42)

asr_train.tsv:   0%|          | 0.00/10.7M [00:00<?, ?B/s]

asr_dev.tsv:   0%|          | 0.00/602k [00:00<?, ?B/s]

asr_test.tsv:   0%|          | 0.00/573k [00:00<?, ?B/s]

train_part_0.tar.gz:   0%|          | 0.00/2.14G [00:00<?, ?B/s]

train_part_1.tar.gz:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

train_part_2.tar.gz:   0%|          | 0.00/2.17G [00:00<?, ?B/s]

train_part_3.tar.gz:   0%|          | 0.00/2.18G [00:00<?, ?B/s]

train_part_4.tar.gz:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

dev_part_0.tar.gz:   0%|          | 0.00/565M [00:00<?, ?B/s]

test_part_0.tar.gz:   0%|          | 0.00/547M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

1. What is the original size of the train split of ”facebook/voxpopuli”, ”train” set for ”italian” ?

In [2]:
ds

Dataset({
    features: ['audio_id', 'language', 'audio', 'raw_text', 'normalized_text', 'gender', 'speaker_id', 'is_gold_transcript', 'accent'],
    num_rows: 22576
})

2. What is the sampling rate of the original audio?

In [4]:
set(ds.map(lambda x: {'sr': x['audio']['sampling_rate']}, remove_columns=ds.column_names)['sr'])

Map:   0%|          | 0/22576 [00:00<?, ? examples/s]

{16000}

3. How many unique characters are in the dataset?

In [5]:
uniq_chars = set()
for sample in ds:
    uniq_chars.update(sample['normalized_text'])
len(uniq_chars)

41

In [6]:
uniq_chars = set()
for sample in ds:
    uniq_chars.update(sample['raw_text'])
len(uniq_chars)

117

4. How many tokens are in the ”microsoft/speechT5” tokenizer?

In [8]:
from transformers import SpeechT5Tokenizer
tokenizer = SpeechT5Tokenizer.from_pretrained('microsoft/speecht5_tts')



In [9]:
vocab = tokenizer.get_vocab()
len(vocab)

81

5. Whether all the unique characters in the italian train split are present in the token list of mi crosoft/speechT5? (true\false)

In [11]:
uniq_chars = set()
for sample in ds:
    uniq_chars.update(sample['normalized_text'])
len(uniq_chars)

41

In [12]:
uniq_chars - vocab.keys()

{' ', 'à', 'è', 'ì', 'í', 'ï', 'ò', 'ó', 'ù'}

7. How many speakers have less than or equal to 100 samples?

In [14]:
from collections import Counter
import pandas as pd

speaker_ids = Counter(ds['speaker_id'])
speaker_ids = pd.Series(speaker_ids)
speaker_ids.head()

124835     519
124851    1350
124778     236
96818      190
None      1248
dtype: int64

In [15]:
(speaker_ids <= 100).sum()

80

8. What is the length of the dataset after removing speakers with less than 100 samples and more than 400 samples?

In [20]:
def filter_by_length(sample):
    n_samples = speaker_ids[sample['speaker_id']]
    return (n_samples >= 100) and (n_samples <= 400)
ds_moderate = ds.filter(filter_by_length)
ds_moderate

Filter:   0%|          | 0/22576 [00:00<?, ? examples/s]

Dataset({
    features: ['audio_id', 'language', 'audio', 'raw_text', 'normalized_text', 'gender', 'speaker_id', 'is_gold_transcript', 'accent'],
    num_rows: 10683
})