# Speaker recognition : Pre-work statistics

This Jupyter notebook is about getting a better understanding of dataset.

## Load or dowload the LibriSpeech material

Regarding the dataset README : "*LibriSpeech is a corpus of read speech, based on LibriVox's public domain
audio books.*"

We use the "*train-clean-100*" subset for two reasons:
- This is the smallest subset of LibriSpeech, it is composed of only 100h of recording from about 250 different speakers, which is **way enough** for our projet.
- These audio was automatacally classified as "clean" audio file, **without noise or any stuff** that could complicate the training of our model. 

In [None]:
import os
import torchaudio

# Create data folder if missing
os.makedirs("./data", exist_ok=True)

# Dowload archive of train-clean-100 and unzip it in data folder
# Or load it if previously downloaded
dataset = torchaudio.datasets.LIBRISPEECH("./data", url="train-clean-100", 
                                          download=not(os.path.isdir("data/LibriSpeech")))

## Compute basic statistics

In order to know better about the dataset especially for train/test/validation set split or data augmentation.

In [None]:
import pandas as pd
import utils.dataset_metadata_parser as dmp

# Load speakers metadata and keep only those in train-clean-100
speaker_df = dmp.parse_pipe("data/LibriSpeech/SPEAKERS.TXT")
filtered_speaker_df = speaker_df[speaker_df["SUBSET"] == "train-clean-100"]

# Total number of audio extracts
total_extracts = len(filtered_speaker_df)

# Total duration of dataset (in minutes)
total_duration = filtered_speaker_df["MINUTES"].sum()

# Average duration per extract
average_duration = filtered_speaker_df["MINUTES"].mean()

# Total number of unique speakers (by ID or NAME, depending on what defines a speaker)
total_speakers = filtered_speaker_df["ID"].nunique()

# Average number of extracts per speaker
average_extracts_per_speaker = total_extracts / total_speakers

# Number of M and F
morf_number = filtered_speaker_df['SEX'].value_counts()

# Print the results
print(f"Total number of audio extracts: {total_extracts}")
print(f"Total duration of dataset (minutes): {total_duration:.2f}")
print(f"Average duration per extract (minutes): {average_duration:.2f}")
print(f"Total number of speakers: {total_speakers}")
print(f"Average number of extracts per speaker: {average_extracts_per_speaker:.2f}")
print(f"Number of M: {morf_number['M']}, and F:{morf_number['F']}")

Total number of audio extracts: 251
Total duration of dataset (minutes): 6035.41
Average duration per extract (minutes): 24.05
Total number of speakers: 251
Average number of extracts per speaker: 1.00
Number of M: 126, and F:125
