# Exploratory data analysis

This notebook is a starting point for exploratory data analysis (EDA) on the tiny-voxceleb data.

To download the data to your computer, you can use `$ROOT_PROJECT_DIR/scripts/download_data.sh`.
Make sure to unzip `$ROOT_PROJECT_DIR/data/data.zip` afterwards.

Jupyter notebooks and git don't play well together. If you decide to use notebooks in this project, we advice to stick to the following conventions:
1. A notebook is belongs to one person, and only they edit the file (or only commit notebooks whose output is cleared). 
2. Use the naming convention `[date]_[initials]_[description].ipynb`.

In [None]:
import pathlib
import numpy as np
import pandas as pd

# Load and explore the meta CSV file

If you have downloaded the data to your own computer, you should find the meta file at `../data/tiny-voxceleb/tiny_meta.csv`

In [None]:
meta_path = pathlib.Path("../data/tiny-voxceleb/tiny_meta.csv")
df_meta = pd.read_csv(meta_path)

In [None]:
df_meta

# Explore the train set

The train data (as single wav files) is located at `../data/tiny-voxceleb/train/wav/`.

In [None]:
train_path = pathlib.Path("../data/tiny-voxceleb/train/wav/")

In [None]:
train_ids = [f.name for f in train_path.iterdir()]

In [None]:
train_audio_files = [f for f in train_path.rglob("*.wav")]

In [None]:
len(train_ids)

In [None]:
len(train_audio_files)

In [None]:
df_train = df_meta[df_meta['id'].isin(train_ids)]

In [None]:
df_train

# Explore the validation set

The validation data (as single wav files) is located at `../data/tiny-voxceleb/val/wav/`.

In [None]:
val_path = pathlib.Path("../data/tiny-voxceleb/val/wav/")

In [None]:
val_ids = [f.name for f in val_path.iterdir()]

In [None]:
val_audio_files = [f for f in val_path.rglob("*.wav")]

In [None]:
len(val_ids)

In [None]:
train_ids == val_ids

In [None]:
len(val_audio_files)

In [None]:
df_val = df_meta[df_meta['id'].isin(val_ids)]

In [None]:
df_val

# Explore the dev set

The dev data (as single wav files) is located at `../data/tiny-voxceleb/dev/wav/`.

In [None]:
dev_path = pathlib.Path("../data/tiny-voxceleb/dev/wav/")

In [None]:
dev_ids = [f.name for f in dev_path.iterdir()]

In [None]:
dev_audio_files = [f for f in dev_path.rglob("*.wav")]

In [None]:
len(dev_ids)

In [None]:
len(dev_audio_files)

In [None]:
df_dev = df_meta[df_meta['id'].isin(dev_ids)]

In [None]:
df_dev