# Data Splitting & Analysis of Common Voice 8.0
*By Dragoș Alexandru Bălan*



In this Jupyter Notebook you can find the code related to the splitting of data into the different subsets used throughout my experiments, as well as the code used for analyzing the distribution of data according to sex and age for each split.

**IMPORTANT:** The experiments have been conducted in a Linux environment. I have no knowledge on how the notebook works on Windows or MacOS, so no support is provided for those operating systems.

### Dependencies
First, the dependencies required to run this notebook are found below. I recommend to first create and activate a virtual environment, then run the commands below in that environment. You will only need to run it once.

In [None]:
!pip install librosa
!pip install pandas
!pip install numpy
!pip install datasets

### Calculating recording lengths in train, test, and dev splits of Common Voice (Experiment 1)
I personally suggest using a local version of the dataset since splitting the data later on will require it. However, if you are interested in using only the train, test, and dev splits from Common Voice and not have to download the entire dataset, then you can also code related to Huggingface. Be advised that using Huggingface will require you to create an account on their platform. More details about that can be found below.

#### Local dataset downloaded from Common Voice website

In [None]:
import pandas as pd
from librosa import load

# CHANGE THE PATH IF NEEDED!!! The path assumed here is the
# parent directory of the folder that contains the dataset
path = './cv-corpus-8.0-2022-01-19/fy-NL/'
df_train = pd.read_csv(path + 'train.tsv', sep='\t')
df_test = pd.read_csv(path + 'test.tsv', sep='\t')
df_dev = pd.read_csv(path + 'dev.tsv', sep='\t')

# Initialize an empty array to which we will append the
# length of each recording in seconds
lengths = []
for i, row in df_train.iterrows():
  # Load the audio from the path defined in the TSV
  aud, sr = load(path + 'clips/' + row['path'], sr=None)
  # Divide the length of the audio array by the sampling rate to
  # get the number of seconds of the recording.
  lengths.append(len(aud) / sr)
# Lastly, assign the calculated array of lengths in seconds to
# a new column named 'length'
df_train['length'] = lengths

lengths = []
for i, row in df_test.iterrows():
  aud, sr = load(path + 'clips/' + row['path'], sr=None)
  lengths.append(len(aud) / sr)

df_test['length'] = lengths

lengths = []
for i, row in df_dev.iterrows():
  aud, sr = load(path + 'clips/' + row['path'], sr=None)
  lengths.append(len(aud) / sr)

df_dev['length'] = lengths

# Save the new dataframes that contain an extra 'length' column as CSVs
# No index required since the filename can act as an index
df_train.to_csv(path + 'train_len.csv', index=False)
df_test.to_csv(path + 'test_len.csv', index=False)
df_dev.to_csv(path + 'dev_len.csv', index=False)

#### Dataset from Huggingface

In case you don't have the dataset cached already, you will need to login to Huggingface so an authentication token can be stored locally. First, you will need to register for an account here: https://huggingface.co/join

After creating an account, you will need an access token in order to login to Huggingface on this notebook. You can create one by going to the profile picture icon on the top-right of the website -> Settings -> Access Tokens. In here, click on 'New token'. Input any name you want but **make sure to set the role to 'write'**.

Once your access token is generated, run the cell below, copy the access token, and paste it when requested.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Now that you are logged in, you can download and cache the dataset and its splits by running the cell below.

In [None]:
from datasets import load_dataset

# Load the Common Voice 8.0 splits from Huggingface. I recommend keeping this in a separate
# cell since loading datasets from Huggingface can be quite slow. After loading, the datasets
# can be used and modified throughout the notebook
cv_train = load_dataset("mozilla-foundation/common_voice_8_0", "fy-NL", split="train", use_auth_token=True)
cv_test = load_dataset("mozilla-foundation/common_voice_8_0", "fy-NL", split="test")
cv_dev = load_dataset("mozilla-foundation/common_voice_8_0", "fy-NL", split="validation")

In [None]:
import pandas as pd

path = './cv-corpus-8.0-2022-01-19/fy-NL/'
df_train = pd.DataFrame(cv_train)
df_test = pd.DataFrame(cv_test)
df_dev = pd.DataFrame(cv_dev)

lengths = []
for i, row in df_train.iterrows():
  # Calculate the length of the recording in seconds
  # We do so by dividing the length of the audio array to the
  # sampling rate
  lengths.append(len(row['audio']['array']) / row['audio']['array'])
# Lastly, assign the calculated lengths in seconds to a new column
# named 'length'
df_train['length'] = lengths

lengths = []
for i, row in df_test.iterrows():
  lengths.append(len(row['audio']['array']) / row['audio']['array'])

df_test['length'] = lengths

lengths = []
for i, row in df_dev.iterrows():
  lengths.append(len(row['audio']['array']) / row['audio']['array'])

df_dev['length'] = lengths

df_train.to_csv(path + 'train_len.csv', index=False)
df_test.to_csv(path + 'test_len.csv', index=False)
df_dev.to_csv(path + 'dev_len.csv', index=False)

### Calculating speaker statistics based on sex and age from train, test, and dev

In [None]:
import pandas as pd
import numpy as np

path = './cv-corpus-8.0-2022-01-19/fy-NL/'
df_train = pd.read_csv(path + 'train.csv')
df_test = pd.read_csv(path + 'test.csv')
df_dev = pd.read_csv(path + 'dev.csv')
# Replace undefined values in the dataframes with string 'unknown'
df_train = df_train.replace(np.nan, 'unknown', regex=True)
df_test = df_test.replace(np.nan, 'unknown', regex=True)
df_dev = df_dev.replace(np.nan, 'unknown', regex=True)

# Create the dictionary with keys based on age and sex (column name is 'gender')
len_test = {g:{el:0 for el in df_test['age'].unique()} for g in df_test['gender'].unique()}
for i, row in df_test.iterrows():
  # For each recording, add the length to the corresponding sum of sex and age combination
  len_test[row['gender']][row['age']] += row['length']
# Create a dataframe out of the statistics calculated
df = pd.DataFrame(len_test)
# Save the dataframe to a CSV
df.to_csv(path + 'test_stats.csv')


len_dev = {g:{el:0 for el in df_dev['age'].unique()} for g in df_dev['gender'].unique()}
for i, row in df_dev.iterrows():
  len_dev[row['gender']][row['age']] += row['length']
df = pd.DataFrame(len_dev)
df.to_csv(path + 'dev_stats.csv')


len_train = {g:{el:0 for el in df_train['age'].unique()} for g in df_train['gender'].unique()}
for i, row in df_train.iterrows():
  len_train[row['gender']][row['age']] += row['length']
df = pd.DataFrame(len_train)
df.to_csv(path + 'train_stats.csv')

## Experiments 2, 6, 7 train dataset

This dataset contains all of the validated data from Common Voice 8.0 (~50h) except for the data that is found in the test or dev splits from above. The amount of data of this dataset sums up to ~41h making it the largest train dataset used in my experiments.

### Extracting the train set

In [None]:
import pandas as pd

path = './cv-corpus-8.0-2022-01-19/fy-NL/'
# Load the TSV files corresponding to all of the validated data (validated.tsv),
# the test split (test.tsv), and the dev split (dev.tsv)
df_large = pd.read_csv(path + 'validated.tsv', sep='\t')
df_test = pd.read_csv(path + 'test.tsv', sep='\t')
df_dev = pd.read_csv(path + 'dev.tsv', sep='\t')

print('Number of entries before removing any data: ' + str(len(df_large)))

# Removes the entries of test from validated
df_large = pd.merge(df_large, df_test, on=list(df_test.columns), how='outer', indicator=True)\
       .query("_merge != 'both'")\
       .drop('_merge', axis=1)\
       .reset_index(drop=True)

print('After removing data from test: ' + str(len(df_large)))
# Removes the entries of dev from validated - test
df_large = pd.merge(df_large, df_dev, on=list(df_dev.columns), how='outer', indicator=True)\
       .query("_merge != 'both'")\
       .drop('_merge', axis=1)\
       .reset_index(drop=True)
    
print('After removing data from test + dev (the resulting dataset): ' + str(len(df_large)))
# MAKE SURE TO CHANGE THE PATH. You must insert an absolute path here so that, 
# during training, the XLS-R model will know where to look for the audio
df_large['path'] = '/scratch/s3944867/cv-corpus-8.0-2022-01-19/fy-NL/clips/' + df_large['path']
df_large['sentence'] = '"' + df_large['sentence'] + '"'
# Save the resulting dataset
df_large.to_csv(path + 'train_large.csv', index=False)

### Calculating length of each recording

**IMPORTANT:** This will take a while.

In [None]:
import pandas as pd
from librosa import load

path = './cv-corpus-8.0-2022-01-19/fy-NL/'
df_train = pd.read_csv(path + 'train_large.csv')

lengths = []
for i, row in df_train.iterrows():
  aud, sr = load(row['path'], sr=None)
  lengths.append(len(aud) / sr)

df_train['length'] = lengths
df_train.to_csv(path + 'train_large_len.csv', index=False)

### Statistics about speakers

In [None]:
import pandas as pd
import numpy as np

path = './cv-corpus-8.0-2022-01-19/fy-NL/'
df_train = pd.read_csv(path + 'train_large_len.csv')

df_train = df_train.replace(np.nan, 'unknown', regex=True)

len_train = {g:{el:0 for el in df_train['age'].unique()} for g in df_train['gender'].unique()}
for i, row in df_train.iterrows():
  len_train[row['gender']][row['age']] += row['length']
df = pd.DataFrame(len_train)
df.to_csv(path + 'train_large_stats.csv')

## Experiments 3,4,5 datasets (10h, 1h, 10m splits)

### Splitting the dataset into 10h, 1h, and 10m

Run all the cells below so that no issue regarding undefined variables or packages are encountered

In [None]:
import pandas as pd
import numpy as np

path = './cv-corpus-8.0-2022-01-19/fy-NL/'
df = pd.read_csv(path + 'train_large_len.csv')
# Randomly sample data such that it adds up to ~10 min
df_10m = df.sample(frac=0.0041) # corresponds to ~10 min
# Value should be very close to 600 (60s * 10m)
print(np.sum(df_10m['length']))
# Remove the 'length' column so that the file can be used during training
df_10m = df_10m.drop(['length'], axis=1)

df_10m.to_csv(path + 'train_10m.csv', index=False)

In [None]:
df_1h = df.sample(frac=0.0246) # corresponds to ~1 h
# Value should be very close to 3600 (3600s in an hour)
print(np.sum(df_1h['length']))
df_1h = df_1h.drop(['length'], axis=1)

df_1h.to_csv(path + 'train_1h.csv', index=False)

In [None]:
df_10h = df.sample(frac=0.244) # corresponds to ~10 h
# Value should be very close to 36000 (3600s * 10h)
print(np.sum(df_10h['length']))
df_10h = df_10h.drop(['length'], axis=1)

df_10h.to_csv(path + 'train_10h.csv', index=False)

### Calculating lengths & Analyzing the 10h, 1h, and 10m train splits

In [None]:
import pandas as pd
import numpy as np
from librosa import load

path = './cv-corpus-8.0-2022-01-19/fy-NL/'
df_10m = pd.read_csv(path + 'train_10m.csv')

lengths = []
for i, row in df_10m.iterrows():
  aud, sr = load(row['path'], sr=None)
  lengths.append(len(aud) / sr)

df_10m['length'] = lengths
df_10m = df_10m.replace(np.nan, 'unknown', regex=True)


len_10m = {g:{el:0 for el in df_10m['age'].unique()} for g in df_10m['gender'].unique()}
for i, row in df_10m.iterrows():
  len_10m[row['gender']][row['age']] += row['length']
df = pd.DataFrame(len_10m)
df_sum = df.sum(numeric_only = True)
df.to_csv(path + 'train_10m_stats.csv')


df_1h = pd.read_csv(path + 'train_1h.csv')

lengths = []
for i, row in df_1h.iterrows():
  aud, sr = load(row['path'], sr=None)
  lengths.append(len(aud) / sr)

df_1h['length'] = lengths
df_1h = df_1h.replace(np.nan, 'unknown', regex=True)


len_1h = {g:{el:0 for el in df_1h['age'].unique()} for g in df_1h['gender'].unique()}
for i, row in df_1h.iterrows():
  len_1h[row['gender']][row['age']] += row['length']
df = pd.DataFrame(len_1h)
df_sum = df.sum(numeric_only = True)
df.to_csv(path + 'train_1h_stats.csv')


df_10h = pd.read_csv(path + 'train_10h.csv')

lengths = []
for i, row in df_10h.iterrows():
  aud, sr = load(row['path'], sr=None)
  lengths.append(len(aud) / sr)

df_10h['length'] = lengths
df_10h = df_10h.replace(np.nan, 'unknown', regex=True)


len_10h = {g:{el:0 for el in df_10h['age'].unique()} for g in df_10h['gender'].unique()}
for i, row in df_10h.iterrows():
  len_10h[row['gender']][row['age']] += row['length']
df = pd.DataFrame(len_10h)
df_sum = df.sum(numeric_only = True)
df.to_csv(path + 'train_10h_stats.csv')