# Impact of Gender on Age Classification of Voice Recordings
### Final Exam for Machine Learning and Deep Learning [CDSCO2004U]
#### Daniel Henke, Jakob Hren, Heinrich Hegenbarth

**Disclaimer:** This notebook is very long and computationally expensive. To be able to run core parts quickly, many outputs are hidden behind **global boolean variables** that one can set to True or False. Please set them according to needs and wishes

In [None]:
DATA_CONVERSION=False

EDA_AND_VISUALIZATION=False

GRID_SEARCH=False

MULTI_CLASSIFICATION=False

## Imports & Installations

We begin with possibly required installs.

In [None]:
#!pip install librosa
# !pip install playsound
# !pip install tqdm
# !pip install scikit-learn

Here are all the imports needed for this notebook:

In [None]:
import numpy as np
import pandas as pd
import librosa
from playsound import playsound
from tqdm import tqdm
import concurrent.futures
import tarfile
import os

import matplotlib.pyplot as plt
import seaborn as sns
from prettytable import PrettyTable
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer   
from sklearn.compose import make_column_selector
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import make_pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.base import clone
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import (
    roc_auc_score, log_loss, RocCurveDisplay
)
from sklearn.calibration import calibration_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import resample
from sklearn.multioutput import MultiOutputClassifier

## Data Conversion

**DISCLAIMER: This section is skippable, the exported data is available below.**

To begin, we need to transform our data. We downloaded raw mp3 files in a tar.gz format from [Mozilla Common Voice](https://commonvoice.mozilla.org/en/datasets). Our data includes the entire March 19 corpus of languages English, Spanish, French, German, Danish & Swedish. To start, we need to unpack all the .gz files. Ideally, one fully unpacks also the .tar file, however we have implemented a .tar extraction function as we only need very few audio files from the entire archive.

### Overview & Filtering of Voice Files

In [None]:
def prepare_overview(list_of_files, folder_path=None):
    """
    This function takes a list of files and creates a joint dataframe for audio file conversion
    
    list_of_files: list of files to be converted
    folder_path: path to the folder where the files are located
    """
    overview = None
    for file in list_of_files:
        file_path = f"{folder_path}/{file}" if folder_path else file
        with open(file_path, "r") as file:
            if overview is None:
                overview = pd.read_csv(file, sep="\t", dtype={'sentence_domain': str}, low_memory=False)
            else:
                overview = pd.concat([overview, pd.read_csv(file, sep="\t", dtype={'sentence_domain': str}, low_memory=False)], ignore_index=True)
    return overview

In [None]:
if DATA_CONVERSION:
    list_of_files = ["validated.tsv","other.tsv"]
    folder_path_danish="./data/cv-corpus-21.0-2025-03-14/da"
    folder_path_swedish="./data/cv-corpus-21.0-2025-03-14/sv-SE"
    folder_path_german="./data/cv-corpus-21.0-2025-03-14/de"
    folder_path_french="./data/cv-corpus-21.0-2025-03-14/fr"
    folder_path_spanish="./data/cv-corpus-21.0-2025-03-14/es"
    folder_path_english="./data/cv-corpus-21.0-2025-03-14/en"
    folder_path=folder_path_english
    overview = prepare_overview(list_of_files,folder_path)
    overview.info()
    overview.describe(include="all")

In [None]:
overview.groupby("gender").size()

In [None]:
def fix_gender(gender):
    if gender == "male":
        return "male_masculine"
    elif gender == "female":
        return "female_feminine"
    else:
        return gender


def preprocess_overview(overview, f_path):
    """
    This function takes a dataframe and preprocesses it for audio file conversion
    """
    overview = overview.dropna(subset=["gender"])
    # irrelevant columns for our analysis
    overview = overview.drop(columns=["variant", "segment", "sentence_id", "up_votes", "down_votes"])
    # fixing gender labels
    overview["gender"] = overview["gender"].apply(fix_gender)
    # dropping all files that are not simply male or female
    # this is not a political statement, simply we do not have enough data for other gender classifications
    overview = overview[(overview["gender"] == "female_feminine") | (overview["gender"] == "male_masculine")]
    # limiting to a maximum of 5 random clips per client_id
    overview = overview.groupby("client_id").apply(lambda group: group.sample(n=min(len(group), 5), random_state=42)).reset_index(drop=True)
    # changing the path to reflect the location of the audio files
    overview["path"] = overview["path"].apply(lambda x: f"{f_path}/{x}")
    return overview.reset_index(drop=True)

In [None]:
# On my device, all audio files are in the same "clips" folder. Please change the path if your files are in a different location.
overview=preprocess_overview(overview, "clips")
overview.head()

In [None]:
overview.describe(include="all")

In [None]:
overview.groupby("gender").size()

In [None]:
overview.groupby("age").size()