# Producing Tone, Age, and Gender Labels Using NBER SI 2022 Methods Lectures

In this tutorial, we show how to produce tone, age, and gender labels for a long lecture using NBER data. These videos are publicly available and downloadable on Youtube. In order to produce the labels, we need to 

1) split the lecture series into splits, 
2) extract audio features from the splits, 
3) feed the audio features into our trained models to generate labels, and 
4) impute gender and select gender-specific imputation results.

In [1]:
import pandas as pd
from pydub import AudioSegment
from pydub.utils import make_chunks
import numpy as np
import librosa
import numpy as np
import pandas as pd 
import statistics
import random
import ast
from helpers import evaluate, choose_gender_val

## Splitting the Data and Extracting Audio Features

The example below uses two separate NBER 2022 Methods Lectures by [Jiaying Gu](https://www.youtube.com/watch?v=z0cdgtcWs0c&t=1854s&pp=ygUUamlheWluZyBndSBuYmVyIDIwMjI%3D) and [Christopher Walters](https://www.youtube.com/watch?v=SdWt8xAAF_4&pp=ygUYY2hyaXN0IHdhbHRlcnMgbmJlciAyMDIy). We split each lecture into five-second segments, and compute a series of audio features as in [Handlan and Sheng (2023)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4316513).

In [2]:
# Speaker last name, which are the same as the file names. 
names = ['Gu', 'Walters']

In [3]:
feature_list = []
sample_rate = 16000
chunk_length_ms = 1000*5
for name in names:
    features = {}
    sound = AudioSegment.from_file(f'../data/NBER/{name}.mp3')
    sound = sound.set_channels(1)
    sound = sound.set_frame_rate(sample_rate)
    # iterate through sound snippets and split them
    chunks = make_chunks(sound, chunk_length_ms)
    for i, chunk in enumerate(chunks):
        samples = chunk.get_array_of_samples()
        # this is to convert the five-second chunks into np array
        X = np.array(samples).astype(np.float32)/32768
        # the following code snippets extract the desired audio features
        stft = np.abs(librosa.stft(X))
        result = np.array([])
        mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40, fmax=8000).T, axis=0)
        result = np.hstack((result, mfccs))
        chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T, axis=0)
        result = np.hstack((result, chroma))
        mel = np.mean(librosa.feature.melspectrogram(y=X, sr=sample_rate, fmax=8000).T, axis=0)
        result = np.hstack((result, mel))
        features[i] = result
    file_feature = pd.DataFrame.from_dict(features, orient='index').reset_index()
    file_feature.rename(columns={'index': 'split'}, inplace=True)
    file_feature.loc[:, 'speaker'] = name
    feature_list.append(file_feature)

We can now concatenate the computed features. We can see the number of splits for each speaker below:

In [4]:
feature_df = pd.concat(feature_list)
meta = feature_df[['speaker', 'split']].reset_index(drop=True)
features = np.matrix(feature_df.drop(columns=['speaker', 'split']))
meta

Unnamed: 0,speaker,split
0,Gu,0
1,Gu,1
2,Gu,2
3,Gu,3
4,Gu,4
...,...,...
1770,Walters,992
1771,Walters,993
1772,Walters,994
1773,Walters,995


The lectures are uninterrupted. If you'd like to run our algorithm for back-and-forth conversations, we recommend splitting the audio files into utterances (continuous and uninterrupted speech segments by the same speaker). This can be done using [Trint](https://www.trint.com) or speaker diarization algorithms (such as the [pyannote](https://huggingface.co/pyannote/speaker-diarization) library). 

## Imputation

The following code snippets show how one can use our models to impute speaker gender and tone. For gender classification, we have five models trained on five splits of the data. For age and tone, we have gender-specific models trained on five splits of data for each gender. 

In [5]:
X = np.expand_dims(features, axis=2)
models_dir = "../model/"
# The list of complete models names includes:
# ["gender_both", "emotion_Female", "emotion_Male", "age_Female", "age_Male"]
# The code below only imputes tone labels
models = ["emotion_Female", "emotion_Male", "gender_both"]
classification = meta
for model in models:
    m_dir = models_dir + model
    predictions = {}
    hyperparams = pd.read_csv(f'{m_dir}/hyperparams.csv')
    outcome = hyperparams['Label'][0]
    classes = ast.literal_eval(hyperparams['Classes'][0])
    gender = hyperparams['CNN_gender'][0]
    # adding prefixes for column names to indicate model used
    if outcome != 'gender':
        gender_pre = gender + '_'
    else:
        gender_pre = ''
    model_dir = f'{m_dir}/saved_models/'
    # check if output is a continuous variable (such as age)
    if 'continuous' in hyperparams['Classes'][0]:
        continuous = True
    else:
        continuous = False
    # iterate through the models to acquire imputation results
    for i in range(len(hyperparams['Iteration Number'])):
        print(f"Evaluating conference data based on {outcome}_{gender}_{i} model")
        model_name = f'cnn_{i}.h5'
        json_name = f'model_{i}.json'
        lb_path = f'{m_dir}/classes.npy'
        predictions[gender_pre + outcome + f'_{i}'] = evaluate(model_dir, model_name, json_name, lb_path, X, continuous, outcome, i).add_prefix(gender_pre)
        print(f"Evaluating conference data based on {outcome}_{gender}_{i} model")
    classification = pd.concat([classification] + list(predictions.values()), axis=1).sort_values(by=['speaker', 'split'])
    if outcome == 'emotion':
        labels = {emotion:[gender_pre + emotion + f"_{i}" for i in range(len(hyperparams['Iteration Number']))] for emotion in classes}
        for emotion in labels.keys():
            classification.loc[:, gender_pre + emotion] = classification[labels[emotion]].mean(axis=1)
    elif continuous:
        # take the mean prediction for continuous output variables across models
        classification.loc[:, gender_pre + outcome] = classification[list(predictions.keys())].mean(axis=1)
    else: 
        # take the modal prediction for discrete output variables
        classification.loc[:, gender_pre + outcome] = classification[list(predictions.keys())].apply(lambda x:random.choice(statistics.multimode(x)), axis=1)
classification = classification.loc[:, [not col[-1].isdigit() for col in classification.columns]]

Evaluating conference data based on emotion_Female_0 model
Evaluating conference data based on emotion_Female_0 model
Evaluating conference data based on emotion_Female_1 model
Evaluating conference data based on emotion_Female_1 model
Evaluating conference data based on emotion_Female_2 model
Evaluating conference data based on emotion_Female_2 model
Evaluating conference data based on emotion_Female_3 model
Evaluating conference data based on emotion_Female_3 model
Evaluating conference data based on emotion_Female_4 model
Evaluating conference data based on emotion_Female_4 model
Evaluating conference data based on emotion_Male_0 model
Evaluating conference data based on emotion_Male_0 model
Evaluating conference data based on emotion_Male_1 model
Evaluating conference data based on emotion_Male_1 model
Evaluating conference data based on emotion_Male_2 model
Evaluating conference data based on emotion_Male_2 model
Evaluating conference data based on emotion_Male_3 model
Evaluating 

The classification results look like the following:

In [6]:
classification.head()

Unnamed: 0,speaker,split,Female_sad,Female_angry,Female_neutral,Female_happy,Female_disgust,Female_fearful,Male_sad,Male_angry,Male_neutral,Male_happy,Male_disgust,Male_fearful,gender
0,Gu,0,0.0001995375,0.810416,0.00111165,0.161178,0.020575,0.006521,0.005634,0.275899,0.000582,0.185165,0.043534,0.489185,Male
1,Gu,1,1.439354e-05,0.834973,0.00011452,0.083911,0.07917,0.001817,0.002392,0.461593,0.000491,0.234406,0.127638,0.17348,Male
2,Gu,2,3.817395e-07,0.768155,2.099104e-07,0.217006,0.009066,0.005772,8.4e-05,0.463928,3.5e-05,0.389705,0.025509,0.12074,Male
3,Gu,3,1.052871e-08,0.993327,1.718877e-07,0.005184,0.001443,4.7e-05,4.7e-05,0.772937,6e-06,0.030513,0.014178,0.182319,Male
4,Gu,4,7.456817e-07,0.92512,4.974977e-06,0.06338,0.01087,0.000624,0.000526,0.603577,5.8e-05,0.077508,0.023929,0.294401,Male


We compute gender of the speakers by choosing the modal prediction for each speaker. 

In [7]:
gender = classification[['speaker', 'gender']].groupby(['speaker']).agg(lambda x: x.value_counts(ascending=False).index[0]).reset_index()
gender

Unnamed: 0,speaker,gender
0,Gu,Female
1,Walters,Male


We can now obtain the gender-specific tone predictions, and compute our final tone labels. 

For example, if the predicted gender for a speaker is female, the following code will use predictions from the column `Female_sad` in the `classification` dataframe as the final imputed value for `sad`.

In [8]:
x = pd.merge(classification.drop(columns=['gender']), gender, on='speaker')
tones = ['happy', 'neutral', 'angry', 'sad', 'fearful', 'disgust']
for tone in tones:
    x[tone] = x.apply(lambda x: choose_gender_val(x, tone), axis=1)
x = x[['speaker', 'split', 'gender'] + tones].sort_values(by=['speaker', 'split'])
x

Unnamed: 0,speaker,split,gender,happy,neutral,angry,sad,fearful,disgust
0,Gu,0,Female,0.161178,1.111650e-03,0.810416,1.995375e-04,0.006521,0.020575
1,Gu,1,Female,0.083911,1.145200e-04,0.834973,1.439354e-05,0.001817,0.079170
2,Gu,2,Female,0.217006,2.099104e-07,0.768155,3.817395e-07,0.005772,0.009066
3,Gu,3,Female,0.005184,1.718877e-07,0.993327,1.052871e-08,0.000047,0.001443
4,Gu,4,Female,0.063380,4.974977e-06,0.925120,7.456817e-07,0.000624,0.010870
...,...,...,...,...,...,...,...,...,...
1770,Walters,992,Male,0.296272,1.867157e-02,0.299780,5.978596e-03,0.074354,0.304944
1771,Walters,993,Male,0.090004,1.154762e-03,0.845318,3.560169e-04,0.045571,0.017597
1772,Walters,994,Male,0.139129,4.254307e-03,0.785329,2.005629e-03,0.036530,0.032751
1773,Walters,995,Male,0.326259,2.829968e-02,0.134785,1.204058e-02,0.072495,0.426121
