## Fine-tuned HuBERT model
This notebook trains a model to predict the label of a vocalization by individual `P01`. Labels with fewer than 30 examples in the training data are discarded. Remaining labels:
 - `selftalk` (444 examples)
 - `delighted` (287 examples)
 - `dysregulated` (179 examples)
 - `social` (143 examples)
 - `frustrated` (118 examples)
 - `request` (107 examples)
 - `dysregulation-sick` (59 examples)

 The model uses features generated by a pre-trained HuBERT model, which are fed into a regularized multi-class logistic regression. Out-of-sample performance:
 - accuracy: 0.757
 - F1 score (unweighted): 0.779
 - cross-entropy: 0.689

In [1]:
from pathlib import Path

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, log_loss
from sklearn.model_selection import (
    cross_val_predict,
    StratifiedKFold,
)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from skopt import BayesSearchCV
import torch
import torchaudio
from tqdm.notebook import tqdm

In [2]:
# Pretrained HuBERT model
bundle = torchaudio.pipelines.HUBERT_BASE
model = bundle.get_model()

# List of data files
data_files = pd.read_csv("../data/directory_w_train_test.csv")
data_files_p1 = data_files.loc[data_files.Participant == "P01"]
label_counts = data_files_p1.Label.value_counts()
training_files = data_files_p1.loc[
    data_files_p1.Label.isin(label_counts[label_counts >= 30].index)
    & (data_files_p1.is_test == 0)
]

In [3]:
# Extract acoustic features from one wav file using
# the pretrained
datadir = Path("../data/wav")
filename = training_files.Filename.iloc[0]
waveform, sample_rate = torchaudio.load(datadir / filename)
waveform = torchaudio.functional.resample(
    waveform, sample_rate, bundle.sample_rate
)
features, _ = model.extract_features(waveform)

# features is a list of 12 tensors, each having shape
# (m, n, 768), where the value of m and n are different
# depending on the audio sample (m is always 1 or 2, n
# varies more widely and I think depends on the length
# of the clip). I'm averaging over time (n).
print(f"{type(features)=}, {len(features)=}")
print(features[0].mean((0, 1)).shape)

type(features)=<class 'list'>, len(features)=12
torch.Size([768])


In [4]:
# Generate features using the pretrained model.
# We will use only the first layer of generated features.
with torch.no_grad():
    t_list = []
    for filename in tqdm(training_files.Filename):
        waveform, sample_rate = torchaudio.load(datadir / filename)
        waveform = torchaudio.functional.resample(
            waveform, sample_rate, bundle.sample_rate
        )

        features, _ = model.extract_features(waveform)
        t_list.append(features[0].mean((0, 1)))

X = torch.stack(t_list).detach()
labels = training_files.Label.unique()
y = torch.zeros(len(training_files), dtype=torch.int)
for idx, label in enumerate(labels):
    y[(training_files.Label == label).values] = idx
print(X.shape, y.shape)

  0%|          | 0/1337 [00:00<?, ?it/s]

torch.Size([1337, 768]) torch.Size([1337])


In [6]:
# There are 768 generated features, which is a lot
# relative to how many training data there are. So we
# will need regularization. Using sk-optimize to optimize
# strength of regularization parameter (this is overkill
# since there's just one parameter, but oh well)
est = make_pipeline(
    StandardScaler(),
    LogisticRegression(
        max_iter=10**6,
    ),
)
opt = BayesSearchCV(
    est,
    {
        "logisticregression__C": (5e-3, 1, "log-uniform"),
    },
    n_iter=20,
    cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=12345),
    scoring="accuracy",
)
opt.fit(
    X.reshape(len(X), -1),
    y,
)
print(opt.best_params_)
print("Best accuracy:", opt.best_score_)

OrderedDict([('logisticregression__C', 0.047272594888604615)])
Best accuracy: 0.7599259342385816


In [7]:
# Generate out-of-sample predictions using a logistic
# regression model, with the parameter determined by
# the optimization above.
est = make_pipeline(
    StandardScaler(),
    LogisticRegression(
        C=opt.best_params_["logisticregression__C"],
        max_iter=10**6,
    ),
)
oos_pred_prob = cross_val_predict(
    est,
    X.reshape(len(X), -1),
    y,
    cv=StratifiedKFold(
        n_splits=10,
        shuffle=True,
        random_state=1234,  # Using different seed to avoid over-fitting parameter
    ),
    method="predict_proba",
)
oos_pred = oos_pred_prob.argmax(1)
print(f"{oos_pred_prob.shape=}, {oos_pred.shape=}")

print(f"Accuracy: {accuracy_score(y, oos_pred):.3f}")
print(f"Unweighted F1 score: {f1_score(y, oos_pred, average='macro'):.3f}")
print(f"Cross-entropy: {log_loss(y, oos_pred_prob):.3f}")

oos_pred_prob.shape=(1337, 7), oos_pred.shape=(1337,)
Accuracy: 0.758
Unweighted F1 score: 0.781
Cross-entropy: 0.691


In [8]:
# Confusion matrix
# selftalk and delighted are frequently confused by this model
conf_matrix_df = pd.DataFrame(
    confusion_matrix(y, oos_pred), columns=labels, index=labels
)
conf_matrix_df.index.name = "actual_label"
conf_matrix_df.columns.name = "pred_label"
display(conf_matrix_df)

pred_label,dysregulation-sick,frustrated,dysregulated,social,selftalk,request,delighted
actual_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
dysregulation-sick,50,2,1,0,3,1,2
frustrated,1,82,2,2,12,0,19
dysregulated,1,4,160,1,9,0,4
social,0,2,1,112,11,0,17
selftalk,6,6,2,15,341,17,57
request,0,0,0,0,12,84,11
delighted,3,7,1,9,80,3,184


The HuBERT features are outputs of an early layer in the HuBERT model. Fitting a logistic regression to these outputs can be interpretted as truncating HuBERT at an appropriate layer, appending a linear layer and a softmax layer, then training the new model keeping the HuBERT parameters frozen. A possible approach to improve this model could be to selective unfreezely HuBERT layers at some points during training.