Finally this is a stage for scoring the kid’s pronunciation. We take two different approaches to predict the kid’s pronunciation, one based on the similarity comparison and the other based on the fine-tuned model prediction. This is the second approach, predicting the child’s pronunciation score based on the fine-tuned model prediction. In this stage, we will use the labeled data from the few shot learning to fine tune the Wav2Vec2 model. To make the details of our model available to everyone, we trained the model using Hugging Face’s libraries and uploaded the trained model to our Hugging Face’s model space. (For those who are curious, please refer to the link below) Since we have total four target variables (accuracy, completeness, fluency and prosodic), we executed four different versions of model training, and uploaded each fine-tuned model. The overall process is as follows.

1. **Preprocessing the dataset**: We first load the final audio files, and preprocess it in the format that suitable for model training. At this stage, we split the whole dataset into train, validation and test dataset
2. **Fine tuning:** Using the preprocessed dataset, we fine tuned model in four different versions. Each version’s target variables are accuracy, completeness, fluency and prosodic. Then we uploaded these fine-tuned models on Hugging Face space.
3. **Prediction:** Using the fine-tuned models, we predict the pronunciation scores of the test wav file, and visualize it as a radar chart using `plotly` library.

For this stage, we referred to the official guidance of Hugging Face.

[Audio classification guidance](https://huggingface.co/docs/transformers/tasks/audio_classification)

# 1. Download and import packages


First, download and import packages. Note that **YOU MUST RESTART YOUR RUNTIME** after downloading the packages to ensure the proper execution of the code.

In [None]:
! pip install -U accelerate
! pip install -U transformers
! pip install datasets evaluate



In [None]:
import torch
import evaluate
import librosa
import numpy as np
import pandas as pd
import warnings
import plotly.graph_objects as go
from datasets import Dataset
from tqdm import tqdm
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


If you want to share your fine tuned model with the Hugging Face community, you should login to your Hugging Face account. To login, you should enter your own token to login. If you don’t want to, you can simply skip this step.

In [None]:
# login to your huggingface account for model upload (you can skip this step)
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# 2. Load and preprocess dataset

To preprocess the dataset into the right format, we first load our `audio_reference_final.pkl` file.

In [None]:
# load the dataset
df = pd.read_pickle('your_own_path/audio_reference_final.pkl')

Then split the dataset into train, validation, and test dataset. We set the ratio of train to test to 8:2. And we used 20% of the train dataset as a validation dataset. For the final evaluation of the model performance, we save the test dataset as `test.pkl` file.

In [None]:
# split the dataset into train, valid, and test dataset
train, test = train_test_split(df, test_size = 0.2)
train, val = train_test_split(train, test_size = 0.2)
test.to_pickle('your_own_path/test.pkl')

Next, we create a dictionary that maps the label name (bad, normal, good) to an integer (0,1, 2) and vice versa. (`label2id`, `id2label`) This helps the model to associate label names with their corresponding ids.

In [None]:
# 0 stands for 'bad', 1 stands for 'normal', 2 stands for 'good'
labels = ['bad', 'normal', 'good']
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

In [None]:
id2label[str(1)]

'normal'

After that, load Wav2Vec2 feature extractor, and define a function that preprocess the data in proper format (`preprocess_function`). This function first load audio file from the specified path and process the audio input using a `feature_extractor`.  Note that you have to resample your audio file into 16000 sampling rate for the proper use of the model, since Wav2Vec2 model is pre-trained by 16000Hz audio files. The processed input is returned as a  variable `inputs`.

In [None]:
# load wav2vec 2.0 feature extractor model
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

# Define preprocess function, note that you have to resample your wav file into 16000 sampling rate, to properly fine tune the wav2vec 2.0 model
def preprocess_function(examples):
    audio_arrays = [librosa.load(x)[0] for x in examples["file_path"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Using the `preprocess_function`, preprocess train dataset and validation dataset respectively. Recall that `train` and `val`  are variables that contain data frame for train dataset and validation dataset. We create `train_labels` and `val_labels` lists and store pronunciation score values. And the datasets are created for training and validation. For each label in `train_labels` and `val_labels`, a dataset is constructed using the `Dataset.from_dict()` method. The dataset includes to keys: ‘label’, which contains the list representation of the corresponding label, and ‘input_values, which contains the preprocessed inputs. Lastly, the datasets are unpacked into separate variables for each label for both training and validation. This allows for easy access to individual datasets during the training process.

In [None]:
# Preprocess inputs, this might take quite long time
train_inputs = preprocess_function(train)['input_values']
val_inputs = preprocess_function(val)['input_values']

# Define label lists and input values for training and validation
train_labels = [train['accuracy'], train['completeness'], train['fluency'], train['prosodic']]
val_labels = [val['accuracy'], val['completeness'], val['fluency'], val['prosodic']]

# Create datasets for training
train_datasets = []
for label in train_labels:
    dataset = Dataset.from_dict({'label': label.to_list(), 'input_values': train_inputs})
    train_datasets.append(dataset)

# Create datasets for validation
val_datasets = []
for label in val_labels:
    dataset = Dataset.from_dict({'label': label.to_list(), 'input_values': val_inputs})
    val_datasets.append(dataset)

# Unpack datasets for training
ds_train1, ds_train2, ds_train3, ds_train4 = train_datasets

# Unpack datasets for validation
ds_val1, ds_val2, ds_val3, ds_val4 = val_datasets

# 3. Fine tuning

To include a metric during the training process, we load a evaluation method (accuracy this time), and define a function that compute the accuracy (`compute_metrics`).

In [None]:
# load the 'accuracy' metric with the Evaluate library
accuracy = evaluate.load("accuracy")

# Define evaluation function
def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

And then, we load pre-trained Wav2Vec2 model along with the number for expected labels. (3 in this case)

In [None]:
# Load pre-trained wav2vec2.0 model
num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
)

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'projector.weight', 'classifier.weight', 'projector.bias', 'classifier.bias', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Finally, we can execute our fine-tuning process. We first define hyper parameters in `TrainingArguments`. Notable configurations include the output directory for saving the model, learning rate, batch sizes for for training and evaluation, numbering of training epochs, and etc. Then, we pass training arguments to `Trainer`. A `Trainer` instance is created, taking in the defined model (`model`), the training arguments(`training_args`), training dataset(`ds_train1`), validation dataset(`ds_val1`), and etc. Finally, we call the `train()` method on the `trainer1`, whose target variable is accuracy (this means pronunciation accuracy) score.

In [None]:
# Define your hyperparameters in TrainingArguments
training_args = TrainingArguments(
    output_dir="pronunciation_scoring_model_accuracy",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)
# Pass the training arguments to Trainer
trainer1 = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train1,
    eval_dataset=ds_val1,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)
# Train your model
trainer1.train()

Epoch,Training Loss,Validation Loss,Accuracy
0,1.0992,1.100095,0.337423
2,1.0949,1.099263,0.33589
4,1.0838,1.106014,0.33589
6,1.0757,1.094053,0.361963
8,1.0653,1.09554,0.377301
9,1.0595,1.095115,0.372699


TrainOutput(global_step=200, training_loss=1.083817992210388, metrics={'train_runtime': 419.2719, 'train_samples_per_second': 62.108, 'train_steps_per_second': 0.477, 'total_flos': 2.3077946887104e+17, 'train_loss': 1.083817992210388, 'epoch': 9.76})

When your model complete the training process, you can share your model. If you don’t want to, you can simple skip this step. If you execute the code, you will have a new model space looks similar to the image below.
![](https://velog.velcdn.com/images/pjh172839/post/14527c16-4bb4-46ec-af17-93be012a1821/image.png)

In [None]:
# Push your model to hugging face
trainer1.push_to_hub()

'https://huggingface.co/JunBro/pronunciation_scoring_model_accuracy/tree/main/'

And we repeat the whole fine tuning process for remaining target variables.

In [None]:
# Define your hyperparameters in TrainingArguments
training_args = TrainingArguments(
    output_dir="pronunciation_scoring_model_completeness",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)
# Pass the training arguments to Trainer
trainer2 = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train2,
    eval_dataset=ds_val2,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)
# Train your model
trainer2.train()

# Push your model to hugging face
trainer2.push_to_hub()

Epoch,Training Loss,Validation Loss,Accuracy
0,1.1062,1.10411,0.334356
2,1.0894,1.100914,0.351227
4,1.0746,1.10221,0.383436
6,1.0522,1.102586,0.384969
8,1.0288,1.096275,0.407975
9,1.019,1.094779,0.407975


'https://huggingface.co/JunBro/pronunciation_scoring_model_completeness/tree/main/'

In [None]:
# Define your hyperparameters in TrainingArguments
training_args = TrainingArguments(
    output_dir="pronunciation_scoring_model_fluency",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)
# Pass the training arguments to Trainer
trainer3 = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train3,
    eval_dataset=ds_val3,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)
# Train your model
trainer3.train()

# Push your model to hugging face
trainer3.push_to_hub()

Epoch,Training Loss,Validation Loss,Accuracy
0,1.114,1.108475,0.326687
2,1.083,1.095289,0.360429
4,1.049,1.109923,0.352761
6,1.0314,1.130276,0.381902
8,1.0072,1.118113,0.397239
9,0.9889,1.12796,0.400307


'https://huggingface.co/JunBro/pronunciation_scoring_model_fluency/tree/main/'

In [None]:
# Define your hyperparameters in TrainingArguments
training_args = TrainingArguments(
    output_dir="pronunciation_scoring_model_prosodic",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)
# Pass the training arguments to Trainer
trainer4 = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train4,
    eval_dataset=ds_val4,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)
# Train your model
trainer4.train()

# Push your model to hugging face
trainer4.push_to_hub()

Epoch,Training Loss,Validation Loss,Accuracy
0,1.1289,1.114514,0.325153
2,1.0828,1.102508,0.340491
4,1.0457,1.108755,0.361963
6,0.994,1.135027,0.349693
8,0.9643,1.141953,0.372699
9,0.9599,1.136199,0.372699


'https://huggingface.co/JunBro/pronunciation_scoring_model_prosodic/tree/main/'

# 4. Inference

Now we’ve fine-tuned our model, we can use the model to the inference. At this step, we will predict four pronunciation scores of the `test.wav` file. First, we load `test.wav` file and preprocess it.


In [None]:
# load test wav file and preprocess it
x, _  = librosa.load('your_own_path/test.wav')
feature_extractor = AutoFeatureExtractor.from_pretrained("JunBro/pronunciation_scoring_model_accuracy")
inputs = feature_extractor(x, sampling_rate=16000, return_tensors="pt")

preprocessor_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

Then, we load four fine-tuned model and pass our test input to each model and return the logits.

In [None]:
# load models
model1 = AutoModelForAudioClassification.from_pretrained("JunBro/pronunciation_scoring_model_accuracy")
model2 = AutoModelForAudioClassification.from_pretrained("JunBro/pronunciation_scoring_model_completeness")
model3 = AutoModelForAudioClassification.from_pretrained("JunBro/pronunciation_scoring_model_fluency")
model4 = AutoModelForAudioClassification.from_pretrained("JunBro/pronunciation_scoring_model_prosodic")

# pass your inputs to the model and return the logits
with torch.no_grad():
    logits1 = model1(**inputs).logits
    logits2 = model2(**inputs).logits
    logits3 = model3(**inputs).logits
    logits4 = model4(**inputs).logits

config.json:   0%|          | 0.00/2.53k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.53k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.53k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.53k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

For the last step, we get the class with the highest probability. We use `torch.argmax()` function to find the index with the highest probability in each set of logits and converts it to a scalar using `.item()`. These indices represent the predicted class for each label. Then we use the model’s id2label mapping to convert them to a label. Finally, the code prints out the predicted labels for each aspect (accuracy, completeness, fluency, prosodic) based on the highest probability class indices.

In [None]:
# get the class with the highest probability
predicted_class_ids1 = torch.argmax(logits1).item()
predicted_class_ids2 = torch.argmax(logits2).item()
predicted_class_ids3 = torch.argmax(logits3).item()
predicted_class_ids4 = torch.argmax(logits4).item()

# use the model’s id2label mapping to convert it to a label
predicted_label1 = model1.config.id2label[predicted_class_ids1]
predicted_label2 = model2.config.id2label[predicted_class_ids2]
predicted_label3 = model3.config.id2label[predicted_class_ids3]
predicted_label4 = model4.config.id2label[predicted_class_ids4]

# print out the result
print('accuracy:', predicted_label1)
print('completeness:', predicted_label2)
print('fluency:', predicted_label3)
print('prosodic:', predicted_label4)

accuracy: bad
completeness: bad
fluency: normal
prosodic: normal


# 5. Graph visualization

Finally, using `plotly` library, we visualize a child’s pronunciation score in a radar chart. Note that the first graph that named as ‘**Average Score**’ is an arbitrary graph that represents the average pronunciation score of all children.

In [None]:
# graph visualization
fig = go.Figure()

categories = ['Accuracy', 'Completeness', 'Fluency', 'Prosodic']

fig.add_trace(go.Scatterpolar(
    r=[1.2,1.3,0.5,1.5],
    theta=categories,
    fill='toself',
    name="Average Score"
))

fig.add_trace(go.Scatterpolar(
    r=[predicted_class_ids1, predicted_class_ids2, predicted_class_ids3, predicted_class_ids4],
    theta=categories,
    fill='toself',
    name="Child Pronunciation Score"
))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            showticklabels=False,
            range=[0, 2]
        )),
    showlegend=True
)

fig.show()