Finally this is a stage for scoring the kid’s pronunciation, and visualize it as a graph. We take two different approaches to predict the kid’s pronunciation, one based on the similarity comparison and the other based on the fine-tuned model prediction. This is the first approach, predicting the child’s pronunciation score based on the similarity to the reference data. A big assumption in this stage is that audio files with similar pronunciation will also be similar when they are vectorized. So if we have reference data consisting of audio files of different children pronouncing the same phrase and their pronunciation scores, we can determine the score of new input data based on the reference. The overall process is as follows.

1. **Convert the reference audio file to a tensor using Wav2Vec 2.0**: After augmentation, we built our reference dataset by converting all of the audio files into tensors using Wav2Vec 2.0 model.
2. **Convert test data into tensor and find the most similar tensors:** When a recorded child’s voice comes through the AI speaker, we convert it to a tensor and find the reference that is most similar to it. In this case, we used cosine similarity to calculate the similarity.
3. **Visualize a child’s predicted pronunciation score:** Visualize the child’s predicted pronunciation scores for the four categories in a radar chart. To visualize the graph, we used the `plotly` library.

For the use of wav2vec 2.0 model, we referred to “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (Baevski et al., 2020) from paperswithcode.

[Papers with Code - wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://paperswithcode.com/paper/wav2vec-2-0-a-framework-for-self-supervised)


# 1. Download and import packages

First, install pydub library and import packages.

In [None]:
!pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [None]:
from transformers import Wav2Vec2Processor, Wav2Vec2Model
from torch.nn.functional import cosine_similarity
from tqdm import tqdm
from pydub import AudioSegment
import plotly.graph_objects as go
import soundfile as sf
import pandas as pd
import numpy as np
import torch
import warnings
import os

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 2. Change wav files to tensors via pre-trained wav2vec 2.0 model

Next, check the availability of GPU and load Wav2Vec2 model and processor.

In [None]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [None]:
# Load wav2vec2.0 model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base").to(device)

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Then, we define a `wav2vec` function that uses a Wav2Vec2 model to convert the audio file into a tensor. The function first load the audio file from the specified path using the `sf.read` function from the `soundfile` library, and process the audio input using a pre-defined `processor`. Note that you have to resample your audio file into 16000 sampling rate for the proper use of the model, since Wav2Vec2 model is pre-trained by 16000Hz audio files. The processed input is returned as a PyTorch tensor and is stored in the variable `input_values`. Next, it moves the processed input data to specified device (GPU or CPU), and feed the `input_values` into the model and obtain the output features. The `last_hidden_state` attribute is accessed to retrieve the final hidden states of the model. Lastly it computes the mean along the obtained features, and returns `fixed_length_vector`. (This is same function with `wav2vec` function in “Labeling via few shot learning” stage)

In [None]:
def wav2vec(audio_path):
    # load audio file
    audio_input, _ = sf.read(audio_path)

    # prepare input data
    input_values = processor(audio_input, return_tensors="pt", sampling_rate=16000).input_values

    # move input data to GPU
    input_values = input_values.to(device)

    # predict by using wav2vec model
    with torch.no_grad():
        features = model(input_values).last_hidden_state

    # transform to fixed_length vector
    fixed_length_vector = torch.mean(features, dim=1)

    return fixed_length_vector

In [None]:
# load audio_reference_scored_augmented.pkl file
df = pd.read_pickle('your_own_path/audio_reference_scored_augmented.pkl')
df

Unnamed: 0,file_path,vector,accuracy,completeness,fluency,prosodic
0,/content/drive/My Drive/03. AI/kaggle_archive2...,,1,2,2,2
1,/content/drive/My Drive/03. AI/kaggle_archive2...,,1,0,1,2
2,/content/drive/My Drive/03. AI/kaggle_archive2...,,0,0,0,2
3,/content/drive/My Drive/03. AI/kaggle_archive2...,,2,0,1,2
4,/content/drive/My Drive/03. AI/kaggle_archive2...,,1,2,1,1
...,...,...,...,...,...,...
4065,/content/drive/My Drive/03. AI/kaggle_archive2...,,1,1,0,0
4066,/content/drive/My Drive/03. AI/kaggle_archive2...,,0,0,2,1
4067,/content/drive/My Drive/03. AI/kaggle_archive2...,,2,0,0,2
4068,/content/drive/My Drive/03. AI/kaggle_archive2...,,2,1,2,0


In the next step, we load `audio_reference_scored_augmented.pkl` file. Then, we create a final reference dataset by converting each audio files into same-sized tensors. Note that depending on the shape of your data frame, you will need to change the indexes of the `df.iloc` and `df.iat` functions accordingly. Also, if you have a large audio file, it is recommended to use a try except statement since it can cause a ‘**CUDA out of memory**’ error.


In [None]:
for i in tqdm(range(len(df))):
  try:
    y = wav2vec(df.iloc[i,0])
    df.iat[i,1] = y
  except Exception as e:
    print(i,e)

100%|██████████| 4070/4070 [32:22<00:00,  2.09it/s]


And then we save the final reference file as `audio_reference_final.pkl` for later use.

In [None]:
# save the data frame in pickle file
df.to_pickle('your_own_path/audio_reference_final.pkl')

#3. Convert new speech data into a tensor and find the n most similar tensors

Now, this is the step to find n most similar tensors and predict the child’s pronunciation score. First, load `audio_reference_final.pkl` file.

In [None]:
df = pd.read_pickle('your_own_path/audio_reference_final.pkl')

To get a test input, we define a simple function that convert m4a file format into wav file format (since our recorded data was m4a format), and convert the test file into wav format.


In [None]:
def convert_m4a_to_wav(input_path, output_path):
    # load m4a file
    audio = AudioSegment.from_file(input_path, format="m4a")

    # save as wav file
    audio.export(output_path, format="wav")

In [None]:
m4a_file_path = "your_own_path/test.m4a"
wav_file_path = "your_own_path/test.wav"

convert_m4a_to_wav(m4a_file_path, wav_file_path)

After the file format conversion, we load the test file once again, transform it into a tensor, and calculate the cosine similarity with each reference data. Then, we predict the pronunciation score of the test file as the mean value of the pronunciation score of 50 most similar references to the test file.

In [None]:
# load test audio file
test_path = 'your_own_path/test.wav'
test_vector = wav2vec(test_path)

# how many references to calculate the score?
n = 50

# calculate the pronunciation score of test voice by calcualting cosine similarity
df['sim'] = df['vector'].apply(lambda x: cosine_similarity(x, test_vector))
score_df = df.sort_values('sim').iloc[:n][['accuracy','completeness','fluency','prosodic']]

In [None]:
accuracy = score_df['accuracy'].mean()
completeness = score_df['completeness'].mean()
fluency = score_df['fluency'].mean()
prosodic = score_df['prosodic'].mean()

# 4. Graph a child's pronunciation score

Finally, using `plotly` library, we visualize a child’s pronunciation score in a radar chart. Note that the first graph that named as ‘**Average Score**’ is an arbitrary graph that represents the average pronunciation score of all children.

In [None]:
# graph visualization via plotly
fig = go.Figure()

categories = ['Accuracy', 'Completeness', 'Fluency', 'Prosodic']

fig.add_trace(go.Scatterpolar(
    r=[1.2,1.3,0.5,1.5],
    theta=categories,
    fill='toself',
    name="Average Score"
))

fig.add_trace(go.Scatterpolar(
    r=[accuracy, completeness, fluency, prosodic],
    theta=categories,
    fill='toself',
    name="Child Pronunciation Score"
))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            showticklabels=False,
            range=[0, 2]
        )),
    showlegend=True
)

fig.show()