After labeling the pronunciation scores of the speech data, we performed data augmentation. There are many benefits of it, but here are some of the most important ones.

1. **Improvement in Generalization Ability**: Augmentation helps the model to not overly rely on specific environments or conditions. This enables the model to maintain high performance in various real-world situations.
2. **Prevention of Overfitting**: Augmentation aids in preventing overfitting when the training data is limited. Exposure to diverse forms and types of data enhances the generalization ability, preventing the model from fitting to closely to the training set.
3. **Creation of Robust Models**: Augmentation helps in making model more robust and resilient. For example, it enhances the model’s ability to handle noise, environmental variations, and imperfect speech, contributing to its robustness in real-world scenarios.

The most important part of data augmentation is it can ensure the reliability of the model. Speech recorded by A.I speaker is susceptible to ambient noise. However, by adding noise and other sound effects during the augmentation process, we can create a model that is robust to these situations. To augment wav files, we referred “Data Augmenting Contrastive Learning of Speech Representations in the Time Domain” (Kharitonov et al., 2020) from paperswithcode, and used WavAugment library.

[Papers with Code - Data Augmenting Contrastive Learning of Speech Representations in the Time Domain](https://paperswithcode.com/paper/data-augmenting-contrastive-learning-of)

# 1. Download and import packages

First, install sox (Sound exChange), torchaudio, and WavAugment, and import libraries.

In [None]:
! apt-get install libsox-fmt-all libsox-dev sox > /dev/null
! python -m pip install torchaudio > /dev/null
! python -m pip install git+https://github.com/facebookresearch/WavAugment.git > /dev/null

  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/WavAugment.git /tmp/pip-req-build-6kvn9ns5


In [None]:
from tqdm import tqdm
import pandas as pd
import numpy as np
import torchaudio
import torch
import augment
import random
import os

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 2. Audio augmentation

First, make sure your GPU is available.

In [None]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


And then we load `audio_reference_scored.pkl` file that was created in the before stage. This pickle file is containing file paths of each wav files, tensors and whole pronunciation scores. If you want to skip the before stage, you can simply use `audio_reference_scored.pkl` file in the share folder.

In [None]:
# initialize new dataframe
df = pd.read_pickle('your_own_path/audio_reference_scored.pkl')

Next, we check the input file paths (paths of original wav files) and output file paths (paths of augmented wav files) using os library, and store them to the data frame.

In [None]:
# load reference audio files
output_path = 'your_own_path/recordings/augmented'
original_path = 'your_own_path/recordings/original'
output_paths = [os.path.join(output_path, file) for file in os.listdir(original_path)]
df['output_path'] = output_paths

After that, we define a audio_modification function, that will modify the original wav files. This function will randomly apply one of four wav augmentation techniques to the source file.

1. **Pitch shift**: Make lower or higher the pitch of the voice. For example, -200 indicates that we’ll go lower by 200 cents of the tone.
2. **Reverberation**: Add echo to a sound signal, conveying spatial depth and width to the sound. Each parameter in `reverb()` function stands for reverberance, dumping factor, and room size.
3. **Noise** : Applying additive noise. In this case, we used generated uniform noise.
4. **Time dropout**: Substituting a brief segment of audio with periods of silence. This method is frequently employed in the literature.

In [None]:
# function to modify original audio file
# change the pitch, add reverb effect, additive noise, and drop out random section of audio

def audio_modification(wave_path, random_pitch_shift, random_room_size):
  x, sr = torchaudio.load(wave_path)
  r = random.randint(1,5)

  if r == 1:
    random_pitch_shift_effect = augment.EffectChain().pitch("-q", random_pitch_shift).rate(sr)
    y = random_pitch_shift_effect.apply(x, src_info={'rate': sr})
  elif r == 2:
    random_reverb = augment.EffectChain().reverb(50, 50, random_room_size).channels(1)
    y = random_reverb.apply(x, src_info={'rate': sr})
  elif r == 3:
    noise_generator = lambda: torch.zeros_like(x).uniform_()
    y = augment.EffectChain().additive_noise(noise_generator, snr=15).apply(x, src_info={'rate': sr})
  else:
    y = augment.EffectChain().time_dropout(max_seconds=0.5).apply(x, src_info={'rate': sr})

  return y

Then we initialize the `random_pitch_shift` and `random_room_size` variables. For `random_room_size`, we set it from 0 to 51, and for `random_pitch_shift`, we set it from -100 to 100 so that it doesn’t pitch too high or low since we’re targeting kids. After that, we call the function and store the tensor form of modified wav file data at the `modified_vector` column of the data frame that we already created.

In [None]:
# set randomized parameters
random_pitch_shift = lambda: np.random.randint(-100, +100)
random_room_size = lambda: np.random.randint(0, 51)

In [None]:
tqdm.pandas()
df['modified_vector'] = df['file_path'].progress_apply(lambda x: audio_modification(x, random_pitch_shift, random_room_size))

100%|██████████| 2035/2035 [04:10<00:00,  8.12it/s]


Finally, to check the augmented result files, we converted the `modified_vectors` into wav files and saved them into the `output_path` of each augmented result.

In [None]:
# Generate augmented wav files
for i in tqdm(range(len(df))):
  output_path = df.loc[i, 'output_path']
  y = df.loc[i, 'modified_vector']
  torchaudio.save(output_path, y, sample_rate = 44100)

100%|██████████| 2035/2035 [26:08<00:00,  1.30it/s]


# 3. Create a new reference pickle file for later use

We create a new reference file `audio_reference_scored_augmented.pkl` that contains all original and augmented wav files, and their pronunciation scores for later use. At this step, we set the pronunciation score of the augmented wav file to be the same as the score of original wav file.

In [None]:
# Create new dataframe that contains original and augmented wav file paths and their features
file_path = df['file_path'].to_list() + df['output_path'].to_list()
accuracy = df['accuracy'].to_list() * 2
completeness = df['completeness'].to_list() * 2
fluency = df['fluency'].to_list() * 2
prosodic = df['prosodic'].to_list() * 2

result = pd.DataFrame(columns = ['file_path', 'vector', 'accuracy', 'completeness', 'fluency', 'prosodic'])
result['file_path'] = file_path
result['accuracy'] = accuracy
result['completeness'] = completeness
result['fluency'] = fluency
result['prosodic'] = prosodic

In [None]:
# save the data frame
result.to_pickle('your_own_path/audio_reference_scored_augmented.pkl')