# Accent Transfer Techniques for Speech Synthesis
## Signal Minded


*  Pradnesh Prasad Kalkar (190101103), B.Tech. Computer Science and Engineering
*  Saket Kumar Singh (190101081), B.Tech. Computer Science and Engineering
*  Anirudh Phukan (190101104), B.Tech. Computer Science and Engineering
*  Mesharya M Choudhary (190101053), B.Tech. Computer Science and Engineering
*  Korada Pavan Kumar (190102093), B.Tech. Electronics and Communication Engineering






# Table of Contents

Overview

*   Results and Conclusions
*   Dataset Generation
*   Neural Style Transfer
*   Simulation settings
*   Dynamic Time Warping
*   Fine Tuning on Audio Representations
*   Existing Work

# Overview

---
Note: This report highlights only the important code snippets. For the complete code kindly refer to [GitHub](https://github.com/sksingh1202/human-accent-transfer-learning).

*   Data Generation
  *   MelDatasetGen.py generates Mel Spectrograms for each of the audio files
  *   SpecAugmentGen.py generates Mel Spectrograms with a data augmentation technique called SpecAugment applied on it for each of the audio files
  *   WindowAugGen.py generates Spectrograms for each of the audio files with 4 different window sizes.
  * preprocess.py is used to split **.mp3** audio files of Speech Accent Archive Dataset into 5 second **.wav** files.
  * split.py is used to create train and validation split in a dataset containing files divided in classes 
*   Finetuning: FineTune.py is used to fine-tune the weights of pre-trained VGG-19 model under the 5 settings mentioned later in the section.
*   dynamic_time_warping.py uses **Dynamic Time Warping** to synhronize audio signals as a preprocessing step to neural style transfer
*   neural_style_transfer.py performs NST given two audios and the file does this using 6 different weights for VGG19, including the imagenet pretrained. The output is the spectrogram of the generated image in the png form. 
*   image_to_spectrogram.py takes a generated spectrogram image and converts it to an audio file.









#  Results and Conclusions

![cantonese](https://github.com/sksingh1202/human-accent-transfer-learning/blob/master/images/cantonese.png?raw=true)

Cantonese

![hindi](https://github.com/sksingh1202/human-accent-transfer-learning/blob/master/images/hindi.png?raw=true)

Hindi

![english](https://github.com/sksingh1202/human-accent-transfer-learning/blob/master/images/english.png?raw=true)

English

![dutch](https://github.com/sksingh1202/human-accent-transfer-learning/blob/master/images/dutch.png?raw=true)

Dutch

![russian](https://github.com/sksingh1202/human-accent-transfer-learning/blob/master/images/russian.png?raw=true)

Russian


Following conclusions were drawn based on the results obtained:

*   The output audios obtained from fine-tuned models seemed to be a bit smoother compared to the pretrained model.
*   The outputs of the fine-tuned model using Mel Spectogram performed to  be the best (qualitatively better than the pre-trained model).
*   The outputs of the other fine-tuned models were not classified correctly using respective models.
*   In all the cases the output of the models was noisy.
*   The outputs generated with DTW as preprocessing step were found to be noisier.



# Dataset Generation

In the below code snipped we randomly choose 10% of the samples for validation set.

```
N = len(filenames)
# we want the validation set to contain 10% of the samples
M = math.floor(N/10)
for filename in filenames:
    # randomly choose M files from the total N files
    random_number = random.randint(0,N)
    datasettype = ""
    if random_number<=M:
            datasettype = "validation"
            N-=1
            M-=1
    else:
            datasettype = "training"
            N-=1
```

SpecAugment is a data augmentation technique for spectrograms which works by time and frequency masking.

In the below code snippet number of mask bands, frequency mask size and time mask size are input parameters of the function. The size of frequency band and time band as well as the starting position of the bands is randomly chosen and the value in the locations covered by the band is set to 0.
```
# SpecAugment function which performs time,frquency masking over spectrograms for data augmentation
def spec_augment(spec: np.ndarray, num_mask=2, 
                 freq_masking_max_percentage=0.15, time_masking_max_percentage=0.3):

    spec = spec.copy()
    for i in range(num_mask):
        all_frames_num, all_freqs_num = spec.shape
        freq_percentage = random.uniform(0.0, freq_masking_max_percentage)
        
        num_freqs_to_mask = int(freq_percentage * all_freqs_num)
        f0 = np.random.uniform(low=0.0, high=all_freqs_num - num_freqs_to_mask)
        f0 = int(f0)
        spec[:, f0:f0 + num_freqs_to_mask] = 0

        time_percentage = random.uniform(0.0, time_masking_max_percentage)
        
        num_frames_to_mask = int(time_percentage * all_frames_num)
        t0 = np.random.uniform(low=0.0, high=all_frames_num - num_frames_to_mask)
        t0 = int(t0)
        spec[t0:t0 + num_frames_to_mask, :] = 0
    
    return spec
```





The below snippet is from preprocess.py that is used to split **.mp3** audio files from Speech Accent Archive Dataset into 5 second **.wav** files.

```
# Calculate the total number of samples for each clip
clip_samples = int(clip_duration * sr)
# Calculate the total number of clips that can be extracted from the audio signal
num_clips = int(np.floor(len(y) / clip_samples))
# Extract the clips and save them as WAV files
for i in range(num_clips):
    # Calculate the start and end sample indices for the current clip
    start_sample = i * clip_samples
    end_sample = (i + 1) * clip_samples
    # Extract the current clip from the audio signal
    clip = y[start_sample:end_sample]
    clip = librosa.util.normalize(clip)
```



# Neural Style Transfer

We take two audio files as input. Then, we compute the spectrogram from the audio. These spectrograms are then stored as images, which are then fed to the Neural Style Transfer (NST) model. In NST, the target image is initialized to the content image and then this target image is optimized using content and style loss. The optimized target image is then converted back to an audio signal for the final output using inverse stft. For representing content and style, we use the VGG-19 model, whose weights remain frozen throughout the NST optimization process. Let C, S and G be the content, style and the generated image respectively. G is initialized with the content image. The following cost function is used to optimize the target image G. 

The cost function minimizes both the style and the content cost. The formula is: 

$$J(G) = \alpha J_{content}(C,G) + \beta J_{style}(S,G)$$


#### Content Cost Function
One goal to aim for when performing NST is for the content in generated image G to match the content of image C. A method to achieve this is to calculate the content cost function, which will be defined as:

$$J_{content}(C,G) =  \frac{1}{4 \times n_H \times n_W \times n_C}\sum _{ \text{all entries}} (a^{(C)} - a^{(G)})^2\tag{1} $$

* Here, $n_H, n_W$ and $n_C$ are the height, width and number of channels of the hidden layer chosen, and appear in a normalization term in the cost. 
* Note that $a^{(C)}$ and $a^{(G)}$ are the 3D volumes corresponding to a hidden layer's activations. 

#### Gram Matrix 
We will compute the Style matrix by multiplying the "unrolled" activations matrix with its transpose as shown. The result is a matrix of dimension $(n_C,n_C)$ where $n_C$ is the number of filters (channels). 
* The value $G_{(gram)i,j}$ measures the correlation between the filters corresponding to the $i$th and $j$th channel i.e. how much do these features occur together.
* The diagonal elements $G_{(gram)ii}$ measure how "active" a filter $i$ is. 
* For example, suppose filter $i$ is detecting vertical textures in the image. Then $G_{(gram)ii}$ measures how common  vertical textures are in the image as a whole.
* If $G_{(gram)ii}$ is large, this means that the image has a lot of vertical texture. 
* In this way, the gram matrix represents the style of the image.

![gram_matrix](https://github.com/sksingh1202/human-accent-transfer-learning/blob/master/images/gram_matrix.png?raw=true)

#### Style Cost Function
Now, for the style cost we will minimize the distance between the Gram matrix of the "style" image S and the Gram matrix of the "generated" image G. 
* Consider only a single hidden layer with activations $a^{[l]}$.  
* The corresponding style cost for this layer is defined as: 

$$J_{style}^{[l]}(S,G) = \frac{1}{4 \times {n_C}^2 \times (n_H \times n_W)^2} \sum _{i=1}^{n_C}\sum_{j=1}^{n_C}(G^{(S)}_{(gram)i,j} - G^{(G)}_{(gram)i,j})^2\tag{2} $$

* $G_{gram}^{(S)}$ Gram matrix of the "style" image.
* $G_{gram}^{(G)}$ Gram matrix of the "generated" image.
* We consider the weighted average of the style cost for 5 layers.

For code, kindly refer the neural_style_transfer.py file in our github repository.

# Simulation Settings

### Transfer Learning
Transfer learning was performed on all datasets mentioned in the previous section to generate 5 set of weights which are present in a folder named **weights** in [GitHub](https://github.com/sksingh1202/human-accent-transfer-learning). Following is the description of the files:

*   **vgg19_fine_tuned.h5** - Fine-tuned weights by splitting the audio clips into 5 sec intervals using the [Speech Accent Dataset](https://www.kaggle.com/datasets/rtatman/speech-accent-archive).
*   **vgg19_fine_tuned_spec_augment.h5** - Fine-tuned weights using spec-augmentation on mel spectograms.
*   **vgg19_fine_tuned_mel.h5** - Fine-tuned weights using mel spectograms as input.
*   **vgg19_fine_tuned_window_size.h5** - Fine-tuned weights using data-augmentation using different time windows to eliminate time-frequency tradeoff.
*   **vgg19_fine_tuned_original_18.h5** - Fine-tuned weights using the original [Arctic Dataset](http://festvox.org/cmu_arctic/).


### Neural style transfer experiment setup
Following four configuration settings were used:


*   Content weight: 1e-2, Style weight: 1
*   Content weight: 1e-2, Style weight: 1e2
*   Content weight: 1e-2, Style weight: 1e4
*   Content weight: 1e-2, Style weight: 1e6

Each of these configurations settings were used with VGG-19 with pre-trained weights and also the 5 weights generated as a result of fine-tuning as mentioned earlier. Moreover, every simulation involved either absence or presence of DTW as a pre-processing step.
This gives us 4 * 6 * 2 = 48 different simulations.

# Dynamic Time Warping

DTW is an algorithm that aligns two time series data, useful for synchronizing audio files with variations in tempo. By converting audio files into Mel spectrograms, we can apply DTW to align the two representations, creating a synchronized version of the audio. DTW finds the optimal alignment by warping one or both time series while minimizing a distance measure between them, using dynamic programming



```
y1, fs = sf.read("./Dataset/american/arctic_a0001.wav")
y2, fs = sf.read("./Dataset/indian/arctic_a0001.wav")
# Add some simple padding
i1 = P.argmax( y1 > P.sqrt((y1**2).mean())/3 )
i2 = P.argmax( y2 > P.sqrt((y2**2).mean())/3 )
I = max(i1, i2)*2
z1 = y1[i1//5:(i1//5)*2]
y1 = P.hstack([z1]*((I-i1)//len(z1)) + [z1[:((I - i1)%len(z1))]] + [y1])
z2 = y2[i2//5:(i2//5)*2]
y2 = P.hstack([z2]*((I-i2)//len(z2)) + [z2[:((I - i2)%len(z2))]] + [y2])
print("Setting padding to {0:.2f} s".format(I/fs))
# Manually downsample by factor of 2
fs = fs//2
y1 = decimate(y1, 2, zero_phase=True)
y2 = decimate(y2, 2, zero_phase=True)
# Normalize loudness
v1 = P.sqrt((y1**2).mean())
v2 = P.sqrt((y2**2).mean())
y1 = y1/v1*.03
y2 = y2/v2*.03
```

This code block loads two audio files in WAV format and performs some preprocessing to synchronize them.

Some padding is added to both audio files by finding the maximum amplitude and repeating it until the desired padding length is achieved. Next, the code down-samples the audio files by a factor of 2 using the **decimate** function to reduce the sample rate of the audio files. Finally, the loudness of the audio files is normalized.

The preprocessing step, especially normalization, is important for DTW to work effectively since it compares the shape of the two audio files.



```
# Performing interpolation and warping
wp = wp[::-1, :]
y1_st, y1_end = wp[0, 0]*hop_size, wp[-1, 0]*hop_size
y2_st, y2_end = wp[0, 1]*hop_size, wp[-1, 1]*hop_size
y1 = y1[y1_st:y1_end]
y2 = y2[y2_st:y2_end]
wp[:, 0] = wp[:, 0] - wp[0,0]
wp[:, 1] = wp[:, 1] - wp[0,1]
wp_s = P.asarray(wp) * hop_size / fs
i, I = P.argsort(wp_s[-1, :])
x, y = close_points(
  P.array([wp_s[:,i]/wp_s[-1,i], wp_s[:,I]/wp_s[-1,I]]), s=1)
f = mono_interp(x, y, extrapolate=True)
yc,yo = (y1,y2) if i==1 else (y2, y1)
l_hop = 64
stft = librosa.stft(yc, n_fft=512, hop_length=l_hop)
z = len(yo)//l_hop + 1
t = P.arange(0, 1, 1/z)
time_steps = P.clip( f(t) * stft.shape[1], 0, None )
# Performing vocoder warping
warped_stft = variable_phase_vocoder(stft, time_steps, hop_length=l_hop)
```

The warping path **wp** is used to align the two audio files by cropping them and interpolating the time steps to create a synchronized version of the audio. The starting and ending indices for each audio file are calculated from **wp**, and then the warping path is modified so that it starts at 0 for both the x and y coordinates. The modified warping path is then sorted by the final y coordinate and used to interpolate the time steps between the two audio files using **mono_interp**.

The interpolated time steps are then used to perform variable phase vocoding on the STFT (Short-Time Fourier Transform) of the audio files using **variable_phase_vocoder**. 

**Variable phase vocoding**(VPV) is a technique used for modifying the timing of audio signals without changing their pitch or spectral content. It involves dividing the audio signal into overlapping short-time windows and converting each window into its corresponding complex-valued STFT representation.



# Fine Tuning on Audio Representations

By utilizing pre-trained neural network models on large datasets such as ImageNet, similar problems with new datasets can be solved efficiently. The pre-trained models provide a wealth of knowledge that can be leveraged, reducing the computation cost and training time on a new dataset.

To implement this approach, the entire pre-trained model is first **frozen** and a new **classification layer** is added. The last layer is then trained for a few epochs to ensure its sensitivity. This is shown in the following code snippet:

```
base_model = tf.keras.applications.VGG19(include_top=False,
                                  input_shape=IMG_SHAPE,
                                  weights='imagenet')

base_model.trainable = False

global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
feature_batch_average = global_average_layer(feature_batch)

prediction_layer = tf.keras.layers.Dense(len(class_names), activation='softmax')
prediction_batch = prediction_layer(feature_batch_average)

inputs = tf.keras.Input(shape=IMG_SHAPE)
x = inputs
x = preprocess_input(x)
x = base_model(x, training=False)
x = global_average_layer(x)
print(x.shape)
x = tf.keras.layers.Dropout(0.2)(x)
print(x.shape)
outputs = prediction_layer(x)
print(x.shape)
model = tf.keras.Model(inputs, outputs)

base_learning_rate = 0.0001
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=base_learning_rate),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(train_dataset,
      epochs=initial_epochs,
      validation_data=validation_dataset)
```


Subsequently, all the layers are **unfrozen** and **trained together** with the new layer to fine-tune the entire model. This process allows the model to learn **task-specific** features while retaining the previously learned features, making it particularly useful when working with limited resources or small datasets. This can be seen in the following code snippet:

```
base_model.trainable = True

model.compile(optimizer = tf.keras.optimizers.RMSprop(learning_rate=base_learning_rate/10),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

fine_tune_epochs = 10
total_epochs =  initial_epochs + fine_tune_epochs

history_fine = model.fit(train_dataset,
                         epochs=total_epochs,
                         initial_epoch=history.epoch[-1],
                         validation_data=validation_dataset)
```




#  Existing Work

Reconstruction of historic artwork attracted many artists as well as scientists in transfer of style in images. Thus, there were initial works where style transfer without help of neural network like image-based artistic rendering which include stroke based (developing of strokes of new style on req. content) and region-based (segmenting the main image-divide and conquer). There were limitations like primitive transfer, narrow range of style etc which compelled to use NN. 

We want to transfer style in audio inputs for previously mentioned applications. There is very minimal material found on audio style transfer, which also use image style transfer techniques to audio properties. Spectrogram is an extremely useful tool for extracting property of time-varying frequency properties of audio signals. Thus, we apply convolutional neural networks (CNN’s) for the produced spectrogram images. 

Gatys et al. first introduced NST algorithm using CNN where he proposed that the content and style can be represented by internal layers.

There are other methods where GAN is used for obtaining mixed style images like the work produced by Chen et al. which obtains cartoonized images using GAN. We didn't focus on this topic since training and experimenting with GAN would not be time-feasible with resources that we possess.

We have used VGG19 as our CNN model since it's size is small enough to avoid difficulties in choosing which layers for content and style representations and is trainable with our resource constraints. 

 
