# Timbre transfer with RAVE

NOTE: This notebook is heavily (almost completely) adapted by a notebook created by Teresa Pelinski 2023

RAVE is a Real-time Audio Variational autoEncoder (https://github.com/acids-ircam/RAVE) released by Caillon and Esling (ACIDS IRCAM) in November 2021. You can read the paper here: https://arxiv.org/abs/2111.05011. RAVE is a particularly light model that allows generating audio in real-time in the CPU and even in embedded systems with low computational power, such as Raspberry Pi (here is a video: https://youtu.be/jAIRf4nGgYI). Still, training this model is computationally expensive: in the original paper, they used 3M steps, which took six days on a TITAN V GPU. 

Today we will use RAVE to perform timbre transfer on sounds downloaded from freesound.org (although you can use sounds from other sources if you prefer!). First, we will download the pretrained RAVE models from the internet. Then we will perform timbre transfer on some sample sounds and experiment with biasing the latent space. Finally, we will see some audio transformations we can do to combine and modify sounds. 

The task for this lab is to generate a 30s–1min composition by combining sounds. These sounds can be a combination of downloaded from freesound, generated with RAVE, modified with the techniques below or a combination of original and timbre transfer. 


\* If you are interested in using RAVE for performing, the real-time implementation runs in MaxMSP and can be downloaded here: https://github.com/acids-ircam/nn_tilde

\* The RAVE generation code in this notebook is based on https://colab.research.google.com/github/hdparmar/AI-Music/blob/main/Latent_Soundings_workshop_RAVE.ipynb


## 1 - Setup
Make sure you are running this notebook in the `dmlap` conda environment.

### Installs and imports

From the terminal. First make sure that your DMLAP conda environment is active, e.g. 
```
conda activate dmlap
```
Then install the required dependencies with
```
pip install acids-rave 
conda install -c conda-forge ffmpeg
pip install wget
```

In [None]:
import torch
import IPython.display as ipd
import librosa as li
import soundfile as sf
import matplotlib.pyplot as plt
import numpy as np
import wget
import os
import sys
from scipy import signal

## 2 - Timbre transfer
### Download pretrained models
Some info on the pretrained models is available here: https://acids-ircam.github.io/rave_models_download

In [None]:
pt_path = "./models/rave-pretrained-models" # folder where pretrained models will be downloaded
if not os.path.exists(pt_path): # create the folder if it doesn't exist
    os.mkdir(pt_path)
    
def bar_progress(current, total, width=80): # progress bar for wget
    progress_message = "Downloading: %d%% [%d / %d] bytes" % (current / total * 100, current, total)
    # Don't use print() as it will print in new line every time.
    sys.stdout.write("\r" + progress_message)
    sys.stdout.flush()

pretrained_models = ["vintage", "percussion", "nasa", "darbouka_onnx", "VCTK"] # list of available pretrained_models to download in https://acids-ircam.github.io/rave_models_download (you can select less if you want to spend less time on this cell)

for model in pretrained_models: # download pretrained models and save them in pt_path
    if not os.path.exists(os.path.join(pt_path, f"{model}.ts")): # only download if not already downloaded
        print(f"Downloading {model}.ts...")
        wget.download(f"https://play.forum.ircam.fr/rave-vst-api/get_model/{model}",f"{pt_path}/{model}.ts", bar=bar_progress)
    else:
        print(f"{model}.ts already downloaded")


### Load an audio file and listen to it
We can load an audio file using librosa (`li`). `li.load` returns an array where every item corresponds to the amplitude at each time sample. You can convert from time in samples to time in seconds using `time = np.arange(0, len(input_data))/sample_rate`

In [None]:
import os
sample_rate = 48000 # sample rate of the audio

input_file = "./sounds/trumpet.wav"  
sound_format = os.path.splitext(input_file)[1]
input_data = li.load(input_file, sr=sample_rate)[0] # load input audio

time = np.arange(0, len(input_data)) / sample_rate # to obtain the time in seconds, we need to divide the sample index by the sample rate
plt.plot(time,input_data)
plt.xlabel("Time (seconds)")
plt.ylabel("Amplitude")
plt.title(input_file.split("/")[-1])
plt.grid()

ipd.display(ipd.Audio(data=input_data, rate=sample_rate)) # display audio widget

### Load the model and generate
We can now load a pretrained model using `torch.jit.load` and encode the input audio into a latent representation.For the vintage model, we will be encoding our input audio into a latent space trained on the [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) dataset. It consists speech from 109 native speakers of English with various accents. We can then decode the latent representation an synthesise it. This will make the original sound as if it was made of speech sounds (timbre transfer).

In [None]:
generated_path = "generated" # folder where generated audio will be saved
if not os.path.exists(generated_path): # create the folder if it doesn't exist
    os.mkdir(generated_path)
    
pretrained_model =  "VCTK" # select the pretrained model to use. VCTK 

model = torch.jit.load(f"{pt_path}/{pretrained_model}.ts" ).eval() # load model
torch.set_grad_enabled(False) # disable gradients
    
x = torch.from_numpy(input_data).reshape(1, 1, -1) # convert audio to tensor and add batch and channel dimensions
z = model.encode(x) # encode audio into latent representation
# synthesize audio from latent representation
y = model.decode(z).numpy() # decode latent representation and convert tensor to numpy array
y = y[:,0,:].reshape(-1) # remove batch and channel dimensions


# also there seems to be a delay this function should take care of that 
def align_generated_signal(input_data, y, thresh=0.05):
    # The generated signal also has an initial silence, also not sure about the reason
    mask = np.abs(y) < thresh
    if mask.any() and thresh >= 0:
        idx = np.argmax(~mask)
        y = np.roll(y, -idx)
    offset = abs(len(input_data)- len(y))
    y = y[:-offset] # trim to match input length --> for some reason the output is a bit longer than the input
    return y
y = align_generated_signal(input_data, y)

# save output audio
output_file =f'{generated_path}/{input_file.replace(".wav", f"_{pretrained_model}_generated.wav").split("/")[-1]}'
print(output_file)

sf.write(output_file, y, sample_rate)
ipd.Audio(output_file) # display audio widget

We can compare the input and output sound wave and spectogram

In [None]:
from scipy import signal
f1, t1, Zxx1 = signal.stft(input_data, fs=sample_rate, nperseg=2048, noverlap=512)
f2, t2, Zxx2 = signal.stft(y, fs=sample_rate, nperseg=2048, noverlap=512)

fig, axs = plt.subplots(2, 2,figsize=(10,5), sharex=True)

axs[0,0].plot(time,input_data)
axs[0,0].set_ylabel("Amplitude")
axs[0,0].grid()
axs[0,0].set_title(input_file.split("/")[-1])
axs[1,0].plot(time,y)
axs[1,0].set_ylabel("Amplitude")
axs[1,0].set_xlabel("Time (seconds)")
axs[1,0].grid()
axs[1,0].set_title(output_file.split("/")[-1])

axs[0,1].pcolormesh(t1, f1[:100], np.abs(li.amplitude_to_db(Zxx1[:100,:],
                                                       ref=np.max)))
axs[1,1].pcolormesh(t2, f2[:100], np.abs(li.amplitude_to_db(Zxx2[:100,:],
                                                       ref=np.max)))
axs[1,1].set_xlabel("Time (seconds)")
axs[0,1].set_title("STFT")
axs[0,1].set_ylabel("Frequency (Hz)")
axs[1,1].set_ylabel("Frequency (Hz)")



## 3 - Sound transformations


### Alter latent representation
We can now modify the latent coordinates of the input file to alter the representation. We can start by adding a constant bias (a displacement) to the coordinates in the latent space. Note that each RAVE model has a different number of coordinates for its latent space.

In [None]:
import time

print(z.shape) # the second dimension corresponds to the latent dimension, in this case, there's 8 latent dimensions

d0 = 2.2  # change in latent dimension 0
d1 = 0.4
d2 = 1
d3 = 0.5 
# we leave dimensions 4-8 unchanged

z_modified = torch.clone(z) # copy latent representation
# bias latent dimensions (displace each sample representation by a constant value)
z_modified[:, 0] += torch.linspace(d0,d0, z.shape[-1])
z_modified[:, 1] += torch.linspace(d1,d1, z.shape[-1])
z_modified[:, 2] += torch.linspace(d2,d2, z.shape[-1])
z_modified[:, 3] += torch.linspace(d3,d3, z.shape[-1])

y_latent_1 = model.decode(z_modified).numpy() # decode latent representation and convert tensor to numpy array

y_latent_1 = y_latent_1[:,0,:].reshape(-1) # remove batch and channel dimensions
y_latent_1 = align_generated_signal(input_data, y_latent_1) # align 
output_file = f'{generated_path}/{input_file.replace(".wav", f"_{pretrained_model}_latent_generated_1.wav").split("/")[-1]}'
sf.write(output_file,y_latent_1, sample_rate) # save output audio

ipd.Audio(output_file) # display audio widget

Instead of using a constant (a bias) to displace the representation of every sample in the latent space, we can use a function so that we "navigate" the latent space. For example, we can use a sinusoidal function that the representation oscillates around the original encoded one:

In [None]:
z_modified = torch.clone(z) # copy original latent representation

# bias latent dimensions with a sinusoidal function at 440 Hz
t = torch.linspace(0, z.shape[-1], z.shape[-1])
for idx in range(0, z.shape[1]): # for each latent dimension
    z_modified[:, idx] += torch.sin(440*2*np.pi*t)

y_latent_2 = model.decode(z_modified).numpy() # decode latent representation and convert tensor to numpy array
y_latent_2 = y_latent_2[:,0,:].reshape(-1) # remove batch and channel dimensions
#print(abs(len(input_data) - len(y_latent_2)))
#y_latent_2 = y_latent_2[abs(len(input_data) - len(y_latent_2))] # trim to match input length
y_latent_2 = align_generated_signal(input_data, y_latent_2) # align 
output_file = f'{generated_path}/{input_file.replace(".wav", f"_{pretrained_model}_latent_generated_1.wav").split("/")[-1]}'
sf.write(output_file,y_latent_2, sample_rate) # save output audio

ipd.Audio(output_file) # display audio widget


### Mix sounds (sum sources)

In [None]:
mixed_output = y + input_data*0.5
sf.write(output_file,mixed_output, sample_rate)
ipd.Audio(output_file)

### Sound collage

To generate your final composition, you should combine various sounds extracts. For this, you can cut excerpts of audio files, pass them through RAVE and combine them in a collage audio file.

In [None]:
# concatenate three sounds -- example using a for loop

# Segments of audio to concatenate.
# The format is:
#             (filename, amplitude, model,  start_sample, end_sample)
# Make sure that end_sample is larger than start sample
segments = [("sounds/trumpet.wav", 1.0, "VCTK", sample_rate, 4*sample_rate),
            ("sounds/happypiano.wav", 1.0, "VCTK", 0, 4*sample_rate),
            ("sounds/happypiano.wav", 1.0, "darbouka_onnx", 0, 4*sample_rate),
            ("sounds/violin-scale.wav", 1.0, "VCTK", 0, 4*sample_rate),
            ("sounds/violin-scale.wav", 1.0, "percussion", 0, 4*sample_rate),
            ]

outputs = [] # here we will store the output audio
inputs = []  # here we will store the input audio

for index, segment in enumerate(segments):
    input_file, amp, model_name, start_sample, end_sample = segment
    input_data = li.load(input_file, sr=sample_rate)[0][int(start_sample):int(end_sample)] # load input audio and cut from start to end sample
    input_data *= amp # Set amplitude
    inputs = np.append(inputs,input_data) # add excerpt to inputs array
    
    # load model
    model = torch.jit.load(f"{pt_path}/{model_name}.ts").eval() # load moel
    torch.set_grad_enabled(False)
        
    # encode input audio to latent representation
    x = torch.from_numpy(input_data).reshape(1, 1, -1)
    z = model.encode(x)

    # synthesize audio from latent representation
    y = model.decode(z).numpy()
    y = y[:,0,:].reshape(-1) # remove batch and channel dimensions

    #y = y[abs(len(input_data)- len(y)):] # trim to match input length
    y = align_generated_signal(input_data, y) # Use this as an alternative to align also start, you may want to play with the thresh optional parameter
    
    outputs = np.append(outputs,y) # append to output array

input_file = f'{generated_path}/input_collage.wav'
output_file = f'{generated_path}/output_collage.wav'
sf.write(output_file,outputs+0.5*inputs, sample_rate) # save output audio (sum input and ouput audio, input with less volume)
ipd.Audio(output_file)
    

In [None]:
sample_rate