# Introduction

A recently published method [1] for audio style transfer has shown how to extend the process of image style transfer to audio. This method synthesizes audio "content" and "style" independently using the magnitudes of a short time Fourier transform, shallow convolutional networks with randomly initialized filters, and iterative phase reconstruction with Griffin-Lim. In this work [2], we explore whether it is possible to directly optimize a time domain audio signal, removing the process of phase reconstruction and opening up possibilities for real-time applications and higher quality syntheses. We explore a variety of style transfer processes on neural networks that operate directly on time domain audio signals and demonstrate one such network capable of audio stylization.

In [None]:
import numpy as np
import librosa
import warnings

from librosa.display import specshow
from IPython.display import Audio, display
from audio_style_transfer.models import timedomain, uylanov

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [None]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

In [None]:
def plot_spec(audio):
  D = librosa.amplitude_to_db(librosa.stft(audio), ref=np.max)
  specshow(D)

In [None]:
sr = 44100
content = './wavs/corpus/johntejada-1.wav'
style = './wavs/target/beat-box-2.wav'

#content = "./wavs/songs/imperial.mp3"
#style = "./wavs/songs/usa.mp3"

In [None]:
style_audio, _ = librosa.core.load(style, sr=sr)
plot_spec(style_audio)
display(Audio(style_audio, rate=sr))

In [None]:
content_audio, _ = librosa.core.load(content, sr=sr)
plot_spec(content_audio)
display(Audio(content_audio, rate=sr))

In [None]:
timedomain.run(
    content,
    style,
    'timedomain_out.wav',
    n_fft=2048,          # 512 to sr / 2. Higher is better quality but is slower.
    n_layers=1,          # 1 to 3. Higher is better quality but is slower.
    n_filters=4096,      # 512 - 4096. Higher is better quality but is slower.
    hop_length=256,      # 256 to n_fft / 2. The lower this value, the better the temporal resolution.
    alpha=0.0005,        # 0.0001 to 0.01. The higher this value, the more of the original "content" bleeds through.
    k_w=3,               # 3 to 5. The higher this value, the more complex the patterns it can synthesize.
    iterations=200,      # 100 to 1000. Higher is better quality but is slower.
    stride=1,            # 1 to 3. Lower is better quality but is slower.
    sr=sr,
)

In [None]:
synth_audio, _ = librosa.core.load('timedomain_out.wav', sr=sr)
plot_spec(synth_audio)
display(Audio(synth_audio, rate=sr))

In [None]:
uylanov.run(
    content,
    style,
    'uly_out.wav',
    alpha=0.001,
    iterations=300,
    phase_iterations=300,
    sr=sr,
)

In [None]:
synth_audio, _ = librosa.core.load('uly_out.wav', sr=sr)
plot_spec(synth_audio)
display(Audio(synth_audio, rate=sr))