<a href="https://colab.research.google.com/github/jameit/Applied-AI-Technologies/blob/StyleChanger/alishdipani_audio_torch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Neural Transfer of Audio in Pytorch
=========================

Neural Style transfer is really interesting. They've been some really interesting applications of style transfer. It basically aims to take the 'style' from one image and change the 'content' image to meet that style. Here's an example. This image has been converted to look like it was painted by Van gough.

![photo](https://camo.githubusercontent.com/974884c2fb949b365c3f415b3712d2cac04a35f7/68747470733a2f2f692e696d6775722e636f6d2f575771364931552e6a7067)


But so far it hasn't really been applied to audio. So this week I explored the idea of applying neural style transfer to audio. To be frank, the results were less than stellar but I'm hoping to keep working on this in the future. 

# Install

In [None]:
import torch
import os
from IPython.display import Audio
from PIL import Image
import matplotlib.pyplot as plt
import copy

In [None]:
!pip install youtube-dl

Collecting youtube-dl
[?25l  Downloading https://files.pythonhosted.org/packages/32/47/a4442e3bd6f13013c0c38a5b16576e9d69da14d09b1ef00a9c0915e75b3e/youtube_dl-2021.5.16-py2.py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 7.1MB/s 
[31mERROR: Operation cancelled by user[0m
[?25h

# Build dataset

For this exercise, I'm going to be using clips from the joe rogan podcast. I'm trying to make [joe rogan](https://en.wikipedia.org/wiki/Joe_Rogan), from the [joe rogan podcast](http://podcasts.joerogan.net/), sound like [joey diaz](https://en.wikipedia.org/wiki/Joey_Diaz), from the [Church of Whats Happening Now](https://www.youtube.com/channel/UCv695o3i-JmkUB7tPbtwXDA). Joe Rogan already does a pretty good [impression of joey diaz](https://www.youtube.com/watch?v=SLolljsbbFs). But I'd like to improve his impression using deep learning.

First I'm going to download the youtube videos. There's a neat trick mentioned on github that allows you to download small segments of youtube videos. That's handy cause I don't want to download the entire video.

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))

Saving data_chopin.wav to data_chopin.wav
Saving data_futurama.wav to data_futurama.wav
User uploaded file "data_chopin.wav" with length 1764126 bytes
User uploaded file "data_futurama.wav" with length 1765750 bytes


In [None]:
content_audio_name = 'data_chopin.wav'
style_audio_name = 'data_futurama.wav'

In [None]:
# download youtube url using ffmpeg
# adapted from: https://github.com/ytdl-org/youtube-dl/issues/622#issuecomment-162337869
def download_from_url_ffmpeg(url, output, minute_mark = 1):

  try:
    os.remove(output)
  except:
    pass

  # cmd = 'ffmpeg -loglevel warning -ss 0 -i $(youtube-dl -f 22 --get-url https://www.youtube.com/watch?v=mMZriSvaVP8) -t 11200 -c:v copy -c:a copy react-spot.mp4'
  cmd = 'ffmpeg -loglevel warning -ss 0 -i $(youtube-dl -f bestaudio --get-url '+str(url)+') -t '+str(minute_mark*60)+' '+str(output)
  os.system(cmd)

  return os.getcwd()+'/'+output


url = 'https://www.youtube.com/watch?v=-xY_D8SMNtE'
content_audio_name = download_from_url_ffmpeg(url, 'jre.wav')
url = 'https://www.youtube.com/watch?v=-l88fMJcvWE'
style_audio_name = download_from_url_ffmpeg(url, 'joey_diaz.wav')

In [None]:
Audio(style_audio_name)

In [None]:
Audio(content_audio_name)

# Loss

There are two types of loss for this:

1. Content loss. Lower values for this means that the output audio sounds like joe rogan. 

2. Style loss. Lower values for this means that the output audio sounds like joey diaz.

Ideally we want both content and style loss to be minimised.



In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
import torch
import torch.nn as nn
from torch.nn import Conv2d, ReLU, AvgPool1d, MaxPool2d, Linear, Conv1d
from torch.autograd import Variable
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np 
import os
import torchvision.transforms as transforms
import copy
import librosa



class GramMatrix(nn.Module):

	def forward(self, input):
		a, b, c = input.size()  # a=batch size(=1)
				# b=number of feature maps
				# (c,d)=dimensions of a f. map (N=c*d)
		features = input.view(a * b, c)  # resise F_XL into \hat F_XL
		G = torch.mm(features, features.t())  # compute the gram product
				# we 'normalize' the values of the gram matrix
				# by dividing by the number of element in each feature maps.
		return G.div(a * b * c)
	

# https://ghamrouni.github.io/stn-tuto/advanced/neural_style_tutorial.html#
class ContentLoss(nn.Module):

		def __init__(self, target, weight):
				super(ContentLoss, self).__init__()
				# we 'detach' the target content from the tree used
				self.target = target.detach() * weight
				# to dynamically compute the gradient: this is a stated value,
				# not a variable. Otherwise the forward method of the criterion
				# will throw an error.
				self.weight = weight
				self.criterion = nn.MSELoss()

		def forward(self, input):
				self.loss = self.criterion(input * self.weight, self.target)
				self.output = input
				return self.output

		def backward(self, retain_graph=True):
				self.loss.backward(retain_graph=retain_graph)
				return self.loss


class StyleLoss(nn.Module):

	def __init__(self, target, weight):
		super(StyleLoss, self).__init__()
		self.target = target.detach() * weight
		self.weight = weight
		self.gram = GramMatrix()
		self.criterion = nn.MSELoss()

	def forward(self, input):
		self.output = input.clone()
		self.G = self.gram(input)
		self.G.mul_(self.weight)
		self.loss = self.criterion(self.G, self.target)
		return self.output

	def backward(self,retain_graph=True):
		self.loss.backward(retain_graph=retain_graph)
		return self.loss

# Converting Wav to Matrix

To convert the waveform audio to a matrix that we can pass to pytorch I'll use `librosa`. Most of this code was borrowed from Dmitry Ulyanov's [github repo](https://github.com/DmitryUlyanov/neural-style-audio-tf/blob/master/neural-style-audio-tf.ipynb) and Alish Dipani's [github repo](https://github.com/alishdipani/Neural-Style-Transfer-Audio). 

We get the Short-time Fourier transform from the audio using the `librosa` library. The window size for this is `2048`, which is also the default setting. There is scope here for replacing the code with code from torchaudio. But this works for now.

In [None]:
import gc; gc.collect()

129

In [None]:
# USING LIBROSA
N_FFT=2048
def read_audio_spectum(filename):
  x, fs = librosa.load(filename)
  S = librosa.stft(x, N_FFT)
  p = np.angle(S)
  S = np.log1p(np.abs(S))  
  return S, fs

style_audio, style_sr = read_audio_spectum(style_audio_name)
content_audio, content_sr = read_audio_spectum(content_audio_name)

if(content_sr != style_sr):
  raise 'Sampling rates are not same'

  
style_audio = style_audio.reshape([1,1025,style_audio.shape[1]])
content_audio = content_audio.reshape([1,1025,content_audio.shape[1]])

if torch.cuda.is_available():
  style_float = Variable((torch.from_numpy(style_audio)).cuda())
  content_float = Variable((torch.from_numpy(content_audio)).cuda())	
else:
  style_float = Variable(torch.from_numpy(style_audio))
  content_float = Variable(torch.from_numpy(content_audio))

In [None]:
# !pip install torchaudio

# Create CNN

In [None]:
import gc; gc.collect(); del cnn

This CNN is very shallow. It consists of 2 convolutions and a ReLU in between them. I originally took the CNN used [here](https://github.com/alishdipani/Neural-Style-Transfer-Audio/blob/master/NeuralStyleTransfer.py) but I've made a few changes. 

 - Firstly, I added content loss. This wasn't added before and is obviously very useful. We'd like to know how close (or far away) the audio sounds to the original content.

 - Secondly, I added a ReLU to the model. It's pretty well [established](https://stats.stackexchange.com/questions/275358/why-is-increasing-the-non-linearity-of-neural-networks-desired) that nonlinear activations are desired in a neural network. Adding a ReLU improved the model significantly.

 - Increased the number of steps. From ``2500`` to `20000`

 - Slightly deepened the network. I added a layer of `Conv1d`. After this layer style loss and content loss is calculated

In [None]:
class CNNModel(nn.Module):
	def __init__(self):
		super(CNNModel, self).__init__()
		self.cnn1 = Conv1d(in_channels=1025, out_channels=4096, kernel_size=3, stride=1, padding=1)
		self.relu = ReLU()
		self.cnn2 = Conv1d(in_channels=4096, out_channels=4096, kernel_size=3, stride=1, padding=1)

	def forward(self, x):
		out = self.cnn1(x)
		out = self.relu(out)
		out = self.cnn2(x)
		out = self.relu(out)
		out = self.cnn3(x)
		return out

In [None]:
cnn = CNNModel()
if torch.cuda.is_available():
  cnn = cnn.cuda()


style_weight=1000
content_weight = 2


def get_style_model_and_losses(cnn, style_float,\
                               content_float=content_float,\
                               style_weight=style_weight):
  
  cnn = copy.deepcopy(cnn)

  style_losses = []
  content_losses = []

  # create model
  model = nn.Sequential()

  # we need a gram module in order to compute style targets
  gram = GramMatrix()

  # load onto gpu  
  if torch.cuda.is_available():
    model = model.cuda()
    gram = gram.cuda()

  # add conv1
  name = 'conv_1'
  model.add_module(name, cnn.cnn1)

  # add relu
  name = 'relu1'
  model.add_module(name, cnn.relu)

  # add conv2
  name = 'conv_2'
  model.add_module(name, cnn.cnn2)

  # add style loss
  target_feature = model(style_float).clone()
  target_feature_gram = gram(target_feature)
  style_loss = StyleLoss(target_feature_gram, style_weight)
  model.add_module("style_loss_1", style_loss)
  style_losses.append(style_loss)

  # add content loss
  target = model(content_float).detach()
  content_loss = ContentLoss(target, content_weight)
  model.add_module("content_loss_1", content_loss)
  content_losses.append(content_loss)

  return model, style_losses, content_losses


get_style_model_and_losses(cnn, style_float, content_float)

(Sequential(
   (conv_1): Conv1d(1025, 4096, kernel_size=(3,), stride=(1,), padding=(1,))
   (relu1): ReLU()
   (conv_2): Conv1d(4096, 4096, kernel_size=(3,), stride=(1,), padding=(1,))
   (style_loss_1): StyleLoss(
     (gram): GramMatrix()
     (criterion): MSELoss()
   )
   (content_loss_1): ContentLoss(
     (criterion): MSELoss()
   )
 ), [StyleLoss(
    (gram): GramMatrix()
    (criterion): MSELoss()
  )], [ContentLoss(
    (criterion): MSELoss()
  )])

# Run style transfer

Now I'll run the style transfer. This will use the `optim.Adam` optimizer. This piece of code was taken from the pytorch tutorial for [neural style transfer](https://pytorch.org/tutorials/advanced/neural_style_tutorial.html). For each iteration of the network the style loss and content loss is calculated. In turn that is used to get the gradients. The gradients are mulitplied by the learnign rates. That in turn updates the input audio matrix. In pytorch the optimizer requries a [closure](https://pytorch.org/tutorials/advanced/neural_style_tutorial.html#gradient-descent) function.

In [None]:
import gc; gc.collect()

input_float = content_float.clone()
#input_float = Variable(torch.randn(content_float.size())).type(torch.FloatTensor)

learning_rate_initial = 1e-4

def get_input_param_optimizer(input_float):
  input_param = nn.Parameter(input_float.data)
  # optimizer = optim.Adagrad([input_param], lr=learning_rate_initial, lr_decay=0.0001,weight_decay=0)
  optimizer = optim.Adam([input_param], lr=learning_rate_initial)
  # optimizer = optim.LBFGS([input_param], lr=learning_rate_initial)
  # optimizer = optim.SGD([input_param], lr=learning_rate_initial)
  # optimizer = optim.RMSprop([input_param], lr=learning_rate_initial)
  return input_param, optimizer

num_steps= 10000


# from https://pytorch.org/tutorials/advanced/neural_style_tutorial.html
def run_style_transfer(cnn, style_float=style_float,\
                       content_float=content_float,\
                       input_float=input_float,\
                       num_steps=num_steps, style_weight=style_weight): 
  print('Building the style transfer model..')
  # model, style_losses = get_style_model_and_losses(cnn, style_float)
  model, style_losses, content_losses = get_style_model_and_losses(cnn, style_float, content_float)
  input_param, optimizer = get_input_param_optimizer(input_float)
  print('Optimizing..')
  run = [0]

  while run[0] <= num_steps:
    def closure():
            # correct the values of updated input image
      input_param.data.clamp_(0, 1)

      optimizer.zero_grad()
      model(input_param)
      style_score = 0
      content_score = 0

      for sl in style_losses:
        #print('sl is ',sl,' style loss is ',style_score)
        style_score += sl.loss

      for cl in content_losses:
        content_score += cl.loss

      style_score *= style_weight
      content_score *= content_weight

      loss = style_score + content_score
      loss.backward()

      run[0] += 1
      if run[0] % 100 == 0:
        print("run {}:".format(run))
        print('Style Loss : {:4f} Content Loss: {:4f}'.format(
                    style_score.item(), content_score.item()))
        print()

      return style_score + content_score

    optimizer.step(closure)

  # ensure values are between 0 and 1
  input_param.data.clamp_(0, 1)

  return input_param.data


output = run_style_transfer(cnn, style_float=style_float, content_float=content_float, input_float=input_float)

Building the style transfer model..
Optimizing..
run [100]:
Style Loss : 0.037150 Content Loss: 0.078876

run [200]:
Style Loss : 0.036688 Content Loss: 0.077489

run [300]:
Style Loss : 0.036229 Content Loss: 0.076232

run [400]:
Style Loss : 0.035772 Content Loss: 0.075073

run [500]:
Style Loss : 0.035319 Content Loss: 0.073997

run [600]:
Style Loss : 0.034872 Content Loss: 0.072990

run [700]:
Style Loss : 0.034430 Content Loss: 0.072046

run [800]:
Style Loss : 0.033994 Content Loss: 0.071156



# Reconstruct Audio

Finally the audio needs to be reconstructed. To do that the librosa inverse short-time fourier transform can be used. 

Then we write to an audio file and use the jupyter notebook extension to play the audio in the notebook. 

In [None]:
# taken from: https://github.com/alishdipani/Neural-Style-Transfer-Audio/blob/master/NeuralStyleTransfer.py

if torch.cuda.is_available():
  output = output.cpu()

output = output.squeeze(0)
output = output.numpy()

N_FFT=2048
a = np.zeros_like(output)
a = np.exp(output) - 1

# This code is supposed to do phase reconstruction
p = 2 * np.pi * np.random.random_sample(a.shape) - np.pi
for i in range(500):
  S = a * np.exp(1j*p)
  x = librosa.istft(S)
  p = np.angle(librosa.stft(x, N_FFT))


In [None]:
OUTPUT_FILENAME = 'output.wav'
librosa.output.write_wav(OUTPUT_FILENAME, x, style_sr)
Audio(OUTPUT_FILENAME)

In [None]:
#from google.colab import files
import soundfile as sf

OUTPUT_FILENAME = 'test.wav'
sf.write(OUTPUT_FILENAME, x, style_sr)

files.download('test.wav') 

print('DONE...')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

DONE...
