<a href="https://colab.research.google.com/github/olaviinha/NeuralTextToAudio/blob/main/AudioLDM_pub.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<font face="Trebuchet MS" size="6">AudioLDM<font color="#999" size="4">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</font><font color="#999" size="4">Text-to-audio</font><font color="#999" size="4">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</font><a href="https://github.com/olaviinha/NeuralTextToAudio" target="_blank"><font color="#999" size="4">Github</font></a>

Generate audio from text-prompt using [AudioLDM](https://github.com/haoheliu/AudioLDM).

In [None]:
#@markdown ### Display instructions and tips
%%html
<style type="text/css">
div, ul.tips {
  font-size: 17px;
  line-height: 25px;
}
ul.tips { 
  max-width: 70%;
  margin-left: 0;
  padding-left: 20px;
}
ul.tips li {
  margin-bottom: 7px;
}
ul.tips li code { 
  font-size: 16px;
  background: #2c2c2c; 
  padding: 2px 5px; 
}
ul.tips li ul.sublist li {
  line-height: auto;
  margin-bottom: 5px;
}
h2, div {
  margin-top: 30px;
  margin-bottom: 20px;
}
.italic {
  font-style: italic;
}
</style>
<h2>Notebook usage</h2>
<ul class="tips">
  <li>All directory and file paths should be relative to your Google Drive root (My Drive). E.g. <code>output_dir</code> value should be <code>Music/AI-Generated-Sounds</code> if you have a directory called <i>Music</i> in your Drive, containing a subdirectory called <i>AI-Generated-Sounds</i>. All paths are case-sensitive.</li>
  <li>Should you opt not to mount Google Drive, directory <i>faux_drive</i> (<code>/content/faux_drive</code>) found in the Files browser of the Colab runtime works as if it was your <i>My Drive</i>. You may use it to upload/download files via Colab's own Files browser pretending it's your Google Drive.</li>
  <li>Model <code>audioldm-full-l</code> requires Premium GPU. Other models run with standard GPU.</li>
  <li><code>local_models_dir</code> (optional but recommended) will save the used checkpoints in your Google Drive and/or use them from there if already available. This will speed up setup significantly next times you use this notebook.</li>
  <li><code>output_dir</code> is where the generated WAV files will be saved.</li>
  <li><code>batch</code> will just repeat whatever you're generating that many times.</li>
  <li>If <code>seed</code> is set to 0 (zero), a random seed will be used.</li>
  <li>You may use <code>;</code> in the <code>prompt</code> field as a separator, in which case a separate audio file will be generated for each semicolon-separated prompt in a single run.</li>
  <li>If you use <code>init_audio_file</code> (path to an existing audio file in your Google Drive), the notebook will try to guess what you want to do according to other parameters you have given, as follows:</li>
  <li><b>Audio-to-Audio generation</b>
    <ul class="sublist">
      <li>Leave <code>prompt</code> field empty.</li>
      <li>Leave <code>style_strength</code> at zero.</li>
      <!-- <li>Leave <code>superresolution</code> unchecked.</li> -->
    </ul>
  </li>
  <li><b>Style Transfer</b>
    <ul class="sublist">
      <li>Fill <code>prompt</code> field.</li>
      <li>Set <code>style_strength</code> (greater than zero).</li>
      <!-- <li>Leave <code>superresolution</code> unchecked.</li> -->
    </ul>
  </li>
  <!-- <li><b>Super-restolution</b>
    <ul class="sublist">
      <li>Check <code>superresolution</code> checkbox.</li>
      <li><code>prompt</code> and <code>style_strength</code> are automatically ignored.
  </li> -->
</ul>

<h2>Prompt tips</h2>
<div>Naturally a <i>good</i> prompt depends on what you're after, but generally:</div>
<ul class="tips">
  <li>Consider adding more detailed description of what kind of sound you want (add adjectives, etc.).</li>
  <li>For better quality, you may try some additional keywords generally associated with better quality, for example
    <ul class="sublist">
      <li><i>in studio</i></li>
      <li><i>studio recording</i></li>
      <li><i>high quality</i></li>
      <li><i>album</i> (for music)</li>
    </ul>
  </li>
</ul>

In [None]:
#@title #Setup
#@markdown This cell needs to be run only once. It will mount your Google Drive and setup prerequisites.<br>
#@markdown <small>Mounting Drive will enable this notebook to save outputs directly to your Drive. Otherwise you will need to copy/download them manually from this notebook.</small>

force_setup = False
repositories = ['https://github.com/haoheliu/AudioLDM.git']
pip_packages = ''
apt_packages = ''
mount_drive = True #@param {type:"boolean"}
skip_setup = False #@ param {type:"boolean"}
local_models_dir = "" #@param {type:"string"}

use_checkpoint = "audioldm-full-s-v2" #@param ["audioldm-s-full", "audioldm-full-l", "audioldm-full-s-v2"]


if use_checkpoint == 'audioldm-s-full':
  ckpt_url = 'https://zenodo.org/record/7600541/files/'+use_checkpoint+'.ckpt?download=1'
else:
  ckpt_url = 'https://zenodo.org/record/7698295/files/'+use_checkpoint+'.ckpt?download=1'
  
use_ckpt = use_checkpoint+'.ckpt'


import os
from google.colab import output
import warnings
warnings.filterwarnings('ignore')
%cd /content/

if pip_packages != '':
  !pip -q install {pip_packages}
if apt_packages != '':
  !apt-get update && apt-get install {apt_packages}

import sys, time, ntpath, string, random, librosa, librosa.display, IPython, shutil, math, psutil, datetime, requests, pytz
import numpy as np
import soundfile as sf
from datetime import timedelta

# Print colors
class c:
  title = '\033[96m'
  ok = '\033[92m'
  okb = '\033[94m'
  warn = '\033[93m'
  fail = '\033[31m'
  endc = '\033[0m'
  bold = '\033[1m'
  dark = '\33[90m'
  u = '\033[4m'

def op(typex, msg, value='', time=False):
  if time == True:
    stamp = timestamp(human_readable=True)
    typex = c.dark+stamp+' '+typex
  if value != '':
    print(typex+msg+c.endc, end=' ')
    print(value)
  else:
    print(typex+msg+c.endc)

def gen_id(type='short'):
  id = ''
  if type == 'timestamp':
    id = timestamp()
  if type == 'short':
    id = requests.get('https://api.inha.asia/k/?type=short').text
  if type == 'long':
    id = requests.get('https://api.inha.asia/k').text
  return id

def timestamp(no_slash=False, human_readable=False, helsinki_time=True, date_only=False):
  if helsinki_time == True:
    dt = datetime.datetime.now(pytz.timezone('Europe/Helsinki'))
  else:
    dt = datetime.datetime.now()
  if no_slash == True:
    dt = dt.strftime("%Y%m%d%H%M%S")
  else:
    if human_readable == True:
      dt = dt.strftime("%Y-%m-%d %H:%M:%S")
    else:
      if date_only == True:
        dt = dt.strftime("%Y-%m-%d")
      else:
        dt = dt.strftime("%Y-%m-%d_%H%M%S")
  return dt;

def fix_path(path, add_slash=False):
  if path.endswith('/'):
    path = path #path[:-1]
  if not path.endswith('/'):
    path = path+"/"
  if path.startswith('/') and add_slash == True:
    path = path[1:]
  return path
  
def path_leaf(path):
  head, tail = ntpath.split(path)
  return tail or ntpath.basename(head)

def path_dir(path):
  return path.replace(path_leaf(path), '')

def path_ext(path, only_ext=False):
  filename, extension = os.path.splitext(path)
  if only_ext == True:
    extension = extension[1:]
  return extension

def basename(path):
  filename = os.path.basename(path).strip()#.replace(" ", "_")
  filebase = os.path.splitext(filename)[0]
  return filebase

def slug(s):
  valid_chars = "-_. %s%s" % (string.ascii_letters, string.digits)
  file = ''.join(c for c in s if c in valid_chars)
  file = file.replace(' ','_')
  return file
  
def fetch(url, save_as):
  headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
  try:
    r = requests.get(url, stream=True, headers=headers, timeout=5)
    if r.status_code == 200:
      with open(save_as, 'wb') as f:
        r.raw.decode_content = True
        shutil.copyfileobj(r.raw, f)
      resp = r.status_code
    else:
      resp = 0
  except requests.exceptions.ConnectionError as e:
    r = 0
    resp = r
  return resp

def list_audio(path, midi=False):
  audiofiles = []
  for ext in ('*.wav', '*.aiff', '*.aif', '*.caf' '*.flac', '*.mp3', '*.m4a', '*.ogg', '*.WAV', '*.AIFF', '*.AIF', '*.CAF', '*.FLAC', '*.MP3', '*.OGG'):
    audiofiles.extend(glob(join(path, ext)))
  if midi is True:
    for ext in ('*.mid', '*.midi', '*.MID', '*.MIDI'):
      audiofiles.extend(glob(join(path, ext)))
  audiofiles.sort()
  return audiofiles

def audio_player(input, sr=44100, limit_duration=2):
  if type(input) != np.ndarray:
    input, sr = librosa.load(input, sr=None, mono=False)
  if limit_duration > 0:
    last_sample = math.floor(limit_duration*60*sr)
    if input.shape[-1] > last_sample:
      input = input[:last_sample, :last_sample]
      op(c.warn, 'WARN! Playback of below audio player is limited to first '+str(limit_duration)+' minutes to prevent Colab from crashing.\n')
  IPython.display.display(IPython.display.Audio(input, rate=sr))

# Mount Drive
if mount_drive == True:
  if not os.path.isdir('/content/drive'):
    from google.colab import drive
    drive.mount('/content/drive')
    drive_root = '/content/drive/My Drive/'
  if not os.path.isdir('/content/mydrive'):
    os.symlink('/content/drive/My Drive', '/content/mydrive')
    drive_root = '/content/mydrive/'
  drive_root_set = True
else:
  os.mkdir('/content/faux_drive')
  drive_root = '/content/faux_drive/'

if mount_drive == False:
  local_models_dir = ''

if len(repositories) > 0 and skip_setup == False:
  for repo in repositories:
    %cd /content/
    install_dir = fix_path('/content/'+path_leaf(repo).replace('.git', ''))
    repo = repo if '.git' in repo else repo+'.git'
    !git clone {repo}
    if os.path.isfile(install_dir+'setup.py') or os.path.isfile(install_dir+'setup.cfg'):
      !pip install -e {install_dir}
    if os.path.isfile(install_dir+'requirements.txt'):
      !pip install -r {install_dir}/requirements.txt

if len(repositories) == 1:
  %cd {install_dir}

dir_tmp = '/content/tmp/'
if not os.path.isdir(dir_tmp): os.mkdir(dir_tmp)

use_ckpt_path = os.path.expanduser('~')+'/.cache/audioldm/'

if not os.path.isdir(use_ckpt_path):
  os.makedirs(use_ckpt_path)

if local_models_dir != '':
  models_dir = drive_root+fix_path(local_models_dir)
  if not os.path.isdir(models_dir):
    os.makedirs(models_dir)
  # for ckpt_url in ckpt_urls:
  #   use_ckpt = ckpt_url.split('files/')[1].split('?')[0]
  if os.path.isfile(models_dir+use_ckpt):
    op(c.title, 'Fetching local ckpt:', models_dir.replace(drive_root, '')+use_ckpt)
    shutil.copy(models_dir+use_ckpt, use_ckpt_path+use_ckpt)
    op(c.ok, 'Done.')
  else:
    op(c.warn, 'Downloading '+use_ckpt+' to ', models_dir.replace(drive_root, ''))
    !wget {ckpt_url} -O {models_dir}{use_ckpt}
    shutil.copy(models_dir+use_ckpt, use_ckpt_path+use_ckpt)
    op(c.ok, 'Done.')
else:
  # for ckpt_url in ckpt_urls:
  #   use_ckpt = ckpt_url.split('files/')[1].split('?')[0]
  models_dir = use_ckpt_path
  op(c.warn, 'Downloading', use_ckpt)
  !wget {ckpt_url} -O {models_dir}{use_ckpt}
  shutil.copy(models_dir+use_ckpt, use_ckpt_path+use_ckpt)
  op(c.ok, 'Done.')

ckpt_path = use_ckpt_path+use_ckpt
op(c.title, 'Build model', ckpt_path)
sys.path.append('/content/AudioLDM/audioldm/')
from audioldm import text_to_audio, style_transfer, super_resolution_and_inpainting, build_model, latent_diffusion
audioldm = build_model(ckpt_path=ckpt_path, model_name=use_checkpoint)

def round_to_multiple(number, multiple):
  x = multiple * round(number / multiple)
  if x == 0: x = multiple
  return x

def text2audio(text, duration, audio_path, guidance_scale, random_seed, n_candidates, steps):
  waveform = text_to_audio(
    audioldm,
    text,
    audio_path,
    random_seed,
    duration=duration,
    guidance_scale=guidance_scale,
    ddim_steps=steps,
    n_candidate_gen_per_text=int(n_candidates)
  )
  if(len(waveform) == 1):
    waveform = waveform[0]
  return waveform

def styleaudio(text, duration, audio_path, strength, guidance_scale, random_seed, steps):
  waveform = style_transfer(
    audioldm,
    text,
    audio_path,
    strength,
    random_seed,
    duration=duration,
    guidance_scale=guidance_scale,
    ddim_steps=steps,
  )
  if(len(waveform) == 1):
    waveform = waveform[0]
  return waveform


# time_mask_ratio_start_and_end=(0.10, 0.15), # regenerate the 10% to 15% of the time steps in the spectrogram
# time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting
# freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
# freq_mask_ratio_start_and_end=(1.0, 1.0), # no super-resolution
def superres(text, duration, audio_path, guidance_scale, random_seed, n_candidates, steps):
  waveform = super_resolution_and_inpainting(
    audioldm,
    text,
    audio_path,
    random_seed,
    ddim_steps=steps,
    duration=duration,
    guidance_scale=guidance_scale,
    n_candidate_gen_per_text=n_candidates,
    freq_mask_ratio_start_and_end=(0.75, 1.0)
  )
  if(len(waveform) == 1):
    waveform = waveform[0]
  return waveform


prompt_list = []

output.clear()
# !nvidia-smi
print()
op(c.title, 'Using:', use_ckpt, time=True)
op(c.ok, 'Setup finished.', time=True)
print()


# Generate audio

In [None]:
prompt = "" #@param {type:"string"}
output_dir = "" #@param {type:"string"}
duration = 5 #@param {type:"slider", min:2.5, max:30, step:2.5}
guidance_scale = 2.5 #@param {type:"slider", min:2, max:5, step:0.5}
seed = 0 #@param {type:"integer"}
candidates = 3 #@param {type:"slider", min:2, max:5, step:1}
batch = 1 #@param {type:"integer"}

#@markdown <br>

#@markdown <b>Style Transfer & Audio-to-Audio</b> settings – Ignore these settings if you just want to generate audio by text prompt.
init_audio_file = "" #@param {type:"string"}

# what_to_do = "Audio-to-audio generation" #@param ["Audio-to-audio generation", "Super-resolution", "Style Transfer"]
what_to_do = None
style_strength = 0 #@param {type:"slider", min:0, max:1, step:0.05}
superresolution = False #@ param {type:"boolean"}

if what_to_do == 'Audio-to-audio-generation': action = 'audio2audio'
if what_to_do == 'Super-resolution': action = 'superres'
if what_to_do == 'Style Transfer': action = 'style'
if what_to_do == 'Inpaint': action = 'inpaint'

ddim_steps = 200
og_seed = seed
og_duration = duration
uniq_id = gen_id()
sr = 16000

# Prompt/input
if ';' in prompt:
  inputs = prompt.split(';')
elif prompt == 'prompt_list':
  inputs = prompt_list
else:
  inputs = [prompt]
inputs = [x.strip() for x in inputs]

# Output
if output_dir == '':
  if mount_drive is True:
    dir_out = dir_tmp
  if mount_drive is False:
    dir_out = drive_root+'generated-audio'
    if not os.path.isdir(dir_out):
      os.mkdir(dir_out)
else:
  if not os.path.isdir(drive_root+output_dir):
    os.mkdir(drive_root+output_dir)
  dir_out = drive_root+fix_path(output_dir)

if batch == 0: batch = 1  
inputs = inputs * batch

timer_start = time.time()
total = len(inputs)
action = 'generate'
init_path = None

for i, input in enumerate(inputs, 1):
  
  ndx_info = str(i)+'/'+str(total)+' '
  print()

  if init_audio_file != '':
    if os.path.isfile(drive_root+init_audio_file):
      init_path = drive_root+init_audio_file
      if superresolution is True:
        action = 'superres'
      elif style_strength > 0:
        init_filename = path_leaf(init_path)
        op(c.title, ndx_info+'Styling audio:', init_path.replace(drive_root, ''), time=True)
        op(c.title, 'With prompt:', input, time=True)
        action = 'style'
      else:
        op(c.title, ndx_info+'Audio-to-audio generation:', init_path.replace(drive_root, ''), time=True)
        # op(c.title, 'With prompt:', input, time=True)
        input = None
        action = 'audio2audio'
      # Trim duration if init duration is shorter than given duration
      init_y, init_sr = librosa.load(init_path, sr=None, mono=True)
      init_duration = librosa.get_duration(init_y, init_sr)
      duration = round_to_multiple(init_duration, 2.5) if init_duration < og_duration else duration
      
    else:
      op(c.fail, ndx_info+'Init audio file not found!', time=True)
      sys.exit('Make sure init_audio_file is a valid audio file and a valid file path relative to your My Drive.')
  else:
    op(c.title, ndx_info+'Generating audio:', input, time=True)

  if og_seed == 0: seed = int(time.time())

  if action == 'generate':
    file_out = dir_out+uniq_id+'__'+slug(input)[:60]+'_'+str(i).zfill(3)+'.wav'
    generated_audio = text2audio(input, duration, None, guidance_scale, seed, candidates, ddim_steps)
  elif action == 'audio2audio':
    file_out = dir_out+uniq_id+'__'+basename(init_path)+'_'+str(i).zfill(3)+'.wav'
    generated_audio = text2audio('placeholder', duration, init_path, guidance_scale, seed, candidates, ddim_steps)
  elif action == 'superres':
    file_out = dir_out+uniq_id+'__'+basename(init_path)+'_'+str(i).zfill(3)+'.wav'
    y, sr = librosa.load(init_path, sr=None)
    duration = librosa.get_duration(y, sr=sr)
    if duration > 30: duration = 30
    generated_audio = superres(None, duration, init_path, guidance_scale, seed, candidates, ddim_steps)
  elif action == 'style':
    file_out = dir_out+uniq_id+'__'+basename(init_path)+'_'+slug(input)[:60]+'_'+str(i).zfill(3)+'.wav'
    generated_audio = styleaudio(input, duration, init_path, style_strength, guidance_scale, seed, ddim_steps)
  else:
    op(c.fail, 'Something went wrong.')
    sys.exit()

  
  sf.write(file_out, generated_audio.T, sr, subtype='PCM_24')
  if os.path.isfile(file_out):
    audio_player(generated_audio, sr=sr)
    print()
    op(c.ok, 'Saved as', file_out.replace(drive_root, ''), time=True)
  else:
    op(c.fail, 'Error saving', file_out.replace(drive_root, ''), time=True)
  
# -- END THINGS --

timer_end = time.time()

print()
op(c.okb, 'Elapsed', timedelta(seconds=timer_end-timer_start), time=True)
op(c.ok, 'FIN.')