<a href="https://colab.research.google.com/github/olaviinha/NeuralInterviewAudiolizer/blob/main/NeuralInterviewAudiolizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<font face="Trebuchet MS" size="6">Neural Interview Audiolizer <font color="#999" size="3">Text-to-Audio Dialogue Generator</font><font color="#999" size="4">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</font><a href="https://github.com/olaviinha/NeuralInterviewAudiolizer" target="_blank"><font color="#999" size="4">Github</font></a><font color="#999" size="4">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</font><font size="3" color="#999"><a href="https://inha.se" target="_blank"><font color="#999">O. Inha</font></a></font></font>

This notebook turns a textual dialogue between two individuals (e.g. an interview, a chat) into audio dialogue (.txt to .wav) using either [Google Cloud Text-to-Speech API](https://cloud.google.com/text-to-speech) or [Amazon Polly Text-to-Speech API](https://aws.amazon.com/polly/).

<h3>Please note</h3>

- Using either of provided APIs **requires access keys**. For more details and instructions on how to obtain the necessary credentials and access, check the following links:
  - Google TTS: [Before you begin](https://cloud.google.com/text-to-speech/docs/quickstart-client-libraries#before-you-begin)
  - Amazon Polly: [AWS Account and Access Keys](https://docs.aws.amazon.com/powershell/latest/userguide/pstools-appendix-sign-up.html)
- Google Cloud TTS API uses command line interface (curl) instead of Python interface due to some packages in Google Colaboratory being deprecated by default, at the time of writing this notebook. Will update it to use Python interface later, when it works without extra runtime restarts, etc.
- The returned audio streams are in mono and have native sample rates of around 22-24 kHz. All audio is converted to 44.1 kHz sample rate stereo by this notebook.

<h3>Accepted formats</h3>

.txt file containing the dialogue must be in one of the following formats. If your dialogue material is a copy-paste from the interwebs, make sure to clean it up first to meet one of the following formats.<br><br>

<hr size="1" color="#666"/>

### `dialogue_with_names`

In this format, character `:` is used to determine when the person speaking changes. Hence, `:` should not appear anywhere else in the text. Empty lines are ignored. Names (whatever comes before `:` in each line) are ignored.

<font color="#dd6">_John: Hello Doe!_<br> _Doe: Well, hello there John._<br>_How are you?_<br>_John: I'm good thank you, and you?_<br>_Doe: Fantastic._</font>

<hr size="1" color="#666"/>

### `question_and_answer`

In this format, empty lines are used to determine when the person speaking changes, i.e. there must be an empty line in the text after one of the individuals is done talking and the other should start talking.

<font color="#dd6">_Hello Doe!_<br><br> _Hi!_<br><br>_How are you?_<br>_What's new?_<br><br>_Struggling as usual, nothing new._<br><br>_What do you think about politics?_<br><br>_Not interested..._<br>_Like no thanks._</font>

<hr size="1" color="#666"/>

<h3>Tips</h3>

- You can test what the voices sound like in advance [here](https://cloud.google.com/text-to-speech#section-2).

In [None]:
#@title #Setup
#@markdown This cell needs to be run only once. It will mount your Google Drive and setup prerequisites.


import os
from google.colab import output

pip_packages = 'google-api-core google-api-python-client google-auth-httplib2 google-auth-oauthlib google-cloud-texttospeech soundfile boto3'

# inhagcutils
if not os.path.isfile('/content/inhagcutils.ipynb'):
  %cd /content/
  !pip -q install --upgrade import-ipynb {pip_packages}
  #!apt-get install sox
  !curl -s -O https://raw.githubusercontent.com/olaviinha/inhagcutils/master/inhagcutils.ipynb
import import_ipynb
from inhagcutils import *

# Mount Drive
if not os.path.isdir('/content/drive'):
  from google.colab import drive
  drive.mount('/content/drive')

# Drive symlink
if not os.path.isdir('/content/mydrive'):
  os.symlink('/content/drive/My Drive', '/content/mydrive')
  drive_root_set = True
drive_root = '/content/mydrive/'

dir_tmp = '/content/tmp/'
tmp_mono = '/content/tmp/mono/'
create_dirs([dir_tmp, tmp_mono])

tmp = dir_tmp

global_sr = 24000

import json, soundfile

def writeFile(file, content):
  f = open(file, 'w')
  f.writelines(content)
  f.close()

def appendTxt(txt_file, content):
  txt = open(txt_file, 'a+') 
  txt.writelines(content+'\n')
  txt.close()

def generate_silence(duration, sr=global_sr):
  content = [0]*librosa.time_to_samples(duration, sr=sr)
  silence = np.array([content, content], dtype=np.float32)
  return silence

def save(audio_data, save_as='frank', sr=global_sr):
  if save_as=='frank':
    global bpm
    timestamp = datetime.datetime.today().strftime('%Y%m%d-%H%M%S')
    save_as = save_as+'_'+rnd_str(4)+'_'+timestamp+'__'+bpm+'bpm.wav'
  soundfile.write(save_as, audio_data.T, sr)

# Parse dialogue
def parseDialogue(format, dialogue_txt, double_backslash=False):
  dlg = []
  dialogue = []
  aline = ''
  if format == 'dialogue_with_names':
    with open(dialogue_txt, 'r') as f_in:
      dialogue = list(line for line in (l.strip() for l in f_in) if line)
    for i, line in enumerate(dialogue):
      if ':' in line:
        if double_backslash == True:
          aline = line.split(':')[1].replace('\n', '').replace("'", r"\\'").replace('\\\\', '\\').strip()
        else:
          aline = line.split(':')[1].replace('\n', '').strip()
        if (i < len(dialogue)-1 and ':' in dialogue[i+1]) or (i >= len(dialogue)-1):
          dlg.append(aline)
      else:
        aline = aline+' '+line
        if double_backslash == True:
          aline = aline.replace('\n', '').replace("'", r"\\'").replace('\\\\', '\\').strip()
        else:
          aline = aline.replace('\n', '').strip()
        if (i < len(dialogue)-1 and ':' in dialogue[i+1]) or (i >= len(dialogue)-1):
          dlg.append(aline)
    
  if format == 'question_and_answer':
    dialogue = open(dialogue_txt, 'r').readlines()
    for i, line in enumerate(dialogue):
      if len(line) > 1:
        aline = aline+' '+line
        aline = aline.replace('\n', '').replace("'", r"\\'").replace('\\\\', '\\').strip()
      else:
        dlg.append(aline)
        aline = ''
    dlg.append(dialogue[-1])
  #dialogue = [dlg[::2], dlg[1::2]]
  #return [dlg[::2], dlg[1::2]]
  return dlg

def concat_audio(dir, output_wav):
  global global_sr, dialogue_txt
  all_audio = []
  for audio_file in list_audio(dir):
    all_audio.append(librosa.load(audio_file, sr=global_sr, mono=False)[0])
  all_audio = np.concatenate(all_audio, axis=1)
  tmp_wav = tmp+path_leaf(output_wav)
  save(all_audio, tmp_wav)
  !ffmpeg {ffmpeg_q} -y -i "{tmp_wav}" {wav_44} "{output_wav}"

output.clear()
op(c.ok, 'FIN.')

# Update voice lists (copy-paste to dropdown)
# for voice in en_voices:
#   print('"'+voice[1]+'_'+voice[2]+'",', end='')

In [None]:
#@title #Google Cloud TTS

#@markdown <small>Path to Google Service Account Key file (json) located in your Google Drive. More info about _how_ [here](https://cloud.google.com/text-to-speech/docs/quickstart-client-libraries#before-you-begin)</small>
credentials_file = "" #@param {type:"string"}
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=drive_root+credentials_file

# Fetch voices
voices_json = !curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) -H "Content-Type: application/json; charset=utf-8" "https://texttospeech.googleapis.com/v1/voices"
voices = json.loads("\n".join(voices_json))
en_voices = []
for voice in voices['voices']:
  if "en-" in voice['languageCodes'][0] and "Wavenet" in voice['name']:
    en_voices.append([voice['languageCodes'], voice['name'], voice['ssmlGender']])


#@markdown <small>Path to .txt file containing the dialogue, located in your Google Drive.</small>
dialogue_txt = "" #@param {type:"string"}
format = "dialogue_with_names" #@param ["dialogue_with_names","question_and_answer"]
#@markdown <small>Add this much random variation to the pauses. `pX_pause` sliders determine how long pause there will be **after** that person is done talking.</small>
pause_variation = 0.2 #@param {type:"slider", min:0, max:1, step:0.01}

## #@markdown <small>Use cURL instead of Python API.</small>
## use_curl = True #@param {type:"boolean"}
use_curl = True

#@markdown <hr size="1" color="#666"/>

#@markdown ###Individual #1
p1_voice = "en-US-Wavenet-B_MALE" #@param ["en-GB-Wavenet-F_FEMALE","en-IN-Wavenet-D_FEMALE","en-AU-Wavenet-A_FEMALE","en-AU-Wavenet-B_MALE","en-AU-Wavenet-C_FEMALE","en-AU-Wavenet-D_MALE","en-GB-Wavenet-A_FEMALE","en-GB-Wavenet-B_MALE","en-GB-Wavenet-C_FEMALE","en-GB-Wavenet-D_MALE","en-IN-Wavenet-A_FEMALE","en-IN-Wavenet-B_MALE","en-IN-Wavenet-C_MALE","en-US-Wavenet-G_FEMALE","en-US-Wavenet-H_FEMALE","en-US-Wavenet-I_MALE","en-US-Wavenet-J_MALE","en-US-Wavenet-A_MALE","en-US-Wavenet-B_MALE","en-US-Wavenet-C_FEMALE","en-US-Wavenet-D_MALE","en-US-Wavenet-E_FEMALE","en-US-Wavenet-F_FEMALE"]
p1_speaking_rate = 1 #@param {type:"slider", min:0.25, max:4, step:0.05}
p1_pitch = 0 #@param {type:"slider", min:-20, max:20, step:1}
p1_pause = 10 #@param {type:"slider", min:0, max:2000, step:10}

#@markdown <hr size="1" color="#666"/>

#@markdown ###Individual #2
p2_voice = "en-GB-Wavenet-C_FEMALE" #@param ["en-GB-Wavenet-F_FEMALE","en-IN-Wavenet-D_FEMALE","en-AU-Wavenet-A_FEMALE","en-AU-Wavenet-B_MALE","en-AU-Wavenet-C_FEMALE","en-AU-Wavenet-D_MALE","en-GB-Wavenet-A_FEMALE","en-GB-Wavenet-B_MALE","en-GB-Wavenet-C_FEMALE","en-GB-Wavenet-D_MALE","en-IN-Wavenet-A_FEMALE","en-IN-Wavenet-B_MALE","en-IN-Wavenet-C_MALE","en-US-Wavenet-G_FEMALE","en-US-Wavenet-H_FEMALE","en-US-Wavenet-I_MALE","en-US-Wavenet-J_MALE","en-US-Wavenet-A_MALE","en-US-Wavenet-B_MALE","en-US-Wavenet-C_FEMALE","en-US-Wavenet-D_MALE","en-US-Wavenet-E_FEMALE","en-US-Wavenet-F_FEMALE"]
p2_speaking_rate = 1 #@param {type:"slider", min:0.25, max:4, step:0.05}
p2_pitch = 0 #@param {type:"slider", min:-20, max:20, step:1}
p2_pause = 70 #@param {type:"slider", min:0, max:2000, step:10}

#@markdown <hr size="1" color="#666"/>



dialogue_txt = drive_root+dialogue_txt
p1_voice = p1_voice.split('_')[0]
p2_voice = p2_voice.split('_')[0]

names = []
langs = []
gends = []
for voice in en_voices:
  if voice[1] == p1_voice or voice[1] == p2_voice:
    names.append(voice[1])
    langs.append(voice[0][0])
    gends.append(voice[2])

dlg = parseDialogue(format, dialogue_txt, True)

# Empty tmp
if os.listdir(tmp_mono):
  !rm {tmp_mono}*
if os.listdir(tmp):
  !rm {tmp}*

# Get audio
if use_curl:
  for i, repla in enumerate(dlg):
    if i % 2 == 0:
      # Person 2
      vname = names[1]
      vlang = langs[1]
      vgend = gends[1]
      vrate = p2_speaking_rate
      vpitc = p2_pitch
      pause = p2_pause
    else:
      # Person 1
      vname = names[0]
      vlang = langs[0]
      vgend = gends[0]
      vrate = p1_speaking_rate
      vpitc = p1_pitch
      pause = p1_pause
    input_json = "{'input':{'text':'"+str(repla)+"'},'voice':{'languageCode':'"+str(vlang)+"','name':'"+str(vname)+"','ssmlGender':'"+str(vgend)+"'},'audioConfig':{'speaking_rate': '"+str(vrate)+"', 'pitch': '"+str(vpitc)+"', 'audioEncoding':'LINEAR16'}}"
    print('\n')
    op(c.title, str(i), repla)
    !curl -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -H "Content-Type: application/json; charset=utf-8" --data "{input_json}" "https://texttospeech.googleapis.com/v1/text:synthesize" > temp.json
    with open('temp.json') as fp:
      for ii, line in enumerate(fp):
        if ii == 1:
          audio_content = line.replace('"audioContent": "', '').replace('"', '').strip()
    writeFile('base64-temp.txt', audio_content)
    tmp_wav_file = tmp_mono+'google_decoded-'+str(i).zfill(4)+'.wav'
    wav_file = tmp+'google_decoded-'+str(i).zfill(4)+'.wav'
    !base64 base64-temp.txt -d > {tmp_wav_file}
    !ffmpeg {ffmpeg_q} -y -i "{tmp_wav_file}" {wav_44} "{wav_file}"

    pvar = pause * pause_variation if odds(0.5) else abs(pause * pause_variation)
    pause = pause + pvar
    save(generate_silence(pause/1000, sr=44100), tmp+'google_decoded-'+str(i).zfill(4)+'_pause.wav', sr=44100)
  
  
  output_wav = path_dir(dialogue_txt)+basename(dialogue_txt)+'_google_tts_'+rnd_str(4)+'.wav'
  concat_audio(tmp, output_wav)

  output.clear()
  op(c.ok, 'File saved as', output_wav.replace(drive_root, ''))
  print('\n')
  audio_player(output_wav)
else:
  pass
  ## Update to Python API whenever it works:
  
  # from google.cloud import texttospeech
  # client = texttospeech.TextToSpeechClient()

  # synthesis_input = texttospeech.SynthesisInput(text="Hello, World!")
  # voice = texttospeech.VoiceSelectionParams(
  #     language_code="en-US", ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL
  # )

  # audio_config = texttospeech.AudioConfig(
  #     audio_encoding=texttospeech.AudioEncoding.LINEAR16
  # )

  # response = client.synthesize_speech(
  #     input=synthesis_input, voice=voice, audio_config=audio_config
  # )

  # with open('out.wav', "wb") as out:
  #     out.write(response.audio_content)


In [None]:
#@title #Amazon Polly TTS

#@markdown <small>You can create access keys [here](https://console.aws.amazon.com/iam/home?#/security_credentials).</small>
aws_access_key_id = "" #@param {type:"string"}
aws_secret_access_key = "" #@param {type:"string"}
#region = "eu-north-1" #@param {type:"string"}
region = "eu-central-1" #@param ["us-east-1", "us-east-2", "us-west-1", "us-west-2", "eu-central-1", "eu-west-2"]



#@markdown <small>Path to .txt file containing the dialogue, located in your Google Drive.</small>
dialogue_txt = "" #@param {type:"string"}
format = "dialogue_with_names" #@param ["dialogue_with_names","question_and_answer"]
#@markdown <small>Add this much random variation to the pauses. `pX_pause` sliders determine how long pause there will be **after** that person is done talking.</small>
pause_variation = 0.2 #@param {type:"slider", min:0, max:1, step:0.01}

## #@markdown <small>Use cURL instead of Python API.</small>
## use_curl = True #@param {type:"boolean"}
use_curl = True

#@markdown <hr size="1" color="#666"/>

#@markdown ###Individual #1
p1_voice = "Brian" #@param ["Emma","Brian","Ivy","Joanna","Kendra","Kimberly","Salli","Joey","Justin","Kevin","Matthew"]
p1_pause = 10 #@param {type:"slider", min:0, max:2000, step:10}

#@markdown <hr size="1" color="#666"/>

#@markdown ###Individual #2
p2_voice = "Joanna" #@param ["Emma","Brian","Ivy","Joanna","Kendra","Kimberly","Salli","Joey","Justin","Kevin","Matthew"]
p2_pause = 70 #@param {type:"slider", min:0, max:2000, step:10}

#@markdown <hr size="1" color="#666"/>



dialogue_txt = drive_root+dialogue_txt

dlg = parseDialogue(format, dialogue_txt)

# Empty tmp
if os.listdir(tmp):
  !rm {tmp}*

import boto3
polly_client = boto3.Session(aws_access_key_id, aws_secret_access_key, region_name=region).client('polly')

for i, repla in enumerate(dlg):

  if i % 2 == 0:
    # Person 2
    voice = p1_voice
    pause = p1_pause
  else:
    # Person 1
    voice = p2_voice
    pause = p2_pause

  op(c.title, str(i), repla)

  response = polly_client.synthesize_speech(Engine='neural', VoiceId=voice, OutputFormat='mp3', Text = repla)

  mp3_file = tmp+'polly_decoded-'+str(i).zfill(4)+'.mp3'
  wav_file = tmp+'polly_decoded-'+str(i).zfill(4)+'.wav'

  file = open(mp3_file, 'wb')
  file.write(response['AudioStream'].read())
  file.close()

  pvar = pause * pause_variation if odds(0.5) else abs(pause * pause_variation)
  pause = pause + pvar
  save(generate_silence(pause/1000), tmp+'polly_decoded-'+str(i).zfill(4)+'_pause.wav')

  !ffmpeg {ffmpeg_q} -y -i "{mp3_file}" {wav_44} -af "pan=stereo|c0=c0|c1=c0" "{wav_file}"
  !rm "{mp3_file}"

output_wav = path_dir(dialogue_txt)+basename(dialogue_txt)+'_polly_'+rnd_str(4)+'.wav'
concat_audio(tmp, output_wav)

output.clear()
op(c.ok, 'File saved as', output_wav.replace(drive_root, ''))
print('\n')
audio_player(output_wav)

