# Generate an audio caption for sonification

First, you will need to install TTS if you don't already have it.

In [None]:
!pip install TTS

In [1]:
from TTS.api import TTS
from IPython.display import Audio

The default text-to-speech (tts) model used in STRAUSS is an English-language, female voice with an Irish accent: 'tts_models/en/jenny/jenny'. You can hear a sample below: 

In [2]:
Audio('tts_jenny.wav', autoplay=True)

You can choose other voices in a range of languages. To try them, use the following code:

In [3]:
# List models
TTS.list_models()

No API token found for 🐸Coqui Studio voices - https://coqui.ai 
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`



['tts_models/multilingual/multi-dataset/your_tts',
 'tts_models/bg/cv/vits',
 'tts_models/cs/cv/vits',
 'tts_models/da/cv/vits',
 'tts_models/et/cv/vits',
 'tts_models/ga/cv/vits',
 'tts_models/en/ek1/tacotron2',
 'tts_models/en/ljspeech/tacotron2-DDC',
 'tts_models/en/ljspeech/tacotron2-DDC_ph',
 'tts_models/en/ljspeech/glow-tts',
 'tts_models/en/ljspeech/speedy-speech',
 'tts_models/en/ljspeech/tacotron2-DCA',
 'tts_models/en/ljspeech/vits',
 'tts_models/en/ljspeech/vits--neon',
 'tts_models/en/ljspeech/fast_pitch',
 'tts_models/en/ljspeech/overflow',
 'tts_models/en/ljspeech/neural_hmm',
 'tts_models/en/vctk/vits',
 'tts_models/en/vctk/fast_pitch',
 'tts_models/en/sam/tacotron-DDC',
 'tts_models/en/blizzard2013/capacitron-t2-c50',
 'tts_models/en/blizzard2013/capacitron-t2-c150_v2',
 'tts_models/en/multi-dataset/tortoise-v2',
 'tts_models/en/jenny/jenny',
 'tts_models/es/mai/tacotron2-DDC',
 'tts_models/es/css10/vits',
 'tts_models/fr/mai/tacotron2-DDC',
 'tts_models/fr/css10/vits',

In [4]:
# Test example: the following is a US English female voice.
OUTPUT_PATH = 'tts_english.wav'
tts = TTS(model_name='tts_models/en/ljspeech/tacotron2-DDC', progress_bar=False, gpu=False)
tts.tts_to_file(text='The quick brown fox jumped over the lazy dogs.', file_path=OUTPUT_PATH)
Audio(OUTPUT_PATH, autoplay=True)

 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > Model's license - apache 2.0
 > Check https://choosealicense.com/licenses/apache-2.0/ for more info.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Model's license - apache 2.0
 > Check https://choosealicense.com/licenses/apache-2.0/ for more info.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_no

In [5]:
# Test example: the following is a German male voice.
OUTPUT_PATH = 'tts_german.wav'
tts = TTS(model_name='tts_models/de/thorsten/vits', progress_bar=False, gpu=False)
tts.tts_to_file(text='Der flinke braune Fuchs sprang über die faulen Hunde.', file_path=OUTPUT_PATH)
Audio(OUTPUT_PATH, autoplay=True)


 > tts_models/de/thorsten/vits is already downloaded.
 > Model's license - apache 2.0
 > Check https://choosealicense.com/licenses/apache-2.0/ for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Text splitted to sentences.
['Der flinke braune Fuchs sprang über die fau

Using punctuation helps create pauses and emphasis in the correct places. Below are two samples, identical apart from the addition of a comma in the second.

In [None]:
Audio('tts_sample.wav', autoplay=True)

In [None]:
Audio('tts_sample_with_comma.wav', autoplay=True)

Let's reset the model to our default:

In [None]:
tts = TTS(model_name='tts_models/en/jenny/jenny', progress_bar=False, gpu=False)

TTS ignores anything it doesn't recognise, such as Greek letters and some mathematical symbols. It can also struggle with multi-digit numbers. It's best to write these out long-hand. Here is an example:

In [None]:
OUTPUT_PATH = 'tts_lya.wav'
tts.tts_to_file(text="The Lyman-α resonance is 1216 Å. The Lyman alpha resonance is twelve hundred and sixteen angstroms.", file_path=OUTPUT_PATH)
Audio(OUTPUT_PATH, autoplay=True)

With any words or names that it struggles with, you can adjust the spelling to make it sound better. For instance the Italian name "Chierchia" can be spelled "Kyerkia" to get close to the correct pronunciation.

Now let's try entering a caption:

In [None]:
caption = input("Please enter your caption: ")

In [None]:
OUTPUT_PATH = 'tts_caption.wav'
tts.tts_to_file(text=caption, file_path=OUTPUT_PATH)
Audio(OUTPUT_PATH, autoplay=True)

## Sandbox