# Generate an audio caption for sonification

First, you will need to install TTS:
pip install TTS

In [30]:
from TTS.api import TTS
from IPython.display import Audio

The default text-to-speech (tts) model we use is English: a soft, female, Irish voice: 'tts_models/en/jenny/jenny'. You can hear a sample below: 

In [None]:
Audio('tts_jenny.wav', autoplay=True)

You can choose other voices in a range of languages. To try them, use the following code:

In [None]:
# List models
TTS.list_models()

In [None]:
# Test models: the following is a US English female voice.
OUTPUT_PATH = 'tts_english.wav'
tts = TTS(model_name='tts_models/en/ljspeech/tacotron2-DDC', progress_bar=False, gpu=False)
tts.tts_to_file(text='The quick brown fox jumped over the lazy dogs.', file_path=OUTPUT_PATH)
Audio(OUTPUT_PATH, autoplay=True)

In [None]:
# Test models: the following is a German male voice.
OUTPUT_PATH = 'tts_german.wav'
tts = TTS(model_name='tts_models/de/thorsten/vits', progress_bar=False, gpu=False)
tts.tts_to_file(text='Der flinke braune Fuchs sprang über die faulen Hunde.', file_path=OUTPUT_PATH)
Audio(OUTPUT_PATH, autoplay=True)


Using punctuation helps create pauses and emphasis in the correct places. Below are two samples, identical apart from the addition of a comma in the second.

In [None]:
Audio('tts_sample.wav', autoplay=True)

In [None]:
Audio('tts_sample_with_comma.wav', autoplay=True)

Let's reset the model to our default:

In [31]:
tts = TTS(model_name='tts_models/en/jenny/jenny', progress_bar=False, gpu=False)

 > tts_models/en/jenny/jenny is already downloaded.
 > Model's license - custom - see https://github.com/dioco-group/jenny-tts-dataset#important
 > Check https://opensource.org/licenses for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:48000
 | > resample:False
 | > num_mels:100
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:2048
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:512
 | > win_length:2048


TTS ignores anything it doesn't recognise, such as Greek letters and some mathematical symbols. It can also struggle with multi-digit numbers. It's best to write these out long-hand. Here is an example:

In [32]:
OUTPUT_PATH = 'tts_lya.wav'
tts.tts_to_file(text="The Lyman-α resonance is 1216 Å. The Lyman alpha resonance is twelve hundred and sixteen angstroms.", file_path=OUTPUT_PATH)
Audio(OUTPUT_PATH, autoplay=True)

 > Text splitted to sentences.
['The Lyman-α resonance is 1216 Å.', 'The Lyman alpha resonance is twelve hundred and sixteen angstroms.']
 > Processing time: 18.433180570602417
 > Real-time factor: 2.7691307817630073


With any words or names that it struggles with, you can adjust the spelling to make it sound better. For instance the Italian name "Chierchia" can be spelled "Kyerkia" to get close to the correct pronunciation.

Now let's try entering a caption:

In [None]:
caption = input("Please enter your caption: ")

In [None]:
OUTPUT_PATH = 'tts_caption.wav'
tts.tts_to_file(text=caption, file_path=OUTPUT_PATH)
Audio(OUTPUT_PATH, autoplay=True)

## Sandbox