# [TTS] Audio Creation Test
This sample demonstrates how to use Azure AI Speech API to generate audio from text. 

> ✨ ***Note*** <br>
> Please check the supported languages and region availabilty before you get started - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts / https://learn.microsoft.com/en-us/azure/ai-services/speech-service/regions  

## Prerequisites
Git clone the repository to your local machine. 

```bash
git clone https://github.com/hyogrin/Azure_OpenAI_samples.git
```

* A subscription key for the Speech service. See [Try the speech service for free](https://docs.microsoft.com/azure/cognitive-services/speech-service/get-started).
* Python 3.5 or later needs to be installed. Downloads are available [here](https://www.python.org/downloads/).
* The Python Speech SDK package is available for Windows (x64 or x86) and Linux (x64; Ubuntu 16.04 or Ubuntu 18.04).
* On Ubuntu 16.04 or 18.04, run the following commands for the installation of required packages:
  ```sh
  sudo apt-get update
  sudo apt-get install libssl1.0.0 libasound2
  ```
* On Debian 9, run the following commands for the installation of required packages:
  ```sh
  sudo apt-get update
  sudo apt-get install libssl1.0.2 libasound2
  ```
* On Windows you need the [Microsoft Visual C++ Redistributable for Visual Studio 2017](https://support.microsoft.com/help/2977003/the-latest-supported-visual-c-downloads) for your platform.

Configure a Python virtual environment for 3.10 or later: 
 1. open the Command Palette (Ctrl+Shift+P).
 1. Search for Python: Create Environment.
 1. select Venv / Conda and choose where to create the new environment.
 1. Select the Python interpreter version. Create with version 3.10 or later.

```bash
pip install -r requirements.txt
```

Create an .env file based on the .env-sample file. Copy the new .env file to the folder containing your notebook and update the variables.

## Speech Synthesis Using the Speech SDK

In [None]:
import azure.cognitiveservices.speech as speechsdk
import os
import time
import json
from openai import AzureOpenAI
from dotenv import load_dotenv
load_dotenv()

speech_key = os.getenv("AZURE_AI_SPEECH_API_KEY")
speech_region = os.getenv("AZURE_AI_SPEECH_REGION")

True

Create an instance of a speech config with specified subscription key and service region.
Replace with your own subscription key and service region (e.g., "westus").

In [None]:
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region)
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)

In [4]:
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

In [5]:
print("Type some text that you want to speak...")
text = input()

Type some text that you want to speak...


In [6]:
MIN_RETRIES = 2
for _ in range(MIN_RETRIES):
    try:
        result = speech_synthesizer.speak_text_async(text).get()
    except Exception as e:
        time.sleep(10)
        continue

ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib pcm_hw.c:1829:(_snd_pcm_hw_open) Invalid value for card
ALSA lib pcm_hw.c:1829:(_snd_pcm_hw_open) Invalid value for card
ALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field port
ALSA lib pulse.c:242:(pulse_connect) PulseAudio: Unable to connect: Connection refused

ALSA lib pcm_hw.c:1829:(_snd_pcm_hw_open) Inval

In [7]:
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesized to speaker for text [{}]".format(text))
    stream = speechsdk.AudioDataStream(result)
    stream.save_to_wav_file("output/result_text.wav")
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print("Speech synthesis canceled: {}".format(cancellation_details.reason))
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        if cancellation_details.error_details:
            print("Error details: {}".format(cancellation_details.error_details))
    print("Did you update the subscription info?")

Speech synthesized to speaker for text [test]


In [8]:
import html
default_tts_voice = 'en-US-JennyMultilingualV2Neural' # Default TTS voice

ssml = f"""<speak version='1.0'  xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
                     <voice name='{default_tts_voice}'>
                             {html.escape(text)}
                     </voice>
                   </speak>"""

In [9]:
speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get()
stream = speechsdk.AudioDataStream(speech_sythesis_result)
stream.save_to_wav_file("output/result_ssml.wav")

In [15]:
def get_audio_file_by_speech_synthesis(text, file_name):
    ssml = f"""<speak version='1.0'  xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
                     <voice name='{default_tts_voice}'>
                             {html.escape(text)}
                     </voice>
                   </speak>"""

    speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get()
    stream = speechsdk.AudioDataStream(speech_sythesis_result)
    stream.save_to_wav_file(file_name)

## Generate Audio from jsonl

In [None]:
import json


# generate 10 rows telephone expression jsonfile for Vietnamese, English, Korean. Here is some examples - {"no":01, "vi": "Xin chào, tôi có thể giúp gì cho bạn?", "en": "Hello, how can I help you?", "ko": "안녕하세요, 무엇을 도와드릴까요?"},
# {"no":02, "vi": "Bạn có thể cho tôi biết số điện thoại của bạn không?", "en": "Can you tell me your phone number?", "ko": "전화번호를 알려주시겠어요?"},   
expressions = [
    {"no": 1, "vi": "Xin chào, tôi có thể giúp gì cho bạn?", "en": "Hello, how can I help you?", "ko": "안녕하세요, 무엇을 도와드릴까요?"},
    {"no": 2, "vi": "Bạn có thể cho tôi biết số điện thoại của bạn không?", "en": "Can you tell me your phone number?", "ko": "전화번호를 알려주시겠어요?"},
    {"no": 3, "vi": "Tôi sẽ gọi lại sau.", "en": "I will call back later.", "ko": "나중에 다시 전화할게요."},
    {"no": 4, "vi": "Bạn có thể giữ máy một chút không?", "en": "Can you hold on a moment?", "ko": "잠시만 기다려 주시겠어요?"},
    {"no": 5, "vi": "Tôi không nghe rõ, bạn có thể nói lại không?", "en": "I can't hear you clearly, can you repeat?", "ko": "잘 안 들려요, 다시 말씀해 주시겠어요?"},
    {"no": 6, "vi": "Xin lỗi, số máy bạn gọi hiện không liên lạc được.", "en": "Sorry, the number you called is currently unavailable.", "ko": "죄송합니다, 지금 전화를 받을 수 없습니다."},
    {"no": 7, "vi": "Bạn có thể để lại tin nhắn không?", "en": "Can you leave a message?", "ko": "메시지를 남겨 주시겠어요?"},
    {"no": 8, "vi": "Tôi sẽ chuyển cuộc gọi của bạn.", "en": "I will transfer your call.", "ko": "전화를 연결해 드리겠습니다."},
    {"no": 9, "vi": "Bạn có thể gọi lại sau không?", "en": "Can you call back later?", "ko": "나중에 다시 전화해 주시겠어요?"},
    {"no": 10, "vi": "Cảm ơn bạn đã gọi, chúc bạn một ngày tốt lành!", "en": "Thank you for calling, have a nice day!", "ko": "전화 주셔서 감사합니다, 좋은 하루 되세요!"}
]

with open('telephone_expressions.jsonl', 'w', encoding='utf-8') as f:
    for expression in expressions:
        f.write(json.dumps(expression, ensure_ascii=False) + '\n')

In [None]:
import datetime
languages = ['vi', 'en', 'ko']
with open('telephone_expressions.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        expression = json.loads(line)
        no = expression['no']
        for lang in languages:
            text = expression[lang]
            timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
            file_name = f"output/{no}_{lang}_{timestamp}.wav"
            get_audio_file_by_speech_synthesis(text, file_name)

## Play WAV Files in Output Folder
Use the os library to list all WAV files in the output folder.

In [28]:
import os
from IPython.display import Audio, display

output_folder = 'output'
files = os.listdir(output_folder)
wav_files = [file for file in files if file.endswith('.wav')]

# Sort wav_files by 'no' in ascending order
wav_files.sort(key=lambda x: int(x.split('_')[0]))
wav_files

['1_en_20241104093736.wav',
 '1_ko_20241104093737.wav',
 '1_vi_20241104093735.wav',
 '2_en_20241104093738.wav',
 '2_ko_20241104093739.wav',
 '2_vi_20241104093738.wav',
 '3_en_20241104093739.wav',
 '3_ko_20241104093740.wav',
 '3_vi_20241104093739.wav',
 '4_en_20241104093740.wav',
 '4_ko_20241104093741.wav',
 '4_vi_20241104093740.wav',
 '5_en_20241104093741.wav',
 '5_ko_20241104093742.wav',
 '5_vi_20241104093741.wav',
 '6_en_20241104093743.wav',
 '6_ko_20241104093743.wav',
 '6_vi_20241104093742.wav',
 '7_en_20241104093744.wav',
 '7_ko_20241104093745.wav',
 '7_vi_20241104093744.wav',
 '8_en_20241104093745.wav',
 '8_ko_20241104093746.wav',
 '8_vi_20241104093745.wav',
 '9_en_20241104093747.wav',
 '9_ko_20241104093747.wav',
 '9_vi_20241104093746.wav',
 '10_en_20241104093748.wav',
 '10_ko_20241104093748.wav',
 '10_vi_20241104093747.wav']

# Play WAV Files
Use IPython.display.Audio to play each WAV file listed in the output folder.

In [29]:
# Play each WAV file in the output folder
for wav_file in wav_files[:3]:
    file_path = os.path.join(output_folder, wav_file)
    display(Audio(filename=file_path))