# TUTORIAL AI Endpoints - Mixing voice emotion with TTS

*This tutorial introduces EMOTION MIXING concept and explain how to use this TTS feature easily with [AI Endpoints](https://endpoints.ai.cloud.ovh.net/).*

![TTS](tts_emotions_tutorial.png)

## Concept

To better understand the **emotion mixing** feature, let’s start by introducing TTS concept…

### AI Endpoints in a few words

**AI Endpoints** is a new serverless platform powered by OVHcloud and designed for developers. The aim of AI Endpoints is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.

It offers a curated catalog of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.

AI Endpoints provides [access to advanced AI models](https://endpoints.ai.cloud.ovh.net/catalog), including Large Language Models (LLMs), Natural Language Processing, translation, but also **Text-To-Speech**.

### Synthesize Speech using TTS

TTS stands for **Text-To-Speech**, which is a type of technology that converts written text into spoken words.

This technology uses Artificial Intelligence algorithms to interpret and generate human-like speech from text input.

It is commonly used in various applications such as voice assistants, audiobooks, language learning platforms, and accessibility tools for individuals with visual or reading impairments.

With AI Endpoints,TTS is easy to use thanks to the turnkey inference APIs in different languages:

- `en-US`
- `es-ES`
- `de-DE`
- `it-IT`
- `zh-CN`

**And what about the voice emotion mixing?**

Emotion mixing in refers to the ability of a Text-To-Speech system to convey different emotions in its synthetic speech output. 
This feature is typically achieved by adjusting various parameters such as **pitch**, **intonation**, **volume**, and **speed** of the speech, as well as adding pauses and changes in style, to simulate emotional states like happiness, sadness, anger, fear, surprise, etc. This feature helps to make the synthetic speech more expressive, engaging, and natural...

## Technical Implementation

In this tutorial, the TTS model in [en-US](https://endpoints.ai.cloud.ovh.net/models/5f607d54-57fa-46ff-b243-87856a242aa0) language is used to explain how emotion mixing works.

### Step 1 - Install dependencies

In [None]:
!pip install python-dotenv nvidia-riva-client pydub 

### Step 2 - Set up you environment

- Import Python librairies

In [1]:
import os
import numpy as np
import riva.client
import IPython.display as ipd
from pydub import AudioSegment

- Create a `.env` file to store AI Endpoints environment variables

*⚠️ Test AI Endpoints and get your free token <`ai-endpoints-api-token`> [here](https://endpoints.ai.cloud.ovh.net/)*

- Load environment variables

In [None]:
# access the environment variables from the .env file
load_dotenv()
tts_endpoint = os.environ.get('TTS_ENDPOINT') 
ai_endpoint_token = os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN")

### Step 3 - Transform text into spoken words using basic TTS

- Connect with TTS endpoint

In [3]:
# connect with tts server
tts_service = riva.client.SpeechSynthesisService(
                riva.client.Auth(
                    uri=tts_endpoint, 
                    use_ssl=True, 
                    metadata_args=[["authorization", f"bearer {ai_endpoint_token}"]]
                )
            )

I0000 00:00:1722004037.813944   73213 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


- Define TTS model configuration

In [4]:
# set up config
sample_rate_hz = 16000
req = { 
        "language_code"  : "en-US",
        "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,  
        "sample_rate_hz" : sample_rate_hz,                       
        "voice_name"     : "English-US.Female-Happy"                    
}

- Test speech synthesis **without** emotion mixing

In [5]:
# TTS inference
req["text"] = "I am very happy! Now I am sad..."
response = tts_service.synthesize(**req)

In [6]:
# play the generated audio
print("TTS synthesis WITHOUT Emotin Mixing:\n")
audio_samples = np.frombuffer(response.audio, dtype=np.int16)
ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))

TTS synthesis WITHOUT Emotin Mixing:



> Here we see that the emotion remains the same as that defined at the outset: `happy`.
> Now let's try **mixing the emotions within the same audio**!

### Step 4 - Mix emotions in audio using SSML

*The **SSML** (Speech Synthesis Markup Language) specification is a markup language for modifying and adapting the performance of the TTS system.*

- Change TTS configuration

In [7]:
# set up config
sample_rate_hz = 16000
req_emotion = { 
        "language_code"  : "en-US",
        "encoding"       : riva.client.AudioEncoding.LINEAR_PCM , 
        "sample_rate_hz" : sample_rate_hz,                          
        "voice_name"     : "English-US-RadTTSpp.Female.happy"                    
}

- Test speech synthesis **with** emotion mixing

In [8]:
# ssml text
ssml_text = """<speak><prosody emotion="happy:extreme"> I am very happy!</prosody><prosody emotion="sad:very"> Now I am sad...</prosody></speak>"""
print("SSML text:\n", ssml_text)

SSML text:
 <speak><prosody emotion="happy:extreme"> I am very happy!</prosody><prosody emotion="sad:very"> Now I am sad...</prosody></speak>


In [9]:
# TTS inference
req_emotion["text"] = ssml_text
resp = tts_service.synthesize(**req_emotion)

In [10]:
# play the generated audio
print("TTS synthesis WITH Emotin Mixing:\n")
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))

TTS synthesis WITH Emotin Mixing:



To go further with voice customization using SSML, please refer to this [tutorial](https://github.com/nvidia-riva/tutorials/blob/main/tts-basics-customize-ssml.ipynb)