# Text to Speech avatar

## From text to speech with a video avatar provided by Azure Speech Services
Custom text to speech avatar allows you to create a customized, one-of-a-kind synthetic talking avatar for your application. With custom text to speech avatar, you can build a unique and natural-looking avatar for your product or brand by providing video recording data of your selected actors. If you also create a custom neural voice for the same actor and use it as the avatar's voice, the avatar will be even more realistic.

<img src="https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/media/custom-avatar-workflow.png#lightbox">

> https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/what-is-custom-text-to-speech-avatar 
> https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-speech-announces-public-preview-of-text-to-speech/ba-p/3981448

In [None]:
#pip install moviepy


### 1. `datetime`

-   **Purpose**: To work with dates and times.
-   **Usage**: Timestamping data, calculating time differences, formatting dates.

### 2. `json`

-   **Purpose**: To work with JSON (JavaScript Object Notation) data.
-   **Usage**: Parsing JSON strings, converting Python objects to JSON format, reading from and writing to JSON files.

### 3. `requests`

-   **Purpose**: To make HTTP requests.
-   **Usage**: Sending HTTP requests to interact with web services (APIs), downloading web content, etc.

### 4. `sys`

-   **Purpose**: To interact with the Python runtime environment.
-   **Usage**: Accessing command-line arguments, handling the interpreter, and system-related functions.

### 5. `time`

-   **Purpose**: To handle time-related tasks.
-   **Usage**: Sleeping, measuring execution time, working with timestamps.

### 6. `ipywidgets.Video`

-   **Purpose**: To display video widgets in Jupyter notebooks.
-   **Usage**: Embedding and controlling video playback within a Jupyter notebook environment.

### 7. `IPython.display.display`

-   **Purpose**: To display rich output (like videos, images, HTML) in Jupyter notebooks.
-   **Usage**: Rendering objects like videos, images, and other rich media in notebook cells.

### 8. `moviepy.editor.VideoFileClip`

-   **Purpose**: To edit and manipulate video files.
-   **Usage**: Reading, editing, and processing video files.

### 9. `pathlib.Path`

-   **Purpose**: To handle filesystem paths.
-   **Usage**: Creating, reading, writing, and manipulating filesystem paths in an object-oriented way.

### 10. `logging`

-   **Purpose**: To log messages for tracking events that happen during software execution.
-   **Usage**: Debugging, error reporting, information logging.

In [14]:
import datetime
import json
import requests
import sys
import time

from ipywidgets import Video
from IPython.display import display
from moviepy.editor import VideoFileClip
from pathlib import Path
import logging


The `sys.version` attribute in Python provides information about the version of Python that is currently being used. This includes the version number, build date, and compiler information.

In [15]:
sys.version

'3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)]'

1.  **Import the `datetime` module**: This module supplies classes for manipulating dates and times.
    
2.  **Get the current date and time**:
3. **Format the date and time**:
The `strftime` method formats the date and time into a string. The format `'%d-%b-%Y %H:%M:%S'` specifies:

-   `%d`: Day of the month as a zero-padded decimal number.
-   `%b`: Abbreviated month name.
-   `%Y`: Year with century as a decimal number.
-   `%H`: Hour (24-hour clock) as a zero-padded decimal number.
-   `%M`: Minute as a zero-padded decimal number.
-   `%S`: Second as a zero-padded decimal number.

In [16]:
dt = datetime.datetime.today().strftime('%d-%b-%Y %H:%M:%S')
print(f"Today is {dt}")

Today is 15-May-2024 10:36:08



This function configures the root logger with the specified options:

-   **stream=sys.stdout**: Log messages will be sent to standard output (the console).
-   **level=logging.INFO**: Only messages with a severity level of INFO or higher will be logged. The levels of severity in ascending order are DEBUG, INFO, WARNING, ERROR, and CRITICAL.
-   **format="[%(asctime)s] %(message)s"**: Specifies the format of the log messages. `%(asctime)s` will be replaced with the current time, and `%(message)s` will be replaced with the actual log message.
-   **datefmt="%m/%d/%Y %I:%M:%S %p %Z"**: Specifies the format of the date and time in the log messages. This format will display the date and time as MM/DD/YYYY HH:MM:SS AM/PM Timezone.

In [17]:
logging.basicConfig(
    stream=sys.stdout,
    level=logging.INFO,
    format="[%(asctime)s] %(message)s",
    datefmt="%m/%d/%Y %I:%M:%S %p %Z",
)
logger = logging.getLogger(__name__)


1.  **Set up the Azure Speech service configuration**: You'll need to configure your application to use the provided `azure_speech_key` and `azure_speech_region`.
    
2.  **Create a Speech Configuration Object**: Use the Azure Speech SDK to create a speech configuration object with your key and region.

In [18]:
# Azure Speech services informations
azure_speech_key = "bfb7797bd9e74a26b52ddcd253edef56"
azure_speech_region = "eastus"

The `service_host` variable indicates that you are using a custom endpoint for the Azure Speech service, which suggests that you might be working with Custom Voice or another specialized service within Azure Cognitive Services.

In [19]:
service_host = "customvoice.api.speech.microsoft.com"  # Do not change

## Functions

1.  **Function Definition**: The function `submit_synthesis` is defined to accept one parameter, `prompt`.
    
2.  **Construct the URL**: The URL for the API endpoint is constructed using formatted string literals (f-strings). This URL combines the `azure_speech_region` and `service_host` to form the base URL for the Azure Speech service, targeting the `batchsynthesis/talkingavatar` endpoint.
    
3.  **Set up the Headers**: A dictionary named `header` is created to set up the headers for the HTTP POST request. It includes the Azure subscription key (`Ocp-Apim-Subscription-Key`) and specifies that the content type is JSON (`Content-Type`).
    
4.  **Create the Payload**: A dictionary named `payload` is created to form the payload for the request. This includes:
    
    -   Metadata about the synthesis job (`displayName` and `description`).
    -   Specifies the type of text as plain text (`textType`).
    -   Configures the voice to be used for synthesis (`synthesisConfig`).
    -   Placeholder for custom voice configurations (commented out in the provided code).
    -   The text to be synthesized (`inputs`).
    -   Additional properties for the synthesis, such as the avatar character, style, video format, codec, subtitle type, and background color.
5.  **Send the POST Request**: The `requests.post` method is used to send an HTTP POST request to the constructed URL with the payload and headers.
    
6.  **Handle the Response**: The response status code is checked:
    
    -   If the status code is less than 400 (indicating success), a success message is logged along with the job ID from the response JSON, and the job ID is returned.
    -   If the status code is 400 or higher (indicating failure), an error message with the response text is logged.

In [20]:
def submit_synthesis(prompt):
    url = f"https://{azure_speech_region}.{service_host}/api/texttospeech/3.1-preview1/batchsynthesis/talkingavatar"

    header = {
        "Ocp-Apim-Subscription-Key": azure_speech_key,
        "Content-Type": "application/json",
    }

    payload = {
        "displayName": "Simple avatar synthesis",
        "description": "Simple avatar synthesis description",
        "textType": "PlainText",
        "synthesisConfig": {
            "voice": "en-US-JennyNeural",
        },
        "customVoices": {
            # "YOUR_CUSTOM_VOICE_NAME": "YOUR_CUSTOM_VOICE_ID"
        },
        "inputs": [
            {
                "text": prompt,
            },
        ],
        "properties": {
            "customized": False,  # set to True if you want to use customized avatar
            "talkingAvatarCharacter": "lisa",  # talking avatar character
            "talkingAvatarStyle": "graceful-sitting",  # talking avatar style, required for prebuilt avatar, optional for custom avatar
            "videoFormat": "webm",  # mp4 or webm, webm is required for transparent background
            "videoCodec": "vp9",  # hevc, h264 or vp9, vp9 is required for transparent background; default is hevc
            "subtitleType": "soft_embedded",
            "backgroundColor": "transparent",
        },
    }

    response = requests.post(url, json.dumps(payload), headers=header)

    if response.status_code < 400:
        logger.info("Batch avatar synthesis job submitted successfully")
        logger.info(f'Job ID: {response.json()["id"]}')
        return response.json()["id"]

    else:
        logger.error(f"Failed to submit batch avatar synthesis job: {response.text}")

1.  **Function Definition**: The function `get_synthesis` is defined to accept one parameter, `job_id`.
    
2.  **Global Variable**: The function declares that it will use a global variable `avatar_url`.
    
3.  **Construct the URL**: The URL for the API endpoint is constructed using formatted string literals (f-strings). This URL combines the `azure_speech_region` and `service_host`, and appends the `job_id` to form the complete URL to check the status of the specific batch synthesis job.
    
4.  **Set up the Headers**: A dictionary named `header` is created to set up the headers for the HTTP GET request. It includes the Azure subscription key (`Ocp-Apim-Subscription-Key`).
    
5.  **Send the GET Request**: The `requests.get` method is used to send an HTTP GET request to the constructed URL with the headers.
    
6.  **Handle the Response**: The response status code is checked:
    
    -   If the status code is less than 400 (indicating success), a debug message is logged indicating the successful retrieval of the batch synthesis job. The JSON response from the server is also logged for debugging purposes.
7.  **Check the Job Status**:
    
    -   The function extracts the `status` from the JSON response.
    -   If the `status` is `"Succeeded"`, it updates the global variable `avatar_url` with the download URL from the JSON response and logs an info message indicating that the job succeeded along with the download URL.
8.  **Return the Status**:
    
    -   The function returns the `status` of the batch synthesis job.
9.  **Error Handling**:
    
    -   If the status code is 400 or higher (indicating failure), an error message with the response text is logged.

### Key Points:

-   The `get_synthesis` function is designed to retrieve the status of a batch synthesis job submitted to the Azure Speech service using a custom endpoint.
-   It dynamically constructs the API endpoint URL based on the provided `job_id`.
-   It prepares the necessary headers, sends the GET request, and processes the response.
-   If the job has succeeded, it updates a global variable `avatar_url` with the download URL of the synthesized avatar and logs appropriate messages to indicate the status of the job.
-   The function handles both successful and failed requests by logging relevant messages.

In [21]:
def get_synthesis(job_id):
    global avatar_url
    url = f"https://{azure_speech_region}.{service_host}/api/texttospeech/3.1-preview1/batchsynthesis/talkingavatar/{job_id}"

    header = {"Ocp-Apim-Subscription-Key": azure_speech_key}

    response = requests.get(url, headers=header)

    if response.status_code < 400:
        logger.debug("Get batch synthesis job successfully")
        logger.debug(response.json())

        status = response.json()["status"]

        if status == "Succeeded":
            avatar_url = response.json()["outputs"]["result"]
            logger.info(f"Batch synthesis job succeeded, download URL: {avatar_url}")

        return status
    else:
        logger.error(f"Failed to get batch synthesis job: {response.text}")


1.  **Function Definition**: The function `list_synthesis_jobs` is defined with two optional parameters, `skip` and `top`, which default to 0 and 100, respectively. These parameters are used for pagination in the API request.
    
2.  **Construct the URL**: The URL for the API endpoint is constructed using formatted string literals (f-strings). The URL combines the `azure_speech_region` and `service_host` and includes query parameters `skip` and `top` to specify the number of results to skip and the number of results to retrieve, respectively.
    
3.  **Set up the Headers**: A dictionary named `header` is created to set up the headers for the HTTP GET request. It includes the Azure subscription key (`Ocp-Apim-Subscription-Key`).
    
4.  **Send the GET Request**: The `requests.get` method is used to send an HTTP GET request to the constructed URL with the headers.
    
5.  **Handle the Response**: The response status code is checked:
    
    -   If the status code is less than 400 (indicating success), an info message is logged indicating the successful retrieval of the batch synthesis jobs. The number of jobs retrieved (`len(response.json()["values"])`) is also logged, along with the entire JSON response for additional information.
6.  **Error Handling**:
    
    -   If the status code is 400 or higher (indicating failure), an error message with the response text is logged.

### Key Points:

-   The `list_synthesis_jobs` function is designed to list all batch synthesis jobs in the subscription, allowing for pagination through `skip` and `top` parameters.
-   It dynamically constructs the API endpoint URL based on the provided `skip` and `top` values.
-   It prepares the necessary headers, sends the GET request, and processes the response.
-   If the request is successful, it logs the number of jobs retrieved and the complete JSON response.
-   The function handles both successful and failed requests by logging relevant messages.

In [22]:
def list_synthesis_jobs(skip: int = 0, top: int = 100):
    """List all batch synthesis jobs in the subscription"""

    url = f"https://{azure_speech_region}.{service_host}/api/texttospeech/3.1-preview1/batchsynthesis/talkingavatar?skip={skip}&top={top}"

    header = {"Ocp-Apim-Subscription-Key": azure_speech_key}

    response = requests.get(url, headers=header)

    if response.status_code < 400:
        logger.info(
            f'List batch synthesis jobs successfully, got {len(response.json()["values"])} jobs'
        )
        logger.info(response.json())
    else:
        logger.error(f"Failed to list batch synthesis jobs: {response.text}")

## Test


The function `submit_synthesis(prompt)` is designed to submit a text-to-speech synthesis job using Azure Speech Services. Here is a step-by-step explanation of what each part of the code does:

1.  **Constructing the URL**: The URL is dynamically generated using the region specified (`azure_speech_region`) and the service host. This URL points to the Azure endpoint for batch synthesis of talking avatars.
    
2.  **Setting up Headers**: The headers include the subscription key (`azure_speech_key`) and specify that the content type is JSON. These headers are required for authentication and to inform the server about the type of data being sent.
    
3.  **Creating the Payload**: The payload is a JSON object containing various settings for the synthesis job:
    
    -   **displayName** and **description**: These provide a name and description for the job.
    -   **textType**: Specifies that the input text is plain text.
    -   **synthesisConfig**: Contains configuration for the voice used in synthesis.
    -   **customVoices**: Placeholder for custom voice settings.
    -   **inputs**: An array with the text to be synthesized.
    -   **properties**: Various properties for the avatar, such as style, video format, codec, and subtitle type.
4.  **Submitting the Request**: The function sends a POST request to the Azure endpoint with the payload and headers.
    
5.  **Handling the Response**: The function checks the response status code. If the request is successful (status code < 400), it logs a success message along with the job ID. If the request fails, it logs an error message with the response text.


The function `get_synthesis(job_id)` is designed to retrieve the status and result of a submitted synthesis job. Here’s the step-by-step explanation:

1.  **Constructing the URL**: The URL is built using the job ID to query the specific synthesis job status.
    
2.  **Setting up Headers**: The headers only include the subscription key for authentication.
    
3.  **Sending the Request**: The function sends a GET request to the Azure endpoint.
    
4.  **Handling the Response**: The function checks the response status code. If the request is successful, it logs the status of the job. If the status is "Succeeded", it retrieves and logs the URL to download the synthesized avatar. If the request fails, it logs an error message with the response text.


The function `list_synthesis_jobs(skip, top)` is designed to list all batch synthesis jobs in the subscription. Here’s the step-by-step explanation:

1.  **Constructing the URL**: The URL includes query parameters for skipping a certain number of jobs and limiting the number of jobs returned (`top`).
    
2.  **Setting up Headers**: The headers include the subscription key for authentication.
    
3.  **Sending the Request**: The function sends a GET request to the Azure endpoint to list the synthesis jobs.
    
4.  **Handling the Response**: The function checks the response status code. If the request is successful, it logs the number of jobs retrieved and the response content. If the request fails, it logs an error message with the response text.


### Prompt for Avatar Synthesis

The prompt text provides a script for the avatar to speak, explaining the Azure OpenAI service. This service provides REST API access to powerful language models like GPT-4 and GPT-3.5-Turbo, which can be used for tasks such as content generation, summarization, semantic search, and natural language to code translation. Microsoft emphasizes responsible AI use, investing in safeguards to prevent abuse and harmful content generation.

In [23]:
prompt = f"""
I am Lisa, your avatar powered by Azure Speech Services.
Today is {dt}.

Let me explain you what is Azure Open AI service.

Azure OpenAI Service provides REST API access to OpenAI's powerful language models including the GPT-4, GPT-3.5-Turbo, and Embeddings model series. In addition, the new GPT-4 and GPT-3.5-Turbo model series have now reached general availability. These models can be easily adapted to your specific task including but not limited to content generation, summarization, semantic search, and natural language to code translation. Users can access the service through REST APIs, Python SDK, or our web-based interface in the Azure OpenAI Studio.

At Microsoft, we're committed to the advancement of AI driven by principles that put people first. Generative models such as the ones available in Azure OpenAI have significant potential benefits, but without careful design and thoughtful mitigations, such models have the potential to generate incorrect or even harmful content. Microsoft has made significant investments to help guard against abuse and unintended harm, which includes requiring applicants to show well-defined use cases, incorporating Microsoft’s principles for responsible AI use, building content filters to support customers, and providing responsible AI implementation guidance to onboarded customers.

To learn more go to https://azure.microsoft.com/en-us/products/ai-services/ai-speech

Thank you and have a good day.
"""

In [12]:
print(prompt)


I am Lisa, your avatar powered by Azure Speech Services.
Today is 15-May-2024 10:32:50.

Let me explain you what is Azure Open AI service.

Azure OpenAI Service provides REST API access to OpenAI's powerful language models including the GPT-4, GPT-3.5-Turbo, and Embeddings model series. In addition, the new GPT-4 and GPT-3.5-Turbo model series have now reached general availability. These models can be easily adapted to your specific task including but not limited to content generation, summarization, semantic search, and natural language to code translation. Users can access the service through REST APIs, Python SDK, or our web-based interface in the Azure OpenAI Studio.

At Microsoft, we're committed to the advancement of AI driven by principles that put people first. Generative models such as the ones available in Azure OpenAI have significant potential benefits, but without careful design and thoughtful mitigations, such models have the potential to generate incorrect or even harmf

## Avatar batch generation


1.  **Start Timer**: The code begins by recording the current time to later calculate how long the process takes.
    
2.  **Submit Synthesis Job**: It calls the `submit_synthesis` function with a prompt containing detailed information about Azure OpenAI services. The function returns a job ID, which is stored for tracking the job status.
    
3.  **Check Job ID**: The script verifies if the job ID is not `None`. If the job submission fails and returns `None`, the process stops here.
    
4.  **Polling Loop**: The code enters an infinite loop to repeatedly check the status of the synthesis job.
    
5.  **Get Job Status**: It calls the `get_synthesis` function with the job ID to retrieve the current status of the job.
    
6.  **Check Status**:
    
    -   **Succeeded**: If the job status is "Succeeded", the code logs a success message, calculates the total elapsed time, and prints it. The loop then breaks to stop further checks.
    -   **Failed**: If the job status is "Failed", an error message is logged, and the loop breaks.
    -   **In Progress**: If the job is still in progress, it logs the current status and waits for 30 seconds before checking again. This prevents excessive requests to the server.

In [24]:
start = time.time()

job_id = submit_synthesis(prompt)

if job_id is not None:
    while True:
        status = get_synthesis(job_id)
        if status == "Succeeded":
            logger.info("Done! Azure batch avatar synthesis job succeeded.")
            elapsed = time.time() - start
            print("Elapsed time: " + time.strftime("%H:%M:%S.{}".format(str(elapsed % 1)[2:])[:15],
                                                   time.gmtime(elapsed)))

            break
        elif status == "Failed":
            logger.error("Failed")
            break
        else:
            logger.info(f"Please wait. Status: [{status}]")
            time.sleep(30)

[05/15/2024 10:36:31 AM India Standard Time] Batch avatar synthesis job submitted successfully
[05/15/2024 10:36:31 AM India Standard Time] Job ID: df6346b1-7533-49f6-a06e-0c631721c11e
[05/15/2024 10:36:33 AM India Standard Time] Please wait. Status: [Running]
[05/15/2024 10:37:05 AM India Standard Time] Please wait. Status: [Running]
[05/15/2024 10:37:36 AM India Standard Time] Please wait. Status: [Running]
[05/15/2024 10:38:08 AM India Standard Time] Please wait. Status: [Running]
[05/15/2024 10:38:39 AM India Standard Time] Please wait. Status: [Running]
[05/15/2024 10:39:11 AM India Standard Time] Please wait. Status: [Running]
[05/15/2024 10:39:42 AM India Standard Time] Please wait. Status: [Running]
[05/15/2024 10:40:14 AM India Standard Time] Please wait. Status: [Running]
[05/15/2024 10:40:46 AM India Standard Time] Please wait. Status: [Running]
[05/15/2024 10:41:17 AM India Standard Time] Please wait. Status: [Running]
[05/15/2024 10:41:49 AM India Standard Time] Please wai

## Avatar video file

In [25]:
print(f"\033[1;31;34mThis is the prompt to speak:\n {prompt}")

[1;31;34mThis is the prompt to speak:
 
I am Lisa, your avatar powered by Azure Speech Services.
Today is 15-May-2024 10:36:08.

Let me explain you what is Azure Open AI service.

Azure OpenAI Service provides REST API access to OpenAI's powerful language models including the GPT-4, GPT-3.5-Turbo, and Embeddings model series. In addition, the new GPT-4 and GPT-3.5-Turbo model series have now reached general availability. These models can be easily adapted to your specific task including but not limited to content generation, summarization, semantic search, and natural language to code translation. Users can access the service through REST APIs, Python SDK, or our web-based interface in the Azure OpenAI Studio.

At Microsoft, we're committed to the advancement of AI driven by principles that put people first. Generative models such as the ones available in Azure OpenAI have significant potential benefits, but without careful design and thoughtful mitigations, such models have the poten

In [26]:
# Save avatar video

avatar_file = (
    "azure_avatar_" + str(datetime.datetime.today().strftime("%d%b%Y_%H%M%S")) + ".mp4"
)
VideoFileClip(avatar_url).write_videofile(avatar_file, verbose=False, logger=None)




In [27]:
# Playing the avatar video

Video.from_file(avatar_file)

Video(value=b'\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free...')