# Text to Speech using a Raspberry Pi

This lab shows how to use the [Azure Cognitive Services speech service](https://azure.microsoft.com/services/cognitive-services/speech-services/?WT.mc_id=academic-7372-jabenn) on a Raspberry Pi. You will need a Cognitive Services speech resource to use this lab, and you can find all the instructions to get set up in the [README file](https://github.com/microsoft/iot-curriculum/tree/main/labs/ai-edge/speech).

This lab takes some text, then sends it to the Speech service to convert to speech as an audio file.

There is currently no SDK support for this speech service on ARM32 Linux, so this lab uses the REST APIs.

To use this Notebook, read each documentation cell, then select Run to run each code cell. The output of the code cells will be shown below. You can read more on running Jupyter Notebooks in the [Jupyter Notebooks documentation](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#notebook-user-interface).

First the options for the Speech cognitive service need to be configured.
* Set the `KEY` variable to be the key of your speech resource.
* Set the `ENDPOINT` variable to be the endpoint of your speech resource.
* Set the `LANGUAGE` variable to the language for the text to convert. You can find details on the supported langauges in the [Language and voice support for the Speech service documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support?WT.mc_id=academic-7372-jabenn).
* Set the `TEXT` variable to the text to convert to speech

In [None]:
KEY = "YOUR_SPEECH_KEY"
ENDPOINT = "YOUR_SPEECH_ENDPOINT"
LANGUAGE = "en-US"
TEXT = "At Microsoft, our mission is to empower every person and every organization on the planet to achieve more."

Import some Python packages to make them available to the Python code

In [None]:
import requests
import json
import os
from io import BytesIO

The endpoint that comes from the Cognitive Service is designed to issue access tokens so you can then make the relevant API call. 

The REST API is documented in the [Text-to-speech REST API documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/rest-text-to-speech?WT.mc_id=academic-7372-jabenn#authentication).

The header passes the following:

* The API Key of the speech resource

The return value is an access token that lasts for 10 minutes and is used when calling the rest of the API.

In [None]:
# Create the request headers with the API key
headers = {
    "Ocp-Apim-Subscription-Key": KEY
}

# Make a POST request to the endpoint to get the token
response = requests.post(ENDPOINT, headers=headers)
access_token = str(response.text)

All future API calls will need to be at the same endpoint as the token issuer, so extract the location now

In [None]:
# Get the location from the endpoint by removing the http protocol and getting the section before the first .
location = ENDPOINT.split("//")[-1].split(".")[0]

Next get the list of voices that are supported by the text to speech service. This list can then be filtered based on the specified language, selecting the first one found.

The REST API is documented in the [Text-to-speech REST API documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/rest-text-to-speech?WT.mc_id=academic-7372-jabenn#get-a-list-of-voices).

The header passes the following:

* The access token that was retrieved in the last step

The return value is a JSON document listing all the supported voices.


In [None]:
# Build the URL from the location
url = "https://" + location + ".tts.speech.microsoft.com/cognitiveservices/voices/list"

# Set the headers to include the Cognitive Services resource key
headers = {
    "Authorization": "Bearer " + access_token
}

# Make the request passing the file as the body
response = requests.get(url, headers=headers)

voices_json = json.loads(response.text)

# Pick the first voice that matches the language
voice = next(x for x in voices_json if x["Locale"].lower() == LANGUAGE.lower())

Next step is to make the REST API call, uploading the text to a URL. The URL is built by extracting the location from the API endpoint and using that to build a new URL pointing to the speech service itself.

The REST API is documented in the [Text-to-speech REST API documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/rest-text-to-speech?WT.mc_id=academic-7372-jabenn#convert-text-to-speech).

The header passes the following:

* The bearer token that was retrieved earlier
* The content type as SSML
* A requested output format of 16KHz Mono WAV file

The body of the request is an SSML document detailing the text to convert. SSML is Speech Synthesis Markup Language, and is an XML-based markup language that lets developers specify how input text is converted into synthesized speech using the text-to-speech service. You can read more on SSML in the [Improve synthesis with Speech Synthesis Markup Language (SSML) documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-synthesis-markup?WT.mc_id=academic-7372-jabenn). The SSML needs to include details on the voice to use to generate the speech, so extract these from the voice found earlier.

In [None]:
# Build the URL from the location
url = "https://" + location + ".tts.speech.microsoft.com/cognitiveservices/v1"

# Set the headers to include the Cognitive Services resource key
headers = {
    "Authorization": "Bearer " + access_token,
    "Content-Type": "application/ssml+xml",
    "X-Microsoft-OutputFormat": "riff-16khz-16bit-mono-pcm"
}

# Build the SSML
ssml =  "<speak version='1.0' xml:lang='" + LANGUAGE + "'>"
ssml += "  <voice  xml:lang='" + LANGUAGE + "' xml:gender='" + voice["Gender"] + "' name='" + voice["ShortName"] + "'>"
ssml += TEXT
ssml += "  </voice>"
ssml += "</speak>"

# Make the request passing the file as the body encoded as unicode to handle all languages
response = requests.post(url, headers=headers, data=ssml.encode("utf-8"))

The `response` contains the result of the text to speech call as binary audio data. This binary data can be saed to a WAV file.

In [None]:
# Get the binary data from the response
response_audio = BytesIO(response.content)

filename = "text_to_speech_output.wav"

with open(filename, "wb") as f:
    f.write(response_audio.getbuffer())

Play the speech audio file by using the `aplay` command line utility.

In [None]:
os.system("aplay " + filename)