# Speech translation using a Raspberry Pi

This lab shows how to use the [Azure Cognitive Services speech service](https://azure.microsoft.com/services/cognitive-services/speech-services/?WT.mc_id=academic-7372-jabenn) and [Translation service](https://azure.microsoft.com/services/cognitive-services/translator/?WT.mc_id=academic-7372-jabenn) on a Raspberry Pi. You will need a Cognitive Services speech resource, as well as a Cognitive Services Translation resource to use this lab, and you can find all the instructions to get set up in the [README file](https://github.com/microsoft/iot-curriculum/tree/main/labs/ai-edge/speech).

This lab records 10 seconds of speech, then sends it to the Speech service to convert to text. The text is then translated to another languges using the Translation service, then converted back to speech using the Speech service.

There is currently no SDK support for this speech service on ARM32 Linux, so this lab uses the REST APIs.

To use this Notebook, read each documentation cell, then select Run to run each code cell. The output of the code cells will be shown below. You can read more on running Jupyter Notebooks in the [Jupyter Notebooks documentation](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#notebook-user-interface).

First the options for the Speech cognitive service need to be configured.
* Set the `SPEECH_KEY` variable to be the key of your speech resource.
* Set the `SPEECH_ENDPOINT` variable to be the endpoint of your speech resource.
* Set the `TRANSLATOR_KEY` variable to be the key of your translator resource.
* Set the `INPUT_LANGUAGE` variable to the language you will be speaking in. You can find details on the supported langauges in the [Language and voice support for the Speech service documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support?WT.mc_id=academic-7372-jabenn).
* Set the `OUTPUT_LANGUAGE` to the language you want your speech translated to.

In [None]:
SPEECH_KEY = "YOUR_SPEECH_KEY"
SPEECH_ENDPOINT = "YOUR_SPEECH_ENDPOINT"
TRANSLATOR_KEY = "YOUR_TRANSLATOR_KEY"
INPUT_LANGUAGE = "en-US"
OUTPUT_LANGUAGE = "zh-CN"

Import some Python packages to hande the microphone, audio files and REST requests to make them available to the Python code

In [None]:
import sounddevice as sd
import requests
import json
import os
from scipy.io.wavfile import write
from io import BytesIO

Before audio can be captured, some configuration needs to be set up. The sample rate needs to be set to 16khz, and the sample length needs to be set to 10 seconds.

> If you want to record for longer, change the value of `sample_len` to the time in seconds that you want to record for.

In [None]:
# The Speech to Text Cognitive Service API currently only supports a 16000hz samplerate
sample_rate = 16000

# Length of the audio sample in seconds
sample_len = 10

Now capture the audio. Once you start running this cell, speak into the microphone for 10 seconds.

In [None]:
# Record the speech sample
speech_sample = sd.rec(int(sample_len * sample_rate), samplerate=sample_rate, channels=1)

print("Start speaking now!")

# Wait for the recording to stop after the specified number of seconds
sd.wait()

# Let the user know the recording is done
print("Recorded!")

The speech sample now needs to be saved to disk.

In [None]:
# Name of audio file to save sample
filename = "speech_to_text_rec.wav"

# Save speech sample as a .wav file
write(filename, sample_rate, speech_sample)

To verify everything was recorded correctly, playback the audio by using the `aplay` command line utility

In [None]:
os.system("aplay " + filename)

The endpoint that comes from the Speech Cognitive Service is designed to issue access tokens so you can then make the relevant API call. 

The REST API is documented in the [Speech-to-text REST API documentation](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-speech-to-text?WT.mc_id=academic-7372-jabenn#authentication).

The header passes the following:

* The API Key of the speech resource

The return value is an access token that lasts for 10 minutes and is used when calling the rest of the API.

In [None]:
# Create the request headers with the API key
headers = {
    "Ocp-Apim-Subscription-Key": SPEECH_KEY
}

# Make a POST request to the endpoint to get the token
response = requests.post(SPEECH_ENDPOINT, headers=headers)
speech_access_token = str(response.text)

The same needs to happen for the Translator service to get an access token. Translators can be global, so don't have a service or region specific endpoint.

The REST API is documented in the [Translator V3.0 REST API documentation](https://docs.microsoft.com/azure/cognitive-services/translator/reference/v3-0-reference?WT.mc_id=academic-7372-jabenn#authenticating-with-an-access-token).

The header passes the following:

* The API Key of the translator resource

The return value is an access token that lasts for 10 minutes and is used when calling the rest of the API.

In [None]:
# Create the request headers with the API key
headers = {
    "Ocp-Apim-Subscription-Key": TRANSLATOR_KEY
}

# Make a POST request to the endpoint to get the token
response = requests.post("https://api.cognitive.microsoft.com/sts/v1.0/issueToken", headers=headers)
translator_access_token = str(response.text)

All future API calls to the speech service will need to be at the same endpoint as the token issuer, so extract the location now

In [None]:
# Get the location from the endpoint by removing the http protocol and getting the section before the first .
speech_location = SPEECH_ENDPOINT.split("//")[-1].split(".")[0]

Next step is to make the REST API call, uploading the file with the speech data to a URL. The URL is built by extracting the location from the API endpoint and using that to build a new URL pointing to the speech service itself.

The REST API is documented in the [Speech-to-text REST API documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/rest-speech-to-text?WT.mc_id=academic-7372-jabenn#sample-request).

The header passes the following:

* The bearer token that was retrieved earlier
* The content type as a WAV file with a sample rate of 16KHz

The body of the request is the audio file that was just written.

The return value is a JSON document with details on the detected speech, including the text from the speech.

In [None]:
# Build the URL from the location
url = "https://" + speech_location + ".stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"

# Set the headers to include the Cognitive Services resource key
headers = {
    "Authorization": "Bearer " + speech_access_token,
    "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
    "Accept": "application/json;text/xml"
}

# Configure the language parameter for the call
params = {
    "language": INPUT_LANGUAGE
}

# Make the request passing the file as the body
response = requests.post(url, headers=headers, params=params, data=open(filename, "rb"))

The `response` contains the result of the speech to text call as JSON. If the call was successful, it will return an object with a `RecognitionStatus` of `Success`, and a `DisplayText` with the speech converted to text.

In [None]:
# Convert the response to JSON
response_json = json.loads(response.text)

# Get the text from the speech
output_text = response_json['DisplayText']

print("Output text:", output_text)

The next step is to translate the text into the output language using another REST API call.

The REST API is documented in the [Translator Translate REST API documentation](https://docs.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-translate?WT.mc_id=academic-7372-jabenn).

The header passes the following:

* The bearer token that was retrieved earlier
* The content type as JSON
* The content length of the length of the body

The body of the request is a JSON document listing the text to translate.

The return value is a JSON document with details on the translated speech.

In [None]:
# Get the Translator URL
url = "https://api.cognitive.microsofttranslator.com/translate?api-version=3.0"

# Build JSON containing the translation request
translation_request_json = [
    { "Text" : output_text }
]

translation_request_body = json.dumps(translation_request_json)

# Set the headers to include the Cognitive Services resource key
headers = {
    "Authorization": "Bearer " + translator_access_token,
    "Content-Type": "application/json; charset=UTF-8",
    "Content-Length": str(len(translation_request_body))
}

# Configure the language parameter for the call
params = {
    "from": INPUT_LANGUAGE,
    "to": OUTPUT_LANGUAGE
}

# Make the request passing the translation request
response = requests.post(url, headers=headers, params=params, data=translation_request_body)

Once the translation response has been received, the translated text can be extracted from it.

In [None]:
# Convert the response body to JSON
response_json = json.loads(response.text)

# The response contains an array of translations. We're only translating to one language, so will only have one item in this array
translation = response_json[0]["translations"][0]

# Get the text of the translation
translated_text = translation["text"]

print("Translated text:", translated_text)

The translated text can then be converted to speech using the Text to speech REST API. The first step in doing this is to get the list of voices that are supported by the text to speech service. This list can then be filtered based on the specified output language, selecting the first one found.

The REST API is documented in the [Text-to-speech REST API documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/rest-text-to-speech?WT.mc_id=academic-7372-jabenn#get-a-list-of-voices).

The header passes the following:

* The access token that was retrieved earlier

The return value is a JSON document listing all the supported voices.


In [None]:
# Build the URL from the location
url = "https://" + speech_location + ".tts.speech.microsoft.com/cognitiveservices/voices/list"

# Set the headers to include the Cognitive Services resource key
headers = {
    "Authorization": "Bearer " + speech_access_token
}

# Make the request passing the file as the body
response = requests.get(url, headers=headers)

voices_json = json.loads(response.text)

# Pick the first voice that matches the language
voice = next(x for x in voices_json if x["Locale"].lower() == OUTPUT_LANGUAGE.lower())

Next step is to make the REST API call, uploading the text to a URL. The URL is built by extracting the location from the API endpoint and using that to build a new URL pointing to the speech service itself.

The REST API is documented in the [Text-to-speech REST API documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/rest-text-to-speech?WT.mc_id=academic-7372-jabenn#convert-text-to-speech).

The header passes the following:

* The bearer token that was retrieved earlier
* The content type as SSML
* A requested output format of 16KHz Mono WAV file

The body of the request is an SSML document detailing the text to convert. SSML is Speech Synthesis Markup Language, and is an XML-based markup language that lets developers specify how input text is converted into synthesized speech using the text-to-speech service. You can read more on SSML in the [Improve synthesis with Speech Synthesis Markup Language (SSML) documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-synthesis-markup?WT.mc_id=academic-7372-jabenn). The SSML needs to include details on the voice to use to generate the speech, so extract these from the voice found earlier.

In [None]:
# Build the URL from the location
url = "https://" + speech_location + ".tts.speech.microsoft.com/cognitiveservices/v1"

# Set the headers to include the Cognitive Services resource key
headers = {
    "Authorization": "Bearer " + speech_access_token,
    "Content-Type": "application/ssml+xml",
    "X-Microsoft-OutputFormat": "riff-16khz-16bit-mono-pcm"
}

# Build the SSML
ssml =  "<speak version='1.0' xml:lang='" + OUTPUT_LANGUAGE + "'>"
ssml += "  <voice  xml:lang='" + OUTPUT_LANGUAGE + "' xml:gender='" + voice["Gender"] + "' name='" + voice["ShortName"] + "'>"
ssml += translated_text
ssml += "  </voice>"
ssml += "</speak>"

# Make the request passing the file as the body encoded as unicode to handle all languages
response = requests.post(url, headers=headers, data=ssml.encode("utf-8"))

The `response` contains the result of the text to speech call as binary audio data. This binary data can be saed to a WAV file.

In [None]:
# Get the binary data from the response
response_audio = BytesIO(response.content)

filename = "translation_output.wav"

with open(filename, "wb") as f:
    f.write(response_audio.getbuffer())

Play the speech audio file by using the `aplay` command line utility.

In [None]:
os.system("aplay " + filename)