# Speech to Text using a Raspberry Pi

This lab shows how to use the [Azure Cognitive Services speech service](https://azure.microsoft.com/services/cognitive-services/speech-services/?WT.mc_id=iotcurriculum-github-jabenn) on a Raspberry Pi. You will need a Cognitive Services speech resource to use this lab, and you can find all the instructions to get set up in the [README file](https://github.com/microsoft/iot-curriculum/tree/main/labs/ai-edge/speech-to-text).

This lab records 10 seconds of speech, then sends it to the Speech service to convert to text.

There is currently no SDK support for this speech service on ARM32 Linux, so this lab uses the REST APIs.

To use this Notebook, read each documentation cell, then select Run to run each code cell. The output of the code cells will be shown below. You can read more on running Jupyter Notebooks in the [Jupyter Notebooks documentation](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#notebook-user-interface).

First the options for the Speech cognitive service need to be configured.
* Set the `KEY` variable to be the key of your speech resource
* Set the `ENDPOINT` variable to be your endpoint.
* Set the `LANGUAGE` variable to the language you will be speaking in. You can find details on the supported langauges in the [Language and voice support for the Speech service documentation](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support?WT.mc_id=iotcurriculum-github-jabenn).

In [None]:
KEY = "YOUR_SPEECH_KEY"
ENDPOINT = "YOUR_SPEECH_ENDPOINT"
LANGUAGE = "en-US"

To access the microphone, a few Python packages need to be installed:

In [None]:
!pip3 install cffi
!pip3 install pyaudio
!pip3 install sounddevice
!pip3 install soundfile
!pip3 install scipy

Another package is needed to make calls to the REST API:

In [None]:
!pip3 install requests

Once the packages are installed, they need to be imported to be available to the Python code

In [None]:
import sounddevice as sd
import soundfile as sf
import requests
from scipy.io.wavfile import write

Before audio can be captured, some configuration needs to be set up. The sample rate needs to be set to 16khz, and the sample length needs to be set to 10 seconds.

> If you want to record for longer, change the value of `sample_len` to the time in seconds that you want to record for.

In [None]:
# The Speech to Text Cognitive Service API currently only supports a 16000hz samplerate
sample_rate = 16000

# Length of the audio sample in seconds
sample_len = 10

Now capture the audio. Once you start running this cell, speak into the microphone for 10 seconds.

In [None]:
# Record the speech sample
speech_sample = sd.rec(int(sample_len * sample_rate), samplerate=sample_rate, channels=1)

print("Start speaking now!")

# Wait for the recording to stop after the specified number of seconds
sd.wait()

# Let the user know the recording is done
print("Recorded!")

The speech sample now needs to be saved to disk.

In [None]:
# Name of audio file to save sample
filename = "speech_to_text_rec.wav"

# Save speech sample as a .wav file
write(filename, sample_rate, speech_sample)

The endpoint that comes from the Cognitive Service is designed to issue access tokens so you can then make the relevant API call. The endpoint is called passing in the API key to get back a bearer token and URL to use for the actual API calls.

In [None]:
# Create the request headers with the API key
headers = {
    "Ocp-Apim-Subscription-Key": KEY
}

# Make a POST request to the endpoint to get the token
response = requests.post(ENDPOINT, headers=headers)
access_token = str(response.text)

Next step is to make the REST API call, uploading the file with the speech data to a URL. The URL is built by extracting the location from the API endpoint and using that to build a new URL pointing to the speech service itself. The bearer token is set in the request header, the body of the request is the file that was just written. 

In [None]:
# Get the location from the endpoint by removing the http protocol and getting the section before the first .
location = ENDPOINT.split("//")[-1].split(".")[0]

# Build the URL from the endpoint
url = "https://" + location + ".stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"

# Set the headers to include the Cognitive Services resource key
headers = {
    "Authorization": "Bearer " + access_token,
    "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
    "Accept": "application/json;text/xml"
}

# Configure the language parameter for the call
params = {
    "language": LANGUAGE
}

# Make the request passing the file as the body
response = requests.post(url, headers=headers, params=params, data=open(filename, "rb"))

The `response` contains the result of the speech to text call as JSON. If the call was successful, it will return an object with a `RecognitionStatus` of `Success`, and a `DisplayText` with the speech converted to text.

In [None]:
# Convert the response to JSON
responsejson = json.loads(response.text)

if responsejson["RecognitionStatus"] == "Success":
    print('Results from Speech to Text API:')
    print(responsejson['DisplayText'])
else
    print("No speech detected")
    print("The raw response is:")
    print(response.text)