The Speech service provides speech to text and text to speech capabilities with a Speech resource. <br/>You can transcribe speech to text with high accuracy, produce natural-sounding text to speech voices, translate spoken audio, and use speaker recognition during conversations.

In this Demo, we will show off the speech to text capabilities <br/>


The speech to text service offers the following core features:

- Real-time transcription: Instant transcription with intermediate results for live audio inputs.
- Fast transcription: Fastest synchronous output for situations with predictable latency.
- Batch transcription: Efficient processing for large volumes of prerecorded audio.
- Custom speech: Models with enhanced accuracy for specific domains and conditions.

### Batch Transcription
You should provide multiple files per request or point to an Azure Blob Storage container with the audio files to transcribe. The batch transcription service can handle a large number of submitted transcriptions. The service transcribes the files concurrently, which reduces the turnaround time. <br/>
**Batch transcription should only be done with the REST API, not he SDK**

##### How does it work?
With batch transcriptions, you submit the audio data, and then retrieve transcription results asynchronously. The service transcribes the audio data and stores the results in a storage container. You can then retrieve the results from the storage container.

(Batch transcription jobs are scheduled on a best-effort basis. At peak hours it might take up to 30 minutes or longer for a transcription job to start processing. )

### Fast Transcription
Fast transcription API is used to transcribe audio files with returning results synchronously and faster than real-time audio. Use fast transcription in the scenarios that you need the transcript of an audio recording as quickly as possible with predictable latency, such as:

Quick audio or video transcription and subtitles: Quickly get a transcription of an entire video or audio file in one go.
Video translation: Immediately get new subtitles for a video if you have audio in different languages.

**Only available via the API**

### Real-Time Transcription
Real-time speech to text transcribes audio as it's recognized from a microphone or file. <br/>
Available via the SDK and REST API.

##### From a File

In [None]:
!pip install azure-cognitiveservices-speech

In [1]:
import os
import azure.cognitiveservices.speech as speechsdk
from dotenv import load_dotenv

In [2]:
load_dotenv()

True

In [3]:
base_url = os.getenv('AI_SPEECH_ENDPOINT')
key = os.getenv('AI_SPEECH_KEY')
region = 'eastus'

In [None]:
#aside: a lot of the time, we will have this scripts as .py files, so you will write functions
def transcribe(filepath):
    speech_config = speechsdk.SpeechConfig(subscription=key,
                                           #endpoint=base_url,
                                           region=region)
    speech_config.speech_recognition_language = 'en-US'

    audio_config = speechsdk.audio.AudioConfig(filename=filepath)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    speech_recognition_result = speech_recognizer.recognize_once_async().get()

    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print(f"Recognized: {speech_recognition_result.text}")
        return speech_recognition_result.text

    elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print(f"No speech could be recognized: {speech_recognition_result.no_match_details}")
    
    elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_recognition_result.cancellation_details
        print(f"Speech Recognition canceled: {cancellation_details.reason}")
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and region values?")

In [None]:
transcribe('./Data/FightMilk.wav')

##### From a Microphone

In [None]:
def recognize_from_microphone():
     # Replace with your own subscription key and endpoint, the endpoint is like : "https://YourServiceRegion.api.cognitive.microsoft.com"
    speech_config = speechsdk.SpeechConfig(subscription=key, endpoint=base_url)
    speech_config.speech_recognition_language="en-US"

    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True) #you can also use the device if you need to specify a microphone
    #https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-select-audio-input-devices
    
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    print("Speak into your microphone.")
    speech_recognition_result = speech_recognizer.recognize_once_async().get()
    #speech_recognition_result = speech_recognizer.recognize_once()
    #both of these will effectively do the same thing here

    #this one will not output the text by default, but you can print the result


    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("Recognized: {}".format(speech_recognition_result.text))
    elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
    elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_recognition_result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and endpoint values?")

In [None]:
recognize_from_microphone()

**NOTE: This example uses the recognize_once_async operation to transcribe utterances of up to 30 seconds, or until silence is detected.**

#### For continous recognition

The previous examples use single-shot recognition, which recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 (or 30?) seconds of audio is processed. <br/><br/>
In contrast, you use continuous recognition when you want to control when to stop recognizing. It requires you to connect to EventSignal to get the recognition results. <br/>To stop recognition, you must call `stop_continuous_recognition()` or `stop_continuous_recognition_async()`. 

In [4]:
import time

In [None]:
def continuous_recognition(filepath = None):
    speech_config = speechsdk.SpeechConfig(subscription=key, endpoint=base_url)
    speech_config.speech_recognition_language = 'en-US'

    if filepath:
        audio_config = speechsdk.audio.AudioConfig(filename=filepath)
    else:
        audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    #Variable to manage state
    done = False

    #create a callback to stop continuous recognition when evt is received.
        # When evt is received, the evt message is printed.
        # After evt is received, stop_continuous_recognition() is called to stop recognition.
        # The recognition state is changed to True.
    
    def stop_cb(evt):
        '''
        This function is triggered when certain events fire (you connect it later).
        '''
        print('CLOSING on {}'.format(evt))
        speech_recognizer.stop_continuous_recognition()
        #global done if doing this in a py file
        # In a .py script outside Jupyter, you’d need global done inside stop_cb to modify the done variable defined outside its scope.
        # if you still get issues, do nonlocal done
        done = True
    
    # Connecting event handlers
    '''
    - recognizing: fires while the recognizer is receiving audio and producing interim results.
    - recognized: fires when final text is recognized (end of an utterance or pause).
    - session_started: signals the speech service session has started. Good for logging or UI readiness.
    - session_stopped: signals the session has ended normally.
        - You connect stop_cb to this below to shut down cleanly.
    - canceled: fires if the recognition is canceled, e.g., by error or manual stop. Also triggers stop_cb to clean up.
    '''
    speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
    #To just get the text, replace above with : speech_recognizer.recognizing.connect( lambda evt: print('RECOGNIZING: {}'.format(evt.result.text)))
    #or comment out recognizing, and olnly get the final output
    speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    #To just get the text, replace above with: speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt.result.text)))

    speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))

    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    #Now that everything is setup, we can call the start_continuous_recognition method to start the recognition process.
    
    speech_recognizer.start_continuous_recognition()
    while not done:
        '''
        - Keeps your Python script or notebook alive so the recognizer thread keeps running.
        - The loop polls the done flag every 0.5 seconds.
        - Once stop_cb sets done = True, this loop exits and your function ends.
        '''
        time.sleep(.5) 


    

In [None]:
continuous_recognition('./Data/FightMilk.wav')

In [None]:
continuous_recognition()

### Stop if there has been 3 seconds of silence

In [5]:
import threading

In [8]:
def continuous_recognition_timed(filepath = None, silence_timeout_sec=3):
    speech_config = speechsdk.SpeechConfig(subscription=key, endpoint=base_url)
    speech_config.speech_recognition_language = 'en-US'

    if filepath:
        audio_config = speechsdk.audio.AudioConfig(filename=filepath)
    else:
        audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    #Variable to manage state
    done = False
    silence_deadline = time.time() + silence_timeout_sec #SET THE DEADLINE
    
    def stop_cb(evt):
        nonlocal done
        print('CLOSING on {}'.format(evt))
        speech_recognizer.stop_continuous_recognition()
        done = True

    def recognizing_cb(evt):
        nonlocal silence_deadline
        print('RECOGNIZING: {}'.format(evt))
        silence_deadline = time.time() + silence_timeout_sec
    
    def recognized_cb(evt): #refactoring this to calculate new dealine each time
        nonlocal silence_deadline
        print('RECOGNIZED: {}'.format(evt))
        silence_deadline = time.time() + silence_timeout_sec


    speech_recognizer.recognizing.connect(recognizing_cb)
    #speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    speech_recognizer.recognized.connect(recognized_cb)
    speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))

    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    #Now that everything is setup, we can call the start_continuous_recognition method to start the recognition process.
    
    speech_recognizer.start_continuous_recognition()
    while not done:
        time.sleep(.5)
        if time.time() > silence_deadline:
            print(f"No speech for {silence_timeout_sec} seconds. Stopping recognition.")
            speech_recognizer.stop_continuous_recognition()
            done = True


    

In [9]:
continuous_recognition_timed()

SESSION STARTED: SessionEventArgs(session_id=e49b9129357f4924bf1d2d53d6dd43eb)
RECOGNIZING: SpeechRecognitionEventArgs(session_id=e49b9129357f4924bf1d2d53d6dd43eb, result=SpeechRecognitionResult(result_id=430d862181724946803b41b229f461ec, text="CS go", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=e49b9129357f4924bf1d2d53d6dd43eb, result=SpeechRecognitionResult(result_id=e213302b8d584fb5a9b595990dade669, text="CS go i'm not sure", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=e49b9129357f4924bf1d2d53d6dd43eb, result=SpeechRecognitionResult(result_id=a1cf73e1ab3d4c7a86d0974e7c996aab, text="CS go i'm not sure if that", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=e49b9129357f4924bf1d2d53d6dd43eb, result=SpeechRecognitionResult(result_id=a1d4e66bf094480fab2a072351d60c10, text="CS go i'm not sure if", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: Sp

### In one Language, Out the other

##### What our function needs to do
- Listen to the microphone continuously.
- Recognize speech in the source language.
- Feed the recognized text to Azure’s TTS.
- *Speak it back out loud in the target language.
    - This function will save the speech in a file with the target language

In [None]:
def echo_with_tts(target_language='fr-FR'):
    speech_config = speechsdk.SpeechConfig(subscription=key, endpoint=base_url)
    speech_config.speech_recognition_language = 'en-US'

    audio_config_in = speechsdk.audio.AudioConfig(use_default_microphone=True)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config_in)

    # Synthesis config: pick target voice matching the target language
    speech_config.speech_synthesis_language = target_language

    # Pick a voice name. You can look up the list in Azure docs.
    # For French:
    speech_config.speech_synthesis_voice_name = 'fr-FR-DeniseNeural'
    #you could set up some logic here to choose the voice based on the target language

    audio_config_out = speechsdk.audio.AudioConfig(filename="output_test.wav")
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config_out)
    # IF you want to do this real time, use a .py file and set audio_config_out to None

    done = False

    def recognized_cb(evt):
        text = evt.result.text
        print(f"RECOGNIZED: {text}")

        # Now speak it in the target language
        if text.strip():
            print(f"Speaking in {target_language}...")
            result = speech_synthesizer.speak_text_async(text).get()

            if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
                print("Speech synthesized for text [{}]".format(text))
            elif result.reason == speechsdk.ResultReason.Canceled:
                cancellation_details = result.cancellation_details
                print("Speech synthesis canceled: {}".format(cancellation_details.reason))
                if cancellation_details.reason == speechsdk.CancellationReason.Error:
                    print("Error details: {}".format(cancellation_details.error_details))

    def stop_cb(evt):
        nonlocal done
        print(f"Session ended: {evt}")
        speech_recognizer.stop_continuous_recognition()
        done = True

    speech_recognizer.recognized.connect(recognized_cb)
    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    speech_recognizer.start_continuous_recognition()

    while not done:
        time.sleep(0.5)

In [None]:
echo_with_tts()

RECOGNIZED: All right, we're going to see if this works this time.
Speaking in fr-FR...
RECOGNIZED: All right, we're going to see if this works this time.
Speaking in fr-FR...
Speech synthesized for text [All right, we're going to see if this works this time.]
Speech synthesized for text [All right, we're going to see if this works this time.]
RECOGNIZED: Interesting to see what happens.
Speaking in fr-FR...
RECOGNIZED: Interesting to see what happens.
Speaking in fr-FR...
Speech synthesized for text [Interesting to see what happens.]
Speech synthesized for text [Interesting to see what happens.]


KeyboardInterrupt: 

RECOGNIZED: 
RECOGNIZED: 
RECOGNIZED: No way.
Speaking in fr-FR...
RECOGNIZED: No way.
Speaking in fr-FR...
Speech synthesized for text [No way.]
Speech synthesized for text [No way.]
RECOGNIZED: No way.
Speaking in fr-FR...
RECOGNIZED: No way.
Speaking in fr-FR...
Speech synthesized for text [No way.]
Speech synthesized for text [No way.]
RECOGNIZED: 
RECOGNIZED: 
RECOGNIZED: 
RECOGNIZED: 
RECOGNIZED: 
RECOGNIZED: 
RECOGNIZED: 
