## 📚 Prerequisites

Before running this notebook, ensure you have configured Azure AI services, set the appropriate configuration parameters, and set up a Conda environment to ensure reproducibility. You can find the setup instructions and how to create a Conda environment in the [REQUIREMENTS.md](REQUIREMENTS.md) file.

## 📋 Table of Contents

This notebook guides you through the following sections:

1. [**Transcription Services**](transcription-services)

    Azure AI's Speech SDK offers robust transcription services that convert spoken language into written text. This capability is useful in various scenarios, such as transcribing meetings, generating subtitles for videos, or enabling voice commands in applications. The following sections explore three different use cases:

    1. [**Local Files**](#transcription-from-local-files): Learn how to transcribe audio from files stored on your machine. This is particularly useful when working with small to medium-sized files that can be easily accessed and processed locally.

    2. [**Blob Storage**](#transcription-from-blob-storage): Discover how to transcribe audio from files stored in Azure Blob Storage. This approach is ideal for larger files that require the scalability and robustness of cloud storage.

    3. [**Multi-language Auto Recognition**](#multi-language-auto-recognition-transcription): Explore the SDK's ability to automatically recognize and transcribe multiple languages within a single audio file. This feature is beneficial when dealing with multilingual content.

    4. [**Enable Diarization (preview)**](#enable-diarization): Run an speech-to-text transcription with real-time diarization. Diarization is the process of distinguishing between different speakers participating in a conversation. The Speech service provides information about which speaker was speaking a particular part of the transcribed speech.
    
2. [**Real time speech to text (Streams)**](#streams): This section demonstrates how to convert speech to text from audio streams, using push streams for real-time processing.

For more details, refer to the following resources:
- [Quickstart: Azure Cognitive Services Speech SDK](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master)

In [1]:
import os

# Define the target directory (change yours)
target_directory = r"C:\Users\pablosal\Desktop\gbbai-azure-ai-speech-services"

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-azure-ai-speech-services


## Transcription Services

### From Local Files

In [2]:
# Import the SpeechTranscriber class from the speech_to_text module in the src.speech package
from src.speech.speech_to_text import SpeechTranscriber

# Create an instance of the SpeechTranscriber class
transcriber_client = SpeechTranscriber()

In this section, we will transcribe speech from a local audio file using Azure AI's Speech SDK. The audio file we will be using is located at `gbbai-azure-ai-speech-services//utils//audio_data//english.wav`.

The expected transcription (ground truth) is: "Oh, he has been away from New York—he has been all round the world. He doesn't know many people here, but he's very sociable, and he wants to know every one."

We will use the `transcribe_speech_from_file_continuous` function, which performs continuous speech recognition with input from an audio file. This function takes several parameters, including the path to the local audio file, language settings, and auto-detection settings, and returns the transcribed text from the audio source.

Here's how we call this function:

```python
transcriber_client.transcribe_speech_from_file_continuous(
    file_path=AUDIO_FILE_PCM_STEREO
)
``` 

In [3]:
AUDIO_FILE_PCM_STEREO = "utils//audio_data//english.wav"

In [4]:
transcriber_client.transcribe_speech_from_file_continuous(
    file_path=AUDIO_FILE_PCM_STEREO,
    auto_detect_source_language=False,
    diarization=False,
)

2024-01-09 23:01:43,452 - micro - MainProcess - INFO     Transcribing with diarization (speech_to_text.py:_transcribe:605)
2024-01-09 23:01:43,456 - micro - MainProcess - INFO     SessionStarted event: SessionEventArgs(session_id=65104312d35940c283d29a83d0dec700) (speech_to_text.py:conversation_transcriber_session_started_cb:31)
2024-01-09 23:01:44,000 - micro - MainProcess - INFO     Transcribing event: ConversationTranscriptionEventArgs(session_id=65104312d35940c283d29a83d0dec700, result=ConversationTranscriptionResult(result_id=5ea9c6cc850840bc9e3af4f6728a3de6, speaker_id=Unknown, text=oh he has been away from, reason=ResultReason.RecognizingSpeech)) (speech_to_text.py:conversation_transcriber_transcribing_started_cb:34)
2024-01-09 23:01:44,091 - micro - MainProcess - INFO     Transcribing event: ConversationTranscriptionEventArgs(session_id=65104312d35940c283d29a83d0dec700, result=ConversationTranscriptionResult(result_id=260b537ab86e4de9badc3bb66ef9a3a4, speaker_id=Unknown, text=o

"Oh, he has been away from New York. He has been all round the world. He doesn't know many people here, but he's very sociable and he wants to know everyone."

### From Blob Storage

The audio file we will be using is located at `https://testeastusdev001.blob.core.windows.net/speechapp/d6a35a5e-be01-40cd-b9ef-d61fcda699fa.pcm`. 

The expected transcription (ground truth) is: 'What is the date? May 15th, 1980. Thursday, May 15th, 19180. What is the date? Saturday, July 6th, 2024.'

In [5]:
AUDIO_FROM_BLOB = "https://testeastusdev001.blob.core.windows.net/speechapp/d6a35a5e-be01-40cd-b9ef-d61fcda699fa.pcm"
transcriber_client.transcribe_speech_from_file_continuous(blob_url=AUDIO_FROM_BLOB)

2024-01-09 23:01:50,207 - micro - MainProcess - INFO     Transcribing with diarization (speech_to_text.py:_transcribe:605)
2024-01-09 23:01:50,223 - micro - MainProcess - INFO     SessionStarted event: SessionEventArgs(session_id=d337d17f5d8848b18e1ca6b870affb2f) (speech_to_text.py:conversation_transcriber_session_started_cb:31)
2024-01-09 23:01:50,762 - micro - MainProcess - INFO     Transcribing event: ConversationTranscriptionEventArgs(session_id=d337d17f5d8848b18e1ca6b870affb2f, result=ConversationTranscriptionResult(result_id=db504ade07ad48b68c21fb2eeb78a8a9, speaker_id=Unknown, text=what is the date, reason=ResultReason.RecognizingSpeech)) (speech_to_text.py:conversation_transcriber_transcribing_started_cb:34)
2024-01-09 23:01:51,458 - micro - MainProcess - INFO     Transcribing event: ConversationTranscriptionEventArgs(session_id=d337d17f5d8848b18e1ca6b870affb2f, result=ConversationTranscriptionResult(result_id=b38f7bd99d534551bd721f9782809cab, speaker_id=Unknown, text=what is t

'What is the date? May 15th, 1980. Thursday, May 15th, 19180. What is the date? Saturday, July 6th, 2024.'

# Multi-language Auto Recognition Transcription

In this section, we will transcribe speech from a local audio file in French using Azure AI's Speech SDK, which features automatic language detection. The audio file we will be using is located at `utils//audio_data//french.wav`. 

The expected transcription (ground truth) is: `En semaine, je me lève à 6h30, je prends une douche et un petit déjeuner et je pars au travail vers 7h15. Pour arriver à mon entreprise à 8 heures, il me faut environ 45 minutes en voiture, mais parfois j’arrive en retard à cause des embouteillages.`

We will use the `transcribe_speech_from_file_continuous` function, setting the `auto_detect_source_language` parameter to `True`. This allows the function to automatically detect the language spoken in the audio file. Currently, the supported languages are English (United States), Spanish (Spain), and French (France). However, you can add new languages using the `transcriber_client.add_supported_language` method or by passing the `auto_detect_source_language` parameter when calling the `transcribe_speech_from_file_continuous` function.

For more information about the available languages, please visit the [Azure AI Services Language Support page](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt).

In [6]:
AUDIO_FILE_french = "utils/audio_data/french.wav"
transcriber_client.transcribe_speech_from_file_continuous(
    file_path=AUDIO_FILE_french, auto_detect_source_language=True
)

2024-01-09 23:02:00,960 - micro - MainProcess - INFO     Transcribing with diarization (speech_to_text.py:_transcribe:605)
2024-01-09 23:02:00,968 - micro - MainProcess - INFO     SessionStarted event: SessionEventArgs(session_id=7148d09395d642f9899b92f72a6d38c0) (speech_to_text.py:conversation_transcriber_session_started_cb:31)
2024-01-09 23:02:01,919 - micro - MainProcess - INFO     Transcribing event: ConversationTranscriptionEventArgs(session_id=7148d09395d642f9899b92f72a6d38c0, result=ConversationTranscriptionResult(result_id=af46564a1b7d4886b922f98c3fe7193d, speaker_id=Unknown, text=en sem, reason=ResultReason.RecognizingSpeech)) (speech_to_text.py:conversation_transcriber_transcribing_started_cb:34)
2024-01-09 23:02:02,028 - micro - MainProcess - INFO     Transcribing event: ConversationTranscriptionEventArgs(session_id=7148d09395d642f9899b92f72a6d38c0, result=ConversationTranscriptionResult(result_id=426be43b8e2f4295803a97257445417b, speaker_id=Unknown, text=en semaine je me, r

"En semaine, je me lève à 06h30, Je prends une douche et un petit déjeuner et je pars au travail vers 07h15 pour arriver à mon entreprise à 08h00. Il me faut environ 45 Min en voiture, mais parfois j'arrive en retard à cause des embouteillages."

## Enable Diarization (preview)

The speaker information is included in the result in the `speaker ID` field. The `speaker ID` is a generic identifier assigned to each conversation participant by the service during the recognition process, as different speakers are identified from the provided audio content.

In [7]:
result = transcriber_client.transcribe_speech_from_file_continuous(
    blob_url=AUDIO_FROM_BLOB, diarization=True, auto_detect_source_language=True
)

2024-01-09 23:02:10,734 - micro - MainProcess - INFO     Transcribing with diarization (speech_to_text.py:_transcribe:605)
2024-01-09 23:02:10,749 - micro - MainProcess - INFO     SessionStarted event: SessionEventArgs(session_id=b4815576a1294258aa7e78181b6621c4) (speech_to_text.py:conversation_transcriber_session_started_cb:31)
2024-01-09 23:02:11,591 - micro - MainProcess - INFO     Transcribing event: ConversationTranscriptionEventArgs(session_id=b4815576a1294258aa7e78181b6621c4, result=ConversationTranscriptionResult(result_id=5b719096b95e43c1a4eabeb38d106567, speaker_id=Unknown, text=what is the date, reason=ResultReason.RecognizingSpeech)) (speech_to_text.py:conversation_transcriber_transcribing_started_cb:34)
2024-01-09 23:02:12,076 - micro - MainProcess - INFO     Transcribing event: ConversationTranscriptionEventArgs(session_id=b4815576a1294258aa7e78181b6621c4, result=ConversationTranscriptionResult(result_id=e50cd91aff7941e788cb7a0e90943806, speaker_id=Unknown, text=what is t

In [8]:
print(result)

Speaker Guest-1: What is the date?
Speaker Guest-1: May 15th, 1980.
Speaker Guest-2: Thursday, May 15th, 19180.
Speaker Guest-1: What is the date?
Speaker Guest-2: Saturday, July 6th, 2024.



## Speech to Text from streams (preview):

In [9]:
from src.speech.utils_audio import check_audio_file, log_audio_characteristics

In [10]:
AUDIO_FILE_PCM_MONO = "C://Users//pablosal//Desktop//gbbai-azure-ai-speech-services//utils//audio_data//aboutSpeechSdk.wav"

In [11]:
log_audio_characteristics(AUDIO_FILE_PCM_MONO)

2024-01-09 23:02:21,759 - micro - MainProcess - INFO     Number of Channels: 1 (utils_audio.py:log_audio_characteristics:75)
2024-01-09 23:02:21,761 - micro - MainProcess - INFO     Sample Width: 2 (utils_audio.py:log_audio_characteristics:76)
2024-01-09 23:02:21,764 - micro - MainProcess - INFO     Frame Rate: 16000 (utils_audio.py:log_audio_characteristics:77)
2024-01-09 23:02:21,765 - micro - MainProcess - INFO     Number of Frames: 838880 (utils_audio.py:log_audio_characteristics:78)
2024-01-09 23:02:21,767 - micro - MainProcess - INFO     Compression Type: NONE (utils_audio.py:log_audio_characteristics:79)
2024-01-09 23:02:21,769 - micro - MainProcess - INFO     Compression Name: not compressed (utils_audio.py:log_audio_characteristics:80)
2024-01-09 23:02:21,770 - micro - MainProcess - INFO     Bytes Per Second: 32000 (utils_audio.py:log_audio_characteristics:84)


In [12]:
transcriber_client.speech_recognition_with_push_stream(audio_file=AUDIO_FILE_PCM_MONO)

2024-01-09 23:02:21,792 - micro - MainProcess - INFO     SESSION STARTED: SessionEventArgs(session_id=434de908c8084a1796b580789ba72e65) (speech_to_text.py:<lambda>:313)
2024-01-09 23:02:21,802 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)
2024-01-09 23:02:21,915 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)
2024-01-09 23:02:22,028 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)
2024-01-09 23:02:22,143 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)


2024-01-09 23:02:22,252 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)
2024-01-09 23:02:22,376 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)
2024-01-09 23:02:22,490 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)
2024-01-09 23:02:22,619 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)
2024-01-09 23:02:22,748 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)
2024-01-09 23:02:22,976 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)
2024-01-09 23:02:23,292 - micro - MainProcess - INFO     Mono data shape: (1600,) (speech_to_text.py:speech_recognition_with_push_stream:352)
2024-0

'The Speech SDK exposes many features from the Speech Service, but not all of them. The capabilities of the Speech SDK are often associated with scenarios. The Speech SDK is ideal for both real time and non real time scenarios using local devices, files, Azure BLOB storage and even input and output streams. When a scenario is not achievable with a Speech SDK, look for a REST API alternative. Speech to text, also known as speech recognition, transcribes audio streams to text that your applications, tools, or devices can consume or display. Use speech to text with language understanding. Louis to derive user intents from transcribed speech and act on voice commands. Use speech translation to translate speech input to a different language with a single call. For more information, see Speech to Text Basics.'