# TUTORIAL AI Endpoints - Speaker Diarization with ASR models

*This tutorial introduces DIARIZATION concept and explain how to use it easily with [AI Endpoints](https://endpoints.ai.cloud.ovh.net/).*

![ASR](./asr_diarization_tutorial.png)

## Concept

To better understand the **diarization** feature, let’s start by examining ASR concept…

### AI Endpoints in a few words

**AI Endpoints** is a new serverless platform powered by OVHcloud and designed for developers. The aim of AI Endpoints is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.

It offers a curated catalog of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.

AI Endpoints provides [access to advanced AI models](https://endpoints.ai.cloud.ovh.net/catalog), including Large Language Models (LLMs), Natural Language Processing, translation, but also Speech Recognition.

### Transcribe audio using ASR

**Automatic Speech Recognition** (ASR) technology, also known as **Speech-To-Text**, is the process of converting spoken language into written text.

This process consists of several stages, including preparing the speech signal, extracting features, creating acoustic models, developing language models, and utilizing speech recognition engines.

With AI Endpoints, ASR technology usage is easier thanks to the ready-to-use inference APIs, and it can transcribe a wide range of languages, supporting approximately 100 in total.

**And what about the speaker diarization?**

It's an ASR process that answers the question **"who spoke when"** in a conversation or an audio recording. It involves partitioning the recording into segments according to who is speaking, which can help in organizing and analyzing conversations, as well as improving the accuracy of audio transcriptions.

### Step 1 - Install dependencies

In [None]:
!pip install python-dotenv openai pydub

### Step 2 - Set up you environment

- Import Python librairies

In [8]:
import os
import IPython.display as ipd
from pydub import AudioSegment
from dotenv import load_dotenv
import requests

- Create a `.env` file to store AI Endpoints environment variables

*⚠️ Test AI Endpoints and get your free token <`ai-endpoints-api-token`> [here](https://endpoints.ai.cloud.ovh.net/)*

- Load environment variables

In [5]:
# access the environment variables from the .env file
load_dotenv()
asr_endpoint = os.environ.get('ASR_ENDPOINT') 
ai_endpoint_token = os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN")

### Step 3 - Process your input audio

In [6]:
audio_filename = "audio_asr_diarization.wav"

# audio processing
audio_input = AudioSegment.from_file(f"/workspace/{audio_filename}", "mp3")
process_audio_to_wav = audio_input.set_channels(1)
process_audio_to_wav = process_audio_to_wav.set_frame_rate(16000)

audio_processed = f"/workspace/output_{audio_filename}.wav"
process_audio_to_wav.export(audio_processed, format="wav")

<_io.BufferedRandom name='/workspace/output_audio_asr_diarization.wav.wav'>

In [7]:
# open and read audio file
with open(audio_processed, 'rb') as fh:
    audio = fh.read()
ipd.Audio(audio_processed)

### Step 4 - Transcribe audio into text using basic ASR

- Setup and authentication of the ASR endpoint

In [13]:
# Whisper API endpoint (OVH deployment example)
url = "https://whisper-large-v3.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1/audio/transcriptions"

# Path to your audio file
audio_file_path = "audio_asr_diarization.wav"

- Configure request parameters **without diarization**

In [49]:
# Authentication headers
headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {ai_endpoint_token}",
}

# Transcription parameters
data = {
    "model": "whisper-large-v3",
    "temperature": "0.0",
    "timestamp_granularities": "segment",
    "prompt": "Please, use punctuation when translating this conversation about AI Endpoints." # Prompt to force punctuation style and help to write correctly AI Endpoints
}

- Test audio recognition

In [50]:
# Open audio file in binary mode
with open(audio_file_path, "rb") as audio_file:
    files = {"file": audio_file}
    
    # Make API request
    response = requests.post(url, headers=headers, files=files, data=data)

# Handle response
if response.status_code == 200:
    result = response.json()
    print("ASR transcript WITHOUT Speaker Diarization:\n\n", result["text"])
else:
    print("Error:", response.status_code, response.text)

ASR transcript WITHOUT Speaker Diarization:

  Where is Bryan? Bryan is in the kitchen. Where is Jenny, the sister of Bryan? Jenny is in the bathroom. Do you know where I can find an ASR model? Yes, of course. You can find it on AI Endpoint website.


### Step 5 - Add Speaker Diarization to config

- Enable diarization to the transcription parameters:

In [63]:
# Transcription parameters
data = {
    "model": "whisper-large-v3",
    "temperature": "0.0",
    "timestamp_granularities": "segment",
    "prompt": "Please, use punctuation when translating this conversation about AI Endpoints.", # Prompt to force punctuation style and help to write correctly AI Endpoints
    "diarize": "true" # Add and enable diarization parameter
}

- Display transcription

In [71]:
# Open audio file in binary mode
with open(audio_file_path, "rb") as audio_file:
    files = {"file": audio_file}
    
    # Make API request
    response = requests.post(url, headers=headers, files=files, data=data)

# Handle response
if response.status_code == 200:
    result = response.json()
    print("ASR transcript WITH Speaker Diarization:")
    for speaker_segment in result["diarization"]:
        color = '\033['+ str(30 + speaker_segment["speaker"]+1) + 'm'
        print(color, speaker_segment["text"], end="")
else:
    print("Error:", response.status_code, response.text)

[31m Where is Bryan?[32m Bryan is in the kitchen.[31m Where is Jenny, the sister of Bryan?[32m Jenny is in the bathroom.[31m Do you know where I can find an ASR model?[32m Yes, of course.[32m You can find it on AI Endpoint website.

### Step 6 - Format text to take diarization even further

- Split speaker sentences as follow

In [80]:
merged_segments = []
current_speaker = None
current_text = []

for speaker_segment in result["diarization"]:
    speaker = speaker_segment["speaker"]

    if speaker == current_speaker:
        # It's the same speaker so we accumulate text
        current_text.append(speaker_segment["text"])
    else:
        # Different speaker, print previous speaker's text if any
        if current_speaker is not None:
            color = '\033[' + str(30 + current_speaker + 1) + 'm'
            merged_text = " ".join(current_text)
            print(color, merged_text + "\n")

        # New speaker
        current_speaker = speaker
        current_text = [speaker_segment["text"]]

# Print the last speaker's text
if current_speaker is not None:
    color = '\033[' + str(30 + current_speaker + 1) + 'm'
    merged_text = " ".join(current_text)
    print(color, merged_text)

[31m Where is Bryan?

[32m Bryan is in the kitchen.

[31m Where is Jenny, the sister of Bryan?

[32m Jenny is in the bathroom.

[31m Do you know where I can find an ASR model?

[32m Yes, of course. You can find it on AI Endpoint website.
