# TUTORIAL AI Endpoints - Speaker Diarization with ASR models

*This tutorial introduces DIARIZATION concept and explain how to use it easily with [AI Endpoints](https://endpoints.ai.cloud.ovh.net/).*

![ASR](./asr_diarization_tutorial.png)

## Concept

To better understand the **diarization** feature, let’s start by examining ASR concept…

### AI Endpoints in a few words

**AI Endpoints** is a new serverless platform powered by OVHcloud and designed for developers. The aim of AI Endpoints is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.

It offers a curated catalog of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.

AI Endpoints provides [access to advanced AI models](https://endpoints.ai.cloud.ovh.net/catalog), including Large Language Models (LLMs), Natural Language Processing, translation, but also Speech Recognition.

### Transcribe audio using ASR

**Automatic Speech Recognition** (ASR) technology, also known as **Speech-To-Text**, is the process of converting spoken language into written text.

This process consists of several stages, including preparing the speech signal, extracting features, creating acoustic models, developing language models, and utilizing speech recognition engines.

With AI Endpoints, ASR technology usage is easier thanks to the ready-to-use inference APIs in different languages:

- `en-US`
- `en-GB`
- `fr-FR`
- `es-US`
- `es-ES`
- `de-DE`
- `it-IT`
- `zh-CN`

**And what about the speaker diarization?**

It's an ASR process that answers the question **"who spoke when"** in a conversation or an audio recording. It involves partitioning the recording into segments according to who is speaking, which can help in organizing and analyzing conversations, as well as improving the accuracy of audio transcriptions.

## Technical Implementation

In this tutorial, the ASR model in [en-US](https://endpoints.ai.cloud.ovh.net/models/0d492510-e5e6-429b-bb1f-de8add9436ca) language is used to explain how diarization works.

### Step 1 - Install dependencies

In [None]:
!pip install python-dotenv nvidia-riva-client pydub 

### Step 2 - Set up you environment

- Import Python librairies

In [1]:
import os
import riva.client
import IPython.display as ipd
from pydub import AudioSegment

- Create a `.env` file to store AI Endpoints environment variables

*⚠️ Test AI Endpoints and get your free token <`ai-endpoints-api-token`> [here](https://endpoints.ai.cloud.ovh.net/)*

- Load environment variables

In [None]:
# access the environment variables from the .env file
load_dotenv()
asr_endpoint = os.environ.get('ASR_ENDPOINT') 
ai_endpoint_token = os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN")

### Step 3 - Process your input audio

In [3]:
audio_filename = "audio_asr_diarization.wav"

# audio processing
audio_input = AudioSegment.from_file(f"/workspace/{audio_filename}", "mp3")
process_audio_to_wav = audio_input.set_channels(1)
process_audio_to_wav = process_audio_to_wav.set_frame_rate(16000)

audio_processed = f"/workspace/output_{audio_filename}.wav"
process_audio_to_wav.export(audio_processed, format="wav")

<_io.BufferedRandom name='/workspace/output_audio_asr_diarization.wav.wav'>

In [4]:
# open and read audio file
with open(audio_processed, 'rb') as fh:
    audio = fh.read()
ipd.Audio(audio_processed)

### Step 4 - Transcribe audio into text using basic ASR

- Connect with ASR endpoint

In [5]:
# connect with asr server
asr_service = riva.client.ASRService(
                riva.client.Auth(
                    uri=asr_endpoint, 
                    use_ssl=True, 
                    metadata_args=[["authorization", f"bearer {ai_endpoint_token}"]]
                )
            )

I0000 00:00:1721981237.730699   68482 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


- Define ASR model configuration

In [6]:
# set up config
asr_config = riva.client.RecognitionConfig(
    language_code="en-US",
    max_alternatives=1,
    enable_automatic_punctuation=True,
    audio_channel_count = 1,
)

- Test audio recognition **without** diarization

In [7]:
# ASR inference
response = asr_service.offline_recognize(audio, asr_config)
print("ASR transcript WITHOUT Speaker Diarization:\n\n", response.results[0].alternatives[0].transcript)

ASR transcript WITHOUT Speaker Diarization:

 Where is Brian? Brian is in the kitchen. Where is Jenny, the sister of Brian? Jenny is in the bathroom. Do you know where I can find an Asr model? Yes, of course. you can find it on Ai Endpoint website. 


### Step 5 - Add Speaker Diarization to config

- Enable diarization as follow

In [8]:
riva.client.asr.add_speaker_diarization_to_config(asr_config, diarization_enable=True)

# ASR inference with diarization
response = asr_service.offline_recognize(audio, asr_config)

- Display transcription

In [9]:
print("ASR transcript WITH Speaker Diarization:\n")

for result in response.results:
    for word in result.alternatives[0].words:
        color = '\033['+ str(30 + word.speaker_tag) + 'm'
        print(color, word.word, end="")
        

ASR transcript WITH Speaker Diarization:

[32m Where[32m is[32m Brian?[31m Brian[31m is[31m in[31m the[31m kitchen.[32m Where[32m is[32m Jenny,[32m the[32m sister[32m of[32m Brian?[31m Jenny[31m is[31m in[31m the[31m bathroom.[32m Do[32m you[32m know[32m where[32m I[32m can[32m find[32m an[32m Asr[32m model?[31m Yes,[31m of[31m course.[31m you[31m can[31m find[31m it[31m on[31m Ai[31m Endpoint[31m website.

### Step 6 - Format text to take diarization even further

- Split speaker sentences as follow

In [10]:
for result in response.results:
    
    outputs = result.alternatives[0].words
    old_speak_tag = result.alternatives[0].words[0].speaker_tag
    
    sentence = ""
    for out in range(len(outputs)):
        
        new_speak_tag = outputs[out].speaker_tag
        color = '\033['+ str(30 + old_speak_tag) + 'm'
        
        if new_speak_tag!=old_speak_tag or out==len(outputs)-1:
            
            print(color, f"\nSpeaker {old_speak_tag}:", sentence)
            sentence = ""
            old_speak_tag = new_speak_tag
        
        sentence = sentence + " " + outputs[out].word

[32m 
Speaker 2:  Where is Brian?
[31m 
Speaker 1:  Brian is in the kitchen.
[32m 
Speaker 2:  Where is Jenny, the sister of Brian?
[31m 
Speaker 1:  Jenny is in the bathroom.
[32m 
Speaker 2:  Do you know where I can find an Asr model?
[31m 
Speaker 1:  Yes, of course. you can find it on Ai Endpoint
