# ASR APIs & Models Testing and Comparison
- In this notebook I will show <b>how to transcribe audio by using a variety of ASR - Automatic Speech Recognition APIs and models</b>  
- ASR/STT takes an audio stream or audio buffer as input and returns one or more text transcripts, along with additional optional metadata   
- We will use following ASR APIs/models:  
-- OpenAI's [Whisper](https://github.com/openai/whisper)  
-- Nvidia's [Riva Conformer](https://github.com/nvidia-riva)  
-- Python package [SpeechRecognition](https://github.com/Uberi/speech_recognition)      
-- [Speechmatics](https://www.speechmatics.com/)    
-- [Deepgram](https://deepgram.com/)   

In [1]:
# Install jiwer for calculating word error rate (WER) for ASR accuracy comparison!
# Uncomment below to install jiwer
!python3 -m pip install jiwer

Collecting jiwer
  Downloading jiwer-3.0.4-py3-none-any.whl (21 kB)
Collecting rapidfuzz<4,>=3 (from jiwer)
  Downloading rapidfuzz-3.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.0.4 rapidfuzz-3.9.2


In [2]:
import io, re
import jiwer
import json
import time
import requests
import pandas as pd
import IPython.display as ipd

## Compare All ASRs Like-for-Like By Using An Audio Sample Clip

In [3]:
!wget https://dl.fbaipublicfiles.com/seamlessM4T/LJ037-0171_sr16k.wav -O /content/LJ_eng.wav
filepath = "/content/LJ_eng.wav" #audio_sample
ipd.Audio(filepath)

--2024-05-30 05:42:07--  https://dl.fbaipublicfiles.com/seamlessM4T/LJ037-0171_sr16k.wav
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.162.163.11, 3.162.163.34, 3.162.163.19, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.162.163.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 485430 (474K) [audio/x-wav]
Saving to: ‘/content/LJ_eng.wav’


2024-05-30 05:42:08 (2.39 MB/s) - ‘/content/LJ_eng.wav’ saved [485430/485430]



## 1. Whisper
- OpenAI's Robust Speech Recognition via Large-Scale Weak Supervision   
- They've open sourced the model code: https://github.com/openai/whisper   
- OpenAI has taken  a data-centric approach: it is a normal transformer architecture but they've used a very large and diverse dataset (even across multiple languages, **non-English counts for 1/3 of training data!**), so it is very robust especially in <b>Zero-shot Learning</b>.   

### Install Whisper Step-by-Step Instructions (Tested & Verified on MacBook)

1.1 create a virtual env., open a command line and type: `conda create -n vva python=3.9`  
1.2 activate the virtual env: `conda activate vva`  
1.3 create a new folder in your MacBook drive, e.g. `mkdir VVA_AIChatbot`   
1.4 Install the openai whisper:   
As per https://github.com/openai/whisper, the following command will pull and install the latest commit from this repository, along with its Python dependencies:  
`pip install git+https://github.com/openai/whisper.git`     
1.5 install Jupyter notebook library: `pip install jupyter notebook`  
1.6 In command line, start notebook by type in: `jupyter notebook`  
1.7 Click button ’New’ —> ‘Python3’ on upper-right corner of the menu to create a new notebook  
1.8 install/update brew on MacBook:   
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"  
Then:   
`brew update-reset`     
  
1.9 change permission:   
You should change the ownership of these directories to your user.  
  `sudo chown -R $(whoami) /usr/local/share/zsh /usr/local/share/zsh/site-functions`  
  
And make sure that your user has write permission.  
  `chmod u+w /usr/local/share/zsh /usr/local/share/zsh/site-functions`  

1.10 install ffmpeg:   
`brew install ffmpeg`  

1.11 install rust as well:  
`pip install setuptools-rust`  

1.12 copy & paste the code from repo to run the notebook to use Whisper to transcribe any audio/video file  

Done!  

### Install Whisper on Centos Instructions (Centos is open-sourced version of RedHat Linux)
- Following GitHub set-up steps:   https://github.com/openai/whisper
- Install ffmpeg on Centos:   
`sudo yum install epel-release   
sudo yum localinstall --nogpgcheck https://download1.rpmfusion.org/free/el/rpmfusion-free-release-7.noarch.rpm   
sudo yum install ffmpeg ffmpeg-devel   
ffmpeg -version`  

### Run whisper in command line

- The default setting (which selects the 'small' model) works well for transcribing English  
- If seeing the error of "CUDA out of memory" in PyTorch, use 'tiny' model instead:  
- https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch   
`whisper audio_sample.wav --model tiny`  

- Note use `--language`  and `--model` to specify other languages and models (default is English and small model)
- Adding `--task translate` will translate the transcribed speech into English (i.e. **Speech Translation**):   
`whisper japanese.wav --language Japanese --model base --task translate`  

### Run whisper programatically in Python

In [18]:
filepath="/content/OSR_us_000_0010_8k.wav"
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg

In [75]:

import whisper
model = whisper.load_model("small")
start = time.time()
result = model.transcribe(filepath)
end = time.time()
whisper_latency = end-start
whisper_transcript = result["text"]
print(f'Whisper transcript:\n{whisper_transcript}')
print(f'Whisper latency: {whisper_latency} seconds')



Whisper transcript:
 The birch canoes slid on the smooth planks. Glue the sheet to the dark blue background. It is easy to tell the depth of a well. These days a chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch. The box was thrown beside the parked truck. The hogs were fed chopped corn and garbage. Four hours of study work faced us. A large size in stockings is hard to sell.
Whisper latency: 41.72216200828552 seconds


#### Below is an example usage of `whisper.detect_language()` and `whisper.decode()` which provide lower-level access to the model.

In [22]:
import whisper

model = whisper.load_model("base")
start = time.time()
# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio(filepath)
audio = whisper.pad_or_trim(audio)
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)
# print the recognized text
print(result.text)
end = time.time()
whisper_latency = end-start

print(f'Whisper latency: {whisper_latency} seconds')

Detected language: en
The birch canoes lid on the smooth planks. Glue the sheet to the dark blue background. It is easy to tell the depths of a well. These days the chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch. The box was thrown beside the pork truck. The hogs are fed chopped corn and garbage. Four hours of study work faced us.
Whisper latency: 135.59844779968262 seconds


## 2. Riva Models

- Speech recognition in Riva is a <b>GPU-accelerated</b> compute pipeline, with optimized performance and accuracy.  
- Riva provides STOA (state-of-the-art) and OOTB (out-of-the-box) models and pipelines for multiple languages, like English, Spanish, German, Russian and Mandarin, that can be easily deployed with the [Riva Speech AI Skills Quick Start Guide](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html). **Note: you do need to depoly locally, up and run a Riva server on GPU before trying this part of notebook for ASR.**
- Riva also supports easy customization of the ASR pipeline, in various ways, to meet your specific needs.  
- Refer to the [Riva ASR documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-overview.html) for more information.  
- Now, let's generate the transcripts using Riva APIs, for some sample audio clips, with an OOTB pipeline, starting with English.  

### Requirements and setup
1. Deploy & start the Riva Speech AI Skills server.  
Follow the instructions in the [Riva Quick Start Guide](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html#) to <b>deploy the pretrained ASR models (e.g. Conformer) on a local machine (e.g. a GPU server or a laptop with GPU) and run a sample client before trying this Riva part of the hands-on notebook</b>. By default, only the English models are deployed.  
`bash riva_init.sh
bash riva_start.sh`

2. Install the Riva Client library.   
Follow the steps in the [Requirements and setup for the Riva Client](https://github.com/nvidia-riva/tutorials#running-the-riva-client) to install the Riva Client library.  

### Transcription for English in either streaming model or offline batch mode

You can use Riva ASR in either streaming mode or offline mode. In streaming mode, a continuous stream of audio is captured and recognized, producing a stream of transcribed text. In offline mode, an audio clip of a set length is transcribed to text. <br>
Let's look at an example showing offline ASR API usage for English:

### Run Riva to transcribe audio in command line
- For data center(x86_64), start a container with sample clients for each service  
`bash riva_start_client.sh`  
- For offline recognition, run:  
`riva_asr_client --audio_file=/opt/riva/wav/en-US_sample.wav`
- For streaming recognition, run:  
`riva_streaming_asr_client --audio_file=/opt/riva/wav/en-US_sample.wav`


Unable to setup riva client server in docker,hence did not test riva

In [None]:
pip install nvidia-riva-client

Collecting nvidia-riva-client
  Downloading nvidia_riva_client-2.15.1-py3-none-any.whl (41 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting grpcio-tools (from nvidia-riva-client)
  Downloading grpcio_tools-1.64.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting protobuf<6.0dev,>=5.26.1 (from grpcio-tools->nvidia-riva-client)
  Downloading protobuf-5.27.0-cp38-abi3-manylinux2014_x86_64.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.2/309.2 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: protobuf, grpcio-tools, nvidia-riva-client
  Attempting uninstall: protobuf
    Found exist

### Run Riva programatically in Python

In [None]:
#import riva client libraries
import grpc
import riva.client

#### Create a Riva client and connect to the Riva Speech API server
The following URI assumes a local deployment of the Riva Speech API server is on the default port. In case the server deployment is on a different host or via a Helm chart on Kubernetes, use an appropriate URI.

In [None]:
auth = riva.client.Auth(uri='localhost:50051')

riva_asr = riva.client.ASRService(auth)

#### Make a gRPC request to the Riva Speech API server
Riva ASR API supports `.wav` files in pulse-code modulation (PCM) format; including `.alaw`, `.mulaw`, and `.flac` formats with single audio channel.

Now, let's make a gRPC request to the Riva Speech server for ASR with a sample `.wav` file in offline mode. Start by loading the audio.

In [None]:
# Set up an offline/batch recognition request
config = riva.client.RecognitionConfig()
#req.config.encoding = ra.AudioEncoding.LINEAR_PCM    # Audio encoding can be detected from wav
#req.config.sample_rate_hertz = 0                     # Sample rate can be detected from wav and resampled if needed
config.language_code = "en-US"                    # Language code of the audio clip
config.max_alternatives = 1                       # How many top-N hypotheses to return
config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
config.audio_channel_count = 1                    # Mono channel

# Alternatively, 2nd option for creating RecognitionConfig:
#config = riva.client.RecognitionConfig(
#  language_code= "en-US", #"en-GB"
#  max_alternatives=1,
#  enable_automatic_punctuation=True,
#  audio_channel_count = 1
#)

In [None]:
# Use riva to transcribe the given audio sample file
import io
filepath="/content/OSR_us_000_0010_8k.wav"
with io.open(filepath, 'rb') as fh:
    content = fh.read()
start = time.time()
response = riva_asr.offline_recognize(content, config)
riva_transcript = response.results[0].alternatives[0].transcript
end = time.time()
riva_latency = end-start
print(f'Riva transcript:\n{riva_transcript}')
print(f'Riva latency: {riva_latency} seconds')

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50051: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50051: Failed to connect to remote host: Connection refused", grpc_status:14, created_time:"2024-05-24T05:51:16.439075824+00:00"}"
>

In [None]:
# Install NVIDIA Riva client
!pip install nvidia-riva-client

# Import necessary libraries
import io
from google.colab import files
import riva.client

# Upload the .wav file
uploaded = files.upload()

# Get the first uploaded file name
filename = next(iter(uploaded))

# Read the uploaded file
with open(filename, 'rb') as f:
    audio_data = f.read()

# Configure Riva client
auth = riva.client.Auth(uri='your_riva_server_uri', use_ssl=False)  # Update with your Riva server URI
riva_asr = riva.client.ASRService(auth)

# Configure recognition request
config = riva.client.RecognitionConfig(
    encoding=riva.client.AudioEncoding.LINEAR_PCM,
    sample_rate_hertz=16000,  # Ensure this matches your .wav file's sample rate
    language_code='en-US'
)

# Perform recognition
response = riva_asr.offline_recognize(bytes(audio_data), config)
riva_transcript = response.results[0].alternatives[0].transcript

print("Transcribed text:", riva_transcript)




Saving OSR_us_000_0010_8k.wav to OSR_us_000_0010_8k (1).wav


_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "DNS resolution failed for your_riva_server_uri: C-ares status is not ARES_SUCCESS qtype=A name=your_riva_server_uri is_balancer=0: Domain name not found"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-05-24T05:51:58.719937652+00:00", grpc_status:14, grpc_message:"DNS resolution failed for your_riva_server_uri: C-ares status is not ARES_SUCCESS qtype=A name=your_riva_server_uri is_balancer=0: Domain name not found"}"
>

## 3. Open-Source Python Package SpeechRecognition
- It is an opens-source Python library for performing speech recognition, with support for several engines and APIs, online and offline, including:   
-- CMU Sphinx (works offline)  
-- Google Cloud Speech API  
-- Wit.ai  
-- Microsoft Azure Speech  
-- Microsoft Bing Voice Recognition (Deprecated)  
-- Houndify API  
-- IBM Speech to Text  
-- Snowboy Hotword Detection (works offline)  
-- Tensorflow  
-- Vosk API (works offline)  
-- OpenAI Whisper (works offline)  
-- Whisper API  

In [6]:
#Install the library
!pip install SpeechRecognition

import speech_recognition as sr

Collecting SpeechRecognition
  Downloading SpeechRecognition-3.10.4-py2.py3-none-any.whl (32.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.8/32.8 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: SpeechRecognition
Successfully installed SpeechRecognition-3.10.4


In [7]:
!pip install pyttsx3


Collecting pyttsx3
  Downloading pyttsx3-2.90-py3-none-any.whl (39 kB)
Installing collected packages: pyttsx3
Successfully installed pyttsx3-2.90


In [8]:
!apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0
!pip install PyAudio

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libasound2-dev is already the newest version (1.2.6.1-1ubuntu1).
Suggested packages:
  portaudio19-doc
The following NEW packages will be installed:
  libportaudio2 libportaudiocpp0 portaudio19-dev
0 upgraded, 3 newly installed, 0 to remove and 53 not upgraded.
Need to get 188 kB of archives.
After this operation, 927 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libportaudio2 amd64 19.6.0-1.1 [65.3 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libportaudiocpp0 amd64 19.6.0-1.1 [16.1 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 portaudio19-dev amd64 19.6.0-1.1 [106 kB]
Fetched 188 kB in 0s (794 kB/s)
Selecting previously unselected package libportaudio2:amd64.
(Reading database ... 121918 files and directories currently installed.)
Preparing to unpack .../libportaudio2_19.6.0-1.1_amd64.deb ...
Unpac

In [35]:
import speech_recognition as sr
from google.colab import files
import time
# Upload an audio file
uploaded = files.upload()

# Use the first uploaded file
filename = next(iter(uploaded))
# Initialize recognizer class (for recognizing the speech)
r = sr.Recognizer()

# Reading Audio file as source
# listening the audio file and store in audio_text variable
with sr.AudioFile(filename) as source:
    audio_text = r.listen(source)
start = time.time()
# Recognize the speech in the audio
try:
    sr_google_transcript = r.recognize_google(audio_text)
    print("You said:", text)
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Speech Recognition service; {0}".format(e))
end = time.time()
sr_google_latency = end - start
# Print the results
print(f"Latency: {sr_google_latency} seconds")



Saving OSR_us_000_0010_8k.wav to OSR_us_000_0010_8k (1).wav
You said: do birds canoe slid on the smooth planks glue the sheet to the dark blue background it is easy to tell the depth of a well these days a chicken leg is a rare dish rice is often served in round Bowls the juice of lemons makes fine punch the box was thrown beside the park truck the dogs are fed chopped corn and garbage
Latency: 8.920884847640991 seconds


#### Note: see this notebook for using all possible APIs (GCP, MS Azure, IBM, Wit, etc.): https://github.com/Uberi/speech_recognition/blob/master/examples/audio_transcribe.py

## 4. Speechmatics
- Commerical ASR service, not open-source

#### Create functions for using the API

In [23]:
# For Speechmatics API
url = 'https://asr.api.speechmatics.com/v2/jobs/'
SM_API_KEY = 'apikey'
headers = {"Authorization": f"Bearer {SM_API_KEY}"}

In [24]:
def send_request(file):
    """
    input: file path
    output: response id

    function to send the file to be processed to the Speechmatics API and get the reponse ID as output
    """
    form = {"config": (None, '{"type": "transcription","transcription_config": { "language": "en" }}'),
            "data_file": (file, open(file, 'rb'))}
    x = requests.post(url, headers=headers, files=form)
    response_id = json.loads(x.text[:-1])['id']
    return response_id

In [25]:
def get_transcript(response_id, sleep=10):
    """
    input: response id
    output: transcript

    function to request transcript with the response ID
    function would keep retrying until the transcript is ready (when status code is 200)
    - it shows 404 when transcript is not ready
    """
    transcript_url = f"https://asr.api.speechmatics.com/v2/jobs/{response_id}/transcript"
    response = requests.get(transcript_url , headers=headers)
    count = 1
    while response.status_code != 200:
        count += 1
        time.sleep(sleep)
        response = requests.get(transcript_url , headers=headers)
    response_test = json.loads(response.text)
    string = ''
    for word in response_test['results']:
        string += f" {word['alternatives'][0]['content']}"
    return string.lower(), count

In [26]:
start = time.time()
response_id = send_request(filepath)
sm_transcript = get_transcript(response_id)
end = time.time()
sm_latency = end-start
print(f'Speechmatics transcript:\n{sm_transcript}') #a tuple (transcript, retries)
print(f'Speechmatics latency: {sm_latency} seconds')

Speechmatics transcript:
(' the birch canoe slid on the smooth planks . glue the sheet to the dark blue background . it is easy to tell the depth of a well . these days , a chicken leg is a rare dish . rice is often served in round bowls . the juice of lemons makes a fine punch . the box was thrown beside the park truck . the hogs were fed chopped corn and garbage . four hours of steady work faced us . a large size in stockings is hard to sell .', 2)
Speechmatics latency: 12.15500783920288 seconds


## 5. Deepgram
- Another commercial spealist in advanced speech recognition services

In [27]:
# Uncomment to download the Deepgram
!python3 -m pip install deepgram-sdk

Collecting deepgram-sdk
  Downloading deepgram_sdk-3.2.7-py3-none-any.whl (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.6/80.6 kB[0m [31m907.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.25.2 (from deepgram-sdk)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting websockets>=12.0 (from deepgram-sdk)
  Downloading websockets-12.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.2/130.2 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dataclasses-json>=0.6.3 (from deepgram-sdk)
  Downloading dataclasses_json-0.6.6-py3-none-any.whl (28 kB)
Collecting aiofiles>=23.2.1 (from deepgram-sdk)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting verboselogs>=1.7 (from 

In [28]:
from deepgram import DeepgramClient, PrerecordedOptions
import asyncio, json
from datetime import datetime

In [29]:
DG_API_KEY = 'apikey' # API Key generated from Deepgram
MIMETYPE = 'audio/wav'

In [66]:
import json

deepgram = DeepgramClient(DG_API_KEY)
start = time.time()
with open('/content/OSR_us_000_0010_8k.wav', 'rb') as buffer_data:
    payload = {'buffer': buffer_data}

    options = PrerecordedOptions(
        smart_format=True, model="nova-2", language="en-US"
    )

    response = deepgram.listen.prerecorded.v('1').transcribe_file(payload, options)
    json_response = response.to_json(indent=4)
    # print(json_response)

    # Parse JSON response into a dictionary
    response_dict = json.loads(json_response)

    # Access the "results" section
    results = response_dict.get("results", {})

    # Access the "channels" list
    channels = results.get("channels", [])

    # Assuming there's only one channel, you can access its first element
    if channels:
        channel = channels[0]

        # Access the first alternative
        first_alternative = channel.get("alternatives", [])[0]

        # Access the transcript and confidence
        transcript = first_alternative.get("transcript", "")
        dg_transcript = transcript
        confidence = first_alternative.get("confidence", "")
        end = time.time()
        # Access the latency information
        dg_latency =end-start

        print("Transcription:", dg_transcript)
        print("Confidence:", confidence)
        print("Latency:", dg_latency)


Transcription: The birch canoe slid on the smooth planks. Glue the sheet to the dark blue background. It is easy to tell the depth of a well. These days, a chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch. The box was thrown beside the parked truck. The hogs were fed chopped corn and garbage. 4 hours of steady work faced us. A large size in stockings is hard to sell.
Confidence: 0.9952251
Latency: 1.0340681076049805


### Deepgram Transcription with Enhanced Tier

## A Like-for-like Transcription Accuracy Comparison Using WER (Word Error Rate)

In [33]:
# An utility function to clean up punctuations to do a like-for-like comparison
def clean_transcript(transcript):
    tmp = re.sub(r'[^a-zA-Z0-9_\']', ' ', transcript) #we want to remove all non-words symbols except ', e.g. don't , i'll
    transcript_ok = re.sub(' +', ' ', tmp.lower().strip()) #suppress multiple whitespaes in a row, lower case and strip whitespaces at beginning/end
    return transcript_ok

In [59]:
whisper_transcript_ok = clean_transcript(whisper_transcript)
#riva_transcript_ok= clean_transcript(riva_transcript)
sr_google_transcript_ok = clean_transcript(sr_google_transcript)
sm_transcript_ok= clean_transcript(sm_transcript[0])
dg_transcript_ok= clean_transcript(dg_transcript)
#dg_transcript2_ok= clean_transcript(dg_transcript2)

### Golden Template
We have created the golden template for this audio sample to do comparison and assessment.

In [60]:
golden_template = """To Sherlock Holmes she is always THE woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer, excellent for drawing the veil from men’s motives and actions. But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a nature such as his. And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory."""
golden_template_ok = clean_transcript(golden_template)

### ASR latency comparison per model/api

In [74]:
latency_dict = {'API':['whisper','speechrecognition','speechmatics', 'deepgram'],
            'latency':[whisper_latency,sr_google_latency, sm_latency, dg_latency]}
df_latency = pd.DataFrame(latency_dict)
df_latency

Unnamed: 0,API,latency
0,whisper,135.598448
1,speechrecognition,8.920885
2,speechmatics,12.155008
3,deepgram,1.034068


### ASR accuracy comparison - calculate WER per each model/api

In [73]:
#whisper
wer_whisper = jiwer.wer(whisper_transcript_ok, golden_template_ok)
# Riva
#wer_riva_conformer = jiwer.wer(riva_transcript_ok, golden_template_ok)
# SR Google
wer_sr_google = jiwer.wer(sr_google_transcript_ok, golden_template_ok)
# Speechmatics
wer_speechmatics = jiwer.wer(sm_transcript_ok, golden_template_ok)
# Deepgram
wer_deepgram = jiwer.wer(dg_transcript_ok, golden_template_ok)
# Deepgram enhancement
#wer_deepgram2 = jiwer.wer(dg_transcript2_ok, golden_template_ok)

wer_dict = {'API':['whisper','speechrecognition','speechmatics', 'deepgram'],
            'WER':[wer_whisper, wer_sr_google, wer_speechmatics, wer_deepgram]}
df_wer = pd.DataFrame(wer_dict)
df_wer

Unnamed: 0,API,WER
0,whisper,2.382716
1,speechrecognition,3.0
2,speechmatics,2.341463
3,deepgram,2.382716


### Observation
#### Latency
- From above it can be seen that `deepgram` is the fastest ASR model, followed by `speechrecognitoin` and `speechmatics`   

#### Accuracy
- Here we can see `Whisper`, `deepgram` and `Speechmatics` are among the top for ASR transcription accuracy   
- But noticed here we just used a single audio sample for testing, and the audio is relatively easy (native English speaker), thus it is not a comprehensive comparison nor solid conclusion can we draw here

SeamlessM4T

In [1]:
!pip install fairseq2
!pip install pydub sentencepiece
!pip install git+https://github.com/facebookresearch/seamless_communication.git

Collecting fairseq2
  Downloading fairseq2-0.2.1-py3-none-any.whl (191 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.8/191.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fairseq2n==0.2.1 (from fairseq2)
  Downloading fairseq2n-0.2.1-cp310-cp310-manylinux2014_x86_64.whl (2.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m61.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jiwer~=3.0 (from fairseq2)
  Downloading jiwer-3.0.4-py3-none-any.whl (21 kB)
Collecting overrides~=7.3 (from fairseq2)
  Downloading overrides-7.7.0-py3-none-any.whl (17 kB)
Collecting packaging~=23.1 (from fairseq2)
  Downloading packaging-23.2-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting sacrebleu~=2.3 (from fairseq2)
  Downloading sacrebleu-2.4.2-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import io
import json
import matplotlib as mpl
import matplotlib.pyplot as plt
import mmap
import numpy
import soundfile
import torch
from collections import defaultdict
from IPython.display import Audio, display
from pathlib import Path
from pydub import AudioSegment
from seamless_communication.inference import Translator
from seamless_communication.streaming.dataloaders.s2tt import SileroVADSilenceRemover

In [3]:
# Initialize a Translator object with a multitask model, vocoder on the GPU.

model_name = "seamlessM4T_v2_large"
vocoder_name = "vocoder_v2" if model_name == "seamlessM4T_v2_large" else "vocoder_36langs"

translator = Translator(
    model_name,
    vocoder_name,
    device=torch.device("cuda:0"),
    dtype=torch.float16,
)

Downloading the checkpoint of seamlessM4T_v2_large...
100%|██████████| 8.45G/8.45G [01:24<00:00, 107MB/s]
Downloading the tokenizer of seamlessM4T_v2_large...
100%|██████████| 360k/360k [00:00<00:00, 36.5MB/s]
Downloading the tokenizer of seamlessM4T_v2_large...
100%|██████████| 4.93M/4.93M [00:00<00:00, 191MB/s]
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Downloading the checkpoint of vocoder_v2...
100%|██████████| 160M/160M [00:00<00:00, 234MB/s]


In [22]:
tgt_langs = ("eng", "hin")
for tgt_lang in tgt_langs:
  in_file = f"/content/OSR_us_000_0010_8k.wav"


  text_output, _ = translator.predict(
        input=in_file,
        task_str="asr",
        tgt_lang=tgt_lang,
    )
  print(f"Transcribed text in {tgt_lang}: {text_output[0]}")
  print()


Transcribed text in eng: the birch canoe is smooth on the smooth planks blue of the sea to a dark blue background it is easy to tell the depth of a well these days a city made is a rare dish rice is often served in round bowls the juice of lemons made fine punch the box was thrown beside the punch chute the hot springs hot corn and garbage four hours of steady work fisters a large size stockings is hard to sell

Transcribed text in hin: इन दिनों सिटी मेड एक दुर्लभ व्यंजन है। चावल अक्सर गोल कटोरे में परोसा जाता है। नींबू का रस ठीक पंच में बनाया जाता है। बॉक्स को पंच चट के बगल में फेंक दिया जाता है।

