# ASR API tutorial

This tutorial demonstates how to use Python Riva API.

## <font color="blue">Server</font>

Before running client part of Riva, please set up a server. The simplest
way to do this is to follow
[quick start guide](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html#local-deployment-using-quick-start-scripts).


## <font color="blue">Authentication</font>

Before using Riva services you will need to establish connection with a server.

In [1]:
import riva_api

uri = "localhost:50051"  # Default value

auth = riva_api.Auth(uri=uri)

## <font color="blue">Setting up service</font>

To instantiate a service pass `riva_api.Auth` instance to a constructor.

In [2]:
asr_service = riva_api.ASRService(auth)

For speech recognition you will need to create a recognition config (an instance of `riva_api.RecognitionConfig`). 
A detailed description of config fields is available in Riva 
[documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/riva_asr.proto.html?highlight=max%20alternatives#riva-proto-riva-asr-proto).
If you intend to use streaming recognition, an offline config has to wrapped into `riva_api.StreamingRecognitionConfig`.


In [3]:
from copy import deepcopy
offline_config = riva_api.RecognitionConfig(
    encoding=riva_api.AudioEncoding.LINEAR_PCM,
    max_alternatives=1,
    enable_automatic_punctuation=True,
    verbatim_transcripts=False,
)
streaming_config = riva_api.StreamingRecognitionConfig(config=deepcopy(offline_config), interim_results=True)

You also need to a set frame rate and number of channels of audio which is going to be processed. If you'd like to process file `examples/en-US_AntiBERTa_for_word_boosting_testing.wav`, then your code will be

In [4]:
my_wav_file = '../examples/en-US_AntiBERTa_for_word_boosting_testing.wav'
riva_api.add_audio_file_specs_to_config(offline_config, my_wav_file)
riva_api.add_audio_file_specs_to_config(streaming_config, my_wav_file)

If you intent to use word boosting, then use convenience method `riva_api.add_word_boosting_to_config()` to add boosting parameters to config.

In [5]:
boosted_lm_words = ['AntiBERTa', 'ABlooper']
boosted_lm_score = 20.0
riva_api.add_word_boosting_to_config(offline_config, boosted_lm_words, boosted_lm_score)
riva_api.add_word_boosting_to_config(streaming_config, boosted_lm_words, boosted_lm_score)

In [6]:
print(offline_config)

encoding: LINEAR_PCM
sample_rate_hertz: 48000
max_alternatives: 1
speech_contexts {
  phrases: "AntiBERTa"
  phrases: "ABlooper"
  boost: 20.0
}
audio_channel_count: 1
enable_automatic_punctuation: true



In [7]:
print(streaming_config)

config {
  encoding: LINEAR_PCM
  sample_rate_hertz: 48000
  max_alternatives: 1
  speech_contexts {
    phrases: "AntiBERTa"
    phrases: "ABlooper"
    boost: 20.0
  }
  audio_channel_count: 1
  enable_automatic_punctuation: true
}
interim_results: true



## <font color="blue">Offline</font>

To run offline speech recognition read data from a file and pass to a service.

In [8]:
with open(my_wav_file, 'rb') as fh:
    data = fh.read()

response = asr_service.offline_recognize(data, offline_config)

In [9]:
print(response)

results {
  alternatives {
    transcript: "AntiBERTa and ABlooper, both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigens. "
    confidence: 1.0
  }
  channel_tag: 1
  audio_processed: 14.762687683105469
}



To extract a transcript you may use

In [10]:
print(response.results[0].alternatives[0].transcript)

AntiBERTa and ABlooper, both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigens. 


In [11]:
print(response.results[0].alternatives[0].confidence)

1.0


## <font color="blue">Streaming</font>

To imitate audio streaming use `riva_api.AudioChunkFileIterator`. You can imitate realtime audio by providing a delay callback to the iterator.

In [12]:
wav_parameters = riva_api.get_wav_file_parameters(my_wav_file)
# correponds to 1 second of audio
chunk_size = wav_parameters['framerate']
with riva_api.AudioChunkFileIterator(
    my_wav_file, chunk_size, delay_callback=riva_api.sleep_audio_length,
) as audio_chunk_iterator:
    for i, chunk in enumerate(audio_chunk_iterator):
        print(i, len(chunk))

0 96000
1 96000
2 96000
3 96000
4 96000
5 96000
6 96000
7 96000
8 96000
9 96000
10 96000
11 96000
12 96000
13 96000
14 73216


Then audio chunks are passed to `ASRService.streaming_response_generator()` and response generator is created.

In [13]:
audio_chunk_iterator = riva_api.AudioChunkFileIterator(my_wav_file, 4800)
response_generator = asr_service.streaming_response_generator(audio_chunk_iterator, streaming_config)

You may find description of streaming response (`StreamingRecognizeResponse`) fields in Riva [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/riva_asr.proto.html?highlight=max%20alternatives#riva-proto-riva-asr-proto).

In [14]:
streaming_response = next(response_generator)

For showing streaming results it is convenient to use function `riva_api.print_streaming()`.

In [15]:
riva_api.print_streaming(response_generator, additional_info='time')

>>>Time 1654013111.81s: ant
>>>Time 1654013111.82s: anti bird
>>>Time 1654013111.83s: anti berta
>>>Time 1654013111.83s: auntie berta and
>>>Time 1654013111.85s: auntie berta and
>>>Time 1654013111.85s: auntie berta and abe
>>>Time 1654013111.85s: AntiBERTa and abu
>>>Time 1654013111.85s: AntiBERTa and abu po
>>>Time 1654013111.86s: auntie bertha and ABlooper
>>>Time 1654013111.86s: berta and ABlooper
>>>Time 1654013111.87s: berta and aber
>>>Time 1654013111.87s: berta and abu both
>>>Time 1654013111.87s: anti and ABlooper both
>>>Time 1654013112.05s: anti and ABlooper both
>>>Time 1654013112.06s: AntiBERTa and ABlooper both strong
>>>Time 1654013112.06s: AntiBERTa and ABlooper both transform
>>>Time 1654013112.07s: AntiBERTa and a looper both transform
>>>Time 1654013112.09s: AntiBERTa and both transform a basic
>>>Time 1654013112.09s: AntiBERTa and both transform a baseline
>>>Time 1654013112.09s: AntiBERTa and both transform a base language
>>>Time 1654013112.10s: AntiBERTa and ABlo

If you set a delay callback in audio chunk iterator and `show_intermediate=True` in `riva_api.print_streaming()`, then you will be able watch transcript forming.

In [16]:
audio_chunk_iterator = riva_api.AudioChunkFileIterator(my_wav_file, 4800, riva_api.sleep_audio_length)
response_generator = asr_service.streaming_response_generator(audio_chunk_iterator, streaming_config)
riva_api.print_streaming(response_generator, show_intermediate=True)

## AntiBERTa and ABlooper, both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigens. 


It is also possible to print streaming results in several places, e.g. in STDOUT and a file.

In [17]:
import sys
output_file = "my_results.txt"
audio_chunk_iterator = riva_api.AudioChunkFileIterator(my_wav_file, 4800)
response_generator = asr_service.streaming_response_generator(audio_chunk_iterator, streaming_config)
riva_api.print_streaming(response_generator, additional_info='confidence', output_file=[sys.stdout, output_file])

>> ant
Stability:    0.1000
----
>> anti bird
Stability:    0.1000
----
>> anti berta
Stability:    0.1000
----
>> auntie berta and
Stability:    0.1000
----
>> auntie berta and
Stability:    0.1000
----
>> auntie berta and abe
Stability:    0.1000
----
>> AntiBERTa and abu
Stability:    0.1000
----
>> AntiBERTa and abu po
Stability:    0.1000
----
>> auntie bertha and ABlooper
Stability:    0.1000
----
>> berta and aber
Stability:    0.1000
----
>> berta and aber
Stability:    0.1000
----
>> berta and abu both
Stability:    0.1000
----
>> anti 
Stability:    0.9000
>> and ABlooper both
Stability:    0.1000
----
>> anti 
Stability:    0.9000
>> and ABlooper both
Stability:    0.1000
----
>> AntiBERTa 
Stability:    0.9000
>> and ABlooper both strong
Stability:    0.1000
----
>> AntiBERTa and 
Stability:    0.9000
>> ABlooper both transform
Stability:    0.1000
----
>> AntiBERTa and a 
Stability:    0.9000
>> looper both transform
Stability:    0.1000
----
>> AntiBERTa and 
Stability:  

>> AntiBERTa and ABlooper both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for 
Stability:    0.9000
>> target antigens
Stability:    0.1000
----
>> AntiBERTa and ABlooper both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for 
Stability:    0.9000
>> target antigens
Stability:    0.1000
----
## AntiBERTa and ABlooper, both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigens. 
Confidence:    1.0000
----


Showing file and clean up in bash

In [18]:
!cat $output_file

'cat' is not recognized as an internal or external command,
operable program or batch file.


In [19]:
!rm $output_file

'rm' is not recognized as an internal or external command,
operable program or batch file.


Showing file and clean up in cmd.exe

In [20]:
!type $output_file

>> ant
Stability:    0.1000
----
>> anti bird
Stability:    0.1000
----
>> anti berta
Stability:    0.1000
----
>> auntie berta and
Stability:    0.1000
----
>> auntie berta and
Stability:    0.1000
----
>> auntie berta and abe
Stability:    0.1000
----
>> AntiBERTa and abu
Stability:    0.1000
----
>> AntiBERTa and abu po
Stability:    0.1000
----
>> auntie bertha and ABlooper
Stability:    0.1000
----
>> berta and aber
Stability:    0.1000
----
>> berta and aber
Stability:    0.1000
----
>> berta and abu both
Stability:    0.1000
----
>> anti 
Stability:    0.9000
>> and ABlooper both
Stability:    0.1000
----
>> anti 
Stability:    0.9000
>> and ABlooper both
Stability:    0.1000
----
>> AntiBERTa 
Stability:    0.9000
>> and ABlooper both strong
Stability:    0.1000
----
>> AntiBERTa and 
Stability:    0.9000
>> ABlooper both transform
Stability:    0.1000
----
>> AntiBERTa and a 
Stability:    0.9000
>> looper both transform
Stability:    0.1000
----
>> AntiBERTa and 
Stability:  

In [21]:
!del $output_file

## <font color="blue">Audio input/output</font>

For using audio input and output you need to install PyAudio.

```bash
conda install -c anaconda pyaudio
```

### <font color="green">Playing audio during transcribing</font>

For playing audio simultaneously with transcribing, provide an instance of `riva_api.audio_io.SoundCallBack` as a `delay_callback` to `riva_api.AudioChunkFileIterator`.

In [22]:
import riva_api.audio_io

In [23]:
# show available output devices
riva_api.audio_io.list_output_devices()

Output audio devices:
2: Microsoft Sound Mapper - Output
3: Speakers (Synaptics Audio)
4: Output 1 (Synaptics Audio headphone)
5: Output 2 (Synaptics Audio headphone)
13: Output 1 (Synaptics Audio output)
14: Output 2 (Synaptics Audio output)


In [24]:
output_device = None  # use default device
wav_parameters = riva_api.get_wav_file_parameters(my_wav_file)
sound_callback = riva_api.audio_io.SoundCallBack(
    output_device, wav_parameters['sampwidth'], wav_parameters['nchannels'], wav_parameters['framerate'],
)
audio_chunk_iterator = riva_api.AudioChunkFileIterator(my_wav_file, 4800, sound_callback)
response_generator = asr_service.streaming_response_generator(audio_chunk_iterator, streaming_config)
riva_api.print_streaming(response_generator, show_intermediate=True)
sound_callback.close()

## AntiBERTa and ABlooper, both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigens. 


### <font color="green">Streaming from microphone</font>

In [25]:
riva_api.audio_io.list_input_devices()

Input audio devices:
0: Microsoft Sound Mapper - Input
1: Microphone Array (Synaptics Aud
6: Input (Synaptics Audio headphone)
7: Microphone 1 (Synaptics Audio capture)
8: Microphone 2 (Synaptics Audio capture)
9: Microphone 3 (Synaptics Audio capture)
10: Microphone Array 1 (Synaptics Audio capture)
11: Microphone Array 2 (Synaptics Audio capture)
12: Microphone Array 3 (Synaptics Audio capture)
15: Input (Synaptics Audio output)


Run code below and then say something in English

In [27]:
input_device = None  # default device
with riva_api.audio_io.MicrophoneStream(
    rate=streaming_config.config.sample_rate_hertz,
    chunk=streaming_config.config.sample_rate_hertz // 10,
    device=input_device,
) as audio_chunk_iterator:
    riva_api.print_streaming(
        responses=asr_service.streaming_response_generator(
            audio_chunks=audio_chunk_iterator,
            streaming_config=streaming_config,
        ),
        show_intermediate=True,
    )

## Tell me something.  
## No. 


KeyboardInterrupt: 