# Speech Recognition techniques

A walkthough the different functions in pyramidman that handle the recording, playing and processing of audio. It is mainly for didactic and testing purposes.

In [1]:
%load_ext autoreload
%autoreload 2

from pyramidman.audio_parameters import AudioParameters
from pyramidman.basic_audio_IO import play_audio, record_audio
from pyramidman.audio_utils import get_available_microphones, get_sysdefault_microphone_index, get_all_devices_str
from pyramidman.queue_utils import record_with_queue
from pyramidman.unwrapper import unwrap
from pyramidman.speech_recognizing import recognize_speech_from_mic

from pyramidman.hieroglyph import plot_timeseries_range_slider, create_tabs, plot_spectrogram
from pyramidman.hieroglyph import add_word_annotations
from pyramidman.Ihy import get_audio_menu_wav_file
from pyramidman.audio_utils import calibrate_microphone

from pyramidman.deepspeech_tools import transcribe, DeepSpeechArgs

import speech_recognition as sr
import time
import numpy as np
from scipy import signal
from scipy.io import wavfile

import matplotlib.pyplot as plt
%matplotlib qt

import plotly.graph_objs as go
from IPython.display import display
import ipywidgets as widgets

from queue import Queue

# Instantiate and calibrate microphone

Ideally, we would like a background process in a thread that whenever a sentence is finished, it is translated and plotted. This is the following code.

In [2]:
audio_params = AudioParameters()
audio_params.set_sysdefault_microphone_index()
audio_params.set_default_input_parameters()
# audio_params.sample_rate = 16000

In [3]:
mic = audio_params.get_microphone()
r = sr.Recognizer()
calibrate_microphone(mic, r, duration = 3)

Calibrating microphone for 3 seconds.
Calibrated energy threshold:  2279.443015006792


## Record example message to process with the recognizers

In [4]:
filename_mic = '../audios/temp/hello_world.wav'

with mic as source:
    audio = r.listen(source)

with open(filename_mic, "wb") as f:
    f.write(audio.get_wav_data())

In [5]:
tabs = get_audio_menu_wav_file(filename_mic)
display(tabs)

Tab(children=(FigureWidget({
    'data': [{'line': {'color': 'deepskyblue'},
              'name': 'AAPL High'…

# Speech recognition

In the following we will talk about the different transcribers implemented and how to use them.

## 1. speech_recognition library

The speech recognition library has a set of ready to use transcribers:
- sphinx: More software needed to be installed locally.
- google: Limited number of calls.


In [6]:
response = recognize_speech_from_mic(audio_params, duration = 5)

In [7]:
print('\nSuccess : {}\nError   : {}\n\nText from Speech\n{}\n\n{}' \
      .format(response['success'],
              response['error'],
              '-'*17,
              response['transcription']))


Success : True
Error   : Unable to recognize speech

Text from Speech
-----------------

None


## 2. DeepSpeech recognizer

We have installed and used this decoder and created some code on top to easily use it. 

It has the following peculiarities:
- It only works for 16000 sample_rate data, so we need to resample the 48000 recording. Still it is better to record in 48000 and then filter (no just downsamplling) due to aliasing
- It can return metadata with the likelihood of the transformation and the start_time and duration of each word.
- There seems to be an error in the decoder as it always assings the time 0 to the first letter it decodes, independently of when it happens.


In [8]:
args = DeepSpeechArgs()
unwrap(args)

<DeepSpeechArgs>	object has children:
    <str>	model:	../models/deepspeech/deepspeech-0.6.0-mo
    <str>	lm:	../models/deepspeech/deepspeech-0.6.0-mo
    <str>	trie:	../models/deepspeech/deepspeech-0.6.0-mo
    <int>	beam_width:	500
    <float>	lm_alpha:	0.75
    <float>	lm_beta:	1.85




In [9]:
metadata = transcribe(args, filename_mic)

In [10]:
metadata["sentence"]

'i'

In [11]:
metadata["words"]

[{'word': 'i', 'start_time ': 0.0, 'duration': 0.0}]

In [12]:
[[item.character,item.start_time, item.timestep] for item in metadata["characters"].items]

[['i', 0.0, 0]]

In [13]:
tabs = get_audio_menu_wav_file(filename_mic)
add_word_annotations(tabs.children[0],metadata["words"])
display(tabs)

Tab(children=(FigureWidget({
    'data': [{'line': {'color': 'deepskyblue'},
              'name': 'AAPL High'…

In [14]:
play_audio(audio_params, filename_mic)

# Automatic tuning for transcription.

We could store my voice for a few sentences and write down what it should be, then play with preprocessing parameters and the transcriber to find the best set up!

# Transcriber class
Its main goal is to manage the recording and transcription of the data. It makes it available through a time queue.

In [37]:
from pyramidman.Seshat import Transcriber

In [38]:
transcriber = Transcriber()
unwrap(transcriber, max_level = 1)

<Transcriber>	object has children:
    <bool>	_listening:	False
    <bool>	_transcribing:	False
    <Queue>	_audios_queue
    <Queue>	_transcriptions_queue
    <NoneType>	_stop_listen_in_background_func:	None
    <NoneType>	_stop_transcribing_in_background_func:	None
    <int>	item_index:	0
    <list>	_transcriptions

  <list>	_transcriptions has children:

  <Queue>	_transcriptions_queue has children:

  <Queue>	_audios_queue has children:




## Instantiating and calibrating

In [39]:
# Recordings folder is where the sentences will be stored
transcriber.set_automatic_default_recording_variables(recordings_folder="../audios/temp/")
transcriber.set_automatic_default_transcribing_variables()

Calibrating microphone for 1 seconds.
Calibrated energy threshold:  3641.8891025080434


In [40]:
unwrap(transcriber, max_level = 1)

<Transcriber>	object has children:
    <bool>	_listening:	False
    <bool>	_transcribing:	False
    <Queue>	_audios_queue
    <Queue>	_transcriptions_queue
    <NoneType>	_stop_listen_in_background_func:	None
    <NoneType>	_stop_transcribing_in_background_func:	None
    <int>	item_index:	0
    <list>	_transcriptions
    <str>	recordings_folder:	../audios/temp/
    <AudioParameters>	audio_params
    <Microphone>	microphone
    <Recognizer>	recognizer
    <Recognizer>	recognizer
    <Recognizer>	recognizer

  <Recognizer>	recognizer has children:

  <Recognizer>	recognizer has children:

  <Recognizer>	recognizer has children:

  <Microphone>	microphone has children:

  <AudioParameters>	audio_params has children:

  <list>	_transcriptions has children:

  <Queue>	_transcriptions_queue has children:

  <Queue>	_audios_queue has children:




## Listening 

When the we start listening, we reinitialize the queue.
When we stop listening, it still exists.

In [51]:
transcriber.start_listening_in_background(phrase_time_limit=5, timeout=5)

In [52]:
transcriber.is_listening()

True

In [70]:
transcriber.stop_listening_in_background()

In [53]:
transcriber.is_listening()

True

In [54]:
audios_queue = transcriber.get_audios_queue()

In [57]:
if audios_queue.empty() == False:
    data = audios_queue.get()
    print(data)
else:
    print("No data was stored")

No data was stored


## Transcribing 


In [58]:
transcriber.start_transcribing_in_background()

In [59]:
transcriber.is_transcribing()

True

In [69]:
transcriber.stop_transcribing_in_background()

In [60]:
transcriptions_queue = transcriber.get_transcriptions_queue()

In [68]:
if transcriptions_queue.empty() == False:
    data = transcriptions_queue.get()
    print(data)
else:
    print("No dat was stored")

{'sentence': 'i', 'words': [{'word': 'i', 'start_time ': 0.0, 'duration': 0.0}], 'characters': <Swig Object of type 'Metadata *' at 0x7f02fe838630>, 'confidence': -9.85685304694977}
