# Speech-to-Text

Speech recognition is an interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition, computer speech recognition or speech to text.

## Open Speech Repository 

The OSR Project provides freely usable speech files in multiple languages for use in Voice over IP testing and other applications.

http://www.voiptroubleshooter.com/open_speech/index.html

http://www.voiptroubleshooter.com/open_speech/american.html
    
http://www.voiptroubleshooter.com/open_speech/british.html

http://www.cs.columbia.edu/~hgs/audio/harvard.html

## Audio files and original text

#### Audio files
OSR_us_000_0010_8k.wav (American English)

OSR_uk_000_0020_8k.wav (British English)

#### Original text

The birch canoe slid on the smooth planks.
Glue the sheet to the dark blue background.
It's easy to tell the depth of a well.
These days a chicken leg is a rare dish.
Rice is often served in round bowls.
The juice of lemons makes fine punch.
The box was thrown beside the parked truck.
The hogs were fed chopped corn and garbage.
Four hours of steady work faced us.
Large size in stockings is hard to sell.

In [1]:
original_text = "The birch canoe slid on the smooth planks Glue the sheet to the dark blue background It's easy to tell the depth of a well These days a chicken leg is a rare dish Rice is often served in round bowls The juice of lemons makes fine punch The box was thrown beside the parked truck The hogs were fed chopped corn and garbage Four hours of steady work faced us Large size in stockings is hard to sell"

## Google Speech-to-Text API Usage

https://pypi.org/project/SpeechRecognition/
#### install google speech api
! pip install SpeechRecognition

In [2]:
import speech_recognition as sr

In [3]:
def get_text(local_file_path):

    recognizer = sr.Recognizer()
    # use the audio file as the audio source
    with sr.AudioFile(local_file_path) as source:
        # reads the audio file.
        audio = recognizer.record(source)
        try:
            return recognizer.recognize_google(audio)
        except sr.UnknownValueError:
            return "Google Speech Recognition could not understand audio"
        except sr.RequestError as e:
            return "Could not request results from Google Speech Recognition service; {0}".format(e)
        finally:
            pass

In [4]:
# American English
google_speech_to_text_us = get_text('audio_samples/OSR_us_000_0010_8k.wav')
print(google_speech_to_text_us)

perched new Swift on the smooth bike seat without play background editor with observable Tuesday it came like a river Rises Aasan search in Rampur to choose of lemons makes find the passport on the side of the how to search quilling art for the study workspaces was nice talking is hard stuff


In [5]:
# British English
google_speech_to_text_uk = get_text('audio_samples/OSR_uk_000_0020_8k.wav')
print(google_speech_to_text_uk)

can you slept on the smooth Planck's live the cheat to the dog Cool background it's easy to tell with depth of a well these days it is a reddish wise is conserved in round poles reduce of lemons makes fun fun the box is thrown decide the pot Rock The House was add chopped Kaun and garbage 4 hours study work as a lot size in stockings is hard to sell


## Google Cloud Speech-to-Text API

Cloud Speech-to-Text enables easy integration of Google speech recognition technologies into developer applications. You can send audio data to the Speech-to-Text API, which then returns a text transcription of that audio file. 

**client-libraries-install-python**
https://cloud.google.com/speech-to-text/docs/reference/libraries

**client-libraries-sample-code-python**
https://github.com/googleapis/google-cloud-python/tree/master/speech
    
**client-libraries-install-python**
https://cloud.google.com/speech-to-text/docs/quickstart-client-libraries

**client-libraries-sample-code-python**
https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/speech/cloud-client


## Setting up authentication

To run the client library, we must first set up authentication by creating a service account and setting an environment variable.

https://console.cloud.google.com/apis/credentials/serviceaccountkey

Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key. 

#### Windows Power Shell 
env:GOOGLE_APPLICATION_CREDENTIALS="SpeechAnalytics.json"
#### Windows Command prompt
set GOOGLE_APPLICATION_CREDENTIALS="SpeechAnalytics.json"

#### Linux/MacOS
export GOOGLE_APPLICATION_CREDENTIALS="SpeechAnalytics.json"

## Google Cloud Speech-to-Text API Usage

https://pypi.org/project/google-cloud-speech/
#### install google speech api
! pip install google-cloud-speech

In [6]:
import io
import os

# the Google Cloud client library
from google.cloud import speech
from google.cloud import speech_v1
from google.cloud.speech import enums
from google.cloud.speech import types

In [7]:
def get_cloud_text(language_code,sample_rate_hertz,local_file_path):
    
    client = speech_v1.SpeechClient()

    # local_file_path = 'audio_samples/OSR_us_000_0010_8k.wav'

    # The language of the supplied audio
    # language_code = "en-US"

    # Sample rate in Hertz of the audio data sent
    # sample_rate_hertz = 8000

    # Encoding of audio data sent. This sample sets this explicitly.
    # This field is optional for FLAC and WAV audio formats.
    encoding = enums.RecognitionConfig.AudioEncoding.LINEAR16
    config = {
        "language_code": language_code,
        "sample_rate_hertz": sample_rate_hertz,
        "encoding": encoding,
    }

    # Transcribe long audio file from Cloud Storage using asynchronous speech recognition
    # storage_uri = "gs://cloud-samples-data/speech/brooklyn_bridge.raw"
    # audio = {"uri": storage_uri}

    # Transcribe a long audio file using asynchronous speech recognition
    # local_file_path Path to local audio file, e.g. /path/audio.wav
    with io.open(local_file_path, "rb") as f:
        content = f.read()
    audio = {"content": content}

    # synchronus
    #response = client.recognize(config, audio)

    # asynchronus
    operation = client.long_running_recognize(config, audio)

    # print(u"Waiting for operation to complete...")
    response = operation.result()
    output_text_list = []
    for result in response.results:
        # First alternative is the most probable result
        alternative = result.alternatives[0]
        output_text_list.append(alternative.transcript)
        # print(u"Transcript: {}".format(alternative.transcript))

    return ''.join(output_text_list)

In [8]:
# American English
google_cloud_speech_to_text_us = get_cloud_text("en-US",8000,'audio_samples/OSR_us_000_0010_8k.wav')
print(google_cloud_speech_to_text_us)

the Birch canoes lid on the smooth planks glue the seat to the dark blue background it is easy to tell the death of a well. These days a chicken leg has a word dish. Rice is often served in round bowls. Did use of lemon snakes find punch. The box was down beside the park truck. the Hogs of the popcorn and garbage 4 hours of study work face to us a large size in stockings is hard to sell.


In [9]:
# British English
google_cloud_speech_to_text_uk = get_cloud_text("en-GB",8000,'audio_samples/OSR_uk_000_0020_8k.wav')
print(google_cloud_speech_to_text_uk)

the birch canoe slid on the smooth planks glue the sheet to the dark blue background it's easy to tell the depth of a well these days a chicken Leg is a reddish rice is often served in round bowls the juice of lemons makes fine punch the box was thrown beside the pot truck the hogs were fed up corn and garbage 4 hours of steady work based s a large size in stockings is hard to sell


### Lets check the accuracy of both Cloud and non-Cloud speech api

Similarity between original text and output from speech-to-text api

In [10]:
# split sentances into words
def WordGram(text):
    tokens = str.split(text.replace('\n', ' ').lower())
    return set(tokens)

In [11]:
original_set = WordGram(original_text)
google_speech_to_text_us_set = WordGram(google_speech_to_text_us)
google_cloud_speech_to_text_us_set = WordGram(google_cloud_speech_to_text_us)

### jaccard similarity (IoU)

In [12]:
def jaccard(s1,s2,s3, message):
    print('Google non-cloud speech-to-text api accuracy is: %.6f percent' % (100.* len(s1.intersection(s2))/ len(s1.union(s2))))
    print('Google cloud speech-to-text api accuracy is: %.6f percent' % (100.* len(s1.intersection(s3))/ len(s1.union(s3))))

In [13]:
jaccard(original_set,google_speech_to_text_us_set,google_cloud_speech_to_text_us_set, 'Word Gram')

Google non-cloud speech-to-text api accuracy is: 13.829787 percent
Google cloud speech-to-text api accuracy is: 43.023256 percent


## Minhashing
MinHash uses the magic of hashing to quickly estimate Jaccard Similarities.

#### Non-cloud accuracy

In [14]:
stotal_non = list(original_set.union(google_speech_to_text_us_set))

In [15]:
import math
# Implementing fast Minhashing algortihm
for k in [20,60,150,300,600]:    
    successCounter = 0
    for t in range (k):
        minNum = [math.inf, math.inf]
        for i in range (len(stotal_non)):
            current = hash(str(t)+stotal_non[i]+str(t)) % 10000
            if stotal_non[i] in original_set: # this is how we'll emulate the vector representation of this sample 1
                if (current < minNum[0]):
                    minNum[0] = current
            if stotal_non[i] in google_speech_to_text_us_set: # this is how we'll emulate the vector representation of this sample 2
                if (current < minNum[1]):
                    minNum[1] = current
        if minNum[0] == minNum[1]:
            successCounter = successCounter+1
    print("with t = %d"%k, " we get a minhash similarity of ", successCounter/k)

with t = 20  we get a minhash similarity of  0.1
with t = 60  we get a minhash similarity of  0.11666666666666667
with t = 150  we get a minhash similarity of  0.11333333333333333
with t = 300  we get a minhash similarity of  0.12
with t = 600  we get a minhash similarity of  0.13333333333333333


#### Cloud accuracy

In [16]:
stotal_cloud = list(original_set.union(google_cloud_speech_to_text_us_set))

In [17]:
import math
# Implementing fast Minhashing algortihm
for k in [20,60,150,300,600]:    
    successCounter = 0
    for t in range (k):
        minNum = [math.inf, math.inf]
        for i in range (len(stotal_cloud)):
            current = hash(str(t)+stotal_cloud[i]+str(t)) % 10000
            if stotal_cloud[i] in original_set: # this is how we'll emulate the vector representation of this sample 1
                if (current < minNum[0]):
                    minNum[0] = current
            if stotal_cloud[i] in google_cloud_speech_to_text_us_set: # this is how we'll emulate the vector representation of this sample 2
                if (current < minNum[1]):
                    minNum[1] = current
        if minNum[0] == minNum[1]:
            successCounter = successCounter+1
    print("with t = %d"%k, " we get a minhash similarity of ", successCounter/k)

with t = 20  we get a minhash similarity of  0.6
with t = 60  we get a minhash similarity of  0.38333333333333336
with t = 150  we get a minhash similarity of  0.43333333333333335
with t = 300  we get a minhash similarity of  0.44
with t = 600  we get a minhash similarity of  0.43833333333333335


# Conclusion

Google non-cloud speech-to-text api accuracy is: **13.829787** percent

Google cloud speech-to-text api accuracy is: **43.023256** percent

 **Google Cloud Speech-to-Text is more accurate in recognizing speech**