# Using the Python SpeechRecognition library
>  Speech recognition is still far from perfect. But the SpeechRecognition library provides an easy way to interact with many speech-to-text APIs. In this section, you'll learn how to use the SpeechRecognition library to easily start converting the spoken language in your audio files to text.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 2 exercises "Spoken Language Processing in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

## SpeechRecognition Python library

### Pick the wrong speech_recognition API

<div class=""><p>Which of the following is <strong>not</strong> a speech recognition API within the <code>speech_recognition</code> library?</p>
<p>An instance of the <code>Recognizer</code> class has been created and saved to <code>recognizer</code>. You can try calling the API on <code>recognizer</code> to see what happens.</p></div>

<pre>
Possible Answers+
recognize_google()
recognize_bing()
recognize_wit()
<b>what_does_this_say()</b>
</pre>

**All of the Recognizer class API calls begin with recognize_.**

### Using the SpeechRecognition library

<div class=""><p>To save typing <code>speech_recognition</code> every time, we'll import it as <code>sr</code>.</p>
<p>We'll also setup an instance of the <code>Recognizer</code> class to use later.</p>
<p>The <code>energy_threshold</code> is a number between 0 and 4000 for how much the <code>Recognizer</code> class should listen to an audio file.</p>
<p><code>energy_threshold</code> will dynamically adjust whilst the recognizer class listens to audio.</p></div>

In [None]:
%%capture
! pip install SpeechRecognition

Instructions
<ul>
<li>Import the <code>speech_recognition</code> library as <code>sr</code>.</li>
<li>Setup an instance of the <code>Recognizer</code> class and save it to <code>recognizer</code>.</li>
<li>Set the <code>recognizer.energy_threshold</code> to 300.</li>
</ul>

In [3]:
# Importing the speech_recognition library
import speech_recognition as sr

# Create an instance of the Recognizer class
recognizer = sr.Recognizer()

# Set the energy threshold
recognizer.energy_threshold = 300

### Using the Recognizer class

<div class=""><p>Now you've created an instance of the <code>Recognizer</code> class we'll use the <code>recognize_google()</code> method on it to access the Google web speech API and turn spoken language into text.</p>
<p><code>recognize_google()</code> requires an argument <code>audio_data</code> otherwise it will return an error.</p>
<p>US English is the default language. If your audio file isn't in US English, you can change the language with the <code>language</code> argument. A list of language codes can be seen <a href="https://cloud.google.com/speech-to-text/docs/languages" target="_blank" rel="noopener noreferrer">here</a>.</p>
<p>An audio file containing English speech has been imported as <code>clean_support_call_audio</code>. You can <a href="https://assets.datacamp.com/production/repositories/4637/datasets/393a2f76d057c906de27ec57ea655cb1dc999fce/clean-support-call.wav" target="_blank" rel="noopener noreferrer">listen to the audio file here</a>. SpeechRecognition has also been imported as <code>sr</code>.</p>
<p>To avoid hitting the API request limit of Google's web API, we've mocked the <code>Recognizer</code> class to work with our audio files. This means some functionality will be limited.</p></div>

In [5]:
%%capture
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/clean-support-call.wav

In [14]:
clean_support_call = sr.AudioFile("clean-support-call.wav")
with clean_support_call as source:
    clean_support_call_audio = recognizer.record(clean_support_call)

Instructions
<ul>
<li>Call the <code>recognize_google()</code> method on <code>recognizer</code> and pass it <code>clean_support_call_audio</code>.</li>
<li>Set the language argument to <code>"en-US"</code>.</li>
</ul>

In [15]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Transcribe the support call audio
text = recognizer.recognize_google(
  audio_data=clean_support_call_audio, 
  language="en-US")

print(text)

hello I'd like to get some help setting up my account please


**You just transcribed your first piece of audio using speech_recognition's Recognizer class! Well, we've set it a mock version of Recognizer so we don't hit the API max requests limit. Notice how the 'hello' wasn't seperate from the rest of the text. As powerful as recognize_google() is, it doesn't have sentence separation.**

## Reading audio files with SpeechRecognition

### From AudioFile to AudioData

<div class=""><p>As you saw earlier, there are some transformation steps we have to take to make our audio data useful. The same goes for SpeechRecognition. </p>
<p>In this exercise, we'll import the <code>clean_support_call.wav</code> <a href="https://assets.datacamp.com/production/repositories/4637/datasets/393a2f76d057c906de27ec57ea655cb1dc999fce/clean-support-call.wav" target="_blank" rel="noopener noreferrer">audio file</a> and get it ready to be recognized.</p>
<p>We first read our audio file using the <code>AudioFile</code> class. But the <code>recognize_google()</code> method requires an input of type <code>AudioData</code>.</p>
<p>To convert our <code>AudioFile</code> to <code>AudioData</code>, we'll use the <code>Recognizer</code> class's method <code>record()</code> along with a context manager. The <code>record()</code> method takes an <code>AudioFile</code> as input and converts it to <code>AudioData</code>, ready to be used with <code>recognize_google()</code>.</p>
<p>SpeechRecognition has already been imported as <code>sr</code>.</p></div>

Instructions
<ul>
<li>Pass the AudioFile class <code>clean_support_call.wav</code>.</li>
<li>Use the context manager to open and read <code>clean_support_call</code> as <code>source</code>.</li>
<li>Record <code>source</code> and run the code.</li>
</ul>

In [17]:
# Instantiate Recognizer
recognizer = sr.Recognizer()

# Convert audio to AudioFile
clean_support_call = sr.AudioFile("clean-support-call.wav")

# Convert AudioFile to AudioData
with clean_support_call as source:
    clean_support_call_audio = recognizer.record(clean_support_call)

# Transcribe AudioData to text
text = recognizer.recognize_google(clean_support_call_audio,
                                   language="en-US")
print(text)

hello I'd like to get some help setting up my account please


**You've gone end to end with SpeechRecognition, you've imported an audio file, converted it to the right data type and transcribed it using Google's free web API! Now let's see a few more capabilities of the record() method.**

### Recording the audio we need

<div class=""><p>Sometimes you may not want the entire audio file you're working with. The <code>duration</code> and <code>offset</code> parameters of the <code>record()</code> method can help with this.</p>
<p>After exploring your dataset, you find there's one file, imported as <code>nothing_at_end</code> which has <a href="https://assets.datacamp.com/production/repositories/4637/datasets/ca799cf2a7b093c06e1a5ae1dd96a49d48d65efa/30-seconds-of-nothing-16k.wav" target="_blank" rel="noopener noreferrer">30-seconds of silence at the end</a> and a support call file, imported as <code>out_of_warranty</code> has <a href="https://assets.datacamp.com/production/repositories/4637/datasets/dbc47d8210fdf8de42b0da73d1c2ba92e883b2d2/static-out-of-warranty.wav" target="_blank" rel="noopener noreferrer">3-seconds of static at the front</a>.</p>
<p>Setting <code>duration</code> and <code>offset</code> means the <code>record()</code> method will record up to <code>duration</code> audio starting at <code>offset</code>. They're both measured in seconds.</p></div>

In [19]:
%%capture
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/30-seconds-of-nothing-16k.wav
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/static-out-of-warranty.wav

In [20]:
nothing_at_end = sr.AudioFile("30-seconds-of-nothing-16k.wav")
static_at_start = sr.AudioFile("static-out-of-warranty.wav")

Instructions 1/2
<p>Let's get the first 10-seconds of <code>nothing_at_end_audio</code>. To do this, you can set <code>duration</code> to 10.</p>

In [21]:
# Convert AudioFile to AudioData
with nothing_at_end as source:
    nothing_at_end_audio = recognizer.record(source,
                                             duration=10,
                                             offset=None)

# Transcribe AudioData to text
text = recognizer.recognize_google(nothing_at_end_audio,
                                   language="en-US")

print(text)

this ODI fall has 30 seconds of nothing at the end of it


"ODI" = It would be "audio"

Instructions 2/2
<p>Let's remove the first 3-seconds of static of <code>static_at_start</code> by setting <code>offset</code> to 3.</p>

In [22]:
# Convert AudioFile to AudioData
with static_at_start as source:
    static_art_start_audio = recognizer.record(source,
                                               duration=None,
                                               offset=3)

# Transcribe AudioData to text
text = recognizer.recognize_google(static_art_start_audio,
                                   language="en-US")

print(text)

hello I'd like to get some help with my device please I think it's out of warranty I bought it about 2 years ago


**Speech recognition can be resource intensive, so in practice, you'll want to explore your audio files to make you're not wasting any compute power trying to transcribe static or silence.**

## Dealing with different kinds of audio

### Different kinds of audio

<div class=""><p>Now you've seen an example of how the <code>Recognizer</code> class works. Let's try a few more. How about speech from a different language?</p>
<p>What do you think will happen when we call the <code>recognize_google()</code> function on a <a href="https://assets.datacamp.com/production/repositories/4637/datasets/cd9b801670d0664275cdbd3a24b6b70a8c2e5222/good-morning-japanense.wav" target="_blank" rel="noopener noreferrer">Japanese version of <code>good_morning.wav</code></a> (<code>japanese_audio</code>)? </p>
<p>The default language is <code>"en-US"</code>, are the results the same with the <code>"ja"</code> tag?</p>
<p>How about non-speech audio? Like this <a href="https://assets.datacamp.com/production/repositories/4637/datasets/5720832b2735089d8e735cac3e0b0ad9b5114864/leopard.wav" target="_blank" rel="noopener noreferrer">leopard roaring</a> (<code>leopard_audio</code>).</p>
<p>Or speech where the sounds may not be real words, such as <a href="https://assets.datacamp.com/production/repositories/4637/datasets/e9fd46a06d74431e3baa942c489e1b119d85a233/charlie-bit-me-5.wav" target="_blank" rel="noopener noreferrer">a baby talking</a> (<code>charlie_audio</code>)?</p>
<p>To familiarize more with the <code>Recognizer</code> class, we'll look at an example of each of these.</p></div>

In [23]:
%%capture
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/good-morning-japanense.wav
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/leopard.wav
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/charlie-bit-me-5.wav

In [33]:
japanese_audio = sr.AudioFile("good-morning-japanense.wav")
leopard_audio = sr.AudioFile("leopard.wav")
charlie_audio = sr.AudioFile("charlie-bit-me-5.wav")

with japanese_audio, leopard_audio, charlie_audio as source:
    japanese_audio = recognizer.record(japanese_audio)
    leopard_audio = recognizer.record(leopard_audio)
    charlie_audio = recognizer.record(charlie_audio)

Instructions 1/4
<p>Pass the Japanese version of good morning (<code>japanese_audio</code>) to <code>recognize_google()</code> using <code>"en-US"</code> as the language.</p>

In [31]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Pass the Japanese audio to recognize_google
text = recognizer.recognize_google(japanese_audio, language="en-US")

# Print the text
print(text)

ohayo gozaimasu


Instructions 2/4
<p>Pass the same Japanese audio (<code>japanese_audio</code>) using <code>"ja"</code> as the language parameter. Do you see a difference?</p>

In [34]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Pass the Japanese audio to recognize_google
text = recognizer.recognize_google(japanese_audio, language="ja")

# Print the text
print(text)

おはようございます


Instructions 3/4
<p>What about about non-speech audio? Pass <code>leopard_audio</code> to <code>recognize_google()</code> with <code>show_all</code> as <code>True</code>.</p>

In [35]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Pass the leopard roar audio to recognize_google
text = recognizer.recognize_google(leopard_audio, 
                                   language="en-US", 
                                   show_all=True)

# Print the text
print(text)

[]


Instructions 4/4
<p>What if your speech files have non-audible human sounds? Pass <code>charlie_audio</code> to <code>recognize_google()</code> to find out.</p>

In [36]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Pass charlie_audio to recognize_google
text = recognizer.recognize_google(charlie_audio, 
                                   language="en-US")

# Print the text
print(text)

charlie bit me


**You've seen how the recognize_google() deals with different kinds of audio. It's worth noting the recognize_google() function is only going to return words, as in, it didn't return the baby saying 'ahhh!' because it doesn't recognize it as a word. Speech recognition has come a long way but it's far from perfect.**

### Multiple Speakers 1

<div class=""><p>If your goal is to transcribe conversations, there will be more than one speaker. However, as you'll see, the <code>recognize_google()</code> function will only transcribe speech into a single block of text.</p>
<p>You can hear in <a href="https://assets.datacamp.com/production/repositories/4637/datasets/925c8c31d6e4af9c291c692f13e4f41c7b5e86b2/multiple-speakers-16k.wav" target="_blank" rel="noopener noreferrer">this audio file</a> there are three different speakers.</p>
<p>But if you transcribe it on its own, <code>recognize_google()</code> returns a single block of text. Which is still useful but it doesn't let you know which speaker said what.</p>
<p>We'll see an alternative to this in the next exercise.</p>
<p>The multiple speakers audio file has been imported and converted to <code>AudioData</code> as <code>multiple_speakers</code>.</p></div>

In [37]:
%%capture
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/multiple-speakers-16k.wav

In [38]:
multiple_speakers = sr.AudioFile("multiple-speakers-16k.wav")
with multiple_speakers as source:
    multiple_speakers = recognizer.record(multiple_speakers)

Instructions
<ul>
<li>Create an instance of <code>Recognizer</code>.</li>
<li>Recognize the <code>multiple_speakers</code> variable using the <code>recognize_google()</code> function.</li>
<li>Set the language to US English (<code>"en-US"</code>).</li>
</ul>

In [39]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Recognize the multiple speaker AudioData
text = recognizer.recognize_google(multiple_speakers, 
                       			   language="en-US")

# Print the text
print(text)

is that it doesn't recognize different speakers invoices it will just return it all as one block of text


one of the limitations of the speech recognition...

**But see how all of the speakers speech came out in one big block of text?**

### Multiple Speakers 2

<div class=""><p>Deciphering between multiple speakers in one audio file is called speaker diarization. However, you've seen the free function we've been using, <code>recognize_google()</code> doesn't have the ability to transcribe different speakers. </p>
<p>One way around this, without using one of the paid speech to text services, is to ensure your audio files are single speaker.</p>
<p>This means if you were working with phone call data, you would make sure the caller and receiver are recorded separately. Then you could transcribe each file individually.</p>
<p>In this exercise, we'll transcribe each of the speakers in our <a href="https://assets.datacamp.com/production/repositories/4637/datasets/925c8c31d6e4af9c291c692f13e4f41c7b5e86b2/multiple-speakers-16k.wav" target="_blank" rel="noopener noreferrer">multiple speakers audio file</a> individually.</p></div>

In [40]:
%%capture
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/speaker_0.wav
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/speaker_1.wav
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/speaker_2.wav

Instructions
<ul>
<li>Pass <code>speakers</code> to the <code>enumerate()</code> function to loop through the different speakers.</li>
<li>Call <code>record()</code> on <code>recognizer</code> to convert the <code>AudioFile</code>s into <code>AudioData</code>.</li>
<li>Use <code>recognize_google()</code> to transcribe each of the <code>speaker_audio</code> objects.</li>
</ul>

In [41]:
recognizer = sr.Recognizer()

# Multiple speakers on different files
speakers = [sr.AudioFile("speaker_0.wav"), 
            sr.AudioFile("speaker_1.wav"), 
            sr.AudioFile("speaker_2.wav")]

# Transcribe each speaker individually
for i, speaker in enumerate(speakers):
    with speaker as source:
        speaker_audio = recognizer.record(source)
    print(f"Text from speaker {i}:")
    print(recognizer.recognize_google(speaker_audio,
         				  language="en-US"))

Text from speaker 0:
one of the limitations of the speech recognition Lottery
Text from speaker 1:
is that it doesn't recognize different speakers invoices
Text from speaker 2:
I'll just return it all as one block a text


Lottery...

**Something to remember is I had to manually split the audio file into different speakers. You can see this solution still isn't perfect but it's easier to deal with than having a single block of text. You could think about automating this process in the future by having a model split the audio when it detects different speakers.**

### Working with noisy audio

<div class=""><p>In this exercise, we'll start by transcribing a clean speech sample to text and then see what happens when we add some background noise.</p>
<p>A clean audio sample has been imported as <code>clean_support_call</code>.</p>
<p><a href="https://assets.datacamp.com/production/repositories/4637/datasets/393a2f76d057c906de27ec57ea655cb1dc999fce/clean-support-call.wav" target="_blank" rel="noopener noreferrer">Play clean support call</a>.</p>
<p>We'll then do the same with the noisy audio file saved as <code>noisy_support_call</code>. It has the same speech as <code>clean_support_call</code> but with additional background noise.</p>
<p><a href="https://assets.datacamp.com/production/repositories/4637/datasets/f3edd5024944eac2f424b592840475890c86d405/2-noisy-support-call.wav" target="_blank" rel="noopener noreferrer">Play noisy support call</a>.</p>
<p>To try and negate the background noise, we'll take advantage of <code>Recognizer</code>'s <code>adjust_for_ambient_noise()</code> function.</p></div>

Instructions 1/4
<p>Let's transcribe some clean audio. Read in <code>clean_support_call</code> as the source and call <code>recognize_google()</code> on the file.</p>

In [42]:
recognizer = sr.Recognizer()

# Record the audio from the clean support call
with clean_support_call as source:
  clean_support_call_audio = recognizer.record(clean_support_call)

# Transcribe the speech from the clean support call
text = recognizer.recognize_google(clean_support_call_audio,
					   language="en-US")

print(text)

hello I'd like to get some help setting up my account please


Instructions 2/4
<p>Let's do the same as before but with a noisy audio file saved as <code>noisy_support_call</code> and <code>show_all</code> parameter as <code>True</code>.</p>

In [45]:
%%capture
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/spoken-language-processing-in-python/data/2-noisy-support-call.wav
noisy_support_call = sr.AudioFile("2-noisy-support-call.wav")

In [46]:
recognizer = sr.Recognizer()

# Record the audio from the noisy support call
with noisy_support_call as source:
  noisy_support_call_audio = recognizer.record(noisy_support_call)

# Transcribe the speech from the noisy support call
text = recognizer.recognize_google(noisy_support_call_audio,
                         language="en-US",
                         show_all=True)

print(text)

{'alternative': [{'transcript': "hello I'd like to get to help setting up my account", 'confidence': 0.89570123}, {'transcript': "hello I'd like to get some help setting up my account"}, {'transcript': "hello I'd like to get to help thinning out my account"}, {'transcript': "hello I'd like to get to help setting up my calendar"}, {'transcript': "hello I'd like to get to help setting up my account."}], 'final': True}


Instructions 3/4
<p>Set the <code>duration</code> parameter of <code>adjust_for_ambient_noise()</code> to 1 (second) so <code>recognizer</code> adjusts for background noise.</p>

In [47]:
recognizer = sr.Recognizer()

# Record the audio from the noisy support call
with noisy_support_call as source:
	# Adjust the recognizer energy threshold for ambient noise
    recognizer.adjust_for_ambient_noise(source, duration=1)
    noisy_support_call_audio = recognizer.record(noisy_support_call)
 
# Transcribe the speech from the noisy support call
text = recognizer.recognize_google(noisy_support_call_audio,
                                   language="en-US",
                                   show_all=True)

print(text)

{'alternative': [{'transcript': "I'd like to get to help setting up my account", 'confidence': 0.83045131}, {'transcript': "I'd like to get to help setting up my calendar"}, {'transcript': "I'd like to get to help setting up my account."}, {'transcript': "I'd like to get some help setting up my account"}, {'transcript': "I'd like to get to help thinning out my account"}], 'final': True}


Instructions 4/4
<p>A <code>duration</code> of 1 was too long and it cut off some of the audio. Try setting <code>duration</code> to 0.5.</p>

In [48]:
recognizer = sr.Recognizer()

# Record the audio from the noisy support call
with noisy_support_call as source:
	# Adjust the recognizer energy threshold for ambient noise
    recognizer.adjust_for_ambient_noise(source, duration=0.5)
    noisy_support_call_audio = recognizer.record(noisy_support_call)
 
# Transcribe the speech from the noisy support call
text = recognizer.recognize_google(noisy_support_call_audio,
                                   language="en-US",
                                   show_all=True)

print(text)

{'alternative': [{'transcript': "hello I'd like to get to help setting up my account", 'confidence': 0.90365565}, {'transcript': "hello I'd like to get to help setting up my calendar"}, {'transcript': "hello I'd like to get to help setting up my account."}, {'transcript': "hello I'd like to get to help setting up my calculator"}, {'transcript': "hello I'd like to get to help setting up my account page"}], 'final': True}


**Well, the results still weren't perfect. This should be expected with some audio files though, sometimes the background noise is too much. If your audio files have a large amount of background noise, you may need to preprocess them with an audio tool such as Audacity before using them with speech_recognition.**