<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Speech Recognition Python Library</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(SPOKEN LANGUAGE PROCESSING IN PYTHON)</span></div>

## Table of Contents

1. [Introduction & Why SpeechRecognition?](#section-1)
2. [Getting Started: Installation](#section-2)
3. [The Recognizer Class](#section-3)
4. [Recognizing Speech (APIs)](#section-4)
5. [Basic Transcription Example](#section-5)
6. [Reading Audio Files](#section-6)
7. [From AudioFile to AudioData](#section-7)
8. [Transcribing AudioData](#section-8)
9. [Duration and Offset](#section-9)
10. [Handling Different Languages](#section-10)
11. [Non-Speech Audio & Errors](#section-11)
12. [Showing All Alternatives](#section-12)
13. [Handling Multiple Speakers](#section-13)
14. [Handling Noisy Audio](#section-14)
15. [Conclusion](#section-15)

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. Introduction & Why SpeechRecognition?</span><br>

Spoken language processing is a critical component of modern AI applications. While there are several libraries available for Python, the **SpeechRecognition** library stands out for its ease of use and compatibility.

### Existing Python Libraries for Speech
There are various tools available in the ecosystem. Below is a comparison of common libraries mentioned in the course material:

| Library | Developer/Origin | Note |
| :--- | :--- | :--- |
| **CMU Sphinx** | Carnegie Mellon University | Offline, older toolkit. |
| **Kaldi** | Kaldi Project | Powerful, research-focused. |
| **SpeechRecognition** | Anthony Zhang (Uberi) | **Focus of this notebook.** Wrapper for many engines. |
| **Wav2letter++** | Facebook AI Research | Fast, C++ based. |

We choose `SpeechRecognition` because it provides a simple, unified interface to access multiple speech recognition APIs (both online and offline).

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Getting Started: Installation</span><br>

To begin using the library, it must be installed from PyPi. The library is compatible with both Python 2 and Python 3, though Python 3 is the standard for modern development.

### Installation Command



In [None]:
# Install the library using pip (run this in your terminal or a Jupyter cell with !)
# !pip install SpeechRecognition



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> Ensure you have a working internet connection if you plan to use web-based APIs like Google Speech Recognition. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. The Recognizer Class</span><br>

The core component of the library is the `Recognizer` class. This class contains all the necessary functions to process audio and send it to recognition services.

### Initializing the Recognizer



In [None]:
# Import the SpeechRecognition library
import speech_recognition as sr

# Create an instance of the Recognizer class
recognizer = sr.Recognizer()

# Set the energy threshold
# This value determines how loud audio must be to be considered speech.
# The default is usually 300, but it can be adjusted for sensitivity.
recognizer.energy_threshold = 300

print(f"Recognizer created with energy threshold: {recognizer.energy_threshold}")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Recognizing Speech (APIs)</span><br>

The `Recognizer` class has built-in functions that interact with various external Speech APIs. This allows you to switch backends easily without rewriting your audio processing code.

### Available Recognition Methods

| Method | Service | Note |
| :--- | :--- | :--- |
| `recognize_bing()` | Microsoft Bing Speech | Requires API Key |
| `recognize_google()` | Google Web Speech API | **Default / Free (limited)** |
| `recognize_google_cloud()` | Google Cloud Speech | Requires Credentials |
| `recognize_wit()` | Wit.ai | Requires API Key |

**Input**: An `audio_file` (specifically, an `AudioData` instance).
**Output**: Transcribed text (string).

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Basic Transcription Example</span><br>

In this section, we focus on `recognize_google()`, which is convenient for testing because it doesn't always require an API key for low-volume usage.

### Code Example



In [None]:
import speech_recognition as sr

# Instantiate Recognizer class
recognizer = sr.Recognizer()

# NOTE: In a real scenario, 'audio_file' would be an AudioData object 
# obtained from recording a file (see next sections).
# For this example, we assume the data is ready.

# Transcribe speech using Google Web API
# recognizer.recognize_google(audio_data=audio_file, language="en-US")

# Expected Output Example:
# "Learning speech recognition on DataCamp is awesome!"



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Reading Audio Files</span><br>

To process pre-recorded audio, we use the `AudioFile` class. This class acts as a file handle for `.wav`, `.aiff`, or `.flac` files.

### The AudioFile Class



In [None]:
import speech_recognition as sr

# Setup recognizer instance
recognizer = sr.Recognizer()

# Define the path to the audio file
# Note: Ensure "clean-support-call.wav" exists in your directory
audio_filename = "clean-support-call.wav"

# Read in audio file using the AudioFile class
# This creates a file object, but does not yet load the data into memory
clean_support_call = sr.AudioFile(audio_filename)

# Check type of clean_support_call
print(type(clean_support_call))



**Output:**
`<class 'speech_recognition.AudioFile'>`

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. From AudioFile to AudioData</span><br>

You cannot pass the `AudioFile` object directly to the recognizer. You must first "record" the data from the file into an `AudioData` object.

### The `record()` Method

If you try to pass the file directly, you will get an error:
> `AssertionError: audio_data must be audio data`

Here is the correct way to convert the file to data:



In [None]:
# Convert from AudioFile to AudioData using a context manager
with clean_support_call as source:
    # Record the audio from the source file
    clean_support_call_audio = recognizer.record(source)

# Check the type of the resulting object
print(type(clean_support_call_audio))



**Output:**
`<class 'speech_recognition.AudioData'>`

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. Transcribing AudioData</span><br>

Now that we have the `AudioData` object, we can pass it to the Google API for transcription.



In [None]:
# Transcribe clean support call
text = recognizer.recognize_google(audio_data=clean_support_call_audio)

print(text)



**Output:**
`hello I'd like to get some help setting up my account please`

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 9. Duration and Offset</span><br>

The `record()` function allows you to read specific parts of an audio file using `duration` and `offset` parameters. By default, both are set to `None`, meaning the whole file is read.

### Using Duration and Offset



In [None]:
# Example 1: Leave duration and offset as default (reads whole file)
with clean_support_call as source:
    clean_support_call_audio = recognizer.record(source, duration=None, offset=None)

# Example 2: Get first 2 seconds of clean support call
with clean_support_call as source:
    # Record only the first 2.0 seconds
    clean_support_call_audio_short = recognizer.record(source, duration=2.0)

# Transcribe the short segment
text_short = recognizer.recognize_google(clean_support_call_audio_short)
print(text_short)



**Output:**
`hello I'd like to get`

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 10. Handling Different Languages</span><br>

The `recognize_google()` function accepts a `language` parameter. If you pass the wrong language code, the API will attempt to transcribe the sounds into the default language (English), often resulting in phonetic nonsense.

### Example: Japanese Audio

**Scenario 1: Wrong Language (English)**



In [None]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Assume 'japanese_good_morning' is an AudioData object containing Japanese speech
# Pass the Japanese audio to recognize_google with English setting
# text = recognizer.recognize_google(japanese_good_morning, language="en-US")

# Print the text
# print(text)



**Output (Phonetic English):**
`Ohio gozaimasu` (Sounds like "Ohayo Gozaimasu" but transcribed as English words).

**Scenario 2: Correct Language (Japanese)**



In [None]:
# Pass the Japanese audio to recognize_google with Japanese setting ("ja")
# text = recognizer.recognize_google(japanese_good_morning, language="ja")

# Print the text
# print(text)



**Output:**
`?????????` (Note: The slide displays question marks, likely indicating character encoding issues in the display environment or font support, but the API returns Japanese characters).

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 11. Non-Speech Audio & Errors</span><br>

Speech recognition APIs are designed for human speech. If you feed them non-speech audio (like animal sounds), they may fail to produce a transcription.

### Example: Leopard Roar



In [None]:
# Import the leopard roar audio file
leopard_roar = sr.AudioFile("leopard_roar.wav")

# Convert the AudioFile to AudioData
with leopard_roar as source:
    leopard_roar_audio = recognizer.record(source)

# Attempt to recognize the AudioData
try:
    recognizer.recognize_google(leopard_roar_audio)
except sr.UnknownValueError:
    print("UnknownValueError: The API could not understand the audio.")



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> The <code>UnknownValueError</code> is raised when the speech recognition engine cannot match the audio to any known words. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 12. Showing All Alternatives</span><br>

Sometimes the API is unsure about the transcription. You can use `show_all=True` to see the raw response from the API, which may include alternative transcriptions and confidence scores.

### Using `show_all=True`

**Example 1: Non-speech (Leopard)**



In [None]:
# Recognize the AudioData with show_all turned on
result = recognizer.recognize_google(leopard_roar_audio, show_all=True)
print(result)


**Output:** `[]` (Empty list, meaning no speech detected).

**Example 2: Ambiguous Speech (Japanese)**



In [None]:
# Recognizing Japanese audio with show_all=True
# text = recognizer.recognize_google(japanese_good_morning, language="en-US", show_all=True)
# print(text)



**Output (Dictionary with alternatives):**


In [None]:
{
    'alternative': [
        {'transcript': 'Ohio gozaimasu', 'confidence': 0.89041114},
        {'transcript': 'all hail gozaimasu'},
        {'transcript': 'ohayo gozaimasu'},
        {'transcript': 'olho gozaimasu'},
        {'transcript': 'all Hale gozaimasu'}
    ],
    'final': True
}



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 13. Handling Multiple Speakers</span><br>

A major limitation of the standard `SpeechRecognition` library is that it does not perform **diarization** (distinguishing between different speakers). It returns all text as a single block.

### The Limitation



In [None]:
# Import an audio file with multiple speakers
multiple_speakers = sr.AudioFile("multiple-speakers.wav")

# Convert AudioFile to AudioData
with multiple_speakers as source:
    multiple_speakers_audio = recognizer.record(source)

# Recognize the AudioData
text = recognizer.recognize_google(multiple_speakers_audio)
print(text)



**Output:**
`one of the limitations of the speech recognition library is that it doesn't recognise different speakers and voices it will just return it all as one block of text`

### The Workaround: Split Files
To handle multiple speakers, you often need to pre-process the audio into separate files (e.g., `s0.wav`, `s1.wav`, `s2.wav`).



In [None]:
# Import audio files separately
speakers = [sr.AudioFile("s0.wav"), sr.AudioFile("s1.wav"), sr.AudioFile("s2.wav")]

# Transcribe each speaker individually
for i, speaker in enumerate(speakers):
    with speaker as source:
        speaker_audio = recognizer.record(source)
        print(f"Text from speaker {i}: {recognizer.recognize_google(speaker_audio)}")



**Output:**
*   Text from speaker 0: one of the limitations of the speech recognition library
*   Text from speaker 1: is that it doesn't recognise different speakers and voices
*   Text from speaker 2: it will just return it all as one block a text

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 14. Handling Noisy Audio</span><br>

Background noise can severely impact transcription accuracy. The `Recognizer` class includes a utility to adjust for ambient noise levels.

### Using `adjust_for_ambient_noise`



In [None]:
# Import audio file with background noise
noisy_support_call = sr.AudioFile("noisy_support_call.wav")

with noisy_support_call as source:
    # Adjust for ambient noise and record
    # The recognizer listens for 0.5 seconds to calibrate the energy threshold
    recognizer.adjust_for_ambient_noise(source, duration=0.5)
    
    # Now record the actual audio
    noisy_support_call_audio = recognizer.record(source)

# Recognize the audio
text = recognizer.recognize_google(noisy_support_call_audio)
print(text)



**Output:**
`hello ID like to get some help setting up my calories`

*(Note: Even with adjustment, noise can lead to errors. Here "account please" might have been misheard as "calories" or similar due to noise).*

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 15. Conclusion</span><br>

In this notebook, we explored the capabilities of the **SpeechRecognition** library in Python.

**Key Takeaways:**
1.  **Recognizer Class**: The central hub for managing audio and APIs.
2.  **AudioFile & AudioData**: You must convert file handles (`AudioFile`) into raw data (`AudioData`) using `record()` before transcription.
3.  **Flexibility**: The library supports multiple APIs (Google, Bing, Wit) and languages.
4.  **Limitations**: It does not natively handle speaker diarization (multiple speakers) and requires clean audio or noise adjustment for best results.
5.  **Debugging**: Use `show_all=True` to inspect raw API responses and confidence scores.

**Next Steps:**
*   Try recording your own voice using a microphone (using `sr.Microphone()` instead of `sr.AudioFile()`).
*   Experiment with different `energy_threshold` settings for noisy environments.
*   Integrate this into a larger NLP pipeline to analyze the sentiment of the transcribed text.
