**Positioning.** Where SpeechRecognition fits among ASR toolkits.

| Library           | Origin/type         | Goal                     | Why pick it           | Notes                        |
| ----------------- | ------------------- | ------------------------ | --------------------- | ---------------------------- |
| CMU Sphinx        | Academic, classical | On-device ASR            | Offline option        | Older models                 |
| Kaldi             | Research/engine     | Custom ASR pipelines     | Power and flexibility | Steep setup                  |
| SpeechRecognition | Python wrapper      | Uniform API over engines | Low barrier to entry  | One interface, many backends |

**Setup.** How to install and versions mentioned.

| Step    | Command                         | Note                                  |
| ------- | ------------------------------- | ------------------------------------- |
| Install | `pip install SpeechRecognition` | Works with Python 3 (course uses Py3) |

**Recognizer essentials.** Core class and key knobs.

| Item                   | Purpose                          | Default/typical     | When to change                |
| ---------------------- | -------------------------------- | ------------------- | ----------------------------- |
| `sr.Recognizer()`      | Main ASR controller              | New instance        | Per task/audio                |
| `energy_threshold`     | Loudness cutoff for speech       | ~300 recommended    | Noisy/quiet clips             |
| Auto adjustment        | Adapts threshold while listening | Enabled in practice | Stabilize VAD on varied audio |
| `AudioFile` / `record` | Load WAV and capture audio       | Whole file or range | Segment long clips            |

**Engines wrapped.** Built-ins exposed by the same API.

| Method                   | Provider                     | Auth    | Cost/limits (noted)      | Typical use      |
| ------------------------ | ---------------------------- | ------- | ------------------------ | ---------------- |
| `recognize_google`       | Google Web Speech            | No key  | ~50 req/day; short clips | Quick free tests |
| `recognize_google_cloud` | Google Cloud Speech-to-Text  | API key | Paid, higher limits      | Production       |
| `recognize_bing`         | Microsoft Cognitive Services | API key | Paid                     | Production       |
| `recognize_wit`          | wit.ai                       | Token   | Free tier limits         | Prototyping      |

**Free Google notes.** Constraints highlighted in the lesson.

| Constraint  | Effect                  | Guidance           |
| ----------- | ----------------------- | ------------------ |
| Daily quota | Requests capped         | Keep trials small  |
| Duration    | Long clips may time out | Prefer <≈5 minutes |


**Pipeline.** The flow you’ll repeat with different backends.

| Step      | Action                             | Output          |
| --------- | ---------------------------------- | --------------- |
| Ingest    | `AudioFile` → `record`             | `AudioData`     |
| Recognize | Call one of `recognize_*`          | Text hypothesis |
| Iterate   | Tune threshold, segmenting, engine | Higher accuracy |


# Recognize the speech in the audio data using Google's Web Speech API

In [None]:
# SpeechRecognition: load WAV, set threshold, transcribe with Google Web Speech
import speech_recognition as sr

In [None]:
r = sr.Recognizer()
r.energy_threshold = 300  # starting point; auto-adjusts while listening

In [32]:
# Load audio from file
file = "../data/audio.wav"  # any format supported by ffmpeg/avlib
with sr.AudioFile(file) as source:
    audio = r.record(source)  # whole file; use record(source, offset=..., duration=...) to segment

In [33]:
text = r.recognize_google(audio, language="en-US")  # free, no API key
print(text)

we are looking at how to pronounce this word how do you say it correctly for reference this is the word of French origin and friendship is said as mayonnaise mayonnaise in English whether it is normally pronounced as mayonnaise mayonnaise


In [None]:
# Recognize the speech in the audio data using Google's Web Speech API
text = r.recognize_google(audio, language="en-US", show_all=True)
print(text)

{'alternative': [{'transcript': 'looking at how to pronounce this word how do you say it correctly for reference this is a word of French origin and friendship is said as mayonnaise mayonnaise in English however it is normally pronounced as mayonnaise mayonnaise', 'confidence': 0.84518248}, {'transcript': 'looking at how to pronounce this word how do you say it correctly for reference this is the word of French origin and friendship is said as mayonnaise mayonnaise in English however it is normally pronounced as mayonnaise mayonnaise', 'confidence': 0.8304143}, {'transcript': 'looking at how to pronounce this word how do you say it correctly for reference this is a word of French origin and friendship is said as mayonnaise mayonnaise in englisch however it is normally pronounced as mayonnaise mayonnaise', 'confidence': 0.84518248}, {'transcript': 'looking at how to pronounce this word how do you say it correctly for reference this is the word of French origin and friendship is said as 

# Reconocimiento de voz (Speech Recognition)

A continuación algunos motores y APIs comunes:

- `recognize_bing()` — Microsoft Bing Voice Recognition  
- `recognize_google()` — Google Web Speech API *(gratuita, con cuota diaria)*  
- `recognize_google_cloud()` — Google Cloud Speech *(requiere instalar el Google Cloud SDK)*  
- `recognize_houndify()` — Houndify de SoundHound  
- `recognize_ibm()` — IBM Speech to Text

**Nota sobre cuotas**  
Con las APIs gratuitas hay un número limitado de solicitudes por día.   
Si necesitamos más volumen o SLA, conviene contratar un plan de pago del proveedor.


!! necesitas instalar flac!!

```bash
sudo apt-get install flac
```

o en windows descargar el instalador de https://xiph.org/flac/download.html
tambien se puede con el comando
```bash
winget install Xiph.FLAC
```


# Diferent types of audio

Yess! but, what is language? it is different from language to language, for example, english is "en-US", spanish is "es-ES", etc.

What hapepn with non-speech audio? like music, noise, etc.?
If we put, for example, a leopard rawr it will try to recognize it as a word, but it will not be able to do it, so it will raise an error.


If the audio is noisy or has a lot of reverb, it will be harder for the recognizer to understand it, so we need to clean the audio before recognizing it. But we can also try to adjust the energy threshold to make it more sensitive to speech.

Multiple speakers? it will try to recognize the speech, and it will, but if we want to know who said it... we need a different approach, like speaker diarization or speaker recognition. This could be a possible output:

```code
one of the limitations of the speech recognition library is that it doesn't recognise different speakers and voices it will just return it all as one block a text
```

```python
recognizer = sr.Recognizer()

# Multiple speakers on different files
speakers = [sr.AudioFile("speaker_0.wav"),
            sr.AudioFile("speaker_1.wav"),
            sr.AudioFile("speaker_2.wav")]

# Transcribe each speaker individually
for i, speaker in enumerate(speakers):
    with speaker as source:
        speaker_audio = recognizer.record(source)
    print(f"Text from speaker {i}:")
    print(recognizer.recognize_google(speaker_audio,
         				  language="en-US"))
```

```code
Text from speaker 0:
one of the limitations of the speech recognition library

Text from speaker 1:
is that it doesn't recognise different speakers and voices

Text from speaker 2:
it will just return it all as one block of text
```
