<a href="https://colab.research.google.com/github/mpfmorawski/pyconpl2023-speech-recognition/blob/main/pyconopl2023_speech_recognition_with_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Content
During today's lecture we will test the following solutions:
1. [SpeechRecognition](https://github.com/Uberi/speech_recognition) (Python module supporting several speech-to-text engines and APIs)
2. [AssemblyAI](https://www.assemblyai.com/) (API)
3. [Whisper](https://github.com/openai/whisper) (speech-to-text model)
4. [Transformers](https://github.com/huggingface/transformers) (pretrained speech-to-text models)


# Environment setup

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
!pip install contexttimer

In [None]:
import contexttimer
import warnings
warnings.filterwarnings("ignore")

# Audio files

Download the audio files from [here](https://drive.google.com/drive/folders/1i-F-dVNvvMBG2TJEO2boT-3ihXnIRT4l?usp=sharing) or download them from the [repository](https://github.com/mpfmorawski/pyconpl2023-speech-recognition) and upload them to your personal Google Drive.

And then update the following PATH variable:

In [None]:
PATH = "drive/MyDrive/PUT_YOUR_FOLDER_NAME_HERE"

In [None]:
import IPython

In [None]:
sentence = {"en": f"{PATH}/sentence_en.wav",
            "pl": f"{PATH}/sentence_pl.wav"}

command = {"en": f"{PATH}/command_en.wav",
           "pl": f"{PATH}/command_pl.wav"}

def display_audio_example_in_all_languages(example: dict) -> None:
  for language in example:
    print(language)
    IPython.display.display(IPython.display.Audio(example[language]))

In [None]:
print("Long sentence")
display_audio_example_in_all_languages(sentence)
print("\nShort command")
display_audio_example_in_all_languages(command)

# 1 - SpeechRecognition
[Speech recognition module](https://github.com/Uberi/speech_recognition) *for Python, supporting several engines and APIs, online and offline.*

Speech recognition engine/API support (based on its [README](https://github.com/Uberi/speech_recognition/blob/master/README.rst)):

* [CMU Sphinx](http://cmusphinx.sourceforge.net/wiki/) (works offline)
* Google Speech Recognition
* [Google Cloud Speech API](https://cloud.google.com/speech/)
* [Wit.ai](https://wit.ai/)
* [Microsoft Azure Speech](https://azure.microsoft.com/en-us/services/cognitive-services/speech)
* [Houndify API](https://houndify.com/)
* [IBM Speech to Text](http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/speech-to-text.html)
* [Snowboy Hotword Detection](https://snowboy.kitt.ai/) (works offline)
* [Tensorflow](https://www.tensorflow.org/)
* [Vosk API](https://github.com/alphacep/vosk-api/) (works offline)
* [OpenAI whisper](https://github.com/openai/whisper) (works offline)
* [Whisper API](https://platform.openai.com/docs/guides/speech-to-text)

Installation and import

In [None]:
!pip install SpeechRecognition

In [None]:
import speech_recognition as sr

## Speech recognition using speech_recognition module with Google Speech Recognition

In [None]:
def transcribe_with_speech_recognition_module_and_google(audio_path: str) -> str:
  r = sr.Recognizer()
  with sr.AudioFile(audio_path) as source:
      audio = r.record(source)

  try:
      return r.recognize_google(audio)
  except sr.UnknownValueError:
      return "Could not understand audio"
  except sr.RequestError as e:
      return f"Could not request results from service; {e}"

Code source: https://github.com/Uberi/speech_recognition/blob/master/examples/audio_transcribe.py


## Testing

In [None]:
audio_path = sentence["en"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

with contexttimer.Timer() as t:
  transcription = transcribe_with_speech_recognition_module_and_google(audio_path)

print(f"\nReceived transcription:\n\n{transcription} \n\nExecution time: {t.elapsed:.2f} s")

In [None]:
audio_path = command["en"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

with contexttimer.Timer() as t:
  transcription = transcribe_with_speech_recognition_module_and_google(audio_path)

print(f"\nReceived transcription:\n\n{transcription} \n\nExecution time: {t.elapsed:.2f} s")

## Usage of other speech recognition engines and APIs
You can find speech_recognition module usage with other engines and APIs examples here https://github.com/Uberi/speech_recognition/blob/master/examples/audio_transcribe.py

# 2 - AssemblyAI
[AssemblyAI](https://www.assemblyai.com/) *API exposes AI models for speech recognition, speaker detection, speech summarization, and more.*

Imports

In [None]:
import requests
import time

## Code template
Getting *Try the API* code from main page of https://www.assemblyai.com/

In [None]:
endpoint = "https://api.assemblyai.com/v2/transcript"

json = {
  "audio_url": "https://storage.googleapis.com/bucket/b2c31290d9d8.wav"
}

headers = {
  "Authorization": "c2a41970d9d811ec9d640242ac12",
  "Content-Type": "application/json"
}

response = requests.post(endpoint, json=json, headers=headers)
parse(response)

Code analysis:
1. Audio files were uploaded to web!
2. Need to get your own AssemblyAI API Key
3. Need to analyze what data comes in response (parsing)

## Authorization


1. Go to: https://www.assemblyai.com/dashboard/signup
2. Sign up
3. Go to: https://www.assemblyai.com/app/account
4. Copy your API Key

In [None]:
ASSEMBLY_AI_API_KEY = "PUT_YOUR_ASSEMBLY_AI_API_KEY_HERE"

## Uploading audio files
Uploading files for transcription basing on https://www.assemblyai.com/docs/walkthroughs#uploading-local-files-for-transcription

In [None]:
filename = sentence["en"]

In [None]:
UPLOAD_ENDPOINT = "https://api.assemblyai.com/v2/upload"
headers = {"authorization": ASSEMBLY_AI_API_KEY}
with open(filename , "rb") as f:
    response = requests.post(UPLOAD_ENDPOINT,
                        headers=headers,
                        data=f)

print(response.json())

In [None]:
TRANSCRIPT_ENDPOINT = "https://api.assemblyai.com/v2/transcript"

json = {
    "audio_url": response.json()["upload_url"]
}
headers = {
    "authorization": ASSEMBLY_AI_API_KEY,
}

response = requests.post(TRANSCRIPT_ENDPOINT,
                         json=json,
                         headers=headers)

print(response.json())

But wait... where is transcription?

In [None]:
print(f"{response.json()['text']=}")
print(f"{response.json()['status']=}")

## Polling

In [None]:
polling_endpoint = f"{TRANSCRIPT_ENDPOINT}/{response.json()['id']}"

while True:
  response = requests.get(polling_endpoint, headers=headers).json()
  if response["status"] == "completed":
    break
  elif response["status"] == "error":
    raise RuntimeError(f"Transcription failed: {response['error']}")
  else:
    time.sleep(3)

print(response["text"])

## All components together


In [None]:
UPLOAD_ENDPOINT = "https://api.assemblyai.com/v2/upload"
TRANSCRIPT_ENDPOINT = "https://api.assemblyai.com/v2/transcript"

headers = {"authorization": ASSEMBLY_AI_API_KEY}

def upload_audio_file(filename : str) -> str:
  with open(filename , "rb") as f:
    response = requests.post(UPLOAD_ENDPOINT,
                        headers=headers,
                        data=f)
  return response.json()["upload_url"]


def make_transcription_request(audio_url) -> str:
  json = { "audio_url": audio_url }
  response = requests.post(TRANSCRIPT_ENDPOINT, json=json, headers=headers)
  return response.json()["id"]


def poll(transcript_id):
  polling_endpoint = f"{TRANSCRIPT_ENDPOINT}/{transcript_id}"
  polling_response = requests.get(polling_endpoint, headers=headers)
  return polling_response.json()


def transcribe_with_assembly_ai(audio_path: str):
  audio_url = upload_audio_file(audio_path)
  transcription_id = make_transcription_request(audio_url)
  while True:
    response = requests.get(f"{TRANSCRIPT_ENDPOINT}/{transcription_id}", headers=headers).json()
    if response["status"] == "completed":
      return response["text"]
    elif response["status"] == "error":
      raise RuntimeError(f"Transcription failed: {response['error']}")
    else:
      time.sleep(3)


Code source: https://www.assemblyai.com/docs/walkthroughs#uploading-local-files-for-transcription

## Testing

In [None]:
audio_path = sentence["en"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

with contexttimer.Timer() as t:
  transcription = transcribe_with_assembly_ai(audio_path)

print(f"\nReceived transcription:\n\n{transcription} \n\nExecution time: {t.elapsed:.2f} s")

In [None]:
audio_path = command["en"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

with contexttimer.Timer() as t:
  transcription = transcribe_with_assembly_ai(audio_path)

print(f"\nReceived transcription:\n\n{transcription} \n\nExecution time: {t.elapsed:.2f} s")

# 3 - OpenAI's Whisper
*Robust Speech Recognition via Large-Scale Weak Supervision*

Installation and import

In [None]:
!pip install -U openai-whisper

In [None]:
import whisper

## Importing models

In [None]:
english_only_models_names = ["tiny.en", "base.en", "small.en"]
# Note: You can use a 'medium.en' model too but it is quite big (1.42G) - Google Colab sometimes crashes because of it
multilingual_models_names = ["tiny", "base", "small"]
# Note: You can use a 'medium' model and a 'large' medels too but...
# A "medium" model is quite big (1.42G) - Google Colab sometimes crashes because of it
# A "large" model is too large to even import it in google colab

### Importing English-only models

In [None]:
english_only_models = [whisper.load_model(model_name) for model_name in english_only_models_names]

### Importing multilingual models


In [None]:
multilingual_models = [whisper.load_model(model_name) for model_name in multilingual_models_names]

## Speech recognition with OpenAI's Whisper

In [None]:
def transcribe_with_whipser(model, audio_path: str) -> dict:
  return model.transcribe(audio_path)

Code source: https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb

## Testing

### Testing English-only models

In [None]:
def test_english_only_models(audio_path: str) -> None:
  for index, model_name in enumerate(english_only_models_names):
    with contexttimer.Timer() as t:
      result = transcribe_with_whipser(english_only_models[index], audio_path)
    print(f"Model: {model_name}\nReceived transcription: {result['text']} | Detected language: {result['language']} | Execution time: {t.elapsed:.2f} s")

In [None]:
audio_path = sentence["en"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

test_english_only_models(audio_path)

In [None]:
audio_path = command["en"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

test_english_only_models(audio_path)

### Testing multilingual base model with different languages

In [None]:
def test_multilingual_base_model(audio_path: str) -> None:
  base_model_index = 1
  with contexttimer.Timer() as t:
    result = transcribe_with_whipser(multilingual_models[base_model_index], audio_path)
  print(f"Model: base | Received transcription: {result['text']} | Detected language: {result['language']} | Execution time: {t.elapsed:.2f} s")

In [None]:
audio_path = sentence["en"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

test_multilingual_base_model(audio_path)

In [None]:
audio_path = sentence["pl"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

test_multilingual_base_model(audio_path)

In [None]:
audio_path = command["en"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

test_multilingual_base_model(audio_path)

In [None]:
audio_path = command["pl"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

test_multilingual_base_model(audio_path)

# 4 - Transformers
*Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.*

Installation and import

In [None]:
!pip install transformers

In [None]:
from transformers import pipeline

## Speech recognition with Transformers

In [None]:
def transcribe_with_transformers_pipeline(audio_path: str) -> str:
  transcriber = pipeline("automatic-speech-recognition",
                         model="facebook/wav2vec2-base-960h")
  transcription = transcriber(audio_path)
  return transcription["text"]

Code source: https://huggingface.co/docs/transformers/main/tasks/asr

## Testing

In [None]:
audio_path = sentence["en"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

with contexttimer.Timer() as t:
  transcription = transcribe_with_transformers_pipeline(audio_path)

print(f"\nReceived transcription:\n\n{transcription} \n\nExecution time: {t.elapsed:.2f} s")

In [None]:
audio_path = command["en"]

print(f"Audio file: {audio_path}")
IPython.display.display(IPython.display.Audio(audio_path))

with contexttimer.Timer() as t:
  transcription = transcribe_with_transformers_pipeline(audio_path)

print(f"\nReceived transcription:\n\n{transcription} \n\nExecution time: {t.elapsed:.2f} s")