#  Lesson 4 Project: Speech Recognition and Synthesis

## Introduction

Welcome to Lesson 4 of our course on cloud-based AI applications! Today, you're diving into the exciting world of speech technologies, focusing on speech recognition and speech synthesis.

In this lesson, you'll explore two powerful capabilities provided by OpenAI:
- Speech Recognition using the Whisper model
- Text-to-Speech (TTS) synthesis

By the end of this lesson, you will be able to:
- Implement speech recognition using OpenAI's Whisper model
- Utilize OpenAI's text-to-speech capabilities for audio synthesis
- Design a basic voice interaction feature in an application

You'll start by looking at how to convert spoken language into written text using the Whisper model. Then, you'll flip the process and learn how to generate natural-sounding speech from text. Finally, you'll combine these technologies to create a simple but powerful voice interaction feature.

Get ready to give your applications a voice and ears!

## Setting Up OpenAI Development Environment

Refer to the Python Crash Course lesson to learn how to set up your OpenAI development environment.

In [41]:
# Install the libraries
!pip install openai python-dotenv matplotlib librosa

# Load the OpenAI library
from openai import OpenAI

# Set up relevant environment variables
# Make sure OPENAI_API_KEY=... exists in .env
from dotenv import load_dotenv

load_dotenv()

# Create the OpenAI connection object
client = OpenAI()

Could not find platform independent libraries <prefix>

[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip




## Implementing Speech recognition using OpenAI's Whisper model

OpenAI's Whisper model is a powerful tool for speech recognition. First, you must prepare the audio files. You can get the audio input directly by using the microphone on your computer and record it directly inside this Jupyter Notebook. You can also download free sample audio files from [Pixabay](https://pixabay.com/sound-effects/search/audio-files/).

In [42]:
import requests
import io
import librosa
from IPython.display import Audio, display

speech_download_link = "https://cdn.pixabay.com/download/audio/2022/03/10/audio_a8e603753c.mp3?filename=self-destruct-sequence-31505.mp3"
save_path = "audio/self-destruct-sequence.mp3"

response = requests.get(speech_download_link)
if response.status_code == 200:
    audio_data = io.BytesIO(response.content)

    with open(save_path, 'wb') as file:
        file.write(response.content)

    y, sr = librosa.load(audio_data)

    audio = Audio(data=y, rate=sr, autoplay=True)

    display(audio)

Let's create a function to transcribe audio using this model.

In [51]:
def play_speech(file_path):
    y, sr = librosa.load(file_path)

    audio = Audio(data=y, rate=sr, autoplay=True)

    display(audio)

In [43]:
with open(save_path, "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file,
      response_format="json"
    )
print(transcription.json())
print(transcription.text)

{"text":"Self-destruct sequence will initiate in 60 seconds. All personnel please exit immediately. Self-destruct sequence has initiated. This facility will self-destruct in 10, 9, 8, 7, 6, 5, 4, 3, 2, 1."}
Self-destruct sequence will initiate in 60 seconds. All personnel please exit immediately. Self-destruct sequence has initiated. This facility will self-destruct in 10, 9, 8, 7, 6, 5, 4, 3, 2, 1.


In [44]:
with open(save_path, "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file,
      response_format="verbose_json",
      timestamp_granularities=["word"]
    )

In [45]:
import json

json_result = transcription.json()
print(json_result)

{"text":"Self-destruct sequence will initiate in 60 seconds. All personnel please exit immediately. Self-destruct sequence has initiated. This facility will self-destruct in 10, 9, 8, 7, 6, 5, 4, 3, 2, 1.","task":"transcribe","language":"english","duration":19.959999084472656,"words":[{"word":"Self","start":0.0,"end":0.6000000238418579},{"word":"destruct","start":0.6000000238418579,"end":0.7400000095367432},{"word":"sequence","start":0.7400000095367432,"end":1.340000033378601},{"word":"will","start":1.340000033378601,"end":1.7200000286102295},{"word":"initiate","start":1.7200000286102295,"end":2.0199999809265137},{"word":"in","start":2.0199999809265137,"end":2.3399999141693115},{"word":"60","start":2.3399999141693115,"end":2.799999952316284},{"word":"seconds","start":2.799999952316284,"end":3.799999952316284},{"word":"All","start":4.320000171661377,"end":4.440000057220459},{"word":"personnel","start":4.440000057220459,"end":4.900000095367432},{"word":"please","start":4.900000095367432,

In [46]:
json_object = json.loads(json_result)
print(json_object["text"])

Self-destruct sequence will initiate in 60 seconds. All personnel please exit immediately. Self-destruct sequence has initiated. This facility will self-destruct in 10, 9, 8, 7, 6, 5, 4, 3, 2, 1.


In [47]:
print(transcription.words)

[{'word': 'Self', 'start': 0.0, 'end': 0.6000000238418579}, {'word': 'destruct', 'start': 0.6000000238418579, 'end': 0.7400000095367432}, {'word': 'sequence', 'start': 0.7400000095367432, 'end': 1.340000033378601}, {'word': 'will', 'start': 1.340000033378601, 'end': 1.7200000286102295}, {'word': 'initiate', 'start': 1.7200000286102295, 'end': 2.0199999809265137}, {'word': 'in', 'start': 2.0199999809265137, 'end': 2.3399999141693115}, {'word': '60', 'start': 2.3399999141693115, 'end': 2.799999952316284}, {'word': 'seconds', 'start': 2.799999952316284, 'end': 3.799999952316284}, {'word': 'All', 'start': 4.320000171661377, 'end': 4.440000057220459}, {'word': 'personnel', 'start': 4.440000057220459, 'end': 4.900000095367432}, {'word': 'please', 'start': 4.900000095367432, 'end': 5.420000076293945}, {'word': 'exit', 'start': 5.420000076293945, 'end': 5.78000020980835}, {'word': 'immediately', 'start': 5.78000020980835, 'end': 6.400000095367432}, {'word': 'Self', 'start': 7.519999980926514, 

In [48]:
print(transcription.words[0])
print(transcription.words[1])

{'word': 'Self', 'start': 0.0, 'end': 0.6000000238418579}
{'word': 'destruct', 'start': 0.6000000238418579, 'end': 0.7400000095367432}


In [49]:
with open(save_path, "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file,
      response_format="verbose_json",
      timestamp_granularities=["segment"]
    )

In [50]:
print(transcription.segments[0])
print(transcription.segments[1])

{'id': 0, 'seek': 0, 'start': 0.0, 'end': 4.28000020980835, 'text': ' Self-destruct sequence will initiate in 60 seconds.', 'tokens': [50364, 16348, 12, 23748, 1757, 8310, 486, 31574, 294, 4060, 907, 10166, 13, 50578], 'temperature': 0.0, 'avg_logprob': -0.25105419754981995, 'compression_ratio': 1.433823585510254, 'no_speech_prob': 0.0036902977153658867}
{'id': 1, 'seek': 0, 'start': 4.28000020980835, 'end': 7.880000114440918, 'text': ' All personnel please exit immediately.', 'tokens': [50578, 1057, 14988, 1767, 11043, 4258, 13, 50758], 'temperature': 0.0, 'avg_logprob': -0.25105419754981995, 'compression_ratio': 1.433823585510254, 'no_speech_prob': 0.0036902977153658867}


In [52]:
ai_programming_audio_path = "audio/kodeco-speech.mp3"
play_speech(ai_programming_audio_path)

In [53]:
with open(ai_programming_audio_path, "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file,
      response_format="text"
    )
print(transcription)

I'm learning AI programming through courses from Codico, which used to be known as Ray Wenderlich.



In [54]:
with open(ai_programming_audio_path, "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file,
      response_format="text",
      prompt="Kodeco,RayWenderlich"
    )
print(transcription)

I'm learning AI programming through courses from Kodeco, which used to be known as RayWenderlich.



In [56]:
# いらっしゃいませ。ラーメン屋へようこそ。何をご注文なさいますか？
japanese_audio_path = "audio/japanese-speech.mp3"
play_speech(japanese_audio_path)

In [57]:
with open(japanese_audio_path, "rb") as audio_file:
    translation = client.audio.translations.create(
      model="whisper-1", 
      file=audio_file,
      response_format="text"
    )
print(translation)

Welcome. Welcome to the ramen shop. What would you like to order?



## Utilizing OpenAI's Text-To-Speech (TTS) Capabilities for Audio Synthesis

Now, let's explore OpenAI's text-to-speech capabilities. You'll create a function to generate speech from text.

In [58]:
speech_file_path = "audio/learn-ai.mp3"
response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input="Would you like to learn AI programming? We have many AI programming courses that you can choose."
)

response.stream_to_file(speech_file_path)

  response.stream_to_file(speech_file_path)


In [59]:
play_speech(speech_file_path)

In [60]:
response = client.audio.speech.create(
  model="tts-1",
  voice="echo",
  speed=0.6,
  input="Would you like to learn AI programming? We have many AI programming courses that you can choose."
)

response.stream_to_file(speech_file_path)

play_speech(speech_file_path)

  response.stream_to_file(speech_file_path)


## Designing a Basic Voice Interaction Feature in an Application

Now, let's combine speech recognition and synthesis to create a simple language tutor application. This application will listen to the user speak in a language, check if the grammar is correct, and provide feedback using synthesized speech.

In [61]:
!pip install ipyaudioworklet



Could not find platform independent libraries <prefix>

[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [62]:
import ipyaudioworklet as ipyaudio

recorder = ipyaudio.AudioRecorder(filename="my_speech.wav")
recorder

AudioRecorder(filename='my_speech.wav')

In [64]:
import wave, numpy as np

_x = (recorder.audiodata * 32767.5).astype(dtype=np.int16)
with wave.open(recorder.filename, mode='wb') as wb:
     wb.setnchannels(1)
     wb.setsampwidth(_x.itemsize)
     wb.setframerate(recorder.sampleRate)
     wb.writeframes(_x.tobytes())

In [65]:
def receive_audio_input(speech_filename="my_speech.wav"):
    recorder = ipyaudio.AudioRecorder(filename=speech_filename)
    return recorder

def record_audio_input(recorder):
    _x = (recorder.audiodata * 32767.5).astype(dtype=np.int16)
    with wave.open(recorder.filename, mode='wb') as wb:
         wb.setnchannels(1)
         wb.setsampwidth(_x.itemsize)
         wb.setframerate(recorder.sampleRate)
         wb.writeframes(_x.tobytes())

In [91]:
def transcript_speech(speech_filename="my_speech.wav"):
    with open(speech_filename, "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
          model="whisper-1", 
          file=audio_file,
          response_format="json",
          language="en"
        )
    return transcription.text

In [75]:
def check_grammar(english_text):
    response = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are an English grammar expert."},
        {"role": "user", "content": f"Fix the grammar: {english_text}"}
      ]
    )
    message = response.choices[0].message.content
    return message

In [72]:
def tell_feedback(grammar_feedback, speech_file_path="feedback_speech.mp3"):
    response = client.audio.speech.create(
      model="tts-1",
      voice="alloy",
      input=grammar_feedback
    )

    response.stream_to_file(speech_file_path)
    play_speech(speech_file_path)

In [95]:
def grammar_feedback_app(recorder):
    record_audio_input(recorder)
    speech_filename = recorder.filename
    transcription = transcript_speech(speech_filename)
    print(transcription)
    feedback = check_grammar(transcription)
    print(feedback)
    tell_feedback(feedback)

In [70]:
recorder = receive_audio_input()
recorder

AudioRecorder(filename='my_speech.wav')

In [96]:
grammar_feedback_app(recorder)

The children is playing in the yard.
The correct sentence is: "The children are playing in the yard."


  response.stream_to_file(speech_file_path)
