#  Lesson 4 Project: Speech Recognition and Synthesis

## Introduction

Welcome to Lesson 4 of our course on cloud-based AI applications! Today, you're diving into the exciting world of speech technologies, focusing on speech recognition and speech synthesis.

In this lesson, you'll explore two powerful capabilities provided by OpenAI:
- Speech Recognition using the Whisper model
- Text-to-Speech (TTS) synthesis

By the end of this lesson, you will be able to:
- Implement speech recognition using OpenAI's Whisper model
- Utilize OpenAI's text-to-speech capabilities for audio synthesis
- Design a basic voice interaction feature in an application

You'll start by looking at how to convert spoken language into written text using the Whisper model. Then, you'll flip the process and learn how to generate natural-sounding speech from text. Finally, you'll combine these technologies to create a simple but powerful voice interaction feature.

Get ready to give your applications a voice and ears!

## Setting Up OpenAI Development Environment

Refer to the Python Crash Course lesson to learn how to set up your OpenAI development environment.

In [None]:
# Install the libraries
!pip install openai python-dotenv matplotlib librosa

# Load the OpenAI library
from openai import OpenAI

# Set up relevant environment variables
# Make sure OPENAI_API_KEY=... exists in .env
from dotenv import load_dotenv

load_dotenv()

# Create the OpenAI connection object
client = OpenAI()

## Implementing Speech recognition using OpenAI's Whisper model

OpenAI's Whisper model is a powerful tool for speech recognition. First, you must prepare the audio files. You can get the audio input directly by using the microphone on your computer and record it directly inside this Jupyter Notebook. You can also download free sample audio files from [Pixabay](https://pixabay.com/sound-effects/search/audio-files/).

In [None]:
# Download and load an audio file using librosa

# Import libraries
import requests
import io
import librosa
from IPython.display import Audio, display

# URL of the sample audio file
speech_download_link = "https://cdn.pixabay.com/download/audio/2022/03/10/audio_a8e603753c.mp3?filename=self-destruct-sequence-31505.mp3"

# Local path where the audio file will be saved
save_path = "audio/self-destruct-sequence.mp3"

# Download the audio file
response = requests.get(speech_download_link)
if response.status_code == 200:
    audio_data = io.BytesIO(response.content)

    # Save the audio file locally
    with open(save_path, 'wb') as file:
        file.write(response.content)

    # Load the audio file using librosa
    y, sr = librosa.load(audio_data)

    # Display the audio file so it can be played
    audio = Audio(data=y, rate=sr, autoplay=True)
    display(audio)

Create a function to play the audio file. This will be useful for confirming the content of the audio files before transcribing.

In [None]:
# Function to play the audio file

def play_speech(file_path):
    # Load the audio file using librosa
    y, sr = librosa.load(file_path)

    # Create an Audio object for playback
    audio = Audio(data=y, rate=sr, autoplay=True)

    # Display the audio player
    display(audio)

Now, use the Whisper model to transcribe the audio file to text.

In [None]:
# Transcribe the audio file using the Whisper model

with open(save_path, "rb") as audio_file:
    # Transcribe the audio file using the Whisper model
    transcription = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file,
      response_format="json"
    )
# Print the transcription result in JSON format
print(transcription.json())
# Print only the transcribed text
print(transcription.text)

To transcribe audio using the Whisper model, you use the `client.audio.transcriptions.create` method, which requires specific parameters to be passed in the request. The `file` parameter is mandatory and must contain the actual audio file object in formats like flac, mp3, wav, or similar—not just the file name. The `model` parameter specifies the ID of the transcription model. The `whisper-1` model is currently the only available model, so it's a required field. The format of the transcription output can be specified with `response_format`. It defaults to `json`. As you can see, with the `json` response format, the JSON result is concise.

But you can try another value, `verbose_json`, which will make the JSON result very verbose.

There is also another useful parameter, `timestamp_granularities`, to get the timestamps at the word or segment level.

You can try to experiment with them with the following code:

In [None]:
# Retrieve the detailed information with timestamps

with open(save_path, "rb") as audio_file:
    # Transcribe the audio file with word-level timestamps
    transcription = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file,
      response_format="verbose_json",
      timestamp_granularities=["word"]
    )

Then you can take a look at the verbose JSON result. You can get much more information from this result.

In [None]:
# Print the detailed information for each word timestamp

import json

json_result = transcription.json()
print(json_result)

json_object = json.loads(json_result)
print(json_object["text"])

You can access detailed information about individual words in the transcription.

In [None]:
# Print the detailed information for words

# Print the detailed information for each word
print(transcription.words)
# Print the detailed information for the first two words
print(transcription.words[0])
print(transcription.words[1])

You can also obtain segment-level timestamps for the transcription.

In [None]:
# Retrieve the detailed information with segment-level timestamps

with open(save_path, "rb") as audio_file:
    # Transcribe the audio file with segment-level timestamps
    transcription = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file,
      response_format="verbose_json",
      timestamp_granularities=["segment"]
    )

You can access detailed information about segments in the transcription. A segment can be something like a sentence.

In [None]:
# Print the detailed information for the first two segments
print(transcription.segments[0])
print(transcription.segments[1])

You want to test the transcription with another audio file that has specific words that can be misspelled.

In [None]:
# Load & play kodeco-speech.mp3 audio file

# Path to another audio file
ai_programming_audio_path = "audio/kodeco-speech.mp3"
# Play the audio file
play_speech(ai_programming_audio_path)

You would hear Kodeco and RayWenderlich being mentioned. You want to transcribe the speech again. This time, you want to use the `text` response format which is simpler than the `json` response format. The return result is just the transcription text.

In [None]:
# Transcribe the audio file with `text` response format

with open(ai_programming_audio_path, "rb") as audio_file:
    # Transcribe the audio file to text
    transcription = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file,
      response_format="text"
    )
# Print the transcribed text
print(transcription)

But the transcription is wrong. Kodeco and RayWenderlich are misspelled. Fortunately, you can guide the transcription process with the `prompt` parameter.

In [None]:
# Transcribe the audio file with a prompt to improve accuracy

with open(ai_programming_audio_path, "rb") as audio_file:
    # Transcribe the audio file with a prompt to improve accuracy
    transcription = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file,
      response_format="text",
      prompt="Kodeco,RayWenderlich"
    )
# Print the transcribed text
print(transcription)

Now, it works well. With the `prompt` parameter, you may guide the transcription with a text prompt, especially useful for continuing a previous segment or matching a specific style. In this case, you guided the transcription in correcting specific words.

There is another parameter, `temperature`. This controls how deterministic or random the transcription will be. Lower values (like 0.2) make the transcription more focused, while higher values introduce more randomness. You can experiment it with longer audio file if you are curious.

## Translation

Other than transcription, you can also translate the audio file directly to English. Right now, only English is supported.

For a start, you want to listen to the Japanese audio file.

In [None]:
# Load & play japanese-speech.mp3 audio file

# The speech in Japanese: いらっしゃいませ。ラーメン屋へようこそ。何をご注文なさいますか？
# Path to the Japanese audio file
japanese_audio_path = "audio/japanese-speech.mp3"
# Play the Japanese audio file
play_speech(japanese_audio_path)

To translate, you can use the `client.audio.translation.create` method. The `model`, `file`, and `response format` work the same as in the `client.audio.transcription.create` method.

In [None]:
# Translate the Japanese audio to English text

with open(japanese_audio_path, "rb") as audio_file:
    # Translate the Japanese audio to English text
    translation = client.audio.translations.create(
      model="whisper-1", 
      file=audio_file,
      response_format="text"
    )
# Print the translated text
print(translation)

## Using OpenAI's Text-To-Speech (TTS) Capabilities for Audio Synthesis

To create a synthesis speech, you can use the `client.audio.speech.create` method.

In [None]:
# Generate speech from text using OpenAI's TTS model

# Path to save the synthesized speech
speech_file_path = "audio/learn-ai.mp3"

# Generate speech from text using OpenAI's TTS model
with client.audio.speech.with_streaming_response.create(
  model="tts-1",
  voice="alloy",
  input="Would you like to learn AI programming? We have many AI programming courses that you can choose."
) as response:
  # Save the synthesized speech to the specified path
  response.stream_to_file(speech_file_path)

The model parameter is set to "tts-1", specifying the text-to-speech model to be used. This model is optimized for speed. You can use another model, "tts-1-hd", if you care more about the quality. The voice parameter is set to "alloy", which determines the voice characteristics such as tone and accent. You have other choices, like `echo`, `fable`, `onyx`, `nova`, and `shimmer`. Finally, the input parameter contains the text that you want to convert to speech: "Would you like to learn AI programming? We have many AI programming courses that you can choose.".

You want to play the synthesized speech to hear the output.

In [None]:
# Play the synthesized speech
play_speech(speech_file_path)

Nice! You've created a synthesized speech.

You can experiment with another value for the `voice` parameter. There is also another parameter, `speed`. You can make the speed of the speech slower by giving it the value less than 1 or make it faster by giving it the value greater than 1.

In [None]:
# Generate speech with a different voice and slower speed

# Generate speech with a different voice and slower speed
response = client.audio.speech.create(
  model="tts-1",
  voice="echo",
  speed=0.6,
  input="Would you like to learn AI programming? We have many AI programming courses that you can choose."
)

# Save the synthesized speech to the specified path
response.stream_to_file(speech_file_path)

# Play the synthesized speech
play_speech(speech_file_path)

## Designing a Basic Voice Interaction Feature in an Application

Now, you want to combine speech recognition and synthesis to create a simple language tutor application. This application will listen to the user speak in a language, check if the grammar is correct, and provide feedback using synthesized speech.

First, you will define a function to transcribe the recorded speech using the Whisper model.

In [None]:
# Define a function to transcribe the recorded speech

def transcript_speech(speech_filename="my_speech.m4a"):
    with open(speech_filename, "rb") as audio_file:
        # Open the audio file and transcribe using the Whisper model
        transcription = client.audio.transcriptions.create(
          model="whisper-1", 
          file=audio_file,
          response_format="json",
          language="en"
        )
    # Return the transcribed text
    return transcription.text

You will also need a function to check the grammar of the transcribed text using OpenAI's GPT model.

In [None]:
# Check the grammar of the transcribed text

def check_grammar(english_text):
    # Use GPT to check and correct the grammar of the input text
    response = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are an English grammar expert."},
        {"role": "user", "content": f"Fix the grammar: {english_text}"}
      ]
    )
    # Extract and return the corrected grammar message
    message = response.choices[0].message.content
    return message

Next, you need a function to generate spoken feedback using the text-to-speech capability.

In [None]:
# Provide spoken feedback using TTS

def tell_feedback(grammar_feedback, speech_file_path="feedback_speech.mp3"):
    # Generate speech from the grammar feedback using TTS
    response = client.audio.speech.create(
      model="tts-1",
      voice="alloy",
      input=grammar_feedback
    )

    # Save the synthesized speech to the specified path
    response.stream_to_file(speech_file_path)
    # Play the synthesized speech
    play_speech(speech_file_path)

Finally, put everything together in a function that handles the entire process from recording audio to providing spoken feedback.

In [None]:
# Implement the grammar feedback application

def grammar_feedback_app(speech_filename):
    # Transcribe the recorded speech
    transcription = transcript_speech(speech_filename)
    print(transcription)
    # Check and correct the grammar of the transcription
    feedback = check_grammar(transcription)
    print(feedback)
    # Provide spoken feedback using TTS
    tell_feedback(feedback)

To create an audio file for speech input, you can use the Sound Recorder app on Windows, QuickTime on Mac, or a similar recording application on Linux. Once recorded, place the audio file in the `audio` folder and update the `wrong_grammar_audio` variable accordingly. Alternatively, you can use a provided audio sample containing a grammatically incorrect sentence, "My sister don't like to eat on night," for testing purposes.

In [None]:
# Set the audio file. Use the audio sample or record the audio yourself and place the file here.
wrong_grammar_audio = "audio/grammar-wrong.mp3"

You can listen to the audio input file first.

In [None]:
# Play the grammatically wrong audio file
play_speech(wrong_grammar_audio)

Then run the grammar feedback application.

In [None]:
# Run the grammar feedback application
grammar_feedback_app(wrong_grammar_audio)