<a href="https://colab.research.google.com/github/luis-arrieta/Building-Generative-AI-Powered-Applications-with-Python/blob/main/Business_AI_Meeting_Companion_STT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Learning Objectives ###

After finishing this lab, you will able to:

* Create a Python script to generate text using a model from the Hugging Face Hub, identify some key parameters that influence the model's output, and have a basic understanding of how to switch between different LLM models.
* Use OpenAI's Whisper technology to convert lecture recordings into text, accurately.
* Implement IBM Watson's AI to effectively summarize the transcribed lectures and extract key points.
* Create an intuitive and user-friendly interface using Hugging Face Gradio, ensuring ease of use for students and educators.

### Preparing the environment ###

Let's start with setting up the environment by creating a Python virtual environment and installing the required libraries, using the following commands in the terminal:

In [None]:
!pip3 install virtualenv
!virtualenv my_env # create a virtual environment my_env
!source my_env/bin/activate # activate my_env

Then, install the required libraries in the environment (this will take time ☕️☕️):

In [None]:
# installing required libraries in my_env
!pip install transformers==4.41.0 torch==2.6.0 gradio==5.23.2 langchain==0.3.25 ibm_watson_machine_learning==1.0.335 huggingface-hub==0.28.1

In [None]:
!sudo apt update

In [None]:
!sudo apt install ffmpeg -y

In [2]:
import requests
# URL of the audio file to be downloaded
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04C6EN/Testing%20speech%20to%20text.mp3"
# Send a GET request to the URL to download the file
response = requests.get(url)
# Define the local file path where the audio file will be saved
audio_file_path = "downloaded_audio.mp3"
# Check if the request was successful (status code 200)
if response.status_code == 200:
    # If successful, write the content to the specified local file path
    with open(audio_file_path, "wb") as file:
        file.write(response.content)
    print("File downloaded successfully")
else:
    # If the request failed, print an error message
    print("Failed to download the file")

File downloaded successfully


In [5]:
import torch
from transformers import pipeline
# Initialize the speech-to-text pipeline from Hugging Face Transformers
# This uses the "openai/whisper-tiny.en" model for automatic speech recognition (ASR)
# The `chunk_length_s` parameter specifies the chunk length in seconds for processing
pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-tiny.en",
  chunk_length_s=30,
)
# Define the path to the audio file that needs to be transcribed
sample = "sample_data/downloaded_audio.mp3"
# Perform speech recognition on the audio file
# The `batch_size=8` parameter indicates how many chunks are processed at a time
# The result is stored in `prediction` with the key "text" containing the transcribed text
prediction = pipe(sample, batch_size=8)["text"]
# Print the transcribed text to the console
print(prediction)

Device set to use cpu
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 Hello, I want to speak fast because I want to test four speech-to-text applications. Today, whether it's sunny, with a slight breeze, making it perfect for outdoor activity, later I plan for a busy local part, maybe even a picnic. The test is designed to assess the accuracy and responsiveness of the speech-to-text feature. Thank you for participating in this test.


### Gradio interface - Creating a simple demo ###

Through this project, we will create different LLM applications with Gradio interface. Let's get familiar with Gradio by creating a simple app:

Still in the project directory, create a Python file and name it hello.py.

Open hello.py, paste the following Python code and save the file.

In [None]:
!pip install gradio

In [8]:
import gradio as gr
def greet(name):
    return "Hello " + name + "!"
demo = gr.Interface(fn=greet, inputs="text", outputs="text")
demo.launch(server_name="0.0.0.0", server_port= 7860)

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://607c2d84d1ce3db209.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




The above code creates a gradio.Interface called demo. It wraps the greet function with a simple text-to-text user interface that you could interact with.

The gradio.Interface class is initialized with 3 required parameters:

fn: the function to wrap a UI around
inputs: which component(s) to use for the input (e.g. “text”, “image” or “audio”)
outputs: which component(s) to use for the output (e.g. “text”, “image” or “label”)
The last line demo.launch() launches a server to serve our demo.

### Step 2: Creating audio transcription app ###

Create a new python file speech2text_app.py

#### Exercise: Complete the transcript_audio function. ####

From the step1: fill the missing parts in transcript_audio function.

In [10]:
import torch
from transformers import pipeline
import gradio as gr
# Function to transcribe audio using the OpenAI Whisper model
def transcript_audio(audio_file):
    # Initialize the speech recognition pipeline
    pipe = pipeline(
        "automatic-speech-recognition",
        model="openai/whisper-tiny.en",
        chunk_length_s=30,
    )
    # Transcribe the audio file and return the result
    result = pipe(audio_file, batch_size=8)["text"]
    return result
# Set up Gradio interface
audio_input = gr.Audio(sources="upload", type="filepath")  # Audio input
output_text = gr.Textbox()  # Text output
# Create the Gradio interface with the function, inputs, and outputs
iface = gr.Interface(fn=transcript_audio,
                     inputs=audio_input, outputs=output_text,
                     title="Audio Transcription App",
                     description="Upload the audio file")
# Launch the Gradio app
iface.launch(server_name="0.0.0.0", server_port=7861)

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://2b17d0ce23ccc71d7d.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




You can download the sample audio file we've provided by right-clicking on it in the file explorer and selecting "Download." Once downloaded, you can upload this file to the app. Alternatively, feel free to choose and upload any MP3 audio file from your local computer.

### Step 3: Integrating LLM: Using Llama 3 in WatsonX as LLM ###

#### Running simple LLM ####

Let's start by generating text with LLMs. Create a Python file and name it simple_llm.py. You can proceed by clicking the link below or by referencing the accompanying image.

Here's how the code works:

* Setting up credentials: The credentials needed to access IBM's services are pre-arranged by the Skills Network team, so you don't have to worry about setting them up yourself.

* Specifying parameters: The code then defines specific parameters for the language model. 'MAX_NEW_TOKENS' sets the limit on the number of words the model can generate in one go. 'TEMPERATURE' adjusts how creative or predictable the generated text is.

* Setting up Llama 3 model: Next, the LLAMA3 model is set up using a model ID, the provided credentials, chosen parameters, and a project ID.

* Creating an object for Llama 3: The code creates an object named llm, which is used to interact with the Llama 3 model. A model object, LLAMA3_model, is created using the Model class, which is initialized with a specific model ID, credentials, parameters, and project ID. Then, an instance of WatsonxLLM is created with LLAMA3_model as an argument, initializing the language model hub llm object.

* Generating and printing response: Finally, 'llm' is used to generate a response to the question, "How to read a book effectively?" The response is then printed out.

In [None]:
!pip install ibm-watson-machine-learning

In [None]:
!pip install -U langchain-community

In [None]:
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
my_credentials = {
    "url": "https://us-south.ml.cloud.ibm.com"
}
params = {
        GenParams.MAX_NEW_TOKENS: 700, # The maximum number of tokens that the model can generate in a single run.
        GenParams.TEMPERATURE: 0.1,   # A parameter that controls the randomness of the token generation. A lower value makes the generation more deterministic, while a higher value introduces more randomness.
    }
LLAMA2_model = Model(
        model_id= 'meta-llama/llama-3-2-11b-vision-instruct',
        credentials=my_credentials,
        params=params,
        project_id="skills-network",
        )
llm = WatsonxLLM(LLAMA2_model)
print(llm("How to read a book effectively?"))

### Step 4: Put them all together ###

In this exercise, we'll set up a language model (LLM) instance, which could be IBM WatsonxLLM, HuggingFaceHub, or an OpenAI model. Then, we'll establish a prompt template. These templates are structured guides to generate prompts for language models, aiding in output organization (more info in langchain prompt template.

Next, we'll develop a transcription function that employs the OpenAI Whisper model to convert speech-to-text. This function takes an audio file uploaded through a Gradio app interface (preferably in .mp3 format). The transcribed text is then fed into an LLMChain, which integrates the text with the prompt template and forwards it to the chosen LLM. The final output from the LLM is then displayed in the Gradio app's output textbox.

In [None]:
import torch
import os
import gradio as gr
#from langchain.llms import OpenAI
from langchain.llms import HuggingFaceHub
from transformers import pipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
my_credentials = {
    "url"    : "https://us-south.ml.cloud.ibm.com"
}
params = {
        GenParams.MAX_NEW_TOKENS: 800, # The maximum number of tokens that the model can generate in a single run.
        GenParams.TEMPERATURE: 0.1,   # A parameter that controls the randomness of the token generation. A lower value makes the generation more deterministic, while a higher value introduces more randomness.
    }
LLAMA2_model = Model(
        model_id= 'meta-llama/llama-3-2-11b-vision-instruct',
        credentials=my_credentials,
        params=params,
        project_id="skills-network",
        )
llm = WatsonxLLM(LLAMA2_model)
#######------------- Prompt Template-------------####
temp = """
<s><<SYS>>
List the key points with details from the context:
[INST] The context : {context} [/INST]
<</SYS>>
"""
pt = PromptTemplate(
    input_variables=["context"],
    template= temp)
prompt_to_LLAMA2 = LLMChain(llm=llm, prompt=pt)
#######------------- Speech2text-------------####
def transcript_audio(audio_file):
    # Initialize the speech recognition pipeline
    pipe = pipeline(
        "automatic-speech-recognition",
        model="openai/whisper-tiny.en",
        chunk_length_s=30,
    )
    # Transcribe the audio file and return the result
    transcript_txt = pipe(audio_file, batch_size=8)["text"]
    result = prompt_to_LLAMA2.run(transcript_txt)
    return result
#######------------- Gradio-------------####
audio_input = gr.Audio(sources="upload", type="filepath")
output_text = gr.Textbox()
iface = gr.Interface(fn= transcript_audio,
                    inputs= audio_input, outputs= output_text,
                    title= "Audio Transcription App",
                    description= "Upload the audio file")
iface.launch(server_name="0.0.0.0", server_port=7860)

### Conclusion ###

Congratulations on completing this project! You have now laid a solid foundation for leveraging powerful Language Models (LLMs) for speech-to-text generation tasks. Here's a quick recap of what you've accomplished:

* Text generation with LLM: You've created a Python script to generate text using a model from the Hugging Face Hub, learned about some key parameters that influence the model's output, and have a basic understanding of how to switch between different LLM models.

* Speech-to-Text conversion: Utilize OpenAI's Whisper technology to convert lecture recordings into text, accurately.

* Content summarization: Implement IBM Watson's AI to effectively summarize the transcribed lectures and extract key points.

* User interface development: Create an intuitive and user-friendly interface using Hugging Face Gradio, ensuring ease of use for students and educators.