# Creating course content with Whisper

## Goals 

Videos can be full of useful information, but getting hold of that info can be slow, since you need to watch the whole thing or try skipping through it. It can be much faster to use a bot to ask questions about the contents of the transcript.

In this project, you'll download a tutorial video from YouTube, transcribe the audio, and create a simple Q&A bot to ask questions about the content.

- Understanding the building blocks of working with Multimodal AI projects
- Working with some of the fundamental concepts of LangChain  
- How to use the Whisper API to transcribe audio to text 
- How to combine both LangChain and Whisper API to create ask questions of any YouTube video 

## Before you begin

You'll need a developer account with [OpenAI ](https://auth0.openai.com/u/signup/identifier?state=hKFo2SAyeTZBU1pzbUNWYWs3Wml5OWVvUVh4enZldC1LYU9PMaFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIDFUakNoUGFMLUdNWFpfQkpqdncyZjVDQk9xUTE4U0xDo2NpZNkgRFJpdnNubTJNdTQyVDNLT3BxZHR3QjNOWXZpSFl6d0Q) and a create API Key. The API secret key will be stored in your 'Environment Variables' on the side menu. See the *getting-started.ipynb* notebook for details on setting this up.

The project requires several packages that need to be installed into Workspace.

- `langchain` is a framework for developing generative AI applications.
- `yt_dlp` lets you download YouTube videos.
- `tiktoken` converts text into tokens.
- `docarray` makes it easier to work with multi-model data (in this case mixing audio and text).

### Instructions

Run the following code to install the packages.

In [1]:
# Install langchain
#!pip install langchain==0.0.292

In [2]:
# Install yt_dlp
# !pip install yt_dlp==2023.7.6

In [3]:
# !pip install tiktoken==0.5.1

In [4]:
# !pip install docarray==0.38.0

### Instructions

## Task 0: Select the YouTube video to transcribw 

In [5]:
# An example YouTube tutorial video
# youtube_url = input("Paste the YouTube URL: ")
youtube_url = "https://www.youtube.com/watch?v=tFXm5ijih98"

## Task 1: Import The Required Libraries 

For this project we need the `os` and the `yt_dlp` packages to download the YouTube video of your choosing, convert it to an `.mp3` and save the file. We will also be using the `openai` package to make easy calls to the OpenAI models we will use. 

Import the following packages.

- Import `os` 
- Import `openai`
- Import `yt_dlp` with the alias `youtube_dl`
- From the `yt_dlp` package, import `DowloadError`
- Assign `openai_api_key` to `os.getenv("OPENAI_API_KEY")`

In [6]:
# Importing the Required Packages including: "os" "openai" "yt_dlp as youtube_dl" and "from yt_dl import Download Error"

# Import the os package 
import os

# Import glob
import glob

# Import the openai package 
import openai
from openai import OpenAI

# Import the yt_dlp package as youtube_dl
import yt_dlp as youtube_dl

# Import DownloadError from yt_dlp 
from yt_dlp import DownloadError

# Import Docarray
import docarray

We will also assign the variable `openai_api_key` to the environment variable "OPEN_AI_KEY". This will help keep our key secure and remove the need to write it in the code here. 

In [7]:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

## Task 2: Download the YouTube Video

After creating the setup, the first step we will need to do is download the video from Youtube and convert it to an audio file (.mp3). 

We'll download a DataCamp tutorial about machine learning in Python.

We will do this by setting a variable to store the `youtube_url` and the `output_dir` that we want the file to be stored. 

The `yt_dlp` allows us to download and convert in a few steps but does require a few configuration steps. This code is provided to you. 

Lastly, we will create a loop that looks in the `output_dir` to find any .mp3 files. Then we will store those in a list called `audio_files` that will be used later to send each file to the Whisper model for transcription. 

Create the following: 
- Two variables - `youtube_url` to store the Video URL and `output_dir` that will be the directory where the audio files will be saved. 
- For this tutorial, we can set the `youtube_url` to the following `"https://www.youtube.com/watch?v=aqzxYofJ_ck"`and the `output_dir`to `files/audio/`. In the future, you can change these values. 
- Use the `ydl_config` that is provided to you 

In [8]:
# Directory to store the downloaded video
output_dir = "files/audio/"

# Config for youtube-dl
ydl_config = {
    "format": "bestaudio/best",
    "postprocessors": [
        {
            "key": "FFmpegExtractAudio",
            "preferredcodec": "mp3",
            "preferredquality": "192",
        }
    ],
    "outtmpl": os.path.join(output_dir, "%(title)s.%(ext)s"),
    "verbose": True
}

# Check if the output directory exists, if not create it
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    
# Print a message indicating which video is being downloaded
print(f"Downloading video from {youtube_url}.")

# Attempt to download the video using the specified configuration
# If a DownloadError occurs, attempt to download the video again
try:
    with youtube_dl.YoutubeDL(ydl_config) as ydl:
        ydl.download([youtube_url])
except DownloadError as DE:
    print(f"An error occured: {DE}")
    with youtube_dl.YoutubeDL(ydl_config) as ydl:
        ydl.download([youtube_url])

[debug] Encodings: locale cp1252, fs utf-8, pref cp1252, out UTF-8 (No VT), error UTF-8 (No VT), screen UTF-8 (No VT)
[debug] yt-dlp version stable@2023.07.06 [b532a3481] (pip) API
[debug] params: {'format': 'bestaudio/best', 'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192'}], 'outtmpl': 'files/audio/%(title)s.%(ext)s', 'verbose': True, 'compat_opts': set()}
[debug] Python 3.8.3 (CPython AMD64 64bit) - Windows-10-10.0.19041-SP0 (OpenSSL 1.1.1g  21 Apr 2020)
[debug] exe versions: ffmpeg 2023-01-16-git-01f46f18db-full_build-www.gyan.dev (setts), ffprobe 2023-01-16-git-01f46f18db-full_build-www.gyan.dev
[debug] Optional libraries: Cryptodome-3.19.1, brotli-None, certifi-2022.12.07, mutagen-1.47.0, sqlite3-2.6.0, websockets-12.0
[debug] Proxy map: {}


Downloading video from https://www.youtube.com/watch?v=tFXm5ijih98.


[debug] Loaded 1855 extractors


[youtube] Extracting URL: https://www.youtube.com/watch?v=tFXm5ijih98
[youtube] tFXm5ijih98: Downloading webpage
[youtube] tFXm5ijih98: Downloading ios player API JSON
[youtube] tFXm5ijih98: Downloading android player API JSON
[youtube] tFXm5ijih98: Downloading m3u8 information


[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec:vp9.2, channels, acodec, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), channels, acodec, lang, proto, size, br, asr, vext, aext, hasaud, id


[info] tFXm5ijih98: Downloading 1 format(s): 251


[debug] Invoking http downloader on "https://rr4---sn-8vq54voxn25po-mgte.googlevideo.com/videoplayback?expire=1706975080&ei=CAu-ZcuTCdnMp-oPmoOFUA&ip=84.125.157.205&id=o-ANJe-MRUDKu3VxcAf6v0_yaaJNTmLxmNcevCnAoNkDfT&itag=251&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ%3D%3D&mh=ho&mm=31%2C29&mn=sn-8vq54voxn25po-mgte%2Csn-h5qzen7d&ms=au%2Crdu&mv=m&mvi=4&pcm2cms=yes&pl=21&initcwndbps=2563750&vprv=1&svpuc=1&mime=audio%2Fwebm&gir=yes&clen=31439839&dur=2169.881&lmt=1692885148814198&mt=1706953029&fvip=4&keepalive=yes&fexp=24007246&c=ANDROID&txp=6308224&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cxpc%2Cvprv%2Csvpuc%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=AJfQdSswRQIhAPm6meqFfDhe_DJs2oDQ-evh6If-ClcaMnXnfHoYnf8QAiBXhFZr3yNJaYleomH1bb9Sj1snWTI4O5_R7fRdE-JDfA%3D%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpcm2cms%2Cpl%2Cinitcwndbps&lsig=AAO5W4owRAIgXEJDWr3HhEyI70Y4DIlCoaY-9IMtgvDLZ3PQDtwS5OsCIFiQDzhm_aeiGWyX-ZmwZf-3eRhEbuZ41z5DuyyEHFeH"
[debug] File locking is not supported. Proceeding

[download] Destination: files\audio\LangSmith Tutorial - LLM Evaluation for Beginners.webm
[download] 100% of   29.98MiB in 00:00:25 at 1.20MiB/s   


[debug] ffmpeg command line: ffprobe -show_streams "file:files\audio\LangSmith Tutorial - LLM Evaluation for Beginners.webm"


[ExtractAudio] Destination: files\audio\LangSmith Tutorial - LLM Evaluation for Beginners.mp3


[debug] ffmpeg command line: ffmpeg -y -loglevel "repeat+info" -i "file:files\audio\LangSmith Tutorial - LLM Evaluation for Beginners.webm" -vn -acodec libmp3lame "-b:a" 192.0k -movflags "+faststart" "file:files\audio\LangSmith Tutorial - LLM Evaluation for Beginners.mp3"


Deleting original file files\audio\LangSmith Tutorial - LLM Evaluation for Beginners.webm (pass -k to keep)


To find the audio files that we will use the `glob` module that looks in the `output_dir` to find any .mp3 files. Then we will append the file to a list called `audio_files`. This will be used later to send each file to the Whisper model for transcription. 

Create the following: 
- A variable called `audio_files`that uses the glob module to find all matching files with the `.mp3` file extension 
- Select the first first file in the list and assign it to `audio_filename`
- To verify the filename, print `audio_filename` 

In [9]:
# Find the audio file in the output directory

# Find all the audio files in the output directory
audio_files = glob.glob(os.path.join(output_dir, "*.mp3"))

# Select the first audio file in the list
audio_filename = audio_files[0]

# Print the name of the selected audio file
print(audio_filename)

files/audio\LangSmith Tutorial - LLM Evaluation for Beginners.mp3


## Task 3: Transcribe the Video using Whisper

In this step we will take the downloaded and converted Youtube video and send it to the Whisper model to be transcribed. To do this we will create variables for the `audio_file`, for the `output_file` and the model. 

Using these variables we will:
- create a list to store the transcripts
- Read the Audio File 
- Send the file to the Whisper Model using the OpenAI package 

To complete this step, create the following: 
- A variable named `audio_file`that is assigned the `audio_filename` we created in the last step
- A variable named `output_file` that is assigned the value `"files/transcripts/transcript.txt"`
- A variable named `model` that is assigned the value  `"whisper-1"`
- An empty list called `transcripts`
- A variable named `audio` that uses the `open` method and `"rb"` modifier on the `audio_file`
- A variable to store the `response` from the `openai.Audio.transcribe` method that takes in the `model`and `audio` variables 
- Append the `response["text"]`to the `transcripts` list. 

In [11]:
#split the audio file
# !pip install pydub

from pydub import AudioSegment

# Load the MP3 file
mp3_file = audio_file
audio = AudioSegment.from_mp3(mp3_file)

# Length of the audio in milliseconds and calculate the duration of each part
length = len(audio)
part_duration = length // 6

# Split the audio into 4 parts and save them
for i in range(6):
    start = i * part_duration
    end = start + part_duration
    part = audio[start:end]
    part_name = f"part_{i+1}.mp3"
    part.export(part_name, format="mp3")

print("The file has been split into 6 parts.")

The file has been split into 6 parts.


In [14]:
audio_segments = ["part_1.mp3", "part_2.mp3", "part_3.mp3", "part_4.mp3", "part_5.mp3", "part_6.mp3"]
transcripts = []

In [15]:
from tqdm import tqdm

client = OpenAI()

for audio_segment in tqdm(audio_segments, desc="Converting audio files"):
    print(f"Converting {audio_segment} to text...\n")
    with open(audio_segment, "rb") as file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1", 
            file=file
        )
        transcripts.append(transcript)     

Converting audio files:   0%|                                                                    | 0/6 [00:00<?, ?it/s]

Converting part_1.mp3 to text...



Converting audio files:  17%|██████████                                                  | 1/6 [00:21<01:45, 21.06s/it]

Converting part_2.mp3 to text...



Converting audio files:  33%|████████████████████                                        | 2/6 [00:40<01:20, 20.08s/it]

Converting part_3.mp3 to text...



Converting audio files:  50%|██████████████████████████████                              | 3/6 [01:00<01:00, 20.22s/it]

Converting part_4.mp3 to text...



Converting audio files:  67%|████████████████████████████████████████                    | 4/6 [01:20<00:39, 20.00s/it]

Converting part_5.mp3 to text...



Converting audio files:  83%|██████████████████████████████████████████████████          | 5/6 [01:41<00:20, 20.39s/it]

Converting part_6.mp3 to text...



Converting audio files: 100%|████████████████████████████████████████████████████████████| 6/6 [01:59<00:00, 19.94s/it]


In [16]:
print(transcripts)

[Transcription(text="Now, if you are building applications with large language models, then LangSmith is definitely a platform you can not ignore. So in this video, I'm going to introduce you to LangSmith. And now LangSmith is a platform for building production grade large language models applications. Now, in today's tutorial, we will be going over what LangSmith is really starting from scratch at the beginner level. And then we will be going over some concepts like data sets and evaluation to really understand how to use this platform. Now, as always, I will be walking you through every step of the way within VS Code, showing all the code examples, showing you how to do this. And also all of the code will be made available on the GitHub repository that I will link in the description. And now, quick note, LangSmith is currently still in private beta. So if you don't have access yet already, make sure to sign up and get on that wait list. And now, for those of you that are new here, my

In [17]:
print(type(transcripts[1].text))

<class 'str'>


In [18]:
full_transcripts = " ".join([t.text for t in transcripts])

In [19]:
print(full_transcripts)

Now, if you are building applications with large language models, then LangSmith is definitely a platform you can not ignore. So in this video, I'm going to introduce you to LangSmith. And now LangSmith is a platform for building production grade large language models applications. Now, in today's tutorial, we will be going over what LangSmith is really starting from scratch at the beginner level. And then we will be going over some concepts like data sets and evaluation to really understand how to use this platform. Now, as always, I will be walking you through every step of the way within VS Code, showing all the code examples, showing you how to do this. And also all of the code will be made available on the GitHub repository that I will link in the description. And now, quick note, LangSmith is currently still in private beta. So if you don't have access yet already, make sure to sign up and get on that wait list. And now, for those of you that are new here, my name is Dave Abelard

To save the transcripts to text files we will use the below provided code: 

In [21]:
# Specify the filename
txt_file_name = 'full_audio_transcription.txt'

# Open the file in write mode ('w') and write the string to it
with open(txt_file_name, 'w') as file:
    file.write(full_transcripts)

print(f"Transcription saved to {txt_file_name}")

Transcription saved to full_audio_transcription.txt


## Task 4: Create a TextLoader using LangChain 

In order to use text or other types of data with LangChain we must first convert that data into Documents. This is done by using loaders. In this tutorial, we will use the `TextLoader` that will take the text from our transcript and load it into a document. 

---
In LangChain, a document is essentially a piece of text along with associated metadata. You can create a document object by importing the `Document` class from the `langchain/document` module and then passing the text and metadata to the constructor of this class. The text is the main content that interacts with the language model, and the metadata can include information such as the source of the document.

Document loaders in LangChain are used for importing data from various sources and converting them into document objects. These loaders can handle different types of input, such as a simple text file, the text contents from a web page, or even transcriptions from videos. The loaders provide a method called `load` which is used to import data as a document from a pre-configured source.

Furthermore, LangChain offers Document Chains, which are sets of tools that allow for efficient processing and analysis of large amounts of text data. These chains can be used for a range of purposes, such as summarizing documents, answering questions over documents, and extracting information from them.  

---

To complete this step, do the following: 
- Import `TextLoader` from `langchain.document_loaders`
- Create a variable called `loader` that uses the `TextLoader` method which takes in the directory of the transcripts `"./files/text"`
- Create a variable called `docs` that is assigned the result of calling the `loader.load()` method. 

In [None]:
# Import the TextLoader class from the langchain.document_loaders module
from langchain.document_loaders import TextLoader


# Create a new instance of the TextLoader class, specifying the directory containing the text files
loader = TextLoader("./files/transcripts/transcript.txt")


# Load the documents from the specified directory using the TextLoader instance
docs = loader.load()


In [None]:
# Show the first element metadata of docs to verify it has been loaded 
docs[0].metadata['source']

In [None]:
# Show the first element of docs to verify it has been loaded 
#docs[0].page_content

## Task 4: Creating an In-Memory Vector Store 

Now that we have created Documents of the transcription, we will store that Document in a vector store. Vector stores allows LLMs to traverse through data to find similiarity between different data based on their distance in space. 

For large amounts of data, it is best to use a designated Vector Database. Since we are only using one transcript for this tutorial, we can create an in-memory vector store using the `docarray` package. 

We will also tokenize our queries using the `tiktoken` package. This means that our query will be seperated into smaller parts either by phrases, words or characters. Each of these parts are assigned a token which helps the model "understand" the text and relationships with other tokens. 

### Instructions

- Import the `tiktoken` package. 

In [None]:
# Import the tiktoken package
import tiktoken

## Task 5: Create the Document Search 

We will now use LangChain to complete some important operations to create the Question and Answer experience. Let´s import the follwing: 

- Import `RetrievalQA` from `langchain.chains` - this chain first retrieves documents from an assigned Retriver and then runs a QA chain for answering over those documents 
- Import `ChatOpenAI` from `langchain.chat_models` - this imports the ChatOpenAI model that we will use to query the data 
- Import `DocArrayInMemorySearch` from `langchain.vectorstores` - this gives the ability to search over the vector store we have created. 
- Import `OpenAIEmbeddings` from `langchain.embeddings` - this will create embeddings for the data store in the vector store. 
- Import `display` and `Markdown`from `IPython.display` - this will create formatted responses to the queries. (

In [None]:
# Import the RetrievalQA class from the langchain.chains module
from langchain.chains import RetrievalQA

# Import the ChatOpenAI class from the langchain.chat_models module
from langchain.chat_models import ChatOpenAI

# Import the DocArrayInMemorySearch class from the langchain.vectorstores module
from langchain.vectorstores import DocArrayInMemorySearch

# Import the OpenAIEmbeddings class from the langchain.embeddings module
from langchain.embeddings import OpenAIEmbeddings

from IPython.display import display, Markdown

Now we will create a vector store that will use the `DocArrayInMemory` search methods which will search through the created embeddings created by the OpenAI Embeddings function. 

To complete this step: 
- Create a variable called `db`
- Assign the `db` variable to store the result of the method `DocArrayInMemorySearch.from_documents`
- In the DocArrayInMemorySearch method, pass in the `docs` and a function call to `OpenAIEmbeddings()`

In [None]:
# Create a new DocArrayInMemorySearch instance from the specified documents and embeddings
db = DocArrayInMemorySearch.from_documents(
    docs,
    OpenAIEmbeddings())

We will now create a retriever from the `db` we created in the last step. This enables the retrieval of the stored embeddings. Since we are also using the `ChatOpenAI` model, will assigned that as our LLM.

Create the following: 
- A variable called `retriever` that is assigned `db.as_retriever()`
- A variable called `llm` that creates the `ChatOpenAI` method with a set `temperature`of `0.0`. This will controle the variability in the responses we receive from the LLM. 

In [None]:
# Convert the DocArrayInMemorySearch instance to a retriever
retriever = db.as_retriever()

# Create a new ChatOpenAI instance with a temperature of 0.0
llm = ChatOpenAI(temperature=0,
                max_tokens=100)

Our last step before starting to ask questions is to create the `RetrievalQA` chain. This chain takes in the:  
- The `llm` we want to use
- The `chain_type` which is how the model retrieves the data
- The `retriever` that we have created 
- An option called `verbose` that allows use to see the seperate steps of the chain 

Create a variable called `qa_stuff`. This variable will be assigned the method `RetrievalQA.from_chain_type`. 

Use the following settings inside this method: 
- `llm=llm`
- `chain_type="stuff"`
- `retriever=retriever`
- `verbose=True`

In [None]:
# Create a new RetrievalQA instance with the specified parameters
qa_stuff = RetrievalQA.from_chain_type(
    
    # The ChatOpenAI instance to use for generating responses
    llm=llm,
    
    # The type of chain to use for the QA system
    chain_type="stuff",
    
    # The retriever to use for retrieving relevant documents
    retriever=retriever,
    
    # Whether to print verbose output during retrieval and generation
    verbose=True
)

## Task 5: Create the Queries 

Now we are ready to create queries about the YouTube video and read the responses from the LLM. This done first by creating a query and then running the RetrievalQA we setup in the last step and passing it the query. 

To create the questions to ask the model complete the following steps: 
- Create a variable call `query` and assigned it a string value of `"What is this tutorial about?"`
- Create a `response` variable that will store the result of `qa_stuff.run(query)` 
- Show the `resposnse`

In [None]:
# Set the query to be used for the QA system
query = "What is this tutorial about?"

# Run the query through the RetrievalQA instance and store the response
response = qa_stuff.run(query)
response

We can continue on creating queries and even creating queries that we know would not be answered in this video to see how the model responds. 

In [None]:
# Set the query to be used for the QA system
query = "What is the difference between a training set and test set?"

# Run the query through the RetrievalQA instance and store the response
response = qa_stuff.run(query)
response

In [None]:
# Set the query to be used for the QA system
query = "Who should watch this lesson?"

# Run the query through the RetrievalQA instance and store the response
response = qa_stuff.run(query)
response

In [None]:
# Set the query to be used for the QA system
query ="Who is the greatest football/soccer team on earth?"

# Run the query through the RetrievalQA instance and store the response
response = qa_stuff.run(query)
response

In [None]:
# Set the query to be used for the QA system
query = "How long is the circumference of the earth?"

# Run the query through the RetrievalQA instance and store the response
response = qa_stuff.run(query)
response