# Langchain: LLM + YouTube Transcriptions

* https://github.com/leegonzales/LangChainExamples

Modified/Checked:

21 Feb 2023: Jon Chun

# Query the YouTube video transcripts, returning timestamps as sources to legitimize the answers by [@m_morzywolek](https://twitter.com/m_morzywolek)

## Setup

In [None]:
# First set runtime to GPU

In [1]:
!pip install pytube # For audio downloading

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytube
  Downloading pytube-12.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 KB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-12.1.2


In [2]:
!pip install git+https://github.com/openai/whisper.git -q # Whisper from OpenAI transcription model

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (setup.py) ... [?25l[?25hdone


In [3]:
import whisper 
import pytube 

## Set YouTube URL and Get Transcript

In [4]:
url = "https://www.youtube.com/watch?v=UF8uR6Z6KLc&ab_channel=Stanford"
video = pytube.YouTube(url)

In [5]:
audio = video.streams.get_audio_only()
audio.download(filename='tmp.mp3') # Downlods only audio from youtube video

'/content/tmp.mp3'

In [6]:
model = whisper.load_model("small")

100%|████████████████████████████████████████| 461M/461M [00:04<00:00, 107MiB/s]


In [7]:
%%time

# NOTE: 01m20s on 20230221 @ 16:32 Tues for 15m YouTube Video: https://www.youtube.com/watch?v=UF8uR6Z6KLc&ab_channel=Stanford

transcription = model.transcribe('/content/tmp.mp3')

In [8]:
res = transcription['segments']

In [9]:
from datetime import datetime

def store_segments(segments):
  texts = []
  start_times = []

  for segment in segments:
    text = segment['text']
    start = segment['start']

    # Convert the starting time to a datetime object
    start_datetime = datetime.fromtimestamp(start)

    # Format the starting time as a string in the format "00:00:00"
    formatted_start_time = start_datetime.strftime('%H:%M:%S')

    texts.append("".join(text))
    start_times.append(formatted_start_time)

  return texts, start_times

In [10]:
store_segments(res)

([' This program is brought to you by Stanford University.',
  ' Please visit us at stanford.edu.',
  ' Thank you.',
  " I'm honored to be with you today for your commencement from one of the finest universities",
  ' in the world.',
  " Truth be told, I never graduated from college and this is the closest I've ever gotten",
  ' to a college graduation.',
  ' Today I want to tell you three stories from my life.',
  " That's it.",
  ' No big deal.',
  ' Just three stories.',
  ' The first story is about connecting the dots.',
  ' I dropped out of Reed College after the first six months but then stayed around as a drop-in',
  ' for another 18 months or so before I really quit.',
  " So why'd I drop out?",
  ' It started before I was born.',
  ' My biological mother was a young unwed graduate student and she decided to put me up for adoption.',
  ' She felt very strongly that I should be adopted by college graduates so everything was all',
  ' set for me to be adopted at birth by a lawyer

In [11]:
texts, start_times = store_segments(res)

## Create LangChain: LLM + FAISS Dense Vector Similiarity Search

In [12]:
!pip install langchain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.92-py3-none-any.whl (288 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.8/288.8 KB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting marshmallow-enum<2.0.0,>=1.5.1
  Downloading marshmallow_enum-1.5.1-py2.py3-none-any.whl (4.2 kB)
Collecting typing-inspect>=0.4.0
  Downloading typing_inspect-0.8.0-py3-none-any.whl (8.7 kB)
Collecting mypy-extensions>=0.3.0
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensions, typing-inspect, marshmallow-enum, dataclasses-json, langchain
Successfully installed dataclasses-json-0.5.7 langchain-0.0.92 marshmallow-enum-1.5.1 mypy-extensions-1.0.0 typing-inspect-0.8.0


In [13]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.26.5.tar.gz (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 KB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: openai
  Building wheel for openai (pyproject.toml) ... [?25l[?25hdone
  Created wheel for openai: filename=openai-0.26.5-py3-none-any.whl size=67620 sha256=354042d432c3cfb1c781cba81b88e4c639f1d49948d3b37084128eaeaa350da3
  Stored in directory: /root/.cache/pip/wheels/a7/47/99/8273a59fbd59c303e8ff175416d5c1c9c03a2e83ebf7525a99
Successfully built openai
Installing collected packages: openai
Successfully installed openai-0.26.5


In [14]:
!pip install --upgrade faiss-gpu==1.7.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-gpu==1.7.1
  Downloading faiss_gpu-1.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (89.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.1


In [17]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain.chains import VectorDBQAWithSourcesChain
from langchain import OpenAI
import openai
import faiss

## Set OpenAI API Key

In [None]:
# Signup for an OpenAPI API Key at www.openai.com/api

In [25]:
import os
from getpass import getpass

OPENAI_API_KEY = getpass('Enter your OpenAI key: ')
# print(f'OPENAI_API_KEY is: {OPENAI_API_KEY}')

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

Enter your OpenAI key: ··········


In [27]:
text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
docs = []
metadatas = []
for i, d in enumerate(texts):
    splits = text_splitter.split_text(d)
    docs.extend(splits)
    metadatas.extend([{"source": start_times[i]}] * len(splits))
embeddings = OpenAIEmbeddings()

In [28]:
store = FAISS.from_texts(docs, embeddings, metadatas=metadatas)
faiss.write_index(store.index, "docs.index")

In [29]:
chain = VectorDBQAWithSourcesChain.from_llm(llm=OpenAI(temperature=0), vectorstore=store)

In [30]:
# Attach GDrive for permanent storage

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Seven Q&A Test Questions

In [43]:
# First Run will download LangChain components (vocab, merges, tokenizer, config)

result = chain({"question": "How old was Steve Jobs when started Apple?"})

Token indices sequence length is longer than the specified maximum sequence length for this model (1576 > 1024). Running this sequence through the model will result in indexing errors


In [44]:
# Q&A Test #1

print(f"Answer: {result['answer']}  Sources: {result['sources']}")

Answer:  Steve Jobs was 20 when he started Apple.
  Sources: 00:05:47


In [50]:
# Q&A Test #2

my_question = "Where was Apple started?"
result = chain({"question": my_question})

print(f"\n\nQuestion: {my_question}")
print(f"\n\nAnswer: {result['answer']}")

# NOTE: type(result['sources']) = str in format "timestamp, timestamp"
print(f"\nBased on Time Stamp: {result['sources']}")  

Token indices sequence length is longer than the specified maximum sequence length for this model (1605 > 1024). Running this sequence through the model will result in indexing errors




Question: Where was Apple started?


Answer:  Apple was started in Waz and Steve Jobs' parents' garage.


Based on Time Stamp: 00:05:47, 00:05:51


In [53]:
# Q&A Test #3

my_question = "Who were the first employees of Apple?"
result = chain({"question": my_question})

print(f"\n\nQuestion: {my_question}")
print(f"\n\nAnswer: {result['answer']}")

# NOTE: type(result['sources']) = str in format "timestamp, timestamp"
print(f"\nBased on Time Stamp: {result['sources']}")  


Token indices sequence length is longer than the specified maximum sequence length for this model (1586 > 1024). Running this sequence through the model will result in indexing errors




Question: Who were the first employees of Apple?


Answer:  The first employees of Apple were Steve Wozniak and Steve Jobs.


Based on Time Stamp: 00:05:47, 00:05:51


In [54]:
# Q&A Test #4

my_question = "What makes a great entrepreneur?"
result = chain({"question": my_question})

print(f"\n\nQuestion: {my_question}")
print(f"\n\nAnswer: {result['answer']}")

# NOTE: type(result['sources']) = str in format "timestamp, timestamp"
print(f"\nBased on Time Stamp: {result['sources']}")  

Token indices sequence length is longer than the specified maximum sequence length for this model (1578 > 1024). Running this sequence through the model will result in indexing errors




Question: What makes a great entrepreneur?


Answer:  A great entrepreneur has the courage to follow their heart and intuition.


Based on Time Stamp: 00:12:46


In [55]:
# Q&A Test #5

my_question = "How do you deal with the fear of failure?"
result = chain({"question": my_question})

print(f"\n\nQuestion: {my_question}")
print(f"\n\nAnswer: {result['answer']}")

# NOTE: type(result['sources']) = str in format "timestamp, timestamp"
print(f"\nBased on Time Stamp: {result['sources']}")  

Token indices sequence length is longer than the specified maximum sequence length for this model (1598 > 1024). Running this sequence through the model will result in indexing errors




Question: How do you deal with the fear of failure?


Answer:  To deal with the fear of failure, have faith and believe that the dots will connect down the road, which will give you the confidence.


Based on Time Stamp: 00:05:21, 00:08:16, 00:09:47


In [56]:
# Q&A Test #6

my_question = "What is the most important goal in life?"
result = chain({"question": my_question})

print(f"\n\nQuestion: {my_question}")
print(f"\n\nAnswer: {result['answer']}")

# NOTE: type(result['sources']) = str in format "timestamp, timestamp"
print(f"\nBased on Time Stamp: {result['sources']}")  

Token indices sequence length is longer than the specified maximum sequence length for this model (1592 > 1024). Running this sequence through the model will result in indexing errors




Question: What is the most important goal in life?


Answer:  The most important goal in life is to have the courage to follow your heart and intuition.


Based on Time Stamp: 00:12:46


In [57]:
# Q&A Test #7

my_question = "Do you fear failure?"
result = chain({"question": my_question})

print(f"\n\nQuestion: {my_question}")
print(f"\n\nAnswer: {result['answer']}")

# NOTE: type(result['sources']) = str in format "timestamp, timestamp"
print(f"\nBased on Time Stamp: {result['sources']}")  

Token indices sequence length is longer than the specified maximum sequence length for this model (1584 > 1024). Running this sequence through the model will result in indexing errors




Question: Do you fear failure?


Answer:  I don't know.


Based on Time Stamp: 00:09:47


In [58]:
# Q&A Test #8

my_question = "How do you deal with people who doubt you?"
result = chain({"question": my_question})

print(f"\n\nQuestion: {my_question}")
print(f"\n\nAnswer: {result['answer']}")

# NOTE: type(result['sources']) = str in format "timestamp, timestamp"
print(f"\nBased on Time Stamp: {result['sources']}")  

Token indices sequence length is longer than the specified maximum sequence length for this model (1592 > 1024). Running this sequence through the model will result in indexing errors




Question: How do you deal with people who doubt you?


Answer:  Don't lose faith and don't let the noise of others' opinions drown out your own inner voice.


Based on Time Stamp: 00:08:16, 00:12:42


## Loop over This Section to Repeatedly Ask Questions based on Transcript

In [39]:
my_question = input("Enter a question to get an answer based upon the YouTube transcript: ")
print(f'\nYour question: {my_question}\n')

Enter a question to get an answer based upon the YouTube transcript: What makes a great entrepreneur?

Your question: What makes a great entrepreneur?



In [40]:
result = chain({"question": my_question})
print(f"Answer: {result['answer']}  Sources: {result['sources']}")

Token indices sequence length is longer than the specified maximum sequence length for this model (1578 > 1024). Running this sequence through the model will result in indexing errors


Answer:  A great entrepreneur has the courage to follow their heart and intuition.
  Sources: 00:12:46
