#  YouTube Video Summarizer

It is possible to build a tool to effectively extract key takeaways from YouTube videos. It can be done in twe main stages: 1) transcribing YouTube audio files with the help of Whisper; 2) creating summarized output with the help of the LangChain's summarization techniques (including stuff, refine, and map_reduce).

The **stuff** approach for summarization is the simplest and most naive one: all the text from the documents is used in a single prompt. This method may raise exceptions if all text is longer than the available context size of the LLM. The **map-reduce** and **refine** approaches offer more sophisticated ways to process and extract information from longer documents. The "map-reduce" method can be parallelized, hence it is faster. The "refine" approach is sequential, so slower if compared to the "map-reduce" method, but it produces better results. The most suitable approach should be selected by considering the trade-offs between speed and quality.

**Whisper** is a cutting-edge, automatic speech recognition system developed by OpenAI. It has been trained on an impressive 680,000 hours of multilingual and multitasking supervised data sourced from the web.

STEPS to implement:
- Download the desired YouTube audio file;
- Transcribe the audio with the help of Whisper;
- Summarize the transcribed text using LangChain (either stuff, or refine, or map_reduce);
- Add multiple URLs to DeepLake database, and retrieve information from database to do sematic search. 

In [1]:
# SETUP

# !pip install langchain==0.0.208 deeplake openai tiktoken
# !pip install -q yt_dlp
# !pip install -q git+https://github.com/openai/whisper.git

######################################

# MacOS (requires https://brew.sh/)
#brew install ffmpeg

# Ubuntu
#sudo apt install ffmpeg

In [2]:
import sys, os
sys.path.append('..')

from keys import OPENAI_API_KEY, ACTIVELOOP_TOKEN
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN

In [3]:
import yt_dlp

# Define function to download video from YouTube to a local file
def download_mp4_from_youtube(url, filename):
    # Set the options for the download
    ydl_opts = {
        'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]',
        'outtmpl': filename,
        'quiet': True,
    }
    # Download the video file
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        result = ydl.extract_info(url, download=True)

        
url = "https://www.youtube.com/watch?v=mBjPyte2ZZo"
filename = '../data/lecuninterview.mp4'

download_mp4_from_youtube(url, filename)

The Whisper package that we installed provides the `.load_model()` method to download the model and transcribe a video file. Multiple different models are available: `tiny`, `base`, `small`, `medium`, and `large` (each of them has tradeoffs between accuracy and speed). 

In [4]:
import whisper

model = whisper.load_model("base")

filename = '../data/lecuninterview.mp4'
result = model.transcribe(filename)



**Note**: If an error about SSL certificate is raised while running the code above, have a look at the solution [here](https://stackoverflow.com/questions/68275857/urllib-error-urlerror-urlopen-error-ssl-certificate-verify-failed-certifica).

In [5]:
# Print out a chunk of the result
print(result['text'][:500])

 Hi, I'm Craig Smith and this is I on A On. This week I talked to Jan LeCoon, one of the seminal figures in deep learning development and a long time proponent of self-supervised learning. Jan spoke about what's missing in large language models and about his new joint embedding predictive architecture which may be a step toward filling that gap. He also talked about his theory of consciousness and the potential for AI systems to someday exhibit the features of consciousness. It's a fascinating c


In [6]:
# Save result to a text file
with open ('../output/text.txt', 'w') as file:  
    file.write(result['text'])

In [9]:
# Load utilities from the LangChain library that are necessary to perform Summarization Step
from langchain import OpenAI, LLMChain  # to handle large texts
from langchain.chains.mapreduce import MapReduceChain  # to optimize
from langchain.prompts import PromptTemplate   # to construct prompt
from langchain.chains.summarize import load_summarize_chain  # to run summarization

# Initialize an instance of OpenAI LLM
llm = OpenAI(model_name="text-davinci-003", temperature=0)  

In [8]:
# Split input text into smaller chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=0, separators=[" ", ",", "\n"]
)

with open('../output/text.txt') as f:
    text = f.read()

texts = text_splitter.split_text(text)
docs = [Document(page_content=t) for t in texts[:5]]  # only the 5 first chunks will be used in this example

In [9]:
from langchain.chains.summarize import load_summarize_chain
import textwrap   

chain = load_summarize_chain(llm, chain_type="map_reduce")
output_summary = chain.run(docs)

# Format and print the output
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)

 Jan Le Ka is a professor at New York University and Chief AI Scientist at Fair, a fundamental AI
research lab. He has been researching self-supervised learning, which has revolutionized natural
language processing by using transformer architectures for pre-training. His latest paper is on
joint embedding predictive architecture and how it relates to large language models. Self-supervised
learning is a technique used to train large neural networks to predict missing words in a piece of
text, and has been used to train large language models to predict the next word. However, attempts
to transfer self-supervised learning methods from language processing to images have not been
successful, and the only successful approach has been to generate representations of images instead
of predicting the image itself.


In [10]:
# To see the prompt template that is used with the map_reduce technique
print( chain.llm_chain.prompt.template )

Write a concise summary of the following:


"{text}"


CONCISE SUMMARY:


In [11]:
# Experimenting with the prompt
prompt_template = """Write a concise bullet point summary of the following:


{text}


CONSCISE SUMMARY IN BULLET POINTS:"""

BULLET_POINT_PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])

In [12]:
chain = load_summarize_chain(llm, 
                             chain_type="stuff", 
                             prompt=BULLET_POINT_PROMPT)

output_summary = chain.run(docs)

wrapped_text = textwrap.fill(output_summary, 
                             width=1000,
                             break_long_words=False,
                             replace_whitespace=False)
print(wrapped_text)


- Jan LeCoon is a seminal figure in deep learning development and a long time proponent of self-supervised learning
- Discussed what's missing in large language models and his new joint embedding predictive architecture
- Theory of consciousness and potential for AI systems to exhibit features of consciousness
- Self-supervised learning revolutionized natural language processing
- Large language models lack a world model and generative models are difficult to represent uncertain predictions
- Successful in audio but not images, so need to predict a representation of the image


In [13]:
# Generating more accurate and context-aware summaries with 'refine'
# It generates the summary of the first chunk; 
# Then, for each successive chunk, the summary is integrated with new info from the new chunk.
chain = load_summarize_chain(llm, chain_type="refine")

output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)

  Craig Smith interviews Jan LeCoon, a deep learning developer and proponent of self-supervised
learning, about his new joint embedding predictive architecture and his theory of consciousness. Jan
discusses the gap in large language models and the potential for AI systems to exhibit features of
consciousness. He explains how self-supervised learning has revolutionized natural language
processing through the use of transformer architectures for pre-training, such as taking a piece of
text, removing some of the words, and replacing them with black markers to train a large neural net
to predict the words that are missing. This technique has been used in practical applications such
as contact moderation systems on Facebook, Google, YouTube, and more. Jan also explains how this
technique can be used to represent uncertain predictions in generative models, such as predicting
the missing words in a text, or predicting the missing frames in a video. He further explains that
while this techniqu

**Working with multiple video URLs. Adding Transcripts to DeepLake.**

In [17]:
# Loading video files from multiple URLs
import yt_dlp

def download_mp4_from_youtube(urls, job_id):
    # This will hold the titles and authors of each downloaded video
    video_info = []

    for i, url in enumerate(urls):
        # Set the options for the download
        file_temp = f'../data/{job_id}_{i}.mp4
        ydl_opts = {
            'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]',
            'outtmpl': file_temp,
            'quiet': True,
        }

        # Download the video file
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            result = ydl.extract_info(url, download=True)
            title = result.get('title', "")
            author = result.get('uploader', "")

        # Add the title and author to our list
        video_info.append((file_temp, title, author))

    return video_info


urls=["https://www.youtube.com/watch?v=mBjPyte2ZZo&t=78s",
    "https://www.youtube.com/watch?v=cjs7QKJNVYM",]

videos_details = download_mp4_from_youtube(urls, 1)

In [19]:
import whisper

# Load the transcription model 
model = whisper.load_model("base")

# Iterate through each video and transcribe
results = []

for video in videos_details:
    print(f"Transcribing {video[0]}")
    result = model.transcribe(video[0])
    results.append( result['text'] )
    # print(f"Transcription for {video[0]}:\n{result['text']}\n")

Transcribing ./data/1_0.mp4




Transcribing ./data/1_1.mp4




In [20]:
# Save obtained transcriptions to .txt file
with open ('../output/mult_text.txt', 'w') as file:
    for r in results:
        file.write(r)

In [21]:
# Load the texts from the file and split the text to chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load the texts
with open('../output/mult_text.txt') as f:
    text = f.read()

# Split 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
                                               chunk_overlap=0, 
                                               separators=[" ", ",", "\n"])

texts = text_splitter.split_text(text)

In [22]:
# Pack all the chunks into a Documents
from langchain.docstore.document import Document

docs = [Document(page_content=t) for t in texts[:10]] # will save the first 100 chunks to DB

In [23]:
# Build a DeepLake database with embedded documents
from langchain.vectorstores import DeepLake
from langchain.embeddings.openai import OpenAIEmbeddings


my_activeloop_org_id = "iryna"
my_activeloop_dataset_name = "youtube_summarizer_db"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"


embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
db.add_documents(docs)

Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


Your Deep Lake dataset has been successfully created!


-

Dataset(path='hub://iryna/youtube_summarizer_db', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
 embedding  embedding  (10, 1536)  float32   None   
    id        text      (10, 1)      str     None   
 metadata     json      (10, 1)      str     None   
   text       text      (10, 1)      str     None   


 

['493d4494-3b9d-11ee-8b7c-12ee7aa5dbdc',
 '493d457a-3b9d-11ee-8b7c-12ee7aa5dbdc',
 '493d45c0-3b9d-11ee-8b7c-12ee7aa5dbdc',
 '493d45f2-3b9d-11ee-8b7c-12ee7aa5dbdc',
 '493d4624-3b9d-11ee-8b7c-12ee7aa5dbdc',
 '493d4656-3b9d-11ee-8b7c-12ee7aa5dbdc',
 '493d467e-3b9d-11ee-8b7c-12ee7aa5dbdc',
 '493d46b0-3b9d-11ee-8b7c-12ee7aa5dbdc',
 '493d46e2-3b9d-11ee-8b7c-12ee7aa5dbdc',
 '493d470a-3b9d-11ee-8b7c-12ee7aa5dbdc']

In [38]:
# Construct a retriever object
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['k'] = 4 #search for 4 the most relevant documents

In [39]:
# Constract prompt template with the QA chain
from langchain.prompts import PromptTemplate

prompt_template = """Use the following pieces of transcripts from a video to answer the question in a summarized manner. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Summarized answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [40]:
from langchain.chains import RetrievalQA

chain_type_kwargs = {"prompt": PROMPT}

qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=retriever,
                                 chain_type_kwargs=chain_type_kwargs)


print(qa.run("What is said about Google company?"))

 Google company is mentioned as an example of a practical application of self-supervised learning, which is used for contact moderation systems.


<hr>
<a class="anchor" id="resources">
    
## Additional Resources
    
</a>

- [Textwrap Package](https://docs.python.org/3/library/textwrap.html)
- [Introducing Whisper](https://openai.com/research/whisper)
- [Deep Lake Vector Store in LangChain](https://docs.activeloop.ai/tutorials/vector-store/deep-lake-vector-store-in-langchain)